Pesquisar

文章
· 七月 23, 2024 阅读大约需 4 分钟

Databricks Station - InterSystems Cloud SQL

 

A Quick Start to InterSystems Cloud SQL Data in Databricks

Up and Running in Databricks against an InterSystmes Cloud SQL consists of four parts.

  • Obtaining Certificate and JDBC Driver for InterSystems IRIS
  • Adding an init script and external library to your Databricks Compute Cluster
  • Getting Data
  • Putting Data

 

Download X.509 Certificate/JDBC Driver from Cloud SQL

Navigate to the overview page of your deployment, if you do not have external connections enabled, do so and download your certificate and the jdbc driver from the overview page.

 

I have used intersystems-jdbc-3.8.4.jar and intersystems-jdbc-3.7.1.jar with success in Databricks from Driver Distribution.

Init Script for your Databricks Cluster

Easiest way to import one or more custom CA certificates to your Databricks Cluster, you can create an init script that adds the entire CA certificate chain to both the Linux SSL and Java default cert stores, and sets the REQUESTS_CA_BUNDLE property. Paste the contents of your downloaded X.509 certificate in the top block of the following script:

import_cloudsql_certficiate.sh
#!/bin/bash

cat << 'EOF' > /usr/local/share/ca-certificates/cloudsql.crt
-----BEGIN CERTIFICATE-----
<PASTE>
-----END CERTIFICATE-----
EOF

update-ca-certificates

PEM_FILE="/etc/ssl/certs/cloudsql.pem"
PASSWORD="changeit"
JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
KEYSTORE="$JAVA_HOME/lib/security/cacerts"
CERTS=$(grep 'END CERTIFICATE' $PEM_FILE| wc -l)

# To process multiple certs with keytool, you need to extract
# each one from the PEM file and import it into the Java KeyStore.
for N in $(seq 0 $(($CERTS - 1))); do
  ALIAS="$(basename $PEM_FILE)-$N"
  echo "Adding to keystore with alias:$ALIAS"
  cat $PEM_FILE |
    awk "n==$N { print }; /END CERTIFICATE/ { n++ }" |
    keytool -noprompt -import -trustcacerts \
            -alias $ALIAS -keystore $KEYSTORE -storepass $PASSWORD
done
echo "export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt" >> /databricks/spark/conf/spark-env.sh
echo "export SSL_CERT_FILE=/etc/ssl/certs/ca-certificates.crt" >> /databricks/spark/conf/spark-env.sh

Now that you have the init script, upload the script to Unity Catalog to a Volume.

Once the script is on a volume, you can add the init script to the cluster from the volume in the Advanced Properties of your cluster.


Secondly, add the intersystems jdbc driver/library to the cluster...

...and either start or restart your compute.

Databricks Station - Inbound InterSystems IRIS Cloud SQL

 

Create a Python Notebook in your workspace, attach it to your cluster and test dragging data inbound to Databricks.  Under the hood, Databricks is going to be using pySpark, if that is not immediately obvious.

The following spark dataframe construction is all you should need, you can grab your connection info from the overview page as before.

df = (spark.read
  .format("jdbc")
  .option("url", "jdbc:IRIS://k8s-05868f04-a4909631-ac5e3e28ef-6d9f5cd5b3f7f100.elb.us-east-1.amazonaws.com:443/USER")
  .option("driver", "com.intersystems.jdbc.IRISDriver")
  .option("dbtable", "(SELECT name,category,review_point FROM SQLUser.scotch_reviews) AS temp_table;") 
  .option("user", "SQLAdmin")
  .option("password", "REDACTED")
  .option("driver", "com.intersystems.jdbc.IRISDriver")\
  .option("connection security level","10")\
  .option("sslConnection","true")\
  .load())

df.show()

Illustrating the dataframe output from data in Cloud SQL... boom!

Databricks Station - Outbound InterSystems IRIS Cloud SQL

 

Lets now take what we read from IRIS and write it write back with Databricks. If you recall we read only 3 fields into our dataframe, so lets write that back immediately and specify an "overwrite" mode.

df = (spark.read
  .format("jdbc")
  .option("url", "jdbc:IRIS://k8s-05868f04-a4909631-ac5e3e28ef-6d9f5cd5b3f7f100.elb.us-east-1.amazonaws.com:443/USER")
  .option("driver", "com.intersystems.jdbc.IRISDriver")
  .option("dbtable", "(SELECT TOP 3 name,category,review_point FROM SQLUser.scotch_reviews) AS temp_table;") 
  .option("user", "SQLAdmin")
  .option("password", "REDACTED")
  .option("driver", "com.intersystems.jdbc.IRISDriver")\
  .option("connection security level","10")\
  .option("sslConnection","true")\
  .load())

df.show()

mode = "overwrite"
properties = {
    "user": "SQLAdmin",
    "password": "REDACTED",
    "driver": "com.intersystems.jdbc.IRISDriver",
    "sslConnection": "true",
    "connection security level": "10",
}

df.write.jdbc(url="jdbc:IRIS://k8s-05868f04-a4909631-ac5e3e28ef-6d9f5cd5b3f7f100.elb.us-east-1.amazonaws.com:443/USER", table="databricks_scotch_reviews", mode=mode, properties=properties)

Executing the Notebook

 
Illustrating the data in InterSystems Cloud SQL!

Things to Consider

  • By default, PySpark writes data using multiple concurrent tasks, which can result in partial writes if one of the tasks fails.
  • To ensure that the write operation is atomic and consistent, you can configure PySpark to write data using a single task (i.e., set the number of partitions to 1) or use a iris-specific feature like transactions.
  • Additionally, you can use PySpark’s DataFrame API to perform filtering and aggregation operations before reading the data from the database, which can reduce the amount of data that needs to be transferred over the network.
2 Comments
讨论 (2)2
登录或注册以继续
问题
· 七月 23, 2024

JWT and CORS

Hi

I'm trying to use JWT authentication on a REST application in IRIS. The login API are correctly "injected" into the application. Login works fine with Postman and other REST clients, and subsequent calls to my REST API using the bearer token work fine (correctly authenticated). So far, so good.

The problem is that it doesn't work with Axios, so I can only test it, I cannot integrate it into my application. I found out the reason for this is that Axios is applying CORS whereas Postman and other REST test clients do not; that is, they don't send the "preflight" OPTIONS request, they send the POST request directly, and apparently IRIS is happy not having to deal with CORS in this case (in other words, it does not check the "allow origin" header, probably because there was no OPTIONS call?) Unfortunately the reverse is not true. Axios sends the OPTIONS preflight, to which IRIS responds with a 500 internal server error and no "Access-Control-Allow-Origin" response header. Axios still attempts to send the POST, but it is getting a NS_ERROR_DOM_BAD_URI error because it failed CORS validation.

As far as I can tell based on the CORS specification, Axios is right to apply CORS in this case, as this query does not (and CAN not) match the criteria for not using CORS (as described on MDN: https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS). There does not seem to be a way to force Axios not to send the pre-flight when the situation is legitimate for CORS.

I have correctly configured the dispatch class to handle CORS; without JWT authentication, CORS is handled correctly. However the dispatch class is never invoked for the JWT authentication login API (not even OnPreHTTP). Apparently the CSP engine is handling the login API on its own and never delegates to the dispatch class. So as far as I can tell, there's no way to intercept the OPTIONS call to somehow ignore or accept it.

Is there a way to make this work in IRIS? Or conversely, does anyone knows of a (minimally invasive) way to force Axios to bypass the pre-flight for this specific call?

Thanks

9 Comments
讨论 (9)2
登录或注册以继续
问题
· 七月 23, 2024

Issue with SOAP Service After Enabling Basic Authentication

Hi Team,

My SOAP functions were working perfectly before enabling basic authentication. To set up basic authentication, I created web applications for the SOAP service, checked the password option, and assigned a user to this web application. However, after enabling basic authentication, the SOAP service stopped working.

3 Comments
讨论 (3)2
登录或注册以继续
问题
· 七月 22, 2024

SQL Gateway getClob() errors in Redshift and Postgresql

I'm trying to use the EnsLib.SQL.Operation.GenericOperation component in a production to read a column from a Redshift table that is set up as VARCHAR(65535) and am getting the following error.  

An error was received : ERROR #5023: Remote Gateway Error: JDBC Gateway getClob(0,1) errorRemote JDBC error: Cannot convert the column of type VARCHAR to requested type long..

The query I'm using is a simple 'SELECT column_name FROM table_name'.  I've done a little research and it sounds like Redshift doesn't support getClob().  Is there anything I can do to force the gateway to not use getClob() on this Redshift column, or some other work around?  Casting the column in my select statement doesn't work... it seems the getClob() call is done under the hood by the gateway, and I would potentially need some way to override that?  

Thanks,

Don Martin, Sanford Health

5 Comments
讨论 (5)3
登录或注册以继续
问题
· 七月 21, 2024

Having trouble replicating IrisVectorStore Llama Index demo from iris-vector-search for my program's user table

I'll preface this by saying I'm not sure if I found a bug or because I'm somehow misusing IrisVectorStore.

Basically I have code from the regular llama-index module working in my Python project which has SimpleDirectoryReader objects similar in nature to the demo I mentioned (https://github.com/intersystems-community/iris-vector-search/blob/main/d...). And I have other code working that can add new users to a SQL table in Iris.

I tried to use IRISVectorStore in a manner similar to the below excerpt from the demo code but I just changed the table name to the name of my user table. And I also just changed the documents object in that code to my own SimpleDirectoryReader object.

However no matter how many times I try to run with those changes I get a flurry of exceptions where the trace makes little sense to me. I can confirm that my code in place to connect to my user table locally does work.

I won't attach the trace yet unless someone asks but my question is basically does anybody know for sure that IRISVectorStore can successfully extract information from a user table? Or might I have hit some weird edge case when trying to use this?

# StorageContext captures how vectors will be stored
vector_store = IRISVectorStore.from_params(
    connection_string = url,
    table_name = "paul_graham_essay",
    embed_dim = 1536,  # openai embedding dimension
    engine_args = { "connect_args": {"sslcontext": sslcontext} }
2 Comments
讨论 (2)1
登录或注册以继续