Create a Connection to Apache Spark
You can use Apache Spark as a data source with Privitar Data Security Platform.
To connect to Apache Spark, you must:
Meet the Apache Spark Connection Prerequisites
Note
Most of the settings for the Spark Thrift server are the same as those for HiveServer2. To learn more, see https://spark.apache.org/docs/latest/sql-distributed-sql-engine.html
Before you connect to Apache Spark, you must:
Have a system user that is able to authenticate to Apache Spark using a username and password and has read access to the relevant databases and tables
Have access to the SSL certificate used to encrypt the connection (or the relevant certificate authority certificates)
If your Secure Sockets Layer (SSL) source uses privately-signed server certificates, you must modify the truststore of your data plane in order to trust the server certificates as follows:
Obtain the SSL certificate from the data source.
Convert the SSL certificate to a JKS truststore.
Copy the truststore into the
shared/truststores/
location of your data plane configuration mounted volume (the volume used to store JDBC drivers).Note
You will need to refer to this truststore when configuring the SSL JDBC properties. By default, the truststore is mounted on
/config/shared/truststores/truststore.jks
.The mounted volume's directory structure should look similar to the following:
├─shared/ | └── jdbc-drivers/ | └── hive-42.2.23.jar | └── truststores/ | └── truststore.jks ├─data-agent | └── EMPTY ├── data-proxy | └── EMPTY
Download the JDBC JAR driver that you will use to connect to the data source.
Place the JDBC JAR driver into the
shared/jdbc-drivers/
location of your data plane configuration mounted volume (the volume used to store JDBC drivers).
For example, the SSL settings for Spark might look like the following:
jdbc:hive2://ip-172-31-26-172.eu-west-2.compute.internal:10000/default;ssl=true;sslTrustStore=/config/shared/truststores/truststore.jks;trustStorePassword=changeit
Build an Apache Spark Connection String
Note
Most of the settings for the Spark Thrift server are the same as those for HiveServer2. To learn more, see https://spark.apache.org/docs/latest/sql-distributed-sql-engine.html
The following is an example of a complete Apache Spark connection string:
jdbc:hive2://localhost:10000/database1
Note
The Spark Thrift server uses the same JDBC driver as HiveServer2.
To build an Apache Spark connection string, follow this example. Note that it has the following segments:
jdbc:hive2://<host>:<port>/<dbName>;<sessionConfs>?<hiveConfs>#<hiveVars>
If you have configured to use SSL in the previous section, the SSL settings for Spark might look like the following:
jdbc:hive2://ip-172-31-26-172.eu-west-2.compute.internal:10000/default;ssl=true;sslTrustStore=/config/shared/truststores/truststore.jks;trustStorePassword=changeit
The following table includes a description of each segment.
String Segment | Description |
---|---|
| The Spark server hosting node. Required. |
| The port that the Spark server listens to. Required. |
| The name of the Hive database. Required. |
| Key-value pairs for the JDBC driver in the format |
| Key-value pairs for Hive in the format |
| Key-value pairs for Hive variables in the format |
Authenticate to Apache Spark
The Privitar Data Security Platform currently supports username/password authentication for Apache Spark.
Enter the system user's Apache Spark credentials in the Username and Password fields on the platform's Connections page.