Skip to main content

User Guide

Administering in a Hadoop Cluster

This section contains important details about administering the Privitar platform in a Hadoop cluster.

Supported Character Encodings

Privitar requires all input data to be in UTF-8 format. This is due to the current level of support in Spark.

Sensitive Information Leakage After YARN Abnormal Termination

As Spark jobs for Privitar are running in the cluster they write out temporary files to the directory(s) set by the YARN yarn.nodemanager.local-dirs parameter. Due to the nature of Privitar, these files may include sensitive information.

In the event that YARN terminates abnormally, such files may remain in YARN's working directory(s) and not be automatically cleaned-up.

The files must therefore be manually removed in this situation. To do this, delete cache files for Privitar jobs that failed to the YARN termination. The cache files are located under the sub-directory of the directory:

${yarn.nodemanager.local-dirs}/usercache/${username}/appcache/${applicationId} 

This directory corresponds to the user that ran the job.

Configuring HDFS-level compression

HDFS-level compression is transparent to Privitar and requires only cluster-level configuration.

To enable HDFS-level compression, see the Cloudera HDFS Administration Guide.

Compression inside Avro files

To enable a compression codec inside Avro files, modify the spark-overrides.properties file in:

<processor-root>/config 

as follows:

spark.sql.avro.compression.codec = uncompressed|snappy|deflate
spark.sql.avro.deflate.level = 5 // if using deflate

Internal Spark compression

Spark can use compression internally (for RDD partitions, broadcast variables and shuffle output). This is transparent to Privitar and requires only cluster-level configuration.

For more information about configuring compression in Spark, see the Apache Spark Configuration Guide .