Administering in a Hadoop Cluster
This section contains important details about administering the platform in a Hadoop cluster.
Supported Character Encodings
The platform requires all input data to be in UTF-8 format. This is due to the current level of support in Spark.
Sensitive Information Leakage After YARN Abnormal Termination
As Spark jobs for the platform are running in the cluster they write out temporary files to the directory(s) set by the YARN yarn.nodemanager.local-dirs
parameter. Due to the nature of the platform, these files may include sensitive information.
In the event that YARN terminates abnormally, such files may remain in YARN's working directory(s) and not be automatically cleaned-up.
The files must therefore be manually removed in this situation. To do this, delete cache files for platform jobs that failed to the YARN termination. The cache files are located under the sub-directory of the directory:
${yarn.nodemanager.local-dirs}/usercache/${username}/appcache/${applicationId}
This directory corresponds to the user that ran the job.
Configuring HDFS-level compression
HDFS-level compression is transparent to the platform and requires only cluster-level configuration.
To enable HDFS-level compression, see the Cloudera HDFS Administration Guide.
Compression inside Avro files
To enable a compression codec inside Avro files, modify the spark-overrides.properties
file in:
<processor-root>/config
as follows:
spark.sql.avro.compression.codec = uncompressed|snappy|deflate spark.sql.avro.deflate.level = 5 // if using deflate
Internal Spark compression
Spark can use compression internally (for RDD partitions, broadcast variables and shuffle output). This is transparent to the platform and requires only cluster-level configuration.
For more information about configuring compression in Spark, see the Apache Spark Configuration Guide .