Administering in a Hadoop Cluster

User Guide

Administering in a Hadoop Cluster

This section contains important details about administering the platform in a Hadoop cluster.

Supported Character Encodings

The platform requires all input data to be in UTF-8 format. This is due to the current level of support in Spark.

Sensitive Information Leakage After YARN Abnormal Termination

As Spark jobs for the platform are running in the cluster they write out temporary files to the directory(s) set by the YARN yarn.nodemanager.local-dirs parameter. Due to the nature of the platform, these files may include sensitive information.

In the event that YARN terminates abnormally, such files may remain in YARN's working directory(s) and not be automatically cleaned-up.

The files must therefore be manually removed in this situation. To do this, delete cache files for platform jobs that failed to the YARN termination. The cache files are located under the sub-directory of the directory:

${yarn.nodemanager.local-dirs}/usercache/${username}/appcache/${applicationId}

This directory corresponds to the user that ran the job.

Configuring HDFS-level compression

HDFS-level compression is transparent to the platform and requires only cluster-level configuration.

To enable HDFS-level compression, see the Cloudera HDFS Administration Guide.

Compression inside Avro files

To enable a compression codec inside Avro files, modify the spark-overrides.properties file in:

<processor-root>/config

as follows:

spark.sql.avro.compression.codec = uncompressed|snappy|deflate
spark.sql.avro.deflate.level = 5 // if using deflate

Internal Spark compression

Spark can use compression internally (for RDD partitions, broadcast variables and shuffle output). This is transparent to the platform and requires only cluster-level configuration.

For more information about configuring compression in Spark, see the Apache Spark Configuration Guide .

In this section:

User Guide

Administering in a Hadoop Cluster

Supported Character Encodings

Sensitive Information Leakage After YARN Abnormal Termination

Configuring HDFS-level compression

Compression inside Avro files

Internal Spark compression

Search results