User Guide

HDFS Batch Jobs

HDFS Batch Jobs are created to process data stored as Avro, Parquet, or CSV files in HDFS. HDFS Batch jobs can also process data stored on equivalent filesystems in the Cloud such as:

  • Amazon S3 buckets using EMRFS

  • Google Cloud Storage (GCS)

  • Microsoft Azure Blob storage

Bad Record Handling

Batch Jobs can be configured to detect errors in execution and stop if some or all records cannot be processed due either to bad data or misconfiguration.

Partitioned Data

Often data will be split into manageable chunks based on a scheme around a variable such as ingest date or country of origin. Where this data is organised hierarchically in HDFS, Privitar can respect this partitioned layout and process data appropriately.

Empty and Hidden Files Handling

Empty files and hidden files (files in hidden directories) will be skipped during processing. If all files/directories in the input path are empty or hidden, a Job will fail.