HDFS Batch Jobs
HDFS Batch Jobs are created to process data stored as Avro, Parquet, or CSV files in HDFS. HDFS Batch jobs can also process data stored on equivalent filesystems in the Cloud such as:
Amazon S3 buckets using EMRFS
Google Cloud Storage (GCS)
Microsoft Azure Blob storage
Bad Record Handling
Batch Jobs can be configured to detect errors in execution and stop if some or all records cannot be processed due either to bad data or misconfiguration.
Partitioned Data
Often data will be split into manageable chunks based on a scheme around a variable such as ingest date or country of origin. Where this data is organised hierarchically in HDFS, Privitar can respect this partitioned layout and process data appropriately.
Empty and Hidden Files Handling
Empty files and hidden files (files in hidden directories) will be skipped during processing. If all files/directories in the input path are empty or hidden, a Job will fail.