What are Batch Jobs?
Batch Jobs apply a Policy in bulk on datasets located on a cluster. They contain a reference to a Policy and the location of specific input data. Batch Jobs are defined and executed from Privitar or via the Policy Manager Automation APIs.
It is possible to monitor progress and obtain potential error details of Batch Jobs. They also serve as a way to re-run specific processing if new data is received or if there is a change to a Policy.
The Schema used by the Policy contains a reference to the data item that is to be read, and the Batch Job points to its actual location. Batch Jobs are started from Privitar, but they are physically executed on the compute infrastructure (either Hadoop or AWS Glue) configured in the Privitar Environment.
When the Batch Job is run, the Policy is applied to the data and the resulting output data is published to a specific Protected Data Domain (PDD), in the PDD's output directory or Hive database.
After a Batch Job has run, its Job definition remains as a record in Privitar. This means that at any time the Batch Job can be re-run on new data. This is useful in the case where the source data has changed, new data was added or the Policy was updated and the privacy processing performed by the Batch Job needs to be repeated.
Running Batch Jobs against the same PDD guarantees data consistency (if the Preserve data consistency option is used) for each rule used in the Policy.
There are two types of Batch Job that can be run from Privitar:
HDFS Batch Jobs process data from an HDFS location and write out the processed data into a PDD that is also stored in an HDFS location.
Hive Batch Jobs process data stored in a Hive table or Hive view and write out the processed data into a PDD that is stored in the Hive database as a Hive table.
AWS Glue Batch Jobs process data from an S3 location and write out the processed data into a PDD that is also stored in an S3 location.
HDFS Batch Jobs
HDFS Batch Jobs are created to process data stored as Avro, Parquet, or CSV files in HDFS. HDFS Batch jobs can also process data stored on equivalent filesystems in the Cloud such as:
Amazon S3 buckets using EMRFS
Google Cloud Storage (GCS)
Microsoft Azure Blob storage
Bad Record Handling
Batch Jobs can be configured to detect errors in execution and stop if some or all records cannot be processed due either to bad data or misconfiguration.
Partitioned Data
Often data will be split into manageable chunks based on a scheme around a variable such as ingest date or country of origin. Where this data is organised hierarchically in HDFS, Privitar can respect this partitioned layout and process data appropriately.
Empty and Hidden Files Handling
Empty files and hidden files (files in hidden directories) will be skipped during processing. If all files/directories in the input path are empty or hidden, a Job will fail.
Hive Batch Jobs
Hive Batch Jobs are created to process data in the cluster that can be accessed via Hive.
Read and write data via Hive
Data is accessed from a cluster using Hive, and is written directly to a Hive database. Privitar creates safe output data in a form that is ready for immediate use by data consumers in a way that is consistent with their existing data access patterns.
Efficient processing of partitioned data
Partitioned data is naturally represented as database columns in Hive, so handling of such data requires no special consideration when creating the Privitar Schema or Policy.
Inferring Schemas from Hive
If inferring a Schema from Hive, Privitar will automatically infer all the tables and columns present from the location and create default names for the Schema to match the names of the Hive tables and columns.
Support for querying Hive views
Privitar can query Hive views as readily as Hive tables, allowing for more flexible application of privacy protection to subsets of data that are defined as Hive views.
Support for Filtering of data
Privitar supports row-level filtering of Hive data. For each column (both data and partition columns), a filter can be set to enable specific subsets of the data to be processed. The filters that can be applied are SQL-like query commands such as Greater than, Less than. For example, a filter could be applied to a column containing date information to only include rows of data from a specific date.
AWS Glue Batch Jobs
Batch Jobs running in an AWS Glue Environment will run using the AWS Glue Service.
AWS Glue Batch Jobs can read and write data to AWS S3 buckets.
The AWS Glue ETL service is used as a serverless data processing environment to run AWS Glue Batch Jobs. This means that Privitar submits data processing jobs to the AWS Glue service, which will run them on ephemeral infrastructure managed by AWS.
Serverless Data Processing
The AWS Glue ETL service is used as a serverless data processing environment to run AWS Glue Batch Jobs. This means that Privitar submits data processing jobs to the AWS Glue service, which will run them on ephemeral infrastructure managed by AWS.