User Guide

What are Batch Jobs?

Batch Jobs apply a Policy in bulk on datasets located on a cluster. They contain a reference to a Policy and the location of specific input data. Batch Jobs are defined and executed from the platform or via the Policy Manager Automation APIs.

It is possible to monitor progress and obtain potential error details of Batch Jobs. They also serve as a way to re-run specific processing if new data is received or if there is a change to a Policy.

The Schema used by the Policy contains a reference to the data item that is to be read, and the Batch Job points to its actual location. Batch Jobs are started from the platform, but they are physically executed on the compute infrastructure (either Hadoop or AWS Glue) configured in the the platform environment.

When the Batch Job is run, the Policy is applied to the data and the resulting output data is published to a specific Protected Data Domain (PDD), in the PDD's output directory or Hive database.

After a Batch Job has run, its Job definition remains as a record in the platform. This means that at any time the Batch Job can be re-run on new data. This is useful in the case where the source data has changed, new data was added or the Policy was updated and the privacy processing performed by the Batch Job needs to be repeated.

Running Batch Jobs against the same PDD guarantees data consistency (if the Preserve data consistency option is used) for each rule used in the Policy.

There are two types of Batch Job that can be run from the platform:

  • HDFS Batch Jobs process data from an HDFS location and write out the processed data into a PDD that is also stored in an HDFS location.

  • Hive Batch Jobs process data stored in a Hive table or Hive view and write out the processed data into a PDD that is stored in the Hive database as a Hive table.

  • AWS Glue Batch Jobs process data from an Amazon Simple Storage Service (Amazon S3) location and write out the processed data into a PDD that is also stored in an Amazon S3 location.