Skip to main content

User Guide

Overview of AWS Glue Batch Jobs

Running an AWS Glue Batch Job involves specifying a Policy together with the location of the data to be processed, applying the Policy to that data and then publishing the output to a Protected Data Domain (PDD).

Note

AWS Glue Batch Jobs can only be defined on Policies defined on single-table schemas.

To create and run an AWS Glue Batch Job, follow these steps:

  1. Check that the AWS Glue Environment has been set up correctly. (See ???.)

  2. Import the Schema. (See, Creating a Schema from an AWS Glue Data Catalog.)

  3. Create the Policy. (See, Creating a Policy.)

  4. (Optional step). If not using an existing PDD, create a new PDD. (See, Creating a PDD.)

  5. Create the Batch Job. (See, Creating a Batch Job).

  6. Choose a name for the Batch Job and specify the Policy to use. (See, Choosing a Name and Policy for a Job.)

  7. Specify the input Data Locations. (See, Specifying Input Data Locations (HDFS and Cloud Storage).)

  8. Check any of the Advanced settings. (See, Advanced settings (AWS Glue Batch Jobs).)

  9. Run the Batch Job and select the location for the PDD. (See, Running a Batch Job.)

  10. Review the output of the Job. (See, Batch Job Output Visualisations (HDFS and Hive).)

Checking AWS Glue Environment setup

The AWS Glue Environment must be set up correctly in Privitar before an AWS Glue Batch Job can be run. In particular, you must have deployed the Privitar Platform in an AWS environment, with the correct permissions to access the data and submit Jobs to the AWS Glue service.

If you are not sure about the AWS Glue setup, contact your local system administrator before running an AWS Glue Batch Job. For more information, refer to AWS Glue Environment Configuration.