Overview of HDFS Batch Jobs

User Guide

Overview of HDFS Batch Jobs

Running an HDFS Batch Job involves specifying a Policy together with the location of the data to be processed, applying the Policy to that data and then publishing the output to a Protected Data Domain (PDD).

To create and run an HDFS Batch Job, follow these steps:

Check that the HDFS Environment has been set up correctly. (See ???.)
Import the Schema from HDFS. (See, Specifying Input Data Locations (HDFS and Cloud Storage).)
Create the Policy. (See, Creating a Policy.)
(Optional step). If not using an existing PDD, create a new PDD. (See, Creating a PDD.)
Create the Batch Job. (See, Creating a Batch Job).
Choose a name for the Batch Job and specify the Policy to use. (See, Choosing a Name and Policy for a Job.)
Specify the input Data Locations. (See, Specifying Input Data Locations (HDFS and Cloud Storage).)
Check any of the Advanced settings. (See, Advanced settings (HDFS Batch Jobs).)
Run the Batch Job and select the location for the PDD. (See, Running a Batch Job.)
Review the output of the Job. (See, Batch Job Output Visualisations (HDFS and Hive).)

Checking HDFS Environment setup

The HDFS Environment must be set up correctly in Privitar before an HDFS Batch Job can be run. In particular, you must have read access for the HDFS location you are reading data from and have write access to the data location where you will be writing out the processed data.

If you are not sure about the HDFS setup, contact your local system administrator before running an HDFS Batch Job. For more information, refer to Hadoop Cluster Environment Configuration.

In this section:

User Guide

Overview of HDFS Batch Jobs

Checking HDFS Environment setup

Search results