Skip to main content

User Guide

Overview of Hive Batch Jobs

Running a Hive Batch Job involves specifying a Policy together with the location of the data to be processed, applying the Policy to that data and then publishing the output to a Protected Data Domain (PDD).

To create and run a Hive Batch Job, follow these steps:

  1. Check that the Hive Environment has been set up correctly. (See below).

  2. Import the Schema from Hive. (See, Creating a Schema from a Hive database.

  3. Create the Policy. (See, Creating a Policy.)

  4. (Optional step). If not using an existing PDD, create a new PDD. (See, Creating a PDD.)

  5. Create the Batch Job. (See, Creating a Batch Job).

  6. Choose a name for the Batch Job and specify the Policy to use. (See, Choosing a Name and Policy for a Job.)

  7. Specify the input Data Locations. (See, Specifying Input Data Locations (Hive))

  8. Check any of the Advanced settings. (See, Advanced Settings (Hive Batch Jobs))

  9. Run the Batch Job and select the location for the PDD. (See, Running a Batch Job.)

  10. Review the output of the Job. (See, Batch Job Output Visualisations (HDFS and Hive).)

Checking Hive Environment Setup

The Hive Environment must be set up in Privitar before a Hive Batch Job can be run. This will have been setup by your system administrator. (For more information, see Hadoop Cluster Environment Configuration.)

In particular, the Job output root location setting determines how the Hive PDD location is specified when you run a Batch Job. The PDD location in the Hive Environment can be defined in two parts:

  • Hive Database; the name of the Hive database in which the tables will be created.

  • HDFS location; the path in HDFS (or supported Cloud location) where the data for the Hive tables will be stored.

The Job output root location setting can be used to define a single location for all Hive PDDs. If this location has been defined, then the HDFS location path for the Hive PDD will not need to be specified when running a Hive Batch Job.

If you are not sure about the Hive setup, contact your local system administrator before running a Hive Batch Job.