User Guide

Hive Batch Jobs

Hive Batch Jobs are created to process data in the cluster that can be accessed via Hive.

Read and write data via Hive

Data is accessed from a cluster using Hive, and is written directly to a Hive database. Privitar creates safe output data in a form that is ready for immediate use by data consumers in a way that is consistent with their existing data access patterns.

Efficient processing of partitioned data

Partitioned data is naturally represented as database columns in Hive, so handling of such data requires no special consideration when creating the Privitar Schema or Policy.

Inferring Schemas from Hive

If inferring a Schema from Hive, Privitar will automatically infer all the tables and columns present from the location and create default names for the Schema to match the names of the Hive tables and columns.

Support for querying Hive views

Privitar can query Hive views as readily as Hive tables, allowing for more flexible application of privacy protection to subsets of data that are defined as Hive views.

Support for filtering of data

Privitar supports row-level filtering of Hive data. For each column (both data and partition columns), a filter can be set to enable specific subsets of the data to be processed. The filters that can be applied are SQL-like query commands such as Greater than, Less than. For example, a filter could be applied to a column containing date information to only include rows of data from a specific date.