Skip to main content

User Guide

Specifying Input Data Locations (HDFS and Cloud Storage)

The Data Locations section enables the locations of input data to be specified. The data items referenced in this section must conform to the Schema of the Job's Policy.

UUID-ee1a86a0-e762-bb7b-d229-ca2cc67b0072.png

To complete the Data Locations section:

  1. Specify a valid pathname in the Input Root Folder box. All files to be read by the Job must reside under this location.

  2. For each table displayed, specify the following:

    • Type of file in the Type column. This can be one of CSV, Avro or Parquet.

    • The path to the table in the Relative Path box.

      The files representing the input are referred to using a path that is matched relative to the folder specified in the Input Root Folder box, and may represent either a single file, or multiple files if wildcards are used. For more information about Relative Paths, see Partitioned Data and Wildcards (HDFS).

    If the selected input type is CSV or Avro, additional Settings for these file types can be configured by clicking on the Cog icon in the Settings column.

Reading from Cloud Storage

When using Privitar in a Cloud Environment, native Cloud storage platforms can be directly used to read and write data with Batch Jobs. This is configured from the Data Locations section of the Batch Job.

The following table defines the syntax to use when specifying a data location for the Cloud storage platforms that are supported by Privitar.

Compute Platform

Container

Location syntax

Amazon EMR Cluster

S3

s3://<bucket-name>/

AWS Glue

S3

s3://<bucket-name>/

Azure HDInsight

Blob

wasb://<container-name>@<account-name>.blob.core.windows.net/

Google Dataproc

Bucket

Prefix path with: gs://

BlueData

DataTap

Prefix path with: dtap://