Specifying Input Data Locations (HDFS and Cloud Storage)

User Guide

Specifying Input Data Locations (HDFS and Cloud Storage)

The Data Locations section enables the locations of input data to be specified. The data items referenced in this section must conform to the Schema of the Job's Policy.

To complete the Data Locations section:

Specify a valid pathname in the Input Root Folder box. All files to be read by the Job must reside under this location.
For each table displayed, specify the following:
- Type of file in the Type column. This can be one of CSV, Avro or Parquet.
- The path to the table in the Relative Path box.
  The files representing the input are referred to using a path that is matched relative to the folder specified in the Input Root Folder box, and may represent either a single file, or multiple files if wildcards are used. For more information about Relative Paths, see Partitioned Data and Wildcards (HDFS).
If the selected input type is CSV or Avro, additional Settings for these file types can be configured by clicking on the Cog icon in the Settings column.

Reading from Cloud Storage

When using Privitar in a Cloud Environment, native Cloud storage platforms can be directly used to read and write data with Batch Jobs. This is configured from the Data Locations section of the Batch Job.

The following table defines the syntax to use when specifying a data location for the Cloud storage platforms that are supported by Privitar.

Compute Platform	Container	Location syntax
Amazon EMR Cluster	S3	`s3://<bucket-name>/`
AWS Glue	S3	`s3://<bucket-name>/`
Azure HDInsight	Blob	`wasb://<container-name>@<account-name>.blob.core.windows.net/`
Google Dataproc	Bucket	Prefix path with: `gs://`
BlueData	DataTap	Prefix path with: `dtap://`

In this section:

User Guide

Specifying Input Data Locations (HDFS and Cloud Storage)

Reading from Cloud Storage

Search results