Mass Ingestion

Back Next

Custom directory structure of initial load output files on Amazon S3, Google Cloud Storage, and ADLS Gen2 targets

You can configure a custom directory structure for the output data files that initial load jobs write to Amazon S3, Google Cloud Storage, and Microsoft Azure Data Lake Storage (ADLS) Gen2 targets if you do not want to use the default structure.

By default, initial load jobs write output files to tablename_timestamp subdirectories under the parent directory. For all targets except Google Cloud Storage, the parent directory is specified in the target connection properties if the

Connection Directory as Parent

check box is selected on the

Target

page of the task wizard. In an Amazon S3 connection, this parent directory is specified in the

Folder Path

field. In an ADLS Gen2 connection, the parent directory is specified in the

Directory Path

field. For Google Cloud Storage targets, the parent directory is the bucket container specified in the

Bucket

field on the

Target

page of the task wizard.

You can customize the directory structure to suit your needs. For example, you can write the output files under a root directory or directory path that is different from the parent directory specified in the connection properties to better organize the files for your environment or to find them more easily. Or you can consolidate all output files for an object directly in a directory with the object name rather than write the files to separate timestamped subdirectories, for example, to facilitate automated processing of all of the files.

To configure a directory structure, you must use the

Data Directory

field on the

Target

page of the ingestion task wizard. The default value is

{TableName}_{Timestamp}

, which causes output files to be written to tablename_timestamp subdirectories under the parent directory. You can configure a custom directory path by creating a directory pattern that consists of any combination of case-insensitive placeholders and directory names. The placeholders are:

{TableName} for a target table name

{Timestamp} for the date and time, in the format yyyymmdd_hhmissms, at which the initial load job started to transfer data to the target

{Schema} for the target schema name

{YY} for a two-digit year

{YYYY} for a four-digit year

{MM} for a two-digit month value

{DD} for a two-digit day in the month

A pattern can also include the following functions:

toLower() to use lowercase for the values represented by the placeholder in parentheses

toUpper() to use uppercase for the values represented by the placeholder in parentheses

By default, the target schema is also written to the data directory. If you want to use a different directory for the schema, you can define a directory pattern in the

Schema Directory

field.

Example 1

You are using an Amazon S3 target and want to write output files and the target schema to the same directory, which is under the parent directory specified in the

Folder Path

field of the connection properties. In this case, the parent directory is idr-test/DEMO. You want write all of the output files for an object to a directory that has a name matching the table name, without a timestamp. You must complete the

Data DIrectory

field and select the

Connection Directory as Parent

check box. The following image shows this configuration on the

Target

page of the task wizard:

Based on this configuration, the resulting directory structure is:

Example 2

You are using an Amazon S3 target and want to write output data files to a custom directory path and write the target schema to a separate directory path. To use the directory specified in the

Folder Path

field in the Amazon S3 connection properties as the parent directory for the data directory and schema directory, select

Connection Directory as Parent

. In this case, the parent directory is

idr-test/DEMO

. In the

Data Directory

and

Schema Directory

fields, define directory patterns by using a specific directory name, such as data_dir and schema_dir, followed by the default {TableName}_{Timestamp} placeholder value. The placeholder creates tablename_timestamp destination directories. The following image shows this configuration on the

Target

page of the task wizard: