Mass Ingestion

Back Next

Custom directory structure for output files on Amazon S3, Google Cloud Storage, Flat File, and ADLS Gen2 targets

You can configure a custom directory structure for the output files that initial load
or incremental load
jobs write to Amazon S3, Google Cloud Storage, Flat File, or Microsoft Azure Data Lake Storage (ADLS) Gen2 targets if you do not want to use the default structure.

Initial loads

By default, initial load jobs write output files to tablename_timestamp subdirectories under the parent directory. For Amazon S3, Flat File, and ADLS Gen2 targets, the parent directory is specified in the target connection properties if the

Connection Directory as Parent

check box is selected on the

Target

page of the task wizard.

In an Amazon S3 connection, this parent directory is specified in the

Folder Path

field.

In a Flat File connection, the parent directory is specified in the

Directory

field.

In an ADLS Gen2 connection, the parent directory is specified in the

Directory Path

field.

For Google Cloud Storage targets, the parent directory is the bucket container specified in the

Bucket

field on the

Target

page of the task wizard.

You can customize the directory structure to suit your needs. For example, for initial loads, you can write the output files under a root directory or directory path that is different from the parent directory specified in the connection properties to better organize the files for your environment or to find them more easily. Or you can consolidate all output files for a table directly in a directory with the table name rather than write the files to separate timestamped subdirectories, for example, to facilitate automated processing of all of the files.

To configure a directory structure, you must use the

Data Directory

field on the

Target

page of the ingestion task wizard. The default value is

{TableName}_{Timestamp}

, which causes output files to be written to tablename_timestamp subdirectories under the parent directory. You can configure a custom directory path by creating a directory pattern that consists of any combination of case-insensitive placeholders and directory names. The placeholders are:

{TableName} for a target table name

{Timestamp} for the date and time, in the format yyyymmdd_hhmissms, at which the initial load job started to transfer data to the target

{Schema} for the target schema name

{YY} for a two-digit year

{YYYY} for a four-digit year

{MM} for a two-digit month value

{DD} for a two-digit day in the month

A pattern can also include the following functions:

toLower() to use lowercase for the values represented by the placeholder in parentheses

toUpper() to use uppercase for the values represented by the placeholder in parentheses

By default, the target schema is also written to the data directory. If you want to use a different directory for the schema, you can define a directory pattern in the

Schema Directory

field.

Example 1

You are using an Amazon S3 target and want to write output files and the target schema to the same directory, which is under the parent directory specified in the

Folder Path

field of the connection properties. In this case, the parent directory is idr-test/DEMO. You want write all of the output files for a table to a directory that has a name matching the table name, without a timestamp. You must complete the

Data Directory

field and select the

Connection Directory as Parent

check box. The following image shows this configuration on the

Target

page of the task wizard:

Based on this configuration, the resulting directory structure is:

Example 2

You are using an Amazon S3 target and want to write output data files to a custom directory path and write the target schema to a separate directory path. To use the directory specified in the

Folder Path

field in the Amazon S3 connection properties as the parent directory for the data directory and schema directory, select

Connection Directory as Parent

. In this case, the parent directory is

idr-test/DEMO

. In the

Data Directory

and

Schema Directory

fields, define directory patterns by using a specific directory name, such as data_dir and schema_dir, followed by the default {TableName}_{Timestamp} placeholder value. The placeholder creates tablename_timestamp destination directories. The following image shows this configuration on the

Target

page of the task wizard:

Based on this configuration, the resulting data directory structure is:

And the resulting schema directory structure is:

Incremental loads

By default, incremental load jobs write Cdc-cycle files and Cdc-data files to subdirectories under the parent directory. However, you can create a custom directory structure to organize the files to best suit your organization's requirements.

This feature applies to database ingestion incremental load tasks that have Amazon S3, Google Cloud Storage, or Microsoft Azure Data Lake Storage (ADLS) Gen2 targets. It also applies to application ingestion incremental load jobs that have a Salesforce source and any one of these target types. This feature does not apply to combined initial and incremental load tasks.

For all targets except Google Cloud Storage, the parent directory is set in the target connection properties if the

Connection Directory as Parent

check box is selected on the

Target

page of the task wizard.

In an Amazon S3 connection, the parent directory is specified in the

Folder Path

field.

In an ADLS Gen2 connection, the parent directory is specified in the

Directory Path

field.

For Google Cloud Storage targets, the parent directory is the bucket container specified in the

Bucket

field on the

Target

page of the task wizard.

You can customize the directory structure to suit your needs. For example, you can write the cdc-data and cdc-cycle files under a target directory for the task instead of under the parent directory specified in the connection properties. Alternatively, you can 1) consolidate table-specific data and schema files under a subdirectory that includes the table name, 2) partition the data files and summary contents and completed files by CDC cycle, or 3) create a completely customized directory structure by defining a pattern that includes literal values and placeholders. For example, if you want to run SQL-type expressions to process the data based on time, you can write all data files directly to timestamp subdirectories without partitioning them by CDC cycle.

To configure a custom directory structure for an incremental load task, define a pattern for any of the following optional fields on the

Target

page of the ingestion task wizard:

Field	Description	Default
Task Target Directory	Name of a root directory to use for storing output files for an incremental load task. If you select the Connection Directory as Parent option, you can still optionally specify a task target directory. It will be appended to the parent directory to form the root for the data, schema, cycle completion, and cycle contents directories. This field is required if the {TaskTargetDirectory} placeholder is specified in patterns for any of the following directory fields.	None
Connection Directory as Parent	Select this check box to use the parent directory specified in the connection properties.	Selected
Data Directory	Path to the subdirectory that contains the cdc-data data files. In the directory path, the {TableName} placeholder is required if data and schema files are not partitioned by CDC cycle.	{TaskTargetDirectory}/cdc-data/{TableName}/data
Schema Directory	Path to the subdirectory in which to store the schema file if you do not want to store it in the data directory. In the directory path, the {TableName} placeholder is required if data and schema files are not partitioned by CDC cycle.	{TaskTargetDirectory}/cdc-data/{TableName}/schema
Cycle Completion Directory	Path to the directory that contains the cdc-cycle completed file.	{TaskTargetDirectory}/cdc-cycle/completed
Cycle Contents Directory	Path to the directory that contains the cdc-cycle contents files.	{TaskTargetDirectory}/cdc-cycle/contents
Use Cycle Partitioning for Data Directory	Causes a timestamp subdirectory to be created for each CDC cycle, under each data directory. If this option is not selected, individual data files are written to the same directory without a timestamp, unless you define an alternative directory structure.	Selected
Use Cycle Partitioning for Summary Directories	Causes a timestamp subdirectory to be created for each CDC cycle, under the summary contents and completed subdirectories.	Selected
List Individual Files in Contents	Lists individual data files under the contents subdirectory. If Use Cycle Partitioning for Summary DIrectories is cleared, this option is selected by default. All of the individual files are listed in the contents subdirectory unless you can configure custom subdirectories by using the placeholders, such as for timestamp or date. If Use Cycle Partitioning for Data Directory is selected, you can still optionally select this check box to list individual files and group them by CDC cycle.	Not selected if Use Cycle Partitioning for Summary Directories is selected. Selected if you cleared Use Cycle Partitioning for Summary Directories .

A directory pattern consists of any combination of case-insensitive placeholders, shown in curly brackets { }, and specific directory names. The following placeholders are supported:

{TaskTargetDirectory} for a task-specific base directory on the target to use instead of the directory the connection properties

{TableName} for a target table name

{Timestamp} for the date and time, in the format yyyymmdd_hhmissms

{Schema} for the target schema name

{YY} for a two-digit year

{YYYY} for a four-digit year

{MM} for a two-digit month value

{DD} for a two-digit day in the month

The timestamp, year, month, and day placeholders indicate when the CDC cycle started when specified in patterns for data, contents, and completed directories, or indicate when the CDC job started when specified in the schema directory pattern.

Example 1

You want to accept the default directory settings for incremental load jobs as displayed in the task wizard. The target type is Amazon S3. Because the

Connection Directory as Parent

check box is selected by default, the parent directory path that is specified in the

Folder Path

field of the Amazon S3 connection properties is used. This parent directory is

idr-test/dbmi

. You also must specify a task target directory name, in this case, s3_target, because the {TaskTargetDirectory} placeholder is used in the default patterns in the subsequent directory fields. The files in the data directory and schema directory will be grouped by table name because the {TableName} placeholder is included in their default patterns. Also, because cycle partitioning is enabled, the files in the data directory, schema directory, and cycle summary directories will be subdivided by CDC cycle. The following image shows the default configuration settings on the

Target

page of the task wizard, except for the specified task target directory name:

Based on this configuration, the resulting data directory structure is:

If you drill down on the cdc_data folder and then on a table in that folder (pgs001_src_allint_init), you can access the data and schema subdirectories:

If drill down on the data folder, you can access the timestamp directories for the data files:

If you drill down on cdc-cycle, you can access the summary contents and completed subdirectories:

Example 2

You want to create a custom directory structure for incremental load jobs that adds the subdirectories "demo" and "d1" in all of the directory paths except in the schema directory so that you can easily find the files for your demos. Because the

Connection Directory as Parent

check box is selected, the parent directory path (

idr-test/dbmi

) that is specified in the

Folder Path

field of the Amazon S3 connection properties is used. You also must specify the task target directory because the {TaskTargetDirectory} placeholder is used in the patterns in the subsequent directory fields. The files in the data directory and schema directory will be grouped by table name. Also, because cycle partitioning is enabled, the files in the data, schema, and cycle summary directories will be subdivided by CDC cycle. The following image shows the custom configuration on the

Target

page of the task wizard:

Based on this configuration, the resulting data directory structure is:

Rename Saved Search

Table of Contents

Mass Ingestion

Mass Ingestion

Custom directory structure for output files on Amazon S3, Google Cloud Storage, Flat File, and ADLS Gen2 targets

Custom directory structure for output files on Amazon S3, Google Cloud Storage, Flat File, and ADLS Gen2 targets

Initial loads

Incremental loads