Mass Ingestion

Back Next

Guidelines for Amazon S3, Google Cloud Storage, and Microsoft Azure Data Lake Storage Gen2 targets

Consider the following guidelines when you use Amazon S3, Google Cloud Storage, and Microsoft Azure Data Lake Storage Gen2 targets:

When you configure an

application ingestion

task for an Amazon S3, Google Cloud Storage, or Microsoft Azure Data Lake Storage Gen2 target, you can select either CSV or AVRO as the format for the output files that contain the source data to be applied to the target.

If you select

CSV

as the output file format,

Mass Ingestion Applications

creates the following files on the target for each source field:

schema.ini file that describes the schema of the field. The file also includes some settings for the output file on the target.

Output files that contain the data stored in the source field.

Mass Ingestion Applications

names the output files based on the name of the source field with an appended date and time.

The schema.ini file lists the sequence of columns for the rows in the corresponding output file. The following table describes the columns in the schema.ini file:

Column	Description
ColNameHeader	Indicates whether the source data files include column headers.
Format	Format of the output files. Mass Ingestion Applications uses a comma (,) to delimit column values.
CharacterSet	Character set that is used for the corresponding output file. By default, Mass Ingestion Applications generates the files in the UTF-8 character set.
COL`<sequence_number>`	Name and data type of the source field.

You must not edit the schema.ini file.

If you select the

Add Before Images

check box in the

Advanced

section of the

Target

page, the

application ingestion

job creates a column_name_OLD column to store the UNDO data and column_name_NEW column to store the REDO data for each source field.

If you select

AVRO

as the output file format, you can specify the serialization format of the Avro output file and write the output data in uncompressed Parquet format. Additionally, you can specify the file compression type, Avro data compression type, and the directory where the Avro schema definitions that are generated for each source object are stored.

For

application ingestion

tasks configured for Microsoft Azure Data Lake Storage Gen2 targets,

Mass Ingestion Applications

creates an empty directory on the target for each empty source field.

For Amazon S3 targets, if you do not specify an access key and secret key in the connection properties,

Mass Ingestion Applications

tries to find the AWS credentials by using the default credential provider chain that is implemented by the DefaultAWSCredentialsProviderChain class. For more information, see the Amazon Web Services documentation.

When an incremental load job that is configured for a target that uses the CSV output format propagates an Update operation that changed primary key values on the source, the job performs a Delete operation on the associated target row and then performs an Insert operation on the same row to replicate the change made to the source object. The Delete operation writes the before image to the target and the subsequent Insert image writes the after image to the target.

For Update operations that do not change primary key values,

application ingestion

jobs process each Update operation as a single operation and writes only the after image to the target.

If a source object does not contain any primary key,

Mass Ingestion Applications

considers all fields of the object to be a part of the primary key. In such scenarios,

Mass Ingestion Applications

processes each Update operation performed on the source as a Delete operation followed by an Insert operation on the target.

Supported targets

Default directory structure of CDC files on Amazon S3, Google Cloud Storage, and Microsoft Azure Data Lake Storage Gen2 targets

Custom directory structure of initial load output files on Amazon S3, Google Cloud Storage, and ADLS Gen2 targets

Rename Saved Search

Table of Contents

Mass Ingestion

Mass Ingestion

Guidelines for Amazon S3, Google Cloud Storage, and Microsoft Azure Data Lake Storage Gen2 targets

Guidelines for Amazon S3, Google Cloud Storage, and Microsoft Azure Data Lake Storage Gen2 targets