Mass Ingestion

Back Next

Amazon S3 target properties

When you define an

application ingestion

task, you must specify the properties for your Amazon S3 target on the

Target

page of the task wizard.

The following table describes the Amazon S3 target properties that appear in

Target

section:

Property	Description
Output Format	Select the format of the output file. Options are: CSV AVRO PARQUET The default value is CSV . Output files in CSV format use double-quotation marks ("") as the delimiter for each field.
Add Headers to CSV File	If CSV is selected as the output format, select this check box to add a header with source column names to the output CSV file.
Parquet Compression Type	If the PARQUET output format is selected, you can select a compression type that is supported by Parquet. Options are: None Gzip Snappy The default value is None , which means no compression is used.
Avro Format	If you selected AVRO as the output format, select the format of the Avro schema that will be created for each source table. Options are: Avro-Flat . This Avro schema format lists all Avro fields in one record. Avro-Generic . This Avro schema format lists all columns from a source table in a single array of Avro fields. Avro-Nested . This Avro schema format organizes each type of information in a separate record. The default value is Avro-Flat .
Avro Serialization Format	If AVRO is selected as the output format, select the serialization format of the Avro output file. Options are: None Binary JSON The default value is Binary .
Avro Schema Directory	If AVRO is selected as the output format, specify the local directory where Mass Ingestion Applications stores Avro schema definitions for each source table. Schema definition files have the following naming pattern: `schemaname`_`tablename`.txt If this directory is not specified, no Avro schema definition file is produced.
File Compression Type	Select a file compression type for output files in CSV or AVRO output format. Options are: None Deflate Gzip Snappy The default value is None , which means no compression is used.
Avro Compression Type	If AVRO is selected as the output format, select an Avro compression type. Options are: None Bzip2 Deflate Snappy The default value is None , which means no compression is used.
Deflate Compression Level	If Deflate is selected in the Avro Compression Type field, specify a compression level from 0 to 9. The default value is 0.
Add Directory Tags	For incremental load tasks, select this check box to add the "dt=" prefix to the names of apply cycle directories to be compatible with the naming convention for Hive partitioning. This check box is cleared by default.
Data Directory	For initial load tasks, define a directory structure for the directories where Mass Ingestion Applications stores output data files and optionally stores the schema. In this value, you can include the following types of entries to define directory patterns: The placeholders {SchemaName}, {TableName), {Timestamp}, {YY}, {YYYY}, {MM}, and {DD}, where {YY}, {YYYY}, {MM}, and {DD} are for date elements. The (Timestamp) values are in the format yyyymmdd_hhmissms. The generated dates and times in the directory paths indicate when the initial load job starts to transfer data to the target. Specific directory names. The toUpper() and toLower() functions, which force the values for an associated (`placeholder`) to uppercase or lowercase. Examples: myDir1/{SchemaName}/{TableName} myDir1/myDir2/{SchemaName}/{YYYY}/{MM}/{TableName}_{Timestamp} myDir1/{toLower(SchemaName)}/{TableName}_{Timestamp} This value is not case sensitive. The default format is {TableName)_{Timestamp}. For Amazon S3 and Microsoft Azure Data Lake Storage Gen2 targets, Mass Ingestion Applications uses the directory specified in the target connection properties as the root for the data directory path. For Google Cloud Storage targets, Mass Ingestion Applications uses the Bucket name that you specify in the target properties for the ingestion task.
Connection Directory as Parent	For initial load and incremental load tasks, select this check box to use the directory value that is specified in the target connection properties as the parent directory for the custom directory paths specified in the task target properties. For initial load tasks, the parent directory is used in the Data Directory and Schema Directory . For incremental load tasks, the parent directory is used in the Data Directory , Schema Directory , Cycle Completion Directory , and Cycle Contents Directory . This check box is selected by default. If you clear it, for initial loads, define the full path to the output files in the Data Directory field. For incremental loads, optionally specify a root directory for the task in the Task Target Directory .
Schema Directory	For initial load and incremental load tasks, you can specify a custom directory in which to store the schema file if you want to store it in a directory other than the default directory. For initial loads, previously used values if available are shown in a drop-down list for your convenience. This field is optional. For initial loads, the schema is stored in the data directory by default. For incremental loads, the default directory for the schema file is {TaskTargetDirectory}/cdc-data/{TableName}/schema You can use the same placeholders as for the Data Directory field. Ensure that you enclose placeholders with curly brackets { }. If you include the toUpper or toLower function, put the placeholder name in parentheses and enclose the both the function and placeholder in curly brackets, for example: {toLower(SchemaName)} Schema is written only to output data files in CSV format. Data files in Parquet and Avro formats contain their own embedded schema.

The following table describes the Amazon S3 advanced target properties that appear in

Advanced

section:

Field	Description
Add Operation Type	Select this check box to add a metadata column that includes the source SQL operation type in the output that the job propagates to the target. For incremental loads, the job writes "I" for insert, "U" for update, or "D" for delete. For initial loads, the job always writes "I" for insert. By default, this check box is cleared.
Add Operation Time	Select this check box to add a metadata column that includes the source SQL operation time in the output that the job propagates to the target. For initial loads, the job always writes the current date and time. By default, this check box is cleared.
Add Operation Owner	Select this check box to add a metadata column that includes the owner of the source SQL operation in the output that the job propagates to the target. For initial loads, the job always writes "INFA" as the owner. By default, this check box is cleared.
Add Operation Transaction Id	Select this check box to add a metadata column that includes the source transaction ID in the output that the job propagates to the target for SQL operations. For initial loads, the job always writes "1" as the ID. By default, this check box is cleared.
Add Before Images	Select this check box to include UNDO data in the output that an incremental load job writes to the target. For initial loads, the job writes nulls. By default, this check box is cleared.

Rename Saved Search

Table of Contents

Mass Ingestion

Mass Ingestion

Amazon S3 target properties

Amazon S3 target properties