Hi, I'm Ask INFA!
What would you like to know?
ASK INFAPreview
Please to access Bolo.

Table of Contents

Search

  1. Introducing Mass Ingestion
  2. Getting Started with Mass Ingestion
  3. Connectors and Connections
  4. Mass Ingestion Applications
  5. Mass Ingestion Databases
  6. Mass Ingestion Files
  7. Mass Ingestion Streaming
  8. Monitoring Mass Ingestion Jobs
  9. Asset Management
  10. Troubleshooting

Mass Ingestion

Mass Ingestion

Google Cloud Storage target properties

Google Cloud Storage target properties

When you define a
database ingestion
task that has a Google Cloud Storage target, you must enter some target properties on the
Target
tab of the task wizard.
Under
Target
, you can enter the following Google Cloud Storage target properties:
Property
Description
Output Format
Select the format of the output file. Options are:
  • CSV
  • AVRO
  • PARQUET
The default value is
CSV
.
Output files in CSV format use double-quotation marks ("") as the delimiter for each field.
Add Headers to CSV File
If
CSV
is selected as the output format, select this check box to add a header with source column names to the output CSV file.
Avro Format
If you selected
AVRO
as the output format, select the format of the Avro schema that will be created for each source table. Options are:
  • Avro-Flat
    . This Avro schema format lists all Avro fields in one record.
  • Avro-Generic
    . This Avro schema format lists all columns from a source table in a single array of Avro fields.
  • Avro-Nested
    . This Avro schema format organizes each type of information in a separate record.
The default value is
Avro-Flat
.
Avro Serialization Format
If
AVRO
is selected as the output format, select the serialization format of the Avro output file. Options are:
  • None
  • Binary
  • JSON
The default value is
Binary
.
Avro Schema Directory
If
AVRO
is selected as the output format, specify the local directory where
Mass Ingestion Databases
stores Avro schema definitions for each source table. Schema definition files have the following naming pattern:
schemaname
_
tablename
.txt
If this directory is not specified, no Avro schema definition file is produced.
File Compression Type
Select a file compression type for output files in CSV or AVRO output format. Options are:
  • None
  • Deflate
  • Gzip
  • Snappy
The default value is
None
, which means no compression is used.
Avro Compression Type
If
AVRO
is selected as the output format, select an Avro compression type. Options are:
  • None
  • Bzip2
  • Deflate
  • Snappy
The default value is
None
, which means no compression is used.
Parquet Compression Type
If the
PARQUET
output format is selected, you can select a compression type that is supported by Parquet. Options are:
  • None
  • Gzip
  • Snappy
The default value is
None
, which means no compression is used.
Deflate Compression Level
If
Deflate
is selected in the
Avro Compression Type
field, specify a compression level from 0 to 9. The default value is 0.
Add Directory Tags
For incremental load tasks, select this check box to add the "dt=" prefix to the names of apply cycle directories to be compatible with the naming convention for Hive partitioning. This check box is cleared by default.
Bucket
Specifies the name of an existing bucket container that stores, organizes, and controls access to the data objects that you load to Google Cloud Storage.
Task Target Directory
For incremental load tasks, the root directory for the other directories that hold output data files, schema files, and CDC cycle contents and completed files. You can use it to specify a custom root directory for the task. If you enable the
Connection Directory as Parent
option, you can still optionally specify a task target directory to use with the parent directory specified in the connection properties.
This field is required if the {TaskTargetDirectory} placeholder is specified in patterns for any of the following directory fields.
Data Directory
For initial load tasks
, define a directory structure for the directories where Mass Ingestion Databases stores output data files and optionally stores the schema. To define directory pattern, you can use the following types of entries:
  • The placeholders {SchemaName}, {TableName), {Timestamp}, {YY}, {YYYY}, {MM}, and {DD}, where {YY}, {YYYY}, {MM}, and {DD} are for date elements. The {Timestamp} values are in the format yyyymmdd_hhmissms. The generated dates and times in the directory paths indicate when the initial load job starts to transfer data to the target.
  • Specific directory names.
  • The toUpper() and toLower() functions, which force the values for an associated (
    placeholder
    ) to uppercase or lowercase.
Placeholder values are not case sensitive.
Examples:
myDir1/{SchemaName}/{TableName} myDir1/myDir2/{SchemaName}/{YYYY}/{MM}/{TableName}_{Timestamp} myDir1/{toLower(SchemaName)}/{TableName}_{Timestamp}
The default directory pattern is
{TableName)_{Timestamp}
.
For incremental load tasks
, define a custom path to the subdirectory that contains the cdc-data data files. To define the directory pattern, you can use the following types of entries:
  • The placeholders {TaskTargetDirectory}, {SchemaName}, {TableName), {Timestamp}, {YY}, {YYYY}, {MM}, and {DD}, where {YY}, {YYYY}, {MM}, and {DD} are for date elements. The {Timestamp} values are in the format yyyymmdd_hhmissms. The generated dates and times in the directory paths indicate when the CDC cycle started.
    If you include the toUpper or toLower function, put the placeholder name in parentheses and enclose the both the function and placeholder in curly brackets, as shown in the preceding example.
  • Specific directory names.
The default directory pattern is
{TaskTargetDirectory}/cdc-data/{TableName}/data
For Amazon S3, Flat File, and Microsoft Azure Data Lake Storage Gen2 targets, Mass Ingestion Databases uses the directory specified in the target connection properties as the root for the data directory path when
Connection Directory as Parent
is selected. For Google Cloud Storage targets, Mass Ingestion Databases uses the
Bucket
name that you specify in the target properties for the ingestion task.
Schema Directory
For initial load and incremental load tasks, you can specify a custom directory in which to store the schema file if you want to store it in a directory other than the default directory. For initial loads, previously used values if available are shown in a drop-down list for your convenience. This field is optional.
For initial loads, the schema is stored in the data directory by default. For incremental loads, the default directory for the schema file is
{TaskTargetDirectory}/cdc-data/{TableName}/schema
You can use the same placeholders as for the
Data Directory
field. Ensure that you enclose placeholders with curly brackets { }.
If you include the toUpper or toLower function, put the placeholder name in parentheses and enclose the both the function and placeholder in curly brackets, for example:
{toLower(SchemaName)}
Schema is written only to output data files in CSV format. Data files in Parquet and Avro formats contain their own embedded schema.
Cycle Completion Directory
For incremental load tasks, the path to the directory that contains the cdc-cycle completed file. Default is
{TaskTargetDirectory}/cdc-cycle/completed
.
Cycle Contents Directory
For incremental load tasks, the path to the directory that contains the cdc-cycle contents files. Default is
{TaskTargetDirectory}/cdc-cycle/contents
.
Use Cycle Partitioning for Data Directory
For incremental load tasks, causes a timestamp subdirectory to be created for each CDC cycle, under each data directory.
If this option is not selected, individual data files are written to the same directory without a timestamp, unless you define an alternative directory structure.
Use Cycle Partitioning for Summary Directories
For incremental load tasks, causes a timestamp subdirectory to be created for each CDC cycle, under the summary contents and completed subdirectories.
List Individual Files in Contents
For incremental load tasks, lists individual data files under the contents subdirectory.
If
Use Cycle Partitioning for Summary DIrectories
is cleared, this option is selected by default. All of the individual files are listed in the contents subdirectory unless you can configure custom subdirectories by using the placeholders, such as for timestamp or date.
If
Use Cycle Partitioning for Data Directory
is selected, you can still optionally select this check box to list individual files and group them by CDC cycle.
Under
Advanced
, you can enter the following Google Cloud Storage advanced target properties, which are primarily for incremental load jobs:
Field
Description
Add Operation Type
Select this check box to add a metadata column that includes the source SQL operation type in the output that the job propagates to the target.
For incremental loads, the job writes "I" for insert, "U" for update, or "D" for delete. For initial loads, the job always writes "I" for insert.
By default, this check box is cleared.
Add Operation Time
Select this check box to add a metadata column that includes the source SQL operation time in the output that the job propagates to the target.
For initial loads, the job always writes the current date and time.
By default, this check box is cleared.
Add Operation Owner
Select this check box to add a metadata column that includes the owner of the source SQL operation in the output that the job propagates to the target.
For initial loads, the job always writes "INFA" as the owner.
By default, this check box is cleared.
This property is not available for jobs that have a PostgreSQL source.
Add Operation Transaction Id
Select this check box to add a metadata column that includes the source transaction ID in the output that the job propagates to the target for SQL operations.
For initial loads, the job always writes "1" as the ID.
By default, this check box is cleared.
Add Before Images
Select this check box to include UNDO data in the output that an incremental load job writes to the target.
For initial loads, the job writes nulls.
By default, this check box is cleared.

0 COMMENTS

We’d like to hear from you!