Mass Ingestion

Back Next

Default directory structure for CDC files on Amazon S3, Google Cloud Storage, and Azure Data Lake Storage Gen2 targets

Database ingestion jobs create directories on Amazon S3, Google Cloud Storage, and Microsoft Azure Data Lake Storage Gen2 targets to store information about change data processing.

The following directory structure is created by default on the targets:

Bucket
└───connection_folder
    └───job_folder
        ├───Cdc-cycle
        │   ├───Completed
        │   │   ├───completed_cycle_folder
        │   │   │   └───Cycle-timestamp.csv
        │   │   │       ...
        │   │   └───completed_cycle_folder
        │   │       └───Cycle-timestamp.csv
        │   └───Contents
        │       ├───cycle_folder
        │       │   └───Cycle-contents-timestamp.csv
        │       │       ...
        │       └───cycle_folder
        │           └───Cycle-contents-timestamp.csv
        └───Cdc-data
            └───table_name
                ├───Data
                │   ├───cycle_folder
                │   │   └───table_name_timestamp.csv
                │   │        ...
                │   └───cycle_folder
                │       └───table_name_timestamp.csv
                └───Schema
                    └───V1
                        └───table_name.schema

The following table describes the directories in the default structure:

Folder	Description
`connection_folder`	Contains the Mass Ingestion Databases objects. This folder is specified in the Folder Path field of the Amazon S3 connection properties or in the Directory Path field of the Microsoft Azure Data Lake Storage Gen2 connection properties. This folder is not created for Google Cloud Storage targets.
`job_folder`	Contains job output files. This folder is specified in the Directory field on the Target page of the database ingestion task wizard.
Cdc-cycle/Completed	Contains a subfolder for each completed CDC cycle. Each cycle subfolder contains a completed cycle file.
Cdc-cycle/Contents	Contains a subfolder for each CDC cycle. Each cycle subfolder contains a cycle contents file.
Cdc-data	Contains output data files and schema files for each table.
Cdc-data/`table_name`/Schema/V1	Contains a schema file. Mass Ingestion Databases does not save a schema file in this folder if the output files use the Parquet format.
Cdc-data/`table_name`/Data	Contains a subfolder for each CDC cycle that produces output data files.

Cycle directories

Mass Ingestion Databases uses the following pattern to name cycle directories:

[dt=]yyyy-mm-dd-hh-mm-ss

The "dt=" prefix is added to cycle folder names if you select the

Add Directory Tags

check box on the

Target

page of the database ingestion task wizard.

Cycle contents files

Cycle contents files are located in Cdc-cycle/Contents/cycle_folder subdirectories. Cycle contents files contain a record for each table that has had a DML event during the cycle. If no DML operations occurred on a table in the cycle, the table does not appear in the cycle contents file.

Mass Ingestion Databases uses the following pattern to name cycle content files:

Cycle-contents-timestamp.csv

A cycle contents csv file contains the following information:

Table name

Cycle name

Path to the cycle folder for the table

Start sequence for the table

End sequence for the table

Number of Insert operations

Number of Update operations

Number of Delete operations

Schema version

Path to the schema file for the schema version

If the output data files use the Parquet format, Mass Ingestion Databases does not save a schema file at the path that is specified in the cycle contents file. Instead, use the schema file in the folder that is specified in the

Avro Schema Directory

field on the

Target

page of the database ingestion task wizard.

Completed cycle files

Completed cycle files are located in Cdc-cycle/Completed/completed_cycle_folder subdirectories. A database ingestion job creates a cycle file in this subdirectory after a cycle completes. If this file is not present, the cycle has not completed yet.

Mass Ingestion Databases uses the following pattern to name completed cycle files:

Cycle-timestamp.csv

A completed cycle csv file contains the following information:

Cycle name

Cycle start time

Cycle end time

Current sequence number at the time the cycle ended

Path to the cycle contents file

Reason for the end of cycle

Valid reason values are:

NORMAL_COMMIT
. A commit operation was encountered after the cycle had reached the DML limit or the end of the cycle interval. A cycle can end only on a commit boundary.

NORMAL_EXPIRY
. The cycle ended because the cycle interval expired. The last operation was a commit.

Output data files

The data files contain records that include the following information:

Operation type. Valid values are:

I
for Insert operations

U
for Update operations

D
for Delete operations

Sortable sequence number

Data columns

Insert and Delete records contain only after images. Update records contain both before and after images.

Rename Saved Search

Table of Contents

Mass Ingestion

Mass Ingestion

Default directory structure for CDC files on Amazon S3, Google Cloud Storage, and Azure Data Lake Storage Gen2 targets

Default directory structure for CDC files on Amazon S3, Google Cloud Storage, and Azure Data Lake Storage Gen2 targets

Cycle directories

Cycle contents files

Completed cycle files

Output data files