Hi, I'm Ask INFA!
What would you like to know?
ASK INFAPreview
Please to access Bolo.

Table of Contents

Search

  1. Introducing Mass Ingestion
  2. Getting Started with Mass Ingestion
  3. Connectors and Connections
  4. Mass Ingestion Applications
  5. Mass Ingestion Databases
  6. Mass Ingestion Files
  7. Mass Ingestion Streaming
  8. Monitoring Mass Ingestion Jobs
  9. Asset Management
  10. Troubleshooting

Mass Ingestion

Mass Ingestion

Default directory structure for CDC files on Amazon S3, Google Cloud Storage, and Azure Data Lake Storage Gen2 targets

Default directory structure for CDC files on Amazon S3, Google Cloud Storage, and Azure Data Lake Storage Gen2 targets

Database ingestion jobs create directories on Amazon S3, Google Cloud Storage, and Microsoft Azure Data Lake Storage Gen2 targets to store information about change data processing.
The following directory structure is created by default on the targets:
Bucket └───
connection_folder
└───
job_folder
├───Cdc-cycle │ ├───Completed │ │ ├───
completed_cycle_folder
│ │ │ └───Cycle-
timestamp
.csv │ │ │ ... │ │ └───
completed_cycle_folder
│ │ └───Cycle-
timestamp
.csv │ └───Contents │ ├───
cycle_folder
│ │ └───Cycle-contents-
timestamp
.csv │ │ ... │ └───
cycle_folder
│ └───Cycle-contents-
timestamp
.csv └───Cdc-data └───
table_name
├───Data │ ├───
cycle_folder
│ │ └───
table_name
_
timestamp
.csv │ │ ... │ └───
cycle_folder
│ └───
table_name
_
timestamp
.csv └───Schema └───V1 └───
table_name
.schema
The following table describes the directories in the default structure:
Folder
Description
connection_folder
Contains the Mass Ingestion Databases objects. This folder is specified in the
Folder Path
field of the Amazon S3 connection properties or in the
Directory Path
field of the Microsoft Azure Data Lake Storage Gen2 connection properties.
This folder is not created for Google Cloud Storage targets.
job_folder
Contains job output files. This folder is specified in the
Directory
field on the
Target
page of the database ingestion task wizard.
Cdc-cycle/Completed
Contains a subfolder for each completed CDC cycle. Each cycle subfolder contains a completed cycle file.
Cdc-cycle/Contents
Contains a subfolder for each CDC cycle. Each cycle subfolder contains a cycle contents file.
Cdc-data
Contains output data files and schema files for each table.
Cdc-data/
table_name
/Schema/V1
Contains a schema file.
Mass Ingestion Databases does not save a schema file in this folder if the output files use the Parquet format.
Cdc-data/
table_name
/Data
Contains a subfolder for each CDC cycle that produces output data files.

Cycle directories

Mass Ingestion Databases uses the following pattern to name cycle directories:
[dt=]
yyyy
-
mm
-
dd
-
hh
-
mm
-
ss
The "dt=" prefix is added to cycle folder names if you select the
Add Directory Tags
check box on the
Target
page of the database ingestion task wizard.

Cycle contents files

Cycle contents files are located in Cdc-cycle/Contents/
cycle_folder
subdirectories. Cycle contents files contain a record for each table that has had a DML event during the cycle. If no DML operations occurred on a table in the cycle, the table does not appear in the cycle contents file.
Mass Ingestion Databases uses the following pattern to name cycle content files:
Cycle-contents-
timestamp
.csv
A cycle contents csv file contains the following information:
  • Table name
  • Cycle name
  • Path to the cycle folder for the table
  • Start sequence for the table
  • End sequence for the table
  • Number of Insert operations
  • Number of Update operations
  • Number of Delete operations
  • Schema version
  • Path to the schema file for the schema version
    If the output data files use the Parquet format, Mass Ingestion Databases does not save a schema file at the path that is specified in the cycle contents file. Instead, use the schema file in the folder that is specified in the
    Avro Schema Directory
    field on the
    Target
    page of the database ingestion task wizard.

Completed cycle files

Completed cycle files are located in Cdc-cycle/Completed/
completed_cycle_folder
subdirectories. A database ingestion job creates a cycle file in this subdirectory after a cycle completes. If this file is not present, the cycle has not completed yet.
Mass Ingestion Databases uses the following pattern to name completed cycle files:
Cycle-
timestamp
.csv
A completed cycle csv file contains the following information:
  • Cycle name
  • Cycle start time
  • Cycle end time
  • Current sequence number at the time the cycle ended
  • Path to the cycle contents file
  • Reason for the end of cycle
    Valid reason values are:
    • NORMAL_COMMIT
      . A commit operation was encountered after the cycle had reached the DML limit or the end of the cycle interval. A cycle can end only on a commit boundary.
    • NORMAL_EXPIRY
      . The cycle ended because the cycle interval expired. The last operation was a commit.

Output data files

The data files contain records that include the following information:
  • Operation type. Valid values are:
    • I
      for Insert operations
    • U
      for Update operations
    • D
      for Delete operations
  • Sortable sequence number
  • Data columns
    Insert and Delete records contain only after images. Update records contain both before and after images.

0 COMMENTS

We’d like to hear from you!