Hi, I'm Ask INFA!
What would you like to know?
ASK INFAPreview
Please to access Bolo.

Table of Contents

Search

  1. Introducing Mass Ingestion
  2. Getting Started with Mass Ingestion
  3. Connectors and Connections
  4. Mass Ingestion Applications
  5. Mass Ingestion Databases
  6. Mass Ingestion Files
  7. Mass Ingestion Streaming
  8. Monitoring Mass Ingestion Jobs
  9. Asset Management
  10. Troubleshooting

Mass Ingestion

Mass Ingestion

Default directory structure of CDC files on Amazon S3, Google Cloud Storage, and Microsoft Azure Data Lake Storage Gen2 targets

Default directory structure of CDC files on Amazon S3, Google Cloud Storage, and Microsoft Azure Data Lake Storage Gen2 targets

Application ingestion jobs create directories on Amazon S3, Google Cloud Storage, and Microsoft Azure Data Lake Storage Gen2 targets to store information about change data processing.
The following directory structure is created by default on the targets:
Bucket └───
connection_folder
└───
job_folder
├───Cdc-cycle │ ├───Completed │ │ ├───
completed_cycle_folder
│ │ │ └───Cycle-
timestamp
.csv │ │ │ ... │ │ └───
completed_cycle_folder
│ │ └───Cycle-
timestamp
.csv │ └───Contents │ ├───
cycle_folder
│ │ └───Cycle-contents-
timestamp
.csv │ │ ... │ └───
cycle_folder
│ └───Cycle-contents-
timestamp
.csv └───Cdc-data └───
object_name
├───Data │ ├───
cycle_folder
│ │ └───
object_name
_
timestamp
.csv │ │ ... │ └───
cycle_folder
│ └───
object_name
_
timestamp
.csv └───Schema └───V1 └───
object_name
.schema
The following table describes the directories in the default structure:
Folder
Description
connection_folder
Contains the Mass Ingestion Applications objects. This folder is specified in the
Folder Path
field of the Amazon S3 connection properties or in the
Directory Path
field of the Microsoft Azure Data Lake Storage Gen2 connection properties.
This folder is not created for Google Cloud Storage targets.
job_folder
Contains job output files. This folder is specified in the
Directory
field on the
Target
page of the application ingestion task wizard.
Cdc-cycle/Completed
Contains a subfolder for each completed CDC cycle. Each cycle subfolder contains a completed cycle file.
Cdc-cycle/Contents
Contains a subfolder for each CDC cycle. Each cycle subfolder contains a cycle contents file.
Cdc-data
Contains output data files and schema files for each object.
Cdc-data/
object_name
/Schema/V1
Contains a schema file.
Mass Ingestion Applications does not save a schema file in this folder if the output files use the Parquet format.
Cdc-data/
object_name
/Data
Contains a subfolder for each CDC cycle that produces output data files.

Cycle directories

Mass Ingestion Applications uses the following pattern to name cycle directories:
[dt=]
yyyy
-
mm
-
dd
-
hh
-
mm
-
ss
The "dt=" prefix is added to cycle folder names if you select the
Add Directory Tags
check box on the
Target
page of the application ingestion task wizard.

Cycle contents files

Cycle contents files are located in Cdc-cycle/Contents/
cycle_folder
subdirectories. Cycle contents files contain a record for each object that has had a DML event during the cycle. If no DML operations occurred on an object in the cycle, the object does not appear in the cycle contents file.
Mass Ingestion Applications uses the following pattern to name cycle content files:
Cycle-contents-
timestamp
.csv
A cycle contents csv file contains the following information:
  • Object name
  • Cycle name
  • Path to the cycle folder for the object
  • Start sequence for the object
  • End sequence for the object
  • Number of Insert operations
  • Number of Update operations
  • Number of Delete operations
  • Schema version
  • Path to the schema file for the schema version
    If the output data files use the Parquet format, Mass Ingestion Applications does not save a schema file at the path that is specified in the cycle contents file. Instead, use the schema file in the folder that is specified in the
    Avro Schema Directory
    field on the
    Target
    page of the application ingestion task wizard.

Completed cycle files

Completed cycle files are located in Cdc-cycle/Completed/
completed_cycle_folder
subdirectories. An application ingestion job creates a cycle file in this subdirectory after a cycle completes. If this file is not present, the cycle has not completed yet.
Mass Ingestion Applications uses the following pattern to name completed cycle files:
Cycle-
timestamp
.csv
A completed cycle csv file contains the following information:
  • Cycle name
  • Cycle start time
  • Cycle end time
  • Current sequence number at the time the cycle ended
  • Path to the cycle contents file
  • Reason for the end of cycle
    Valid reason values are:
    • NORMAL_COMMIT
      . A commit operation was encountered after the cycle had reached the DML limit or the end of the cycle interval. A cycle can end only on a commit boundary.
    • NORMAL_EXPIRY
      . The cycle ended because the cycle interval expired. The last operation was a commit.

Output data files

The data files contain records that include the following information:
  • Operation type. Valid values are:
    • I
      for Insert operations
    • U
      for Update operations
    • D
      for Delete operations
  • Sortable sequence number
  • Data fields
    Insert and Delete records contain only after images. Update records contain both before and after images.

0 COMMENTS

We’d like to hear from you!