Hi, I'm Ask INFA!
What would you like to know?
ASK INFAPreview
Please to access Ask INFA.

Table of Contents

Search

  1. Introducing Mass Ingestion
  2. Getting Started with Mass Ingestion
  3. Connectors and Connections
  4. Mass Ingestion Applications
  5. Mass Ingestion Databases
  6. Mass Ingestion Files
  7. Mass Ingestion Streaming
  8. Monitoring Mass Ingestion Jobs
  9. Asset Management
  10. Troubleshooting

Mass Ingestion

Mass Ingestion

Guidelines for Amazon S3, Google Cloud Storage, and Microsoft Azure Data Lake Storage Gen2 targets

Guidelines for Amazon S3, Google Cloud Storage, and Microsoft Azure Data Lake Storage Gen2 targets

Consider the following guidelines when you use Amazon S3, Google Cloud Storage, and Microsoft Azure Data Lake Storage Gen2 targets:
  • When you configure an
    application ingestion
    task for an Amazon S3, Google Cloud Storage, or Microsoft Azure Data Lake Storage Gen2 target, you can select either CSV or AVRO as the format for the output files that contain the source data to be applied to the target.
  • If you select
    CSV
    as the output file format,
    Mass Ingestion Applications
    creates the following files on the target for each source field:
    • schema.ini file that describes the schema of the field. The file also includes some settings for the output file on the target.
    • Output files that contain the data stored in the source field.
      Mass Ingestion Applications
      names the output files based on the name of the source field with an appended date and time.
    The schema.ini file lists the sequence of columns for the rows in the corresponding output file. The following table describes the columns in the schema.ini file:
    Column
    Description
    ColNameHeader
    Indicates whether the source data files include column headers.
    Format
    Format of the output files.
    Mass Ingestion Applications
    uses a comma (,) to delimit column values.
    CharacterSet
    Character set that is used for the corresponding output file. By default,
    Mass Ingestion Applications
    generates the files in the UTF-8 character set.
    COL
    <sequence_number>
    Name and data type of the source field.
    • You must not edit the schema.ini file.
    • If you select the
      Add Before Images
      check box in the
      Advanced
      section of the
      Target
      page, the
      application ingestion
      job creates a
      column_name
      _OLD column to store the UNDO data and
      column_name
      _NEW column to store the REDO data for each source field.
  • If you select
    AVRO
    as the output file format, you can specify the serialization format of the Avro output file and write the output data in uncompressed Parquet format. Additionally, you can specify the file compression type, Avro data compression type, and the directory where the Avro schema definitions that are generated for each source object are stored.
  • For
    application ingestion
    tasks configured for Microsoft Azure Data Lake Storage Gen2 targets,
    Mass Ingestion Applications
    creates an empty directory on the target for each empty source field.
  • For Amazon S3 targets, if you do not specify an access key and secret key in the connection properties,
    Mass Ingestion Applications
    tries to find the AWS credentials by using the default credential provider chain that is implemented by the DefaultAWSCredentialsProviderChain class. For more information, see the Amazon Web Services documentation.
  • When an incremental load job that is configured for a target that uses the CSV output format propagates an Update operation that changed primary key values on the source, the job performs a Delete operation on the associated target row and then performs an Insert operation on the same row to replicate the change made to the source object. The Delete operation writes the before image to the target and the subsequent Insert image writes the after image to the target.
    For Update operations that do not change primary key values,
    application ingestion
    jobs process each Update operation as a single operation and writes only the after image to the target.
    If a source object does not contain any primary key,
    Mass Ingestion Applications
    considers all fields of the object to be a part of the primary key. In such scenarios,
    Mass Ingestion Applications
    processes each Update operation performed on the source as a Delete operation followed by an Insert operation on the target.

0 COMMENTS

We’d like to hear from you!