Hi, I'm Ask INFA!
What would you like to know?
ASK INFAPreview
Please to access Bolo.

Table of Contents

Search

  1. Introducing Mass Ingestion
  2. Getting Started with Mass Ingestion
  3. Connectors and Connections
  4. Mass Ingestion Applications
  5. Mass Ingestion Databases
  6. Mass Ingestion Files
  7. Mass Ingestion Streaming
  8. Monitoring Mass Ingestion Jobs
  9. Asset Management
  10. Troubleshooting

Mass Ingestion

Mass Ingestion

Configuring the source

Configuring the source

Configure the source on the
Source
page of the
database ingestion
task wizard.
For MongoDB sources only, the task wizard displays
database
instead of
schema
and displays
collection
instead of
table
. However, for simplicity, the terms
schema
and
table
are used in this documentation to cover all source types.
  1. In the
    Connection
    list, select the connection for the source system.
    The connection must be predefined in Administrator for a runtime environment that your organization uses.
    The list includes only the connection types that are valid for the load type selected on the
    Definition
    page. No connections are listed if you did not select a load type.
    If you change the load type and the selected connection is no longer valid, a warning message is issued and the
    Connection
    field is cleared. You must select another connection that is valid for the updated load type.
    After you deploy the ingestion task, you cannot change the connection without first undeploying the associated ingestion job. After you change the connection, you must deploy the task again.
  2. In the
    Schema
    list, select the source schema that includes the source tables.
    The list includes only the schemas that are available in the database that is accessed with the specified source connection. When creating a task that has an Oracle, Microsoft SQL Server, Netezza, or PostgreSQL source, the schema name that is specified in the connection properties is displayed by default.
    An expanded view of the
    Table Selection
    area appears.
  3. If you are defining a Db2 for i source for an incremental load task, in the
    Journal Name
    field, select the name of the journal that records the changes made to the source tables.
  4. If you are defining a PostgreSQL source for an incremental load task, specify the following fields:
    Field
    Description
    Replication Slot Name
    Specify the unique name of a PostgreSQL replication slot.
    A slot name can contain Latin alphanumeric characters in lowercase and the underscore (_) character.
    Maximum length is 63 characters.
    Each database ingestion task must use a different replication slot.
    Replication Plugin
    Select a PostgreSQL replication plugin. Options are:
    • pgoutput
      . You can select this option only for PostgreSQL version 10 and later.
    • wal2json
    Publication
    If you selected pgoutput as the replication plugin, specify the publication name that is used by this plugin.
    This field is not displayed if you selected wal2json as the replication plugin.
  5. Under
    Table Selection
    , select one of the following options to indicate whether you want to select all tables in the specified schema or create table selection rules to define a subset of the source tables:
    • Select All
      . Selects all tables in the schema for data replication.
    • Rule-based Selection
      . Enables you to define rules to select only the tables you want to replicate. You also must be in this mode to assign actions to columns in any selected table.
    Default is
    Rule-based Selection
    .
    If you switch to
    Select All
    and then switch back to
    Rule-based Selection
    , none of the rules that you previously defined are reinstated and all tables in the schema are listed. You can add new rules to filter the list.
  6. If you selected
    Rule-based Selection
    , create rules to select the tables you want to replicate.
    By default, the
    Rules
    list contains a single
    Include
    rule with a condition that specifies only the asterisk (*) wildcard character. This rule selects all tables in the specified source schema.
    To create a rule, you can either use the entry boxes under
    Create Rule
    or click the Add Rule (+) icon and enter the rule within the
    Table Rule
    list.
    To create a rule under
    Create Rule
    :
    1. Select
      Table Selection
      as the general rule type.
    2. In the adjacent drop-down list, select
      Include
      or
      Exclude
      to create an inclusion or exclusion rule, respectively.
    3. In the condition field, enter a table name or a table-name mask that includes one or more wildcards to identify the source tables to include in or exclude from table selection. Use the following guidelines:
      • A mask can contain one or both of the following wildcards: an asterisk (*) wildcard to represent one or more characters and a question mark (?) wildcard to represent a single character. A wildcard can occur multiple times in a mask value and can occur anywhere in the value.
      • The task wizard is case sensitive. Enter the table names or masks in the case with which the tables were defined.
      • Do not include delimiters such as quotation marks or brackets, even if the source database uses them. For example, Oracle requires quotation marks around lowercase and mixed-case names to force the names to be stored in lowercase or mixed case. However, in the task wizard, you must enter the lowercase or mixed-case names without quotation marks.
      • If a table name includes special characters such as a backslash (\), asterisk(*), dollar sign ($), caret (^), or question mark (?) escape each special character with a backslash (\) when you enter the rule.
    4. Click
      Add Rule
      .
      The rule appears in the
      Rules
      list.
      The rules will be processed in the order in which they're listed, from top to bottom. Use the arrow icons to change the order. For an example of using multiple rules, see Example of rules for selecting source tables.
      Tip:
      Click the Refresh icon next to
      Total Tables Selected
      field to refresh the total number of tables that match all rules and to display the number of source tables that match each rule in the
      Tables Affected
      column.
    5. To preview the tables to be selected based on all rules, click
      Preview Selection
      .
      The tables are listed on the
      Selected Tables
      tab. The list shows the table names and column count.
      Tip:
      Click the Refresh icon next to the
      Updated
      date to refresh the total tables count and the list of tables based on the current rules. You can check the results of new rules in this manner. Click the Settings icon to control the line spacing in the list of tables, from Comfortable (most spacing) to Compact (least spacing).
    6. To refine the table selection, you can switch back to the
      Rules
      tab to edit the rules or define additional rules. Then preview the table selection again.
    Mass Ingestion Databases
    might exclude an unsupported type of table from processing even if this table matches the selection rules.
  7. If you selected
    Select All
    , you can review the tables in the specified schema that will be replicated.
    A list of all tables in the schema appears, which shows the number of columns in each table. The
    Tables Selected
    field displays the total number of tables selected. For example:
    Click the Refresh icon to refresh the list to reflect any added or dropped tables in the schema in the source database. Click the Settings icon to change the line spacing in the list of tables.
    If you switch back to
    Rule-based Selection
    , the list of tables remains displayed but you can define rules to filter the list of tables.
  8. To perform trim actions on columns in source tables that were selected based on rules, create column action rules.
    You cannot create column action rules for MongoDB sources.
    To create a column action rule:
    1. Ensure that the
      Rule-based Selection
      option is selected.
    2. Under
      Create Rule
      , select
      Column Action
      .
    3. In the adjacent list, select one of the following action types:
      • LTRIM
        . Trims spaces to the left of character column values.
      • RTRIM
        . Trims spaces to the right of character column values.
      • TRIM
        . Trims spaces to the left of and to the right of character column values.
    4. In the condition field, enter a column name or a column name mask that includes one or more asterisk (*) or question mark (?) wildcards. The value is matched against columns in the selected source tables to identify the columns to which the action applies.
    5. Click
      Add Rule
      .
      The rule appears in the actions list, below the table rules list.
    You can define multiple rules for different action types or for the same action type with different conditions. The rules are processed in the order in which they're listed, from top to bottom. Use the arrow icons to change the order.
  9. If you are defining an initial load task that has a Db2 for i, Db2 for LUW, Db2 for z/OS, Microsoft SQL Server, MySQL, Oracle, PostgreSQL, or Teradata source and want to include database views as sources, select the
    Include Views
    check box.
    The views are then fetched and included in the
    Total Tables Selected
    or
    Tables Selected
    count. The views will also be included in the list of table names when you preview selections and when you download the list of table names.
  10. If you are defining an incremental load or combined initial and incremental load task that has a Db2 for i, Db2 for z/OS, Microsoft SQL Server, Oracle, PostgreSQL, or SAP HANA source and one or more of the selected source tables are not enabled for change data capture, you can generate a script for enabling CDC and then run or download the script.
    1. In the
      CDC Script
      field, select one of the following options:
      • Enable CDC for all columns
        . Enables CDC for all columns in the selected source tables. This option is the only valid option for a Db2 for i, Db2 for z/OS, PostgreSQL, SAP HANA, or SQL Server source.
        For source tables without a primary key, including any tables with unique indexes, CDC is enabled for all columns by default, regardless of which option is selected.
      • Enable CDC for primary key columns
        . Enables CDC only for primary key columns in the selected source tables. Do not use this option for a Db2 for i, Db2 for z/OS, PostgreSQL, or SQL Server source or for any task that has a Google Big Query target.
      The script enables CDC in the following ways, depending on the source type:
      • For Db2 for i sources, the script enables journaling on the source tables.
      • For Db2 for z/OS, the script sets DATA CAPTURE CHANGES for the source tables.
      • For Microsoft SQL Server sources, the script runs the stored procedures sys.sp_cdc_enable_db and sys.sp_cdc_enable_table to enable CDC on the source database and tables.
      • For Oracle sources, the script enables supplemental logging for all or primary key columns in the selected source tables to log additional information in the redo logs.
      • For PostgreSQL sources, the script sets REPLICATION IDENTITY FULL on the selected source tables to write all column values to the WAL file.
      • For SAP HANA sources, the script creates the required PKLOG, PROCESSED, and _CDC shadow tables. The script also creates three triggers and a sequence for each selected source table.
    2. For Microsoft SQL Server sources, complete the following fields:
      • In the
        Capture Filegroup
        field, enter the name of the filegroup to be used for the change table that is created for the capture. If you leave this field empty, the change table is located in the default filegroup of the database.
      • In the
        Gating Role
        field, enter the name of the database role that is used to gate access to change data. If you leave this field empty, the database does not use the gating role.
    3. To run the script, click
      Execute
      .
      If you do not have a database role or privilege that allows you to run the script, click the Download icon to download the script. The script file name has the following format:
      cdc_script_
      taskname
      _
      number
      .txt
      . Then ask your database administrator to run the script.
      Make sure the script runs before you run the database ingestion task.
    If you change to the
    CDC Script
    option later and run the script again, the script first drops CDC for the original set of columns and then enables CDC for the current set of columns. If the PROCESSED and PKLOG tables already exist, they are omitted from the new script. If the shadow _CDC table and triggers already exist for any table, the SQL statements for creating those objects are commented out in the new script.
  11. To create and download a list of the source tables that match the table selection criteria, perform the following substeps:
    1. If you used rule-based table selection, in the
      Table Names
      list, select the type of selection rules that you want to use. Options are:
      • Include Rules Only
      • Exclude Rules Only
      • Include And Exclude Rules
    2. To list the columns, regardless of which table selection method you used, select the
      Include Columns
      check box.
      This option is not available for MongoDB sources.
    3. Click the Download icon.
      A downloaded list that includes columns has the following format:
      status
      ,
      schema_name
      ,
      table_name
      ,
      object_type
      ,
      column_name
      ,
      comment
      The following table describes the information that is displayed in the downloaded list:
      Field
      Description
      status
      Indicates whether Mass Ingestion Databases excludes the source table or column from processing because it has an unsupported type. Valid values are:
      • E
        . The object is excluded from processing by an Exclude rule.
      • I
        . The object is included in processing.
      • X
        . The object is excluded from processing because it is an unsupported type of object. For example, unsupported types of objects include columns with unsupported data types and tables that include only unsupported columns. The comment field provides detail on unsupported types.
      schema_name
      Specifies the name of the source schema.
      table_name
      Specifies the name of the source table.
      object_type
      Specifies the type of the source object. Valid values are:
      • C
        . Column.
      • T
        . Table.
      column_name
      Specifies the name of the source column. This information appears only if you selected the
      Columns
      check box.
      comment
      Specifies the reason why a source object of an unsupported type is excluded from processing even though it matches the selection rules.
  12. Under
    Advanced
    , set the advanced properties that are available for your source type and load type. All of the properties are optional.
    Field
    Source and Load Type
    Description
    Disable Flashback
    Oracle - Initial loads
    Select this check box to disable Mass Ingestion Databases use of Oracle Flashback when fetching data from the database.
    The use of Oracle Flashback requires users to be granted the EXECUTE ON DBMS_FLASHBACK privilege, which is not necessary for initial loads.
    This check box is selected by default for new initial load tasks. For existing initial load tasks, this check box is cleared by default, which causes Oracle Flashback to remain enabled.
    Include LOBs
    Oracle and PostgreSQL - Initial loads with Amazon Redshift, Amazon S3, Google Big Query, Google Cloud Storage, Microsoft Azure Data Lake Storage Gen2, Microsoft Azure Synapse Analytics, and Snowflake targets
    SQL Server with the following load types:
    • Initial loads with Amazon Redshift, Amazon S3, Google Big Query, Google Cloud Storage, Microsoft Azure Data Lake Storage Gen2, Microsoft Azure Synapse Analytics, and Snowflake targets
    • Incremental, and combined initial and incremental loads with Snowflake targets
    Select this check box if the source contains the large-object columns from which you want to replicate data to an Amazon Redshift, Amazon S3, Google Big Query, Google Cloud Storage, Microsoft Azure Data Lake Storage Gen2, Microsoft Azure Synapse Analytics, or Snowflake target.
    Large-object data types are:
    • For Oracle: BLOB, CLOB, and NCLOB
    • For PostgreSQL: BYTEA, TEXT, and XML
    • For SQL Server: IMAGE, NTEXT, NVARCHAR(MAX), TEXT, VARBINARY(MAX), VARCHAR(MAX), and XML
    For initial loads:
    • BLOB, BYTEA, IMAGE, or VARBINARY(MAX) columns are truncated before being written to BINARY columns on the target.
      • For Amazon S3, Google Cloud Storage, and Microsoft Azure Data Lake Storage Gen2, the data is truncated to 16777216 bytes.
      • For Amazon Redshift, the data is truncated to 1024000 bytes.
      • For Microsoft Azure Synapse Analytics, the data is truncated to 1000000 bytes.
      • For Google Big Query and Snowflake, the data is truncated to 8388608 bytes.
    • CLOB, NCLOB, TEXT, XML, VARCHAR(MAX), NTEXT, NVARCHAR(MAX), VARCHAR(MAX), or XML columns are truncated before being written to VARCHAR columns on the target.
      • For Amazon S3, Google Cloud Storage, Microsoft Azure Data Lake Storage Gen2, and Snowflake, the data is truncated to 16777216 bytes.
      • For Amazon Redshift, the data is truncated to 65535 bytes.
      • For Microsoft Azure Synapse Analytics, the data is truncated to 500000 bytes.
      • For Google Big Query, the data is truncated to 8388608 bytes.
    For incremental loads and combined loads:
    • If large-object columns contain more than 8 KB of data, the data is truncated to 4000 bytes if stored inline or truncated to approximately 8000 bytes if stored out-of-line, before being written to a BINARY or VARCHAR column on the target.
    Enable Persistent Storage
    All sources except PostgreSQL - Incremental and combined initial and incremental loads
    Select this check box to enable persistent storage of transaction data in a disk buffer so that the data can be consumed continually, even when the writing of data to the target is slow or delayed.
    Benefits of using persistent storage are faster consumption of the source transaction logs, less reliance on log archives or backups, and the ability to still access the data persisted in disk storage after restarting a database ingestion job.
    For PostgreSQL CDC sources, persistent storage is required and cannot be disabled.
    Restart Point for Incremental Load
    All sources - Incremental and combined initial and incremental loads
    Set this field if you want to customize the position in the source logs from which the
    database ingestion
    job starts reading change records the first time it runs.
    Earliest Available
    and
    Position
    options are not supported for MongoDB sources.
    Options are:
    • Earliest Available
      . The earliest available position in the database log or structure where changes are stored.
      • For Db2 for i, the start of the current journal.
      • For Db2 for z/OS, the earliest available record in the transaction log.
      • For Oracle, the earliest available record in the online redo log.
      • For PostgreSQL, the earliest available record in the replication slot.
      • For SAP HANA, the earliest available record in the PKLOG table.
      • For SQL Server, the earliest available record in the active transaction log.
    • Latest Available
      . The latest available position in the database log or structure.
    • Position
      . A valid Oracle SCN or SQL Server LSN that
      Mass Ingestion
      uses to determine a position in the change stream from which to start retrieving change records. The SCN or LSN must be equal to or less than the current SCN or LSN. An invalid value will cause the job to fail. The default
      Position
      value is 0.
      For Oracle, 0 causes reading to start from the default restart point, which is latest available point.
      For SQL Server, 0 causes reading to start from the earliest available point. A non-zero LSN that predates the beginning of the active transaction log causes data to be read from the CDC tables instead of from the transaction log.
    • Specific Date and Time
      . A date and time, in the format MM/DD/YYYY
      hh:mm AM|PM
      , that
      Mass Ingestion
      uses to determine the position in the change stream from which to start retrieving change records.
      Mass Ingestion
      retrieves only the changes that were started after this date and time. If you enter a date and time earlier than the earliest date and time in the available archived logs, the job will fail.
    The default is
    Latest Available
    .
    • This restart point option pertains only to the initial run of a job. Thereafter, if you resume a stopped or aborted job, the job begins propagating source data from where it last left off.
    • For combined initial and incremental load jobs, initial loading is not performed until the incremental processing of change data reaches the end of the current transaction log. For this reason, Informatica recommends that you select
      Latest Available
      as the start point.
      For Oracle combined initial and incremental load jobs, Oracle Flashback queries are used to get committed data that was current at a specific point in the change stream. Ensure that no source table is truncated during the initial load period. If truncation occurs, any DDL change performed during a flashback query causes the query to fail.
      For SQL Server, data changes are read from the active portion of the transaction log if the requested LSN is available there. If the LSN predates the active transaction log, the data changes are read from previously enabled CDC tables. Ensure that SQL Server CDC is enabled on the source tables.
    Fetch Size
    MongoDB - Initial and incremental loads
    For a MongoDB source, the number of records that a
    database ingestion
    job must read at a single time from the source. Valid values are 1 to 2147483647. The default is 5000.
  13. Under
    Custom Properties
    , you can specify custom properties that Informatica provides to meet your special requirements. To add a property, in the
    Create Property
    fields, enter the property name and value. Then click
    Add Property
    .
    Specify these properties only at the direction of Informatica Global Customer Support. Usually, these properties address unique environments or special processing needs. You can specify multiple properties, if necessary. A property name can contain only alphanumeric characters and the following special characters: periods (.), hyphens (-), and underscores (_).
    To delete a property, click the Delete icon at the right end of the property row in the list.
  14. Click
    Next
    .

0 COMMENTS

We’d like to hear from you!