Mass Ingestion

Back Next

Configuring the source

Configure the source on the

Source

page of the

database ingestion

task wizard.

For MongoDB sources only, the task wizard displays database
instead of schema
and displays collection
instead of table
. However, for simplicity, the terms schema
and table
are used in this documentation to cover all source types.

In the

Connection

list, select the connection for the source system.

The connection must be predefined in Administrator for a runtime environment that your organization uses.

The list includes only the connection types that are valid for the load type selected on the

Definition

page. No connections are listed if you did not select a load type.

If you change the load type and the selected connection is no longer valid, a warning message is issued and the

Connection

field is cleared. You must select another connection that is valid for the updated load type.

After you deploy the ingestion task, you cannot change the connection without first undeploying the associated ingestion job. After you change the connection, you must deploy the task again.

In the

Schema

list, select the source schema that includes the source tables.

The list includes only the schemas that are available in the database that is accessed with the specified source connection. When creating a task that has an Oracle, Microsoft SQL Server, Netezza, or PostgreSQL source, the schema name that is specified in the connection properties is displayed by default.

An expanded view of the

Table Selection

area appears.

If you are defining a Db2 for i source for an incremental load task, in the

Journal Name

field, select the name of the journal that records the changes made to the source tables.

If you are defining a PostgreSQL source for an incremental load task, specify the following fields:

Field	Description
Replication Slot Name	Specify the unique name of a PostgreSQL replication slot. A slot name can contain Latin alphanumeric characters in lowercase and the underscore (_) character. Maximum length is 63 characters. Each database ingestion task must use a different replication slot.
Replication Plugin	Select a PostgreSQL replication plugin. Options are: pgoutput . You can select this option only for PostgreSQL version 10 and later. wal2json
Publication	If you selected pgoutput as the replication plugin, specify the publication name that is used by this plugin. This field is not displayed if you selected wal2json as the replication plugin.

Under

Table Selection

, select one of the following options to indicate whether you want to select all tables in the specified schema or create table selection rules to define a subset of the source tables:

Select All

. Selects all tables in the schema for data replication.

Rule-based Selection

. Enables you to define rules to select only the tables you want to replicate. You also must be in this mode to assign actions to columns in any selected table.

Default is

Rule-based Selection

If you switch to

Select All

and then switch back to

Rule-based Selection

, none of the rules that you previously defined are reinstated and all tables in the schema are listed. You can add new rules to filter the list.

If you selected

Rule-based Selection

, create rules to select the tables you want to replicate.

By default, the

Rules

list contains a single Include
rule with a condition that specifies only the asterisk (*) wildcard character. This rule selects all tables in the specified source schema.

To create a rule, you can either use the entry boxes under

Create Rule

or click the Add Rule (+) icon and enter the rule within the

Table Rule

list.

To create a rule under

Create Rule

Select

Table Selection

as the general rule type.

In the adjacent drop-down list, select

Include

Exclude

to create an inclusion or exclusion rule, respectively.

In the condition field, enter a table name or a table-name mask that includes one or more wildcards to identify the source tables to include in or exclude from table selection. Use the following guidelines:

A mask can contain one or both of the following wildcards: an asterisk (*) wildcard to represent one or more characters and a question mark (?) wildcard to represent a single character. A wildcard can occur multiple times in a mask value and can occur anywhere in the value.

The task wizard is case sensitive. Enter the table names or masks in the case with which the tables were defined.

Do not include delimiters such as quotation marks or brackets, even if the source database uses them. For example, Oracle requires quotation marks around lowercase and mixed-case names to force the names to be stored in lowercase or mixed case. However, in the task wizard, you must enter the lowercase or mixed-case names without quotation marks.

If a table name includes special characters such as a backslash (\), asterisk(*), dollar sign ($), caret (^), or question mark (?) escape each special character with a backslash (\) when you enter the rule.

Click

Add Rule

The rule appears in the

Rules

list.

The rules will be processed in the order in which they're listed, from top to bottom. Use the arrow icons to change the order. For an example of using multiple rules, see Example of rules for selecting source tables.

Tip:
Click the Refresh icon next to

Total Tables Selected

field to refresh the total number of tables that match all rules and to display the number of source tables that match each rule in the

Tables Affected

column.

To preview the tables to be selected based on all rules, click

Preview Selection

The tables are listed on the

Selected Tables

tab. The list shows the table names and column count.

Tip:

Click the Refresh icon next to the

Updated

date to refresh the total tables count and the list of tables based on the current rules. You can check the results of new rules in this manner. Click the Settings icon to control the line spacing in the list of tables, from Comfortable (most spacing) to Compact (least spacing).

To refine the table selection, you can switch back to the

Rules

tab to edit the rules or define additional rules. Then preview the table selection again.

Mass Ingestion Databases

might exclude an unsupported type of table from processing even if this table matches the selection rules.

If you selected

Select All

, you can review the tables in the specified schema that will be replicated.

A list of all tables in the schema appears, which shows the number of columns in each table. The

Tables Selected

field displays the total number of tables selected. For example:

Click the Refresh icon to refresh the list to reflect any added or dropped tables in the schema in the source database. Click the Settings icon to change the line spacing in the list of tables.

If you switch back to

Rule-based Selection

, the list of tables remains displayed but you can define rules to filter the list of tables.

To perform trim actions on columns in source tables that were selected based on rules, create column action rules.

You cannot create column action rules for MongoDB sources.

To create a column action rule:

Ensure that the

Rule-based Selection

option is selected.

Under

Create Rule

, select

Column Action

In the adjacent list, select one of the following action types:

LTRIM

. Trims spaces to the left of character column values.

RTRIM

. Trims spaces to the right of character column values.

TRIM

. Trims spaces to the left of and to the right of character column values.

In the condition field, enter a column name or a column name mask that includes one or more asterisk (*) or question mark (?) wildcards. The value is matched against columns in the selected source tables to identify the columns to which the action applies.

Click

Add Rule

The rule appears in the actions list, below the table rules list.

You can define multiple rules for different action types or for the same action type with different conditions. The rules are processed in the order in which they're listed, from top to bottom. Use the arrow icons to change the order.

If you are defining an initial load task that has a Db2 for i, Db2 for LUW, Db2 for z/OS, Microsoft SQL Server, MySQL, Oracle, PostgreSQL, or Teradata source and want to include database views as sources, select the

Include Views

check box.

The views are then fetched and included in the

Total Tables Selected

Tables Selected

count. The views will also be included in the list of table names when you preview selections and when you download the list of table names.

If you are defining an incremental load or combined initial and incremental load task that has a Db2 for i, Db2 for z/OS, Microsoft SQL Server, Oracle, PostgreSQL, or SAP HANA source and one or more of the selected source tables are not enabled for change data capture, you can generate a script for enabling CDC and then run or download the script.

In the

CDC Script

field, select one of the following options:

Enable CDC for all columns

. Enables CDC for all columns in the selected source tables. This option is the only valid option for a Db2 for i, Db2 for z/OS, PostgreSQL, SAP HANA, or SQL Server source.

For source tables without a primary key, including any tables with unique indexes, CDC is enabled for all columns by default, regardless of which option is selected.

Enable CDC for primary key columns

. Enables CDC only for primary key columns in the selected source tables. Do not use this option for a Db2 for i, Db2 for z/OS, PostgreSQL, or SQL Server source or for any task that has a Google Big Query target.

The script enables CDC in the following ways, depending on the source type:

For Db2 for i sources, the script enables journaling on the source tables.

For Db2 for z/OS, the script sets DATA CAPTURE CHANGES for the source tables.

For Microsoft SQL Server sources, the script runs the stored procedures sys.sp_cdc_enable_db and sys.sp_cdc_enable_table to enable CDC on the source database and tables.

For Oracle sources, the script enables supplemental logging for all or primary key columns in the selected source tables to log additional information in the redo logs.

For PostgreSQL sources, the script sets REPLICATION IDENTITY FULL on the selected source tables to write all column values to the WAL file.

For SAP HANA sources, the script creates the required PKLOG, PROCESSED, and _CDC shadow tables. The script also creates three triggers and a sequence for each selected source table.

For Microsoft SQL Server sources, complete the following fields:

In the

Capture Filegroup

field, enter the name of the filegroup to be used for the change table that is created for the capture. If you leave this field empty, the change table is located in the default filegroup of the database.

In the

Gating Role

field, enter the name of the database role that is used to gate access to change data. If you leave this field empty, the database does not use the gating role.

To run the script, click

Execute

If you do not have a database role or privilege that allows you to run the script, click the Download icon to download the script. The script file name has the following format:

cdc_script_taskname_number.txt

. Then ask your database administrator to run the script.

Make sure the script runs before you run the database ingestion task.

If you change to the

CDC Script

option later and run the script again, the script first drops CDC for the original set of columns and then enables CDC for the current set of columns. If the PROCESSED and PKLOG tables already exist, they are omitted from the new script. If the shadow _CDC table and triggers already exist for any table, the SQL statements for creating those objects are commented out in the new script.

To create and download a list of the source tables that match the table selection criteria, perform the following substeps:

If you used rule-based table selection, in the

Table Names

list, select the type of selection rules that you want to use. Options are:

Include Rules Only

Exclude Rules Only

Include And Exclude Rules

To list the columns, regardless of which table selection method you used, select the

Include Columns

check box.

This option is not available for MongoDB sources.

Click the Download icon.

A downloaded list that includes columns has the following format:

status,schema_name,table_name,object_type,column_name,comment

The following table describes the information that is displayed in the downloaded list:

Field	Description
status	Indicates whether Mass Ingestion Databases excludes the source table or column from processing because it has an unsupported type. Valid values are: E . The object is excluded from processing by an Exclude rule. I . The object is included in processing. X . The object is excluded from processing because it is an unsupported type of object. For example, unsupported types of objects include columns with unsupported data types and tables that include only unsupported columns. The comment field provides detail on unsupported types.
schema_name	Specifies the name of the source schema.
table_name	Specifies the name of the source table.
object_type	Specifies the type of the source object. Valid values are: C . Column. T . Table.
column_name	Specifies the name of the source column. This information appears only if you selected the Columns check box.
comment	Specifies the reason why a source object of an unsupported type is excluded from processing even though it matches the selection rules.

Under

Advanced

, set the advanced properties that are available for your source type and load type. All of the properties are optional.

Field	Source and Load Type	Description
Disable Flashback	Oracle - Initial loads	Select this check box to disable Mass Ingestion Databases use of Oracle Flashback when fetching data from the database. The use of Oracle Flashback requires users to be granted the EXECUTE ON DBMS_FLASHBACK privilege, which is not necessary for initial loads. This check box is selected by default for new initial load tasks. For existing initial load tasks, this check box is cleared by default, which causes Oracle Flashback to remain enabled.
Include LOBs	Oracle and PostgreSQL - Initial loads with Amazon Redshift, Amazon S3, Google Big Query, Google Cloud Storage, Microsoft Azure Data Lake Storage Gen2, Microsoft Azure Synapse Analytics, and Snowflake targets SQL Server with the following load types: Initial loads with Amazon Redshift, Amazon S3, Google Big Query, Google Cloud Storage, Microsoft Azure Data Lake Storage Gen2, Microsoft Azure Synapse Analytics, and Snowflake targets Incremental, and combined initial and incremental loads with Snowflake targets	Select this check box if the source contains the large-object columns from which you want to replicate data to an Amazon Redshift, Amazon S3, Google Big Query, Google Cloud Storage, Microsoft Azure Data Lake Storage Gen2, Microsoft Azure Synapse Analytics, or Snowflake target. Large-object data types are: For Oracle: BLOB, CLOB, and NCLOB For PostgreSQL: BYTEA, TEXT, and XML For SQL Server: IMAGE, NTEXT, NVARCHAR(MAX), TEXT, VARBINARY(MAX), VARCHAR(MAX), and XML For initial loads: BLOB, BYTEA, IMAGE, or VARBINARY(MAX) columns are truncated before being written to BINARY columns on the target. For Amazon S3, Google Cloud Storage, and Microsoft Azure Data Lake Storage Gen2, the data is truncated to 16777216 bytes. For Amazon Redshift, the data is truncated to 1024000 bytes. For Microsoft Azure Synapse Analytics, the data is truncated to 1000000 bytes. For Google Big Query and Snowflake, the data is truncated to 8388608 bytes. CLOB, NCLOB, TEXT, XML, VARCHAR(MAX), NTEXT, NVARCHAR(MAX), VARCHAR(MAX), or XML columns are truncated before being written to VARCHAR columns on the target. For Amazon S3, Google Cloud Storage, Microsoft Azure Data Lake Storage Gen2, and Snowflake, the data is truncated to 16777216 bytes. For Amazon Redshift, the data is truncated to 65535 bytes. For Microsoft Azure Synapse Analytics, the data is truncated to 500000 bytes. For Google Big Query, the data is truncated to 8388608 bytes. For incremental loads and combined loads: If large-object columns contain more than 8 KB of data, the data is truncated to 4000 bytes if stored inline or truncated to approximately 8000 bytes if stored out-of-line, before being written to a BINARY or VARCHAR column on the target.
Enable Persistent Storage	All sources except PostgreSQL - Incremental and combined initial and incremental loads	Select this check box to enable persistent storage of transaction data in a disk buffer so that the data can be consumed continually, even when the writing of data to the target is slow or delayed. Benefits of using persistent storage are faster consumption of the source transaction logs, less reliance on log archives or backups, and the ability to still access the data persisted in disk storage after restarting a database ingestion job. For PostgreSQL CDC sources, persistent storage is required and cannot be disabled.
Restart Point for Incremental Load	All sources - Incremental and combined initial and incremental loads	Set this field if you want to customize the position in the source logs from which the database ingestion job starts reading change records the first time it runs. Earliest Available and Position options are not supported for MongoDB sources. Options are: Earliest Available . The earliest available position in the database log or structure where changes are stored. For Db2 for i, the start of the current journal. For Db2 for z/OS, the earliest available record in the transaction log. For Oracle, the earliest available record in the online redo log. For PostgreSQL, the earliest available record in the replication slot. For SAP HANA, the earliest available record in the PKLOG table. For SQL Server, the earliest available record in the active transaction log. Latest Available . The latest available position in the database log or structure. Position . A valid Oracle SCN or SQL Server LSN that Mass Ingestion uses to determine a position in the change stream from which to start retrieving change records. The SCN or LSN must be equal to or less than the current SCN or LSN. An invalid value will cause the job to fail. The default Position value is 0. For Oracle, 0 causes reading to start from the default restart point, which is latest available point. For SQL Server, 0 causes reading to start from the earliest available point. A non-zero LSN that predates the beginning of the active transaction log causes data to be read from the CDC tables instead of from the transaction log. Specific Date and Time . A date and time, in the format MM/DD/YYYY hh:mm AM\|PM , that Mass Ingestion uses to determine the position in the change stream from which to start retrieving change records. Mass Ingestion retrieves only the changes that were started after this date and time. If you enter a date and time earlier than the earliest date and time in the available archived logs, the job will fail. The default is Latest Available . This restart point option pertains only to the initial run of a job. Thereafter, if you resume a stopped or aborted job, the job begins propagating source data from where it last left off. For combined initial and incremental load jobs, initial loading is not performed until the incremental processing of change data reaches the end of the current transaction log. For this reason, Informatica recommends that you select Latest Available as the start point. For Oracle combined initial and incremental load jobs, Oracle Flashback queries are used to get committed data that was current at a specific point in the change stream. Ensure that no source table is truncated during the initial load period. If truncation occurs, any DDL change performed during a flashback query causes the query to fail. For SQL Server, data changes are read from the active portion of the transaction log if the requested LSN is available there. If the LSN predates the active transaction log, the data changes are read from previously enabled CDC tables. Ensure that SQL Server CDC is enabled on the source tables.
Fetch Size	MongoDB - Initial and incremental loads	For a MongoDB source, the number of records that a database ingestion job must read at a single time from the source. Valid values are 1 to 2147483647. The default is 5000.

Under

Custom Properties

, you can specify custom properties that Informatica provides to meet your special requirements. To add a property, in the

Create Property

fields, enter the property name and value. Then click

Add Property

Specify these properties only at the direction of Informatica Global Customer Support. Usually, these properties address unique environments or special processing needs. You can specify multiple properties, if necessary. A property name can contain only alphanumeric characters and the following special characters: periods (.), hyphens (-), and underscores (_).