Table of Contents

Search

  1. Introducing Mass Ingestion
  2. Getting Started with Mass Ingestion
  3. Connectors and Connections
  4. Mass Ingestion Applications
  5. Mass Ingestion Databases
  6. Mass Ingestion Files
  7. Mass Ingestion Streaming
  8. Monitoring Mass Ingestion Jobs
  9. Asset Management
  10. Troubleshooting

Mass Ingestion

Mass Ingestion

Troubleshooting a database ingestion task

Troubleshooting a database ingestion task

If you change an unsupported data type of a source column to a supported data type, the change might not be replicated to the target.
This problem occurs when the
Modify column
schema drift option is set to
Replicate
and the
Add column
option is set to
Ignore
.
Mass Ingestion Databases
does not create target columns for source columns that have unsupported data types when you deploy a task. If you change the unsupported data type to a supported data type for the source column later,
Mass Ingestion Databases
processes the modify column operation on the source but does not replicate the change to the target. When
Mass Ingestion Databases
tries to add a column with the supported data type to the target, the operation is ignored because the schema drift option
Add column
is set to
Ignore
.
To handle this situation, perform the following steps:
  1. On the
    Schedule and Runtime Options
    page in the database ingestion task wizard, under
    Schema Drift Options
    , set the
    Add column
    option to
    Replicate
    .
  2. Change the source column data type to a supported data type again so that the database ingestion job can detect this schema change.
    The database ingestion job processes the DDL operation and creates the new target column.
    The database ingestion job does not propagate the column values that were added prior to changing the source column data type.
  3. If you want to propagate all of the values from the source column to the target, resynchronize the target table with the source.
If you change a primary key constraint on the source,
Mass Ingestion Databases
stops processing the source table on which the DDL change occurred.
This problem occurs if you add or drop a primary key constraint, or if you add or drop a column from an existing primary key.
To resume processing the source table for combined initial and incremental jobs, resynchronize the target table with the source.
To resume processing the source table for incremental jobs, perform the following steps:
  1. On the
    Source
    tab in the database ingestion task definition, add a table selection rule to exclude the source table.
  2. Redeploy the task.
    Mass Ingestion Databases
    deploys the edited task and deletes the information about the primary keys of the excluded table.
  3. Edit the task again to delete the table selection rule that excluded the source table.
  4. Redeploy the task.
If a DDL column-level change causes a source table subtask to stop or be in error and then you resume the database ingestion job, the expected change in the table state is delayed.
If a DDL column-level change on a source table causes a table subtask to stop or be in error and then you resume the database ingestion job, the state of the table subtask might remain unchanged until a DML operation occurs on the table. For example, if you set a schema drift option to
Stop Table
for an incremental or initial and incremental database ingestion task and then deploy and run the job, when a DDL change occurs on a source table, the job monitoring details shows the table subtask to be in the Error state. If you stop the job and then resume it with a schema drift override to replicate the DDL change, the table subtask temporarily remains in the Error state until the first DML operation occurs on the source table.
Mass Ingestion Databases
failed to deploy a task that has a Snowflake target with the following error:
Information schema query returned too much data. Please repeat query with more selective predicates.
This error occurs because of a known Snowflake issue related to schema queries. For more information, see the Snowflake documentation.
In Mass Ingestion Databases, the error can cause the deployment of a database ingestion task that has a Snowflake target to fail when a large number of source tables are selected.
To handle the deployment failure, drop the target tables. Then update the database ingestion task to select fewer source tables for generating the target tables. Then try to deploy the task again.
A database ingestion job that runs on Linux ends abnormally with the following
out-of-memory
error:
java.lang.OutOfMemoryError: unable to create new native thread
The maximum number of user processes that is set for the operating system might have been exceeded. If the Linux ulimit value for maximum user processes is not already set to
unlimited
, set it to
unlimited
or a higher value. Then resume the job.
If you copy an asset to another location that already includes an asset of the same name, the operation might fail with one of the following errors:
Operation succeeded on 1 artifacts, failed on 1 artifacts. Operation did not succeed on any of the artifacts.
If you try to copy an asset to another location that already has an asset of the same name, Mass Ingestion Databases displays a warning message that asks if you want to keep both assets, one with a suffix such as "- Copy 1". Note that when you choose to keep both assets, Mass Ingestion Databases validates the name length to ensure that it will not exceed the maximum length of 50 characters after the suffix is added. If the name length will exceed 50 characters, the copy operation will fail. In this case, you must copy the asset to another location, rename the copy, and then move the renamed asset back to the original location.
Schema drift options on the Schedule and Runtime Options page do not match the options in the job log
In the Mass Ingestion Databases Spring 2020 April release, the order of the
Schema Drift
options displayed on the
Schedule and Runtime Options
page changed but their values remained in the original order. If you use a database ingestion task that was created after the Spring 2020 April release and before the Fall 2020 December release, the schema drift options that the associated job uses at runtime, as reported under "schemaChangeRules" in the job log, might not match the
Schema Drift
options displayed on the
Runtime Options
page. In this case, reset each of the
Schema Drift
options on the
Runtime Options
page and save the task again.
A Kafka consumer ends with one of the following errors:
org.apache.avro.AvroTypeException: Invalid default for field meta_data: null not a {"type":"array"... org.apache.avro.AvroTypeException: Invalid default for field header: null not a {"type":"record"...
This error might occur because the consumer has been upgraded to a new Avro version but still uses the Avro schema files from the older version.
To resolve the problem, use the new Avro schema files that Mass Ingestion Databases provides.
A database ingestion job that propagates incremental change data to a Kafka target that uses Confluent Schema Registry fails with the following error:
io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Register operation timed out; error code: 50002
This problem might occur when the job is processing many source tables, which requires Confluent Schema Registry to process many schemas. To resolve the problem, try increasing the value of the Confluent Schema Registry
kafkastore.timeout.ms
option. This option sets the timeout for an operation on the Kafka store. For more information, see the Confluent Schema Registry documentation.
Subtasks of a database ingestion job that has a Google BigQuery target fail to complete initial load processing of source tables with the following error:
The job has timed out on the server. Try increasing the timeout value.
This problem occurs when the job is configured to process many source tables and the Google BigQuery target connection times out before initial load processing of the source tables is complete. To resolve this problem, increase the timeout interval in the Google BigQuery V2 target connection properties.
  1. In Administrator, open the Google BigQuery V2 connection that is associated with the database ingestion job in Edit mode.
  2. In the
    Provide Optional Properties
    field, set the timeout property to the required timeout interval in seconds. Use the following format:
    "timeout": "<
    timeout_interval_in_seconds
    >"
  3. Save the connection.
  4. Redeploy the database ingestion task.
A database ingestion task with an Amazon Redshift target returns one of the following errors during deployment:
Mass Ingestion Databases could not find target table '
table_name
' which is mapped to source table '
table_name
' when deploying the database ingestion task.
com.amazon.redshift.util.RedshiftException: ERROR: Relation "
table_name
" already exists
This problem occurs because Amazon Redshift reads table and column names as lowercase by default.
To resolve the problem and allow case-sensitive identifiers on Amazon Redshift targets, perform the following steps:
  1. Disable the Amazon Redshift
    downcase_delimited_identifier
    parameter at the user level by using the following statement:
    ALTER USER
    username
    SET downcase_delimited_identifier TO off;
  2. Redeploy the database ingestion task.
To prevent these errors, you can set the
enable_case_sensitive_identifier
parameter to "true" when configuring the database parameter group.
Deployment of a database ingestion task fails if the source table or column names include multibyte or special characters and the target is Databricks Delta.
When a new Databricks Delta target table is created during deployment, an entry is added to the Hive metastore that Databricks Delta uses. The Hive metastore is typically a MySQL database. More specifically, column names are inserted into the TABLE_PARAMS field of the metastore. The charset collation of the PARAM_VALUE from TABLE_PARAMS is latin1_bin, and the charset is latin1. This charset does not support Japanese characters. To resolve the problem, create an external metastore with UTF-8_bin as the collation and UTF-8 as the charset. For more information, see the Databricks Delta documentation at https://docs.microsoft.com/en-us/azure/databricks/kb/metastore/jpn-char-external-metastore and https://kb.databricks.com/metastore/jpn-char-external-metastore.html.

0 COMMENTS

We’d like to hear from you!