Skip to main content

User Guide

Configuring a Data Flow Pipeline

A Data Flow Job execution is not initiated from Privitar, but it is entirely managed externally, in the Data Flow platform of choice.

Privitar provides platform-specific plug-ins to execute Data Flow Jobs. The supported platforms are:

  • Apache NiFi

  • Apache Kafka Connect/Confluent Platform

  • StreamSets

For other platforms, Privitar provides an SDK that can be used to build a custom integration.

To configure a Data Flow platform, the following information is required:

  • The unique ID of the desired Data Flow Jobs. This can be found in the Jobs list. For more information, see Working with Existing Jobs.

  • The Privitar URL. This can be provided by your administrator

  • The credentials of a Data Flow Job Operator.

    The operator has to be set up as an API user that has a Role with a Run Data Flow permission in the Team that the Job is defined in. (For example, the default Data Flow Operator Role).

    For more information, see Managing API Users.

  • The details of input and output data formats. These can be obtained from the Jobs list in Avro Schema format. These files can be imported directly in an Avro Schema Registry, or used as a reference for configuring the Data Flow platform.

    For more information, see Working with Existing Jobs.

Data Flow Execution

On startup, the Data Flow plugin will connect to the configured Privitar instance to download the Data Flow Job configuration.

While the pipeline runs, the data flow plugin will continuously process the incoming data, and apply the configured Policy.

The output data will be sent either to the next step in the pipeline or to the final destination of the data, depending on the specific configuration of the data flow pipeline.

Note

On NiFi and Apache Kafka/Confluent, any changes to Policies and Jobs will be dynamically applied after the refresh time interval (set by default to 10 minutes). On StreamSets, any changes to the Policy and Jobs or the Data Flow Job configuration while a pipeline runs, will not take effect until the pipeline process is restarted.

Data Flow supports the following authentication mechanisms between the Data Flow plugin and Privitar:

  • Apache NiFi: Mutual TLS or Basic HTTP authentication.

  • Apache Kafka Connect/Confluent Platform: Basic HTTP authentication.

    Note

    For more information on how to configure the Apache Kafka Connect/Confluent Privitar Connector, see the separately provided Kafka Connect Reference Guide. (Please contact Privitar for further information about Apache Kafka Connect/Confluent integration.)

  • StreamSets: Mutual TLS or Basic HTTP authentication.

    Note

    For more information on how to configure the StreamSets Privitar Data Processor, see the separately provided StreamSets Reference Guide. (Please contact Privitar for further information about StreamSets integration.)

Failed Records

The Data Flow plugin can be optionally configured with a failed records output, where all the input data records which Privitar failed to tokenize are output. The details of how to configure the failed records flow are specific to the data flow platform in use.

On NiFi, it is recommended that before the Privitar processor a ValidateRecord processor is placed to ensure the incoming data is valid (for example, using the correct date formats).