Skip to main content

User Guide

About Partitioned Data

Consider the following input folders, containing a partitioning of data across many files:

data
  2016-01
    20160101.avro
    20160101.avro
    ...
  2016-02
    20160201.avro
    20160201.avro
    ...
  ...

We wish to process all the Avro files into the same folder structure under the result folder:

result
  2016-01
    20160101.avro
    20160101.avro
    ...
  2016-02
    20160201.avro
    20160201.avro
    ...
  ...

Privitar supports this operation by being aware of folder structures underneath a root input folder (/data in this example), and creating a parallel structure under an output folder (/result).

The files representing the partitions are referred to using a relative path expression that is matched with the input folder as a base, and that may contain wildcards. In this example, the input folder is /data, and the required expression is */*.avro.

There is no limit on the number of wildcard folder levels that can be used in relative paths.

Note

Each wildcard will match a single folder level. It is not possible to match multiple folder levels with a single wildcard. One wildcard must be used per folder level to identify the files, for example to match files one level deeper, use */*/*.avro.

After processing, the files are created in corresponding positions by taking the part of the filename that matched the relative path and resolving it into the output folder.

Privitar exactly mirrors and replicates the input partition structure, as opposed to repartitioning based on the data. This is because the user might choose to transform, obfuscate or even remove the variables upon which the input data was originally partitioned.