Skip to content

Metadata Generation

The make_metadata_from_data.py utility generates baseline CSVW-EO metadata from an existing dataset.


Purpose

The generator infers:

  • Datatypes
  • Nullable proportions
  • Numeric bounds
  • Public keys
  • Contribution assumptions
  • Optional dependencies
  • Optional partitions
  • Optional column groups

Important

Automatically generated metadata may leak sensitive information. Manual review is always required before publication.


Basic Usage

python make_metadata_from_data.py \
  data.csv \
  --privacy_unit user_id

Contribution Levels

Four contribution levels are supported:

Level Description
table Table-level DP metadata only
table_with_keys Table-level metadata with public keys
column Per-column DP contribution metadata
partition Fine-grained partition-level metadata

Higher level incurr more risk of disclosure. Always use the lowest metadata granularity sufficient for the use case.

Example: Table-Level Metadata

python make_metadata_from_data.py \
  data.csv \
  --privacy_unit user_id \
  --default_contributions_level table

Example: Continuous Partitions

python make_metadata_from_data.py \
  data.csv \
  --privacy_unit user_id \
  --continuous_partitions '{"age":[0,18,30,50,100]}'

Example: Column Groups

python make_metadata_from_data.py \
  data.csv \
  --privacy_unit user_id \
  --column_groups '[["age","income"]]'

Example: Dependency Detection

python make_metadata_from_data.py \
  data.csv \
  --privacy_unit user_id \
  --with_dependencies True

Notes

  • Datetime columns are inferred automatically
  • Numeric bounds are inferred for numeric columns
  • Dependency detection may increase runtime
  • Fine-grained metadata increases disclosure risk