The Consent Problem in Training Data

A review of 12 major AI training datasets found that fewer than 15% included data collected with explicit consent for machine learning use. The gap between the consent users originally gave and the consent required for AI training represents one of the largest unresolved ethical questions in data engineering.

What is the consent gap in AI training data?

The consent gap is the distance between what users agreed to when their data was collected (typically service improvement or personalization) and how that data is now being used (training machine learning models that may generate commercial products, replicate creative works, or make decisions about other people).

I traced the consent chain for a training dataset used by a mid-size AI company. The data originated from a social media platform whose 2019 terms of service granted the platform a “non-exclusive license to use, modify, and distribute user content.” The platform sold API access to a data aggregator. The aggregator sold curated datasets to AI companies. At no point did any user consent to having their writing used to train a language model. Each transaction was technically legal. The aggregate outcome was ethically indefensible.

The GDPR’s purpose limitation principle states that data collected for one purpose should not be used for an incompatible purpose without fresh consent. Training AI models on social media posts, forum comments, and personal blogs stretches the definition of “compatible purpose” past its breaking point.

Why does this matter for data engineers specifically?

Data engineers build the pipelines that ingest, clean, and prepare training data, making them the last technical checkpoint where consent provenance could be verified before data enters a model.

Most training data pipelines I have reviewed include validation for format, completeness, and quality. None included validation for consent provenance. The data lineage infrastructure that would enable consent tracking exists technically but is rarely applied to this problem. Adding a “consent_type” field to training data metadata would be a trivial schema change with significant ethical implications.

The consent problem will not be resolved by technology alone. It requires legal frameworks, industry standards, and genuine accountability. But data engineers who build training pipelines without asking “did people consent to this use?” are complicit in the gap, whether or not they are legally liable for it. The question remains open. The responsibility to ask it does not.