AI Ethics in the Supply Chain: Training Data Provenance Problem
Why does AI training data have a provenance problem?
AI training datasets are assembled from dozens or hundreds of sources, often through intermediaries, and the ethical conditions under which the original data was collected, labeled, and consented are rarely documented or verified at the point of model training.
I audited a sentiment analysis model used for customer feedback processing. The model was trained on a dataset of 2.3 million labeled text samples. The labeling had been outsourced to a data labeling company, which had subcontracted portions to freelancers in 4 countries. The base text came from web scraping (consent status unknown), customer reviews (consented for review publication, not for ML training), and social media posts (mixed consent depending on platform terms). Nobody in the organization that deployed the model knew these details until I traced the chain.
This is not unusual. It is standard. Most organizations I work with cannot fully trace their training data provenance because the supply chain is long, fragmented, and poorly documented. The ethical implications are significant: you may be training on data collected without consent, labeled under exploitative working conditions, or sourced from communities that would object to its use.
How does supply chain thinking apply to AI training data?
The same principles that govern physical supply chain ethics (transparency, traceability, labor standards, consent verification) apply directly to AI training data, and the absence of these practices is an ethical failure, not a technical limitation.
Physical supply chains underwent an ethics transformation over the past 2 decades. Companies now audit their suppliers for labor practices, environmental standards, and regulatory compliance. The AI training data supply chain has not undergone the same transformation. Data brokers operate with minimal transparency. Labeling companies compete on price, incentivizing practices that prioritize volume over labeler welfare. Web scraping tools collect data without regard for the original creators’ intent.
I apply the same audit framework to data supply chains that ethical sourcing teams apply to physical supply chains. For each data source, I document: where the data originated, what consent was given at collection, how many intermediaries handled it, what transformations were applied, who performed the labeling, and what working conditions existed for labelers. In the sentiment analysis audit, completing this documentation for all 47 sources took 3 weeks. The findings were uncomfortable. They were also necessary. This is an extension of the data quality as trust principle: you cannot trust data whose origins you cannot verify.
What does ethical data sourcing look like in practice?
Ethical data sourcing requires a documented provenance chain, verified consent at every collection point, fair labor standards for data labelers, and a willingness to exclude data sources that cannot meet these standards even when they reduce training data volume.
- Data lineage documentation: Every training dataset should have a documented lineage showing its path from collection to training. I use a structured format that records the source, consent mechanism, intermediaries, transformations, and known limitations for each data component.
- Consent chain verification: For each data source, verify that consent was obtained for the intended use (ML training), not just for the original use (e.g., product review publication). I have excluded 15-20% of candidate training data in recent projects because the consent chain could not be verified.
- Labeler welfare standards: Establish minimum standards for labeling partners: fair wages (above local living wage), reasonable working hours, content exposure limits for harmful material, and mental health support. According to Time’s investigation into AI data labeling, the conditions for some labeling workers are exploitative.
- Provenance as model card requirement: Include data provenance information in every model card. If you cannot document where your training data came from, that is information your users deserve to know.
What are the implications for the AI industry’s data practices?
The AI industry’s current data sourcing practices would not survive the supply chain scrutiny applied to physical goods, and the ethical reckoning is coming either through regulation or through public accountability.
The data contracts framework I advocate for applies here. Just as data contracts formalize the interface between data producers and consumers, AI data provenance contracts should formalize the ethical standards expected at each stage of the supply chain. The organizations that build robust provenance practices now will be better positioned when regulation arrives. The organizations that do not will face the same reputational and financial consequences that companies with exploitative physical supply chains have faced.
I am not optimistic that voluntary adoption will be sufficient. The economic incentives favor large, cheap training datasets over well-documented, ethically sourced ones. But I am certain that the trajectory of regulation, public awareness, and industry maturation will eventually require the same supply chain transparency for data that we now expect for physical goods. Building provenance infrastructure now is not just ethical. It is prudent.