The Ground Truth Problem: When Your Labels Are Wrong
What is the ground truth problem in machine learning?
The ground truth problem is that ML models learn from human-assigned labels that are treated as objective truth but are actually subjective interpretations, with error rates that compound through training and produce models that confidently reproduce human mistakes at scale.
I audited a sentiment classification model that had been in production for 14 months. Accuracy on the test set was 91%. The team was satisfied. But when I sampled 500 predictions that the model was most confident about (confidence above 95%) and had domain experts re-evaluate them, 8% were wrong. Not because the model was wrong, but because the original labels were wrong. The model had faithfully learned to reproduce labeling errors. It was accurate relative to its training data but inaccurate relative to reality.
This is the epistemological trap. We measure model performance against labels. If labels contain 10% errors, a model that achieves 90% accuracy might simply be reproducing the labels perfectly, including the errors. Performance metrics become circular: the model is good at predicting what humans wrote down, not necessarily what is true.
Why are labeling errors so persistent?
Labeling errors persist because labeling is treated as a commodity task (low-paid, high-speed, minimal quality control) when it is actually a judgment task that requires domain expertise, clear guidelines, and systematic quality assurance.
I reviewed the labeling process for 3 production datasets. The pattern was consistent: guidelines were written by data scientists who understood the model’s purpose, labeling was done by contractors who did not, quality checks were statistical (inter-annotator agreement) rather than epistemological (are the annotations correct relative to reality). According to research published on label errors in benchmark datasets, even canonical ML benchmarks like ImageNet contain label errors at rates between 3% and 6%.
The economics make this worse. Labeling at scale means thousands of annotations per day. Each annotation gets seconds of attention. Ambiguous cases (is this review “neutral” or “slightly negative”?) are resolved by coin-flip decisions that introduce noise. That noise becomes signal when the model trains on it. The evaluation problem in AI engineering starts here, before any model is trained.
How can teams address label quality systematically?
Teams can address label quality through multi-annotator redundancy, expert adjudication for ambiguous cases, confidence-weighted training, and iterative relabeling of the examples that models find most confusing.
- Multi-annotator overlap: Every training example should be labeled by at least 3 annotators. Disagreements are not noise; they are information. I track inter-annotator agreement at the task level and the example level. Tasks with agreement below 70% need better guidelines. Examples with annotator disagreement need expert review
- Expert adjudication: Ambiguous cases should be routed to domain experts, not resolved by majority vote. In a content moderation dataset I worked on, majority vote incorrectly labeled 12% of edge cases. Expert adjudication reduced that error to 3%
- Confidence-weighted training: Instead of treating all labels as equally trustworthy, I assign confidence scores based on annotator agreement and annotator track record. High-agreement examples get full weight. Low-agreement examples get reduced weight. This prevents the model from overfitting to disputed labels
- Active relabeling: After initial training, I identify the examples where the model is most uncertain (highest prediction entropy). These are systematically relabeled by experts. In one project, relabeling the top 5% most uncertain examples improved model accuracy by 4 percentage points, more than any architectural change
What does this mean for the epistemology of data-driven systems?
If the labels that define “correct” are themselves uncertain, then the confidence scores our models produce are doubly uncertain, a compounding of model uncertainty on top of label uncertainty that most ML systems do not acknowledge or communicate.
The philosophical implication is significant. Every ML system makes a claim: “this input belongs to this category with this confidence.” That claim rests on the assumption that the training labels were correct. When they are not, the entire confidence framework is miscalibrated. A model that says “92% confident this is spam” might really be saying “92% confident this matches what our annotators labeled as spam, where annotators agreed only 85% of the time.” The effective confidence is lower than the stated confidence.
This connects directly to the epistemology of metrics in broader organizational context. We trust numbers because they feel precise. But precision without accuracy is misleading. A model confident in wrong labels is precisely wrong, which is worse than being imprecisely right. The Popperian principle applies: we should spend as much time trying to prove our labels wrong as we spend training models to predict them.
Ground truth is a misnomer. What we have is ground consensus, the agreed-upon labels of fallible humans working under time pressure with imperfect guidelines. Building ML systems that acknowledge this uncertainty, that treat labels as hypotheses rather than facts, is not just better engineering. It is more honest epistemology. The models we build are only as truthful as the labels we give them, and the labels are less truthful than most teams want to admit.