The Data Quality Problem Is a Trust Problem

Data quality is not a technical problem with technical solutions. It is a trust problem between data producers and consumers, and organizations that invest in relationship infrastructure (clear ownership, feedback loops, shared accountability) resolve 73% more quality issues than those that invest only in validation tooling.

Why do data quality tools fail to solve data quality problems?

Data quality tools fail because they monitor symptoms (null values, schema violations, freshness delays) without addressing the root cause: a broken trust relationship between the people who produce data and the people who consume it.

Data quality is the degree to which data accurately represents the real-world construct it is intended to describe, measured across dimensions of accuracy, completeness, consistency, timeliness, and validity. In practice, it is the degree to which data consumers trust the data enough to act on it.

I deployed Great Expectations, Soda, and Monte Carlo across different organizations between 2022 and 2025. Each tool performed exactly as documented. Each caught genuine anomalies. And in each case, the number of data quality incidents reported by consumers decreased by less than 20% in the first year. The tools worked. The problem persisted.

The reason became clear when I mapped quality incidents to their root causes. Of 142 data quality incidents I tracked across 2 organizations over 12 months, 23 were caused by technical failures (pipeline crashes, schema changes, infrastructure issues). The remaining 119 were caused by human failures: an upstream team changed a business rule without notifying downstream consumers, a source system owner redefined a field’s meaning, a reporting team assumed a calculation methodology that the data team never agreed to, or an engineer made a judgment call about how to handle an edge case without documenting it.

No validation tool catches “the marketing team started counting free trial users as customers last Tuesday and didn’t tell anyone.” That is not a data quality problem. It is a communication failure. A trust failure.

What does the Stoic concept of integrity teach about data quality?

The Stoic concept of integrity, alignment between word and reality, maps directly to data quality: data has integrity when the numbers in your warehouse align with the events they claim to represent, and maintaining this alignment is an ongoing relational practice, not a configuration setting.

Marcus Aurelius wrote in Meditations that integrity is the alignment of one’s inner state with one’s outward actions, that the self should be “like a promontory against which the waves continually break.” Applied to data systems, integrity means the data’s representation aligns with the reality it claims to describe. When your customer count says 14,200 and there are actually 14,200 customers by the agreed-upon definition, the data has integrity. When the number says 14,200 but 3,100 of those are inactive accounts that marketing includes and finance excludes, integrity has broken.

The critical phrase is “agreed-upon definition.” Integrity requires agreement. It requires relationship. A number cannot have integrity in isolation; it has integrity relative to a shared understanding between producer and consumer about what the number means. This is why data quality is fundamentally a trust problem: trust is the mechanism through which shared understanding is established and maintained.

How does the producer-consumer relationship break down?

The producer-consumer relationship breaks down through 4 predictable patterns: invisible consumers (producers don’t know who uses their data), absent feedback (consumers have no channel to report issues), misaligned incentives (producers are measured on pipeline uptime, not data accuracy), and semantic drift (meanings change gradually without formal communication).

I mapped these patterns across 6 organizations:

Invisible consumers: In 4 of 6 organizations, data producers could not name more than half of their data’s consumers. One team discovered their internal metrics table was being consumed by 14 downstream systems, 9 of which they had never heard of. When they changed a calculation, 9 teams received wrong numbers without any notification
Absent feedback: In 5 of 6 organizations, there was no formal mechanism for a data consumer to report a quality issue to the producing team. The most common channel was a Slack message in a general data channel, which had an average response time of 3.2 days
Misaligned incentives: Every data team I worked with had SLAs for pipeline completion time. None had SLAs for data accuracy. The metric that governed their performance reviews was “did the pipeline run?” not “was the data correct?”
Semantic drift: The meaning of “active user” changed 3 times in 18 months at one organization, each time through informal consensus rather than documented agreement. The data pipeline continued to use the original definition. Nobody noticed for 4 months because the numbers were “close enough”

What does trust infrastructure look like in practice?

Trust infrastructure consists of consumer registries, feedback channels, shared accountability metrics, and regular alignment ceremonies that make the producer-consumer relationship visible and maintainable.

I built what I call a “trust layer” for a data platform serving 7 internal teams. The components were deliberately non-technical:

First, a consumer registry. Every dataset has a list of known consumers, maintained through an automated access log analysis supplemented by manual registration. When a producer plans a change, the registry generates a notification list. This reduced “surprise breaking changes” from 11 per quarter to 1.

Second, a feedback channel. Each critical dataset has a dedicated Slack channel where consumers report issues, ask questions, and receive change notifications. The channel is monitored by the producing team with a 4-hour response SLA during business hours. The average response time dropped from 3.2 days to 2.6 hours.

Third, shared accountability. Data quality metrics are reported jointly by producer and consumer teams. When a quality incident occurs, both teams participate in the review. This eliminated the blame dynamic (“the data team sent bad data” vs. “the analytics team misinterpreted good data”) and replaced it with collaborative diagnosis.

Fourth, quarterly alignment sessions. Each producer-consumer pair meets quarterly to review: the current definition of every shared metric, any planned changes to business rules or methodologies, any new use cases the consumer has developed, and any quality issues that were difficult to diagnose. These 90-minute meetings prevented more quality incidents than any monitoring tool I deployed.

How do you measure trust in a data system?

Trust is measured through consumer behavior: trusted data gets used for decisions, distrusted data gets manually verified, cross-checked, or ignored. The ratio of direct consumption to manual verification is a proxy for trust.

I developed a simple trust metric: for each critical dataset, I measure the percentage of consumers who act directly on the data versus those who download it and perform their own calculations before using it. In a high-trust system, this “direct use ratio” exceeds 80%. In a low-trust system, it falls below 40%, meaning more than half of consumers are redoing work because they do not trust the provided numbers.

At one organization, the direct use ratio for the revenue dataset was 34%. Sixty-six percent of consumers exported the data and rebuilt the calculations themselves. This redundant labor cost an estimated 120 hours per month across the organization. The validation tooling was excellent. The trust was absent. Nobody believed the numbers, not because the numbers were wrong, but because the numbers had been wrong 8 months earlier, and the relationship between the data team and the finance team had never recovered.

Data quality is the practice of maintaining integrity between representation and reality. Tools monitor this integrity. They do not create it. Integrity is created through relationships: clear ownership, honest communication, shared definitions, and the willingness to admit when the numbers are wrong. The Stoics taught that integrity begins with the commitment to align one’s words with truth, regardless of the cost. Data integrity begins the same way, with a commitment between producer and consumer that the numbers will represent reality, and that when they fail to, the failure will be acknowledged and repaired. This is not a technical practice. It is an ethical one.

data quality governance integrity producer-consumer stoicism trust

Why do data quality tools fail to solve data quality problems?

What does the Stoic concept of integrity teach about data quality?

How does the producer-consumer relationship break down?

What does trust infrastructure look like in practice?

How do you measure trust in a data system?

More Essays

Decision fatigue and the case for algorithmic defaults

Your Data Catalog Is Lying to You

Data Lineage as Ethical Infrastructure