System

Data Validation Is Testing for Your Data Pipeline

Implementing data validation as a testing discipline caught 94% of data quality issues before they reached downstream consumers, reducing incident tickets from 23 per month to 2.

Implementing data validation as a testing discipline across 3 production pipelines caught 94% of data quality issues before they reached downstream consumers, reducing data incident tickets from 23 per month to 2. Data validation is not optional hygiene. It is the testing layer your data pipeline cannot function without.

What problem does this system address?

Most data pipelines have no formal validation layer, meaning schema violations, null values, range anomalies, and referential integrity failures propagate silently to downstream consumers until someone notices bad numbers in a dashboard or report.

I inherited a pipeline that processed 200,000 records daily from 4 source systems. It had zero data validation. Records with null primary keys, negative revenue values, and future-dated transactions all flowed through to the analytics layer. The team discovered issues only when stakeholders reported “the numbers look wrong.” By that point, the bad data had been in production dashboards for days, sometimes weeks. The trust problem this created took months to repair.

How is the system structured?

The validation system operates at three checkpoints (ingestion, transformation, serving) with tests categorized by severity (blocking, warning, monitoring) and implemented using a combination of Great Expectations, dbt tests, and custom checks.

Step 1: Ingestion validation

At the point where data enters the pipeline, I validate schema conformance (expected columns, types, not-null constraints) and volume expectations (record count within 2 standard deviations of the 30-day average). Schema violations are blocking: the pipeline halts if the source schema has changed unexpectedly. Volume anomalies generate warnings for manual review. I implement this using Great Expectations with custom expectation suites per source system. A typical source has 15 to 25 expectations that run in under 30 seconds for 200,000 records.

Step 2: Transformation validation

After each major transformation step, I validate business logic invariants. Revenue should be non-negative. Dates should be within business-reasonable ranges. Foreign keys should resolve to valid references. I implement these as dbt tests, with each model having a corresponding test file. A typical transformation layer has 40 to 60 tests. Tests run as part of the dbt build process, meaning a failing test prevents the model from materializing. This is identical to how application developers treat unit tests: if the test fails, the code does not deploy.

Step 3: Serving layer validation

Before data is exposed to consumers (dashboards, APIs, reports), I run aggregate validation: are totals consistent with source system totals (reconciliation), are key metrics within expected ranges, are there unexpected nulls in consumer-facing fields. These checks run as a post-build step and generate alerts rather than blocking, because blocking at the serving layer means consumers see stale data, which is sometimes worse than slightly imperfect fresh data. The severity classification (blocking versus warning) is a design decision that depends on the consumer’s tolerance, not a universal rule.

How do you validate it works?

Validation effectiveness is measured by catch rate (what percentage of issues are caught before reaching consumers), false positive rate (how often valid data is flagged incorrectly), and time-to-detection (how quickly after ingestion an issue is identified).

I track three metrics for the validation system itself. Catch rate is measured monthly by comparing validation alerts to downstream incident reports. A catch rate below 90% means the validation suite has gaps. False positive rate is measured by reviewing all validation failures weekly. If more than 10% are false positives, expectations need recalibration. Time-to-detection is measured from ingestion timestamp to alert timestamp. The target is under 15 minutes for blocking issues and under 1 hour for warnings.

According to Great Expectations documentation, organizations implementing systematic data validation reduce data-related incidents by 60% to 80% within the first quarter. My experience aligns with the higher end of that range. The investment is modest (2 to 3 weeks of initial setup, 1 to 2 hours per week of maintenance) and the return is immediate. Data validation is not a luxury for mature data teams. It is a prerequisite for any team that wants its data to be trusted. The data contracts pattern extends this thinking to inter-team agreements.

adam@adam-analytics.com writes about AI systems, software architecture, and the philosophy of technology at .