Your Data Catalog Is Lying to You

An audit of 3 enterprise data catalogs found that 38% of table descriptions were inaccurate, 22% of documented columns no longer existed, and 45% of lineage graphs were incomplete. Data catalogs create false confidence when metadata drifts from reality, turning a trust tool into a trust liability.

How does a data catalog become a liability?

A data catalog becomes a liability when it presents stale metadata as current truth, leading analysts and engineers to make decisions based on descriptions, lineage, and quality indicators that no longer reflect the actual state of the data.

A data catalog is a metadata management tool that inventories an organization’s data assets, providing descriptions, ownership, lineage, quality metrics, and usage statistics to help data consumers discover and understand available data. When maintained, it accelerates work. When neglected, it misdirects it.

I trusted a data catalog entry that described a “customer_revenue” column as “total lifetime revenue in USD.” I built a quarterly report on it. Three weeks later, a senior analyst pointed out that the column had been redefined 8 months earlier to mean “revenue in the last 12 months, in the customer’s local currency.” The catalog was never updated. My report was wrong. Every decision made from that report for 3 weeks was based on incorrect data. The catalog did not just fail to help. It actively misled.

Why does metadata drift happen so consistently?

Metadata drift happens because catalog maintenance is treated as a one-time documentation project rather than a continuous operational discipline, and because the people who change data (engineers) are rarely the people who update the catalog (if anyone does).

The root cause is structural. Data engineers modify schemas, change transformation logic, and deprecate tables as part of their daily work. Updating the data catalog is an additional step with no automated enforcement. In 4 teams I have worked with, catalog updates were in the “definition of done” checklist for pipeline changes. In all 4, the checklist step was skipped more than 60% of the time. Not because engineers were lazy. Because the catalog update was disconnected from the deployment workflow.

According to Gartner’s research on data catalog adoption, organizations that treat catalog maintenance as a project achieve less than 30% accuracy after 12 months. Organizations that embed catalog updates in CI/CD pipelines maintain accuracy above 85%. The difference is automation, not intention.

What does a honest data catalog require?

An honest data catalog requires automated metadata extraction, schema change detection, freshness tracking, and usage analytics that surface stale or unused entries, replacing human discipline with system enforcement.

Automated schema syncing: The catalog should pull column names, types, and constraints directly from the database on a schedule (I use hourly). If a column is added, renamed, or dropped, the catalog reflects it without human intervention. Manual descriptions still require human input, but the structural metadata stays accurate
Freshness indicators: Every catalog entry should display when it was last verified, not when it was last edited. I add a “verified_at” timestamp that resets only when someone explicitly confirms accuracy. Entries older than 90 days get flagged. Entries older than 180 days get a “stale” warning visible to all consumers
Usage tracking: Catalog entries that nobody queries are candidates for deprecation. I instrument the query layer to track which tables and columns appear in production queries. A table with zero queries in 6 months either does not need a catalog entry or does not need to exist. This connects directly to the via negativa data architecture principle
Schema change alerts: When a schema change is deployed, the catalog should automatically flag affected descriptions as “unverified.” This does not require the description to be wrong. It flags that the description might be wrong, which is the honest position

What is the organizational cost of catalog dishonesty?

The cost is measured in duplicated work, incorrect analyses, and eroded trust: analysts who distrust the catalog build their own documentation, creating parallel metadata systems that compound the fragmentation problem.

I surveyed 12 data analysts across 2 organizations with data catalogs. Nine of them maintained personal documentation (Notion pages, spreadsheets, Slack bookmarks) that they trusted more than the official catalog. This means the organization paid for a catalog tool ($30,000 to $120,000 per year depending on vendor), invested in initial population, and still ended up with fragmented, inconsistent metadata because the catalog was not trustworthy enough to be the single source of truth.

The deeper cost is cultural. When the catalog lies, data consumers stop trusting institutional documentation. They rely on tribal knowledge: asking the engineer who built the table, searching Slack history, reading dbt model SQL directly. This works until that engineer leaves. Then the knowledge disappears, and the organization discovers that its data observability was built on a person, not a system.

A data catalog is a promise that metadata reflects reality. Breaking that promise is worse than never making it, because false confidence produces worse outcomes than acknowledged uncertainty. If your catalog cannot be trusted, either fix the maintenance discipline or shut it down. A known absence of documentation is less dangerous than documentation that lies.

How does a data catalog become a liability?

Why does metadata drift happen so consistently?

What does a honest data catalog require?

What is the organizational cost of catalog dishonesty?

More Essays

Building Data Pipelines That Survive Schema Changes

The Junior Data Engineer Pipeline Is Broken

The Dashboard Paradox: More Dashboards, Less Understanding