Why does governance belong in version control, not spreadsheets?
Data governance enforced through spreadsheets and committee meetings catches violations after deployment, when remediation is expensive. Governance embedded in CI/CD catches violations before merge, when remediation costs a code change.
I audited the governance process at a healthcare data company in 2024. Their compliance workflow: a data engineer builds a new pipeline, submits a change request form (Google Sheet), a governance committee reviews it in a biweekly meeting (average wait: 8 days), the committee approves or requests changes (40% revision rate), and the engineer implements revisions and resubmits. End-to-end cycle: 12 to 22 business days for a governance review.
The result was predictable. Engineers avoided the process. Of 47 pipeline changes deployed in Q3 2024, only 19 went through governance review. The other 28 were deployed directly, with engineers planning to “submit the form later.” Later never came. The governance process had an effective coverage rate of 40%, and the violations it caught were the ones that compliant engineers submitted voluntarily.
What does policy-as-code look like for data governance?
Policy-as-code translates governance rules into machine-executable checks that run automatically in CI/CD pipelines, treating compliance violations the same way linting tools treat code style violations: as pre-merge blockers.
Implementation framework using Open Policy Agent (OPA) and dbt:
- Step 1: Codify existing policies. Translate each governance rule from prose to a machine-readable policy. Example: “All PII columns must be masked in analytics models” becomes an OPA policy that checks dbt model YAML for masking annotations on columns tagged as PII. I codified 23 governance rules in 4 days
- Step 2: Integrate into CI. Add policy evaluation as a CI pipeline step that runs on every pull request. The step reads the proposed dbt model changes, evaluates them against the policy bundle, and fails the CI check if any policy is violated. The failure message includes the specific policy, the violating code, and the remediation instruction
- Step 3: Classify severity levels. Not all governance violations are equal. I classified policies into 3 tiers: blocking (PII exposure, data retention violations, access control gaps) which prevent merge; warning (documentation gaps, naming convention violations) which annotate the PR but don’t block; and informational (optimization suggestions, deprecation notices) which appear as PR comments
- Step 4: Version the policy bundle. Governance policies live in a Git repository alongside the data code. Policy changes go through the same PR review process as code changes. This creates an auditable history of governance evolution and prevents silent policy modifications
- Step 5: Generate compliance reports from CI logs. Every policy evaluation is logged. Monthly compliance reports are generated automatically from CI logs, replacing the manual spreadsheet tracking. Auditors receive a complete record of every governance check, its result, and the timestamp, without anyone filling out a form
How do you handle governance for schema changes and data lineage?
Schema change governance requires automated impact analysis: before any schema modification merges, the CI pipeline must identify every downstream consumer that would be affected and verify that the change is backward-compatible or that consumers have been notified.
I built a schema change governance workflow that operates in 3 stages. First, the CI pipeline extracts the proposed schema change from the dbt model diff. Second, it queries the data lineage graph (maintained by dbt’s built-in lineage tracking) to identify all downstream models, dashboards, and API consumers that reference the changed columns. Third, it evaluates the change type: additive changes (new columns) pass automatically; type changes and removals require explicit acknowledgment from each affected consumer team, implemented as a required PR approval from the consuming team’s code owners.
This workflow caught 7 breaking schema changes in the first quarter, changes that would have previously been discovered only when downstream dashboards broke or API consumers reported errors. The average remediation cost for a schema change caught in CI was 45 minutes of engineering time. The average remediation cost for a schema change discovered in production was 6 hours.
What governance patterns should be automated first?
Automate high-frequency, low-ambiguity governance patterns first: PII detection, naming conventions, documentation requirements, and access control verification. Reserve human review for high-ambiguity decisions like data classification and retention policy exceptions.
- PII detection: Scan column names and sample values for patterns matching personal identifiable information (email, phone, SSN, name patterns). Flag any PII column that lacks a masking policy annotation. False positive rate in my implementation: 8%, manageable through a whitelist mechanism
- Documentation requirements: Enforce that every dbt model has a description, every column with business logic has a documentation block, and every model has an owner annotation. Missing documentation fails CI with a specific message: “Model customer_health_score is missing column documentation for score_value”
- Naming conventions: Enforce table and column naming standards (snake_case, prefix conventions, prohibited abbreviations) through regex-based policy checks. This eliminated the naming inconsistency that previously required manual review
- Access control verification: Verify that new models inherit appropriate access grants and that no model exposes data to a broader audience than its source tables permit. This policy prevented 3 access control escalations in 6 months
- Retention compliance: Flag models that reference tables with retention policies and verify that the derived model does not extend data retention beyond the source’s limit. Particularly relevant for GDPR and CCPA compliance
How do you get engineering teams to adopt governance-as-code?
Adoption requires making governance faster than the alternative: if the automated CI check resolves in 3 minutes and the manual process takes 12 days, engineers will choose the automated path without persuasion.
I deployed the governance pipeline alongside the existing manual process for 1 month. Engineers could choose either path. In the first week, 30% used the automated path. By week 4, 92% used the automated path. The adoption was not driven by mandate but by speed: a governance check that returns in 3 minutes versus a committee review that returns in 8 to 12 days is not a close competition.
The remaining 8% who continued using the manual path had legitimate reasons: their changes involved novel data categories that no existing policy covered, or they needed exception approvals that required human judgment. This is the correct steady state. Automated governance handles the 90% of routine compliance. Human judgment handles the 10% of novel situations. The committee that previously reviewed 47 changes per quarter now reviews 4 to 6 genuinely complex cases, giving each one the attention it deserves.
Data governance belongs in the same workflow where data code is written, reviewed, and deployed. Governance in spreadsheets is governance theater: it creates the appearance of control without the reality of enforcement. Policy-as-code makes governance real by making it automatic, auditable, and faster than non-compliance. The goal is not to make governance stricter. The goal is to make compliance the path of least resistance.