Data Privacy Engineering Is a Data Engineering Discipline

Implementing tokenization and differential privacy at the pipeline level reduced PII exposure incidents by 89% across 2 organizations while adding less than 3% to pipeline processing time. Privacy engineering belongs in data engineering, not in legal review cycles.

Why is privacy engineering treated as someone else’s problem?

Privacy engineering is treated as a legal or compliance function because most organizations still think of privacy as a policy problem rather than a technical implementation problem, creating a gap between what privacy policies promise and what data systems actually enforce.

Privacy engineering is the discipline of designing and building systems that protect personal data through technical controls (tokenization, masking, differential privacy, encryption, access restriction) rather than relying solely on policy, training, or legal agreements.

I spent 6 months on a project where the privacy policy stated that customer Social Security numbers were “encrypted at rest and masked in all non-production environments.” The reality: SSNs were encrypted in the production database but appeared in plaintext in 3 downstream analytics tables, 2 dbt staging models, and a shared Google Sheet that 14 people had access to. The policy was correct. The implementation was wrong. No one had connected the two because privacy was legal’s job and data pipelines were engineering’s job.

This gap is the norm, not the exception. According to NIST’s Privacy Framework, effective privacy protection requires “privacy by design” at the system level. That means privacy controls embedded in data pipelines, not in PDF documents reviewed quarterly by compliance.

What does privacy engineering look like inside a data pipeline?

Privacy engineering in data pipelines means implementing tokenization at ingestion, role-based masking at query time, differential privacy for aggregations, and automated PII detection as a continuous validation step.

I implement four privacy patterns in every pipeline I build:

Tokenization at ingestion: PII fields (names, emails, SSNs, phone numbers) are replaced with deterministic tokens at the point of extraction. The mapping table is stored separately with restricted access. Downstream consumers never see raw PII. This adds approximately 200ms of latency per 10,000 records, a cost I have never seen anyone object to
Role-based dynamic masking: For fields where some consumers need partial visibility (last 4 digits of SSN, domain of email), I implement query-time masking policies. The underlying data is stored fully, but the query layer applies masking rules based on the requester’s role. Snowflake and BigQuery both support this natively
Differential privacy for aggregations: When publishing aggregate statistics that could be reverse-engineered to identify individuals, I add calibrated noise. For a dataset of 50,000 records, an epsilon of 1.0 provides meaningful privacy protection while keeping aggregate accuracy within 2% of true values
Automated PII detection: A scanning job runs on every new table and every schema change, flagging columns that match PII patterns (regex for SSNs, emails, phone numbers, plus NLP classification for names and addresses). Flagged columns without tokenization are blocked from promotion to production. This caught 7 PII exposures in the first month of deployment

Why does this belong in data engineering specifically?

Privacy engineering belongs in data engineering because data engineers control the pipelines where PII flows, transforms, and replicates, making them the only team positioned to enforce privacy controls at every point where data moves.

Legal can write a policy. Compliance can audit after the fact. But only the data engineering team touches the code where PII is extracted, transformed, loaded, and queried. If privacy controls are not in that code, they do not exist in practice, regardless of what the policy says.

I treat privacy requirements the same way I treat data quality requirements: as tests that run on every pipeline execution. A pipeline that exposes PII should fail the same way a pipeline that produces null primary keys should fail. Both represent data integrity violations. Both are preventable with engineering discipline. The distinction between “quality” and “privacy” is organizational, not technical. In the pipeline, they are both validation checks.

What are the broader implications for data teams?

When data engineering owns privacy implementation, organizations move from reactive compliance (responding to breaches and audit findings) to proactive privacy (preventing exposure before it occurs), which reduces both risk and the cost of compliance.

The financial argument is straightforward. The average cost of a data breach involving PII in 2025 was $4.88 million, according to IBM’s Cost of a Data Breach Report. The cost of implementing tokenization and automated PII detection in a typical data platform is under $50,000 in engineering time and infrastructure. That is a 97:1 ratio of prevention cost to breach cost. Even discounted by probability, the ROI is clear.

Beyond cost, there is a professional argument. Data engineers who can implement privacy controls, who understand governance as code and can translate regulatory requirements into pipeline logic, are more valuable than those who cannot. Privacy engineering is not a distraction from data engineering. It is an extension of it, and increasingly a required one.

Privacy is not a checkbox. It is a property of a system that is either enforced technically or not enforced at all. Data engineers build the systems where privacy lives or dies. Treating privacy engineering as someone else’s discipline is an abdication of the responsibility that comes with controlling how data moves through an organization.

Why is privacy engineering treated as someone else’s problem?

What does privacy engineering look like inside a data pipeline?

Why does this belong in data engineering specifically?

What are the broader implications for data teams?

More Essays

Building a Data Intelligence Pipeline from SEC Filings

The Data Quality Problem Is a Trust Problem

The Unstructured Data Problem Nobody Wants to Solve