The Ethics of Data Collection at Scale
What happens when data collection outpaces ethical reasoning?
When data collection capability outpaces ethical reasoning, organizations default to collecting everything, creating liability, eroding trust, and building systems that treat human behavior as an extractable resource rather than a relationship.
I audited the data collection footprint of a mid-size SaaS platform in 2025. The platform collected 847 distinct data points per user session. When I asked the product team which of those data points were necessary for the product to function, the answer was 23. The remaining 824 were collected because “we might need them later” or “analytics wants everything.” That ratio, 23 necessary to 824 speculative, is the ethical problem in concrete terms.
The technical infrastructure to collect everything exists and is cheap. Cloud storage costs roughly $0.023 per GB per month. The marginal cost of collecting one more data point is effectively zero. But the ethical cost is not zero. Each data point represents a piece of someone’s behavior, captured without meaningful understanding of how it will be used.
Why does the consent model fail at scale?
The consent model fails because it was designed for simple, comprehensible transactions and is now applied to collection practices so complex that no human could meaningfully understand what they are consenting to.
According to a study published in the Journal of Information, Communication and Society, the average privacy policy takes 18 minutes to read. The average user spends 73 seconds on a consent screen. This is not informed consent. It is performative consent, a legal ritual that protects the collector while providing no meaningful agency to the collected.
I have built consent management systems for 3 organizations. In each case, the legal team wanted comprehensive consent (covering every possible future use). The UX team wanted simple consent (one button, minimal friction). The result was always a compromise that served neither purpose: dense enough to be legally defensible, simple enough to be ethically meaningless. The data governance as code approach helps technically but does not resolve the fundamental consent problem.
What do organizations actually owe the people whose data they collect?
Organizations owe data subjects three things: transparency about what is collected and why, genuine control over their data, and accountability when data is misused, and most organizations deliver none of these credibly.
Transparency means more than a privacy policy. It means making collection visible in the product experience. I worked on a system that displayed a real-time data collection indicator: a small counter showing how many data points had been collected during the current session. User trust scores increased by 31% in a 90-day pilot. The counter also reduced collection: when engineers saw the number climb, they questioned whether each data point was necessary.
Genuine control means more than a “delete my data” button. It means granular choices about what is collected, how long it is retained, and who can access it. The GDPR established a legal framework for this, but technical implementation remains inconsistent. In my experience, fewer than 20% of organizations can actually fulfill a complete data deletion request within the legally required timeframe.
How should data engineers think about collection ethics?
Data engineers should apply the principle of minimal collection: collect what is needed, justify what is retained, and delete what is not actively serving a defined purpose.
This is not an abstract moral argument. It is a practical engineering discipline. Every data point collected increases storage costs, schema complexity, privacy attack surface, and regulatory liability. I have seen organizations spend $40,000 per month storing data that no query had touched in 18 months. The ethical choice and the economical choice were identical: stop collecting what you do not need.
The via negativa approach to data architecture applies directly here. The most ethical data architecture is often the one that collects less, not more. Every field in a schema should justify its existence. Every retention policy should default to deletion, not preservation. Every new data collection should require a stated purpose that a reasonable person could understand.
The volume of data collection will continue to grow. The question is whether our ethical reasoning grows with it, or whether we continue treating consent as a checkbox, privacy as a legal department problem, and human behavior as a resource to be extracted. Data engineers are not neutral in this. We build the systems that collect. That makes the ethics of collection our professional responsibility, not someone else’s.