🌱 Seedling

Infrastructure Drift Is Technical Debt With Compound Interest

· 3 min read
In 3 organizations I audited, infrastructure drift (manual changes diverging from infrastructure-as-code definitions) accumulated an average remediation cost of $47,000 per quarter, with the oldest undetected drift spanning 14 months and affecting 23 production resources.

Why does infrastructure drift compound faster than other forms of technical debt?

Infrastructure drift compounds because every system built on top of drifted infrastructure inherits assumptions that are no longer true, and each layer of abstraction makes the original drift harder to detect and more expensive to fix.

I use the term “compound interest” deliberately. A single manual change to a security group rule is a minor divergence. But the monitoring system configured to trust that security group now has a blind spot. The deployment pipeline that assumes the security group’s state now produces inconsistent results. The disaster recovery plan that was tested against the infrastructure-as-code definition no longer matches production. Each downstream dependency amplifies the cost of the original drift.

In one organization, a single undocumented change to a load balancer timeout (from 30 to 120 seconds) went undetected for 14 months. During that time, 3 new services were deployed behind the load balancer, all designed assuming the 30-second timeout documented in Terraform. When a latency spike caused requests to exceed 30 seconds, the services failed in unexpected ways because the actual timeout was 4 times longer than the code assumed. The incident took 11 hours to diagnose because the investigation started with the services, not the infrastructure. This is the compound interest at work: a 30-second manual change cost 11 hours of incident response plus 3 days of remediation.

How can teams detect and prevent drift before it compounds?

Automated drift detection running on a daily schedule, combined with strict change control that routes all infrastructure modifications through code review, catches drift before downstream systems encode false assumptions.

The tools exist. Terraform plan, AWS Config rules, and open-source tools like driftctl can compare actual infrastructure state against declared state and report divergences. The challenge is organizational, not technical. Teams must commit to running drift detection continuously (not just before deployments) and treating detected drift as a severity-2 incident that requires same-day remediation.

I implement a daily drift scan that runs at 6 AM, compares every managed resource against its Terraform state, and creates a ticket for any divergence. The ticket is assigned to the team that owns the resource. If the drift is not resolved within 48 hours, it escalates. This policy reduced the average drift age from 4.7 months to 2.3 days in the first organization where I deployed it. The key was treating drift with the same urgency as a failing health check, because drifted infrastructure is an unmonitored failure waiting to become a monitored incident.

As I explored in technical debt as a loan, the problem is not that drift exists. It is that drift accrues interest silently, and organizations do not notice the balance until a production incident forces them to pay it all at once.

The question I keep asking teams that tolerate drift is this: if you cannot trust your infrastructure-as-code definitions to describe your actual infrastructure, what exactly is the code for?