The Productivity Placebo: METR’s AI Coding Study
What did METR’s study actually measure?
METR conducted the first rigorous randomized controlled trial of AI coding tool productivity, tracking 16 experienced developers across 246 real tasks on their own open-source repositories, and found a statistically significant slowdown of 19% when AI tools were used.
The study design matters. This was not a survey. It was not a self-reported time estimate. METR used randomized task assignment, where developers were told whether they could or could not use AI tools (primarily Cursor Pro with Claude 3.5 Sonnet) for each specific task. Tasks were real issues on repositories the developers maintained. Time was measured from task start to pull request submission. The result: tasks completed with AI assistance took 19% longer on average.
The developers predicted, before seeing the results, that AI tools would make them 24% faster. The gap between belief and reality, 43 percentage points, is not a rounding error. It is a structural misperception.
Why did experienced developers get slower with AI tools?
The slowdown emerged from 3 compounding factors: context-switching overhead between writing and reviewing AI suggestions, false starts from plausible but incorrect generated code, and the cognitive cost of integrating AI output into existing, complex codebases the developer already understood deeply.
I have observed this pattern in my own work. When I use AI coding assistants on codebases I know well, I spend measurable time evaluating suggestions that are syntactically correct but architecturally wrong. The model does not understand the 47 implicit conventions in a mature codebase. It does not know that we use factory patterns in module A but dependency injection in module B, or that the test suite has a specific initialization sequence that breaks if you add imports in the wrong order.
For these 16 developers, working on their own repositories, the AI tool was essentially a confident junior developer who had read the documentation but never attended a team standup. Every suggestion required evaluation against context the model could not see. That evaluation time, invisible in the moment, accumulated.
The study also noted that developers spent considerable time crafting prompts and iterating on AI-generated solutions. This “prompt iteration loop” is a hidden tax. When you write code directly, the feedback loop is between your intention and the compiler. When you write code through an AI intermediary, the feedback loop adds a translation layer: your intention, your prompt, the model’s interpretation, the generated code, your evaluation, your correction. Each additional node in that loop introduces latency and potential error.
How does this connect to the Stoic concept of phantasia?
The Stoic concept of phantasia (initial impressions that precede rational judgment) explains why developers experienced productivity that did not exist: the constant stream of AI-generated code created an overwhelming impression of progress that bypassed critical evaluation.
Marcus Aurelius wrote in the Meditations that we should examine our initial impressions and not simply assent to them. The Stoics called this process “testing the phantasia,” subjecting each impression to rational scrutiny before accepting it as true. The METR study is a case study in what happens when an entire profession fails to test the phantasia.
AI coding tools produce a continuous stream of impressions that feel like productivity. Code appears on screen. Autocomplete suggestions arrive in milliseconds. The visual experience of watching an AI generate 40 lines of code in 2 seconds is viscerally compelling. It feels fast. The feeling is the phantasia, the raw impression that has not yet been subjected to rational evaluation.
What rational evaluation reveals, when conducted rigorously as METR did, is that the speed of generation is not the speed of completion. Generation is one phase of software development. Integration, testing, debugging, and review are the others. If AI accelerates generation by 300% but adds 50% overhead to each subsequent phase, the net effect can be negative. The developer, anchored to the vivid impression of rapid generation, does not perceive the distributed slowdown across the other phases.
Does this mean AI coding tools are useless?
No. It means the conditions under which AI coding tools provide genuine acceleration are narrower and more specific than the industry has acknowledged, and identifying those conditions requires the kind of rigorous measurement most teams have not performed.
I use AI coding tools daily. I find them genuinely useful for 4 specific categories of work: boilerplate generation where the pattern is well-established and the integration context is minimal, language translation tasks (porting Python logic to TypeScript, for example), documentation generation from existing code, and exploration of unfamiliar APIs. In each of these cases, the common factor is that the cost of evaluating the output is low relative to the cost of producing it from scratch.
Where I have measured slowdowns in my own work (and I do measure, using time-tracking at the task level), the common factor is the opposite: tasks where the integration context is high, where correctness depends on implicit knowledge, and where the cost of evaluating AI output approaches or exceeds the cost of writing the code directly. Refactoring a 500-line module in a codebase I designed is one such task. The AI cannot see the architectural intent. I spend more time explaining the constraints than I would spend making the changes.
METR’s study tested experienced developers on repositories they maintained. This is precisely the high-context, high-implicit-knowledge condition where AI tools are least likely to help. The study does not prove AI tools are universally slower. It proves they are slower in the exact scenario the industry most commonly markets them for: making expert developers more productive on their own code.
What should engineering organizations do with this evidence?
Engineering organizations should instrument their own productivity metrics before and after AI tool adoption, abandon subjective assessments as evidence, and design AI tool usage policies based on task-type analysis rather than blanket deployment.
- Measure, do not survey: Self-reported productivity estimates are unreliable. The METR study proved this with a 43-point perception gap. Use cycle time, defect rates, and PR throughput as objective proxies.
- Segment by task type: AI tools likely accelerate some tasks and decelerate others. Blanket productivity claims are meaningless without task-type segmentation. I categorize tasks into 4 quadrants based on context complexity (low/high) and output novelty (low/high), and measure AI impact in each.
- Account for quality externalities: Speed is not the only metric. If AI-assisted code introduces more bugs, the downstream cost in debugging and review may exceed the upstream time savings. Apiiro’s research found AI-merged code had 322% more privilege escalation vulnerabilities.
- Resist the sunk cost of tool investment: Organizations that have purchased enterprise AI coding licenses have a financial incentive to believe those tools are working. This is a textbook sunk cost bias. The investment in the tool should not influence the evaluation of its impact.
The most useful thing METR’s study reveals is not about AI tools specifically. It is about the human capacity for self-deception when a technology aligns with our desire to feel productive. The Stoics understood this 2,000 years ago. Seneca warned that busyness is not the same as productivity, that the appearance of motion can mask the absence of progress. AI coding tools, with their constant generation of plausible output, are the most sophisticated busyness machines ever built. The discipline is in knowing when that busyness translates to genuine progress and when it does not.