System

Copilot-Agent Spectrum: AI Autonomy Framework

This framework defines 5 AI autonomy levels from passive suggestion to full autonomy, with a scoring rubric based on reversibility, error cost, and decision frequency.

This framework adapts the operational design domain concept from autonomous vehicles to AI system design, providing a structured rubric for deciding at which of 5 autonomy levels (from passive suggestion to full autonomous action) an AI system should operate based on the reversibility, cost, and frequency of its decisions.

Why is the copilot-agent distinction too binary?

Framing AI systems as either “copilots” (human decides) or “agents” (AI decides) obscures the 5 distinct autonomy levels between those extremes, each with different error profiles, cost structures, and appropriate use cases.

The industry talks about copilots and agents as if they are two categories. In practice, the systems I build occupy a spectrum with at least 5 meaningful positions. Treating this as a binary choice leads to systems that are either too constrained (wasting AI capability on suggestions no one reads) or too autonomous (executing actions that should require oversight). The framework below provides a structured approach to finding the right position on the spectrum for each specific task.

What are the 5 autonomy levels?

The 5 levels are: passive suggestion, active recommendation, conditional execution, supervised autonomy, and full autonomy, each defined by who makes the final decision and what happens when the decision is wrong.

  • Level 1: Passive Suggestion: The AI generates options, but takes no action. The human must actively select and execute. Example: code autocomplete that displays but does not insert. Appropriate when the cost of a wrong action exceeds $10,000 or involves irreversible consequences. Error cost: near zero (suggestions are ignored without consequence).
  • Level 2: Active Recommendation: The AI recommends a specific action with reasoning. The human approves or modifies before execution. Example: a draft email the human reviews and sends. Appropriate when actions are reversible but carry moderate cost ($1,000-$10,000 per error). Error cost: human review time (typically 30-120 seconds per recommendation).
  • Level 3: Conditional Execution: The AI executes automatically when confidence exceeds a threshold; otherwise escalates to human review. Example: a support ticket classifier that auto-routes high-confidence tickets and queues ambiguous ones. Appropriate when 70-90% of decisions are routine and the remaining 10-30% require judgment. Error cost: variable (routine errors are low-cost; escalation failures can be high-cost).
  • Level 4: Supervised Autonomy: The AI executes all actions autonomously, but a separate oversight system (human or AI) audits a sample and can retract actions. Example: an automated code reviewer that approves PRs but flags anomalies for human review. Appropriate when volume makes per-action human review impractical and actions are retractable within a time window. Error cost: retraction cost plus reputation damage from incorrect actions that reach users before retraction.
  • Level 5: Full Autonomy: The AI acts without oversight. Appropriate only when the cost of any individual error is less than the cost of human review, and the aggregate error rate is acceptable. Example: spam filtering, ad-hoc data formatting, log analysis. Error cost: the cost of the individual wrong action (which, by definition, must be low).

How do you determine the right autonomy level for a task?

Score each task on 3 dimensions (reversibility, error cost, and decision frequency) and map the combined score to an autonomy level using the rubric below.

The scoring rubric:

  • Step 1: Score Reversibility (1-5): 1 = fully irreversible (deleted data, sent legal notice). 3 = partially reversible (sent email can be followed up, database change can be rolled back within 24 hours). 5 = fully reversible (draft document, internal classification, staging environment change).
  • Step 2: Score Error Cost (1-5): 1 = catastrophic (>$100K, regulatory violation, safety incident). 2 = severe ($10K-$100K, client relationship damage). 3 = moderate ($1K-$10K, rework required). 4 = minor ($100-$1K, inconvenience). 5 = negligible (<$100, trivially correctable).
  • Step 3: Score Decision Frequency (1-5): 1 = rare (<10/month). 2 = occasional (10-100/month). 3 = regular (100-1,000/month). 4 = frequent (1,000-10,000/month). 5 = continuous (>10,000/month).
  • Step 4: Calculate Autonomy Score: Autonomy Score = (Reversibility + Error Cost + Frequency) / 3. Map: 1.0-1.7 = Level 1, 1.8-2.5 = Level 2, 2.6-3.5 = Level 3, 3.6-4.3 = Level 4, 4.4-5.0 = Level 5.
  • Step 5: Apply Override Rules: If Error Cost = 1 (catastrophic), cap at Level 2 regardless of other scores. If Reversibility = 1 (irreversible), cap at Level 3. If the system has been in production for less than 30 days, reduce the level by 1 (minimum Level 1) until sufficient operational data exists.

How does this connect to autonomous vehicle design domains?

Autonomous vehicle engineers define “operational design domains” that specify exactly where and when a vehicle can operate autonomously. AI systems need the same concept: explicit boundaries defining the conditions under which each autonomy level is safe.

A self-driving car does not operate at Level 5 everywhere. It operates at Level 5 on mapped highways in clear weather, Level 3 in construction zones, and Level 1 in unmapped areas. The operational design domain (ODD) specifies these boundaries precisely. AI systems should adopt the same discipline.

For each autonomy level assignment, define the ODD: what input types are covered, what edge cases trigger escalation, what environmental conditions (system load, data quality, model confidence) cause the system to downshift to a lower autonomy level. A customer service agent might operate at Level 4 for standard product questions but automatically downshift to Level 2 when it detects the customer is discussing a legal complaint. The ODD makes these transitions explicit, testable, and auditable.

The cost of getting autonomy levels wrong flows in both directions. Too much autonomy creates risk. Too little autonomy creates waste. The framework above provides a structured path between those failure modes, grounded in the measurable properties of each specific task rather than the general capabilities of the AI system.

adam@adam-analytics.com writes about AI systems, software architecture, and the philosophy of technology at .