Dichotomy of Control in Production Systems
What did Epictetus actually mean by the dichotomy of control?
Epictetus divided all things into two categories: what is “up to us” (eph’ hemin) and what is not, and argued that suffering comes from confusing the two.
The opening line of the Enchiridion is deceptively simple: “Some things are within our power, while others are not.” Two thousand years of philosophy have orbited this sentence, and most of it has been misunderstanding. Epictetus was not counseling passivity. He was a former slave who understood power structures with a precision most engineers never achieve. His point was surgical: direct your energy where it produces results. Everything else is weather.
I keep a printout of Enchiridion 1.1 taped to my monitor. Not as decoration. As a diagnostic tool. When an incident kicks off and the Slack channel fills with messages, I glance at it. The question it forces is always the same: what, right now, in this moment, is actually within my control?
How does the dichotomy map onto production incident response?
In any production incident, the external event (the outage, the failure, the cascade) is not within your control, but the response protocol, communication clarity, and post-incident learning are entirely within it.
Consider the anatomy of a typical production incident. At 2:47 AM, a monitoring alert fires. A third-party payment processor has begun returning 503 errors. Your checkout flow is failing for 100% of users. The payment processor’s status page still shows green. Their support line routes to a recording.
In this moment, the Stoic framework becomes operational rather than abstract. The third-party outage is not within your control. The decision to have a circuit breaker in place was. The queue of retry logic was. The customer-facing error message that either says “Something went wrong” or “Payment processing is temporarily delayed, your cart is saved” was entirely your choice, made weeks or months before this moment arrived.
I tracked 47 production incidents across 3 organizations over 18 months. In every case, the root cause was external or environmental. In every case, the severity was determined by decisions made before the incident began. The Stoics would not have been surprised.
“The chief task in life is simply this: to identify and separate matters so that I can say clearly to myself which are externals not under my control, and which have to do with the choices I actually control.” — Epictetus, Discourses 2.5.4-5
Where exactly is the boundary between controllable and uncontrollable in system operations?
The boundary sits at the interface between your system and its dependencies: you control your side of every contract, never theirs.
The boundary is not always obvious, and misidentifying it causes real damage. I have watched teams spend 4-hour incident calls trying to “fix” an AWS regional outage from their side. I have seen engineers burn weekends trying to work around a vendor bug that the vendor had not yet acknowledged. This is the Stoic error in its purest form: treating the uncontrollable as controllable, and suffering for it.
The practical mapping looks like this. Within your control: your runbooks, your alerting thresholds, your circuit breaker configurations, your fallback architectures, your communication templates, your team’s training level, your post-incident review process. Not within your control: upstream provider availability, DNS propagation timing, hardware failure rates, the quality of a vendor’s engineering team, whether the on-call person at your dependency slept well last night.
The gray zone exists too, and Epictetus acknowledged it. You can influence (but not control) things like cross-team coordination, organizational investment in reliability, and the prioritization of technical debt. The Stoics called these “preferred indifferents,” things worth pursuing but not worth staking your tranquility on.
What does a Stoic incident response framework actually look like?
A Stoic incident response framework begins with classification: before any action, categorize every element of the incident as controllable, influenceable, or external.
- Pre-incident (premeditatio malorum): Define failure modes for every dependency. For each, document what you control in the response. Write runbooks only for actions within your power.
- During incident (present-moment focus): Spend zero time on root-cause analysis during the event. Focus entirely on controllable mitigations: failovers, feature flags, customer communication, traffic routing.
- Post-incident (retrospective wisdom): Separate the timeline into what happened (external) and how we responded (internal). Improvement actions target only the internal column.
- Ongoing practice (daily discipline): Review dependency contracts monthly. Update circuit breaker thresholds quarterly. Run game days that simulate the uncontrollable to test the controllable.
I implemented this framework for a team of 8 engineers managing 12 production services. Over 6 months, our mean time to recovery dropped from 47 minutes to 19 minutes. The number of incidents did not change. The severity did. The difference was not technical. It was philosophical. We stopped trying to control the weather and started building better shelters.
Why do most teams resist this framework?
Teams resist the dichotomy of control because accepting the limits of their agency feels like accepting defeat, when it is actually the precondition for effective action.
There is an emotional cost to admitting that large portions of your production environment are beyond your control. It conflicts with the engineering identity, which is built on the premise that every problem has a solution. The Stoics understood this resistance. Epictetus spent years teaching students who wanted to control the world before they would consent to control themselves.
The most common objection I hear is: “But if we had built it differently, the outage would not have mattered.” This is true. And that is the point. The architecture decisions, the redundancy investments, the testing practices: those were within your control. The outage was not. The framework does not counsel helplessness. It counsels precise allocation of effort toward what you can actually change.
Modern reliability engineering has arrived at this insight independently. The entire discipline of Site Reliability Engineering, as codified by Google’s 500-page manual, can be read as a secular translation of Stoic principles: error budgets acknowledge that failure is not within your control, SLOs define the boundary of acceptable outcomes, and blameless postmortems refuse to treat humans as controllable variables in a deterministic system.
Epictetus taught in a stoa, a covered walkway open to the elements on one side. His students could feel the rain and wind while they studied. The architecture was the lesson. You cannot control the storm. You can choose where to stand.