Premeditatio Malorum: Stoic Case for Chaos Engineering
What is premeditatio malorum and why did the Stoics practice it?
Premeditatio malorum is the deliberate mental rehearsal of worst-case scenarios, practiced not to produce anxiety but to eliminate the surprise that transforms manageable events into crises.
Seneca wrote to Lucilius: “It is precisely in times of immunity from care that the soul should toughen itself beforehand for occasions of greater stress.” This is not pessimism. Pessimism dwells on bad outcomes and stops there. Premeditatio malorum dwells on bad outcomes and asks: “What would I do? What is within my control? What can I prepare now that will serve me then?”
Marcus Aurelius practiced it every morning. Before facing the Roman Senate, he would remind himself: “Today I shall meet with interference, ingratitude, insolence, disloyalty, ill-will, and selfishness.” Not because he expected the worst, but because having expected it, he could respond with reason rather than reaction. The surprise is what kills you. The event itself is just engineering.
How does chaos engineering implement the same principle?
Chaos engineering introduces controlled failures into production systems during working hours so that teams experience, diagnose, and respond to degradation before it arrives unannounced at 3 AM.
In 2011, Netflix released Chaos Monkey into their production environment. The tool randomly terminated virtual machine instances during business hours. The engineering teams had to build systems that survived these disruptions. Within 2 years, Netflix’s availability improved measurably, and the practice spawned an entire discipline.
The structural parallel to Stoic practice is exact. The Stoic practitioner deliberately imagines failure (premeditatio). The chaos engineer deliberately introduces failure (fault injection). Both share the same logic: experiencing adversity in a controlled context builds the capacity to handle adversity in an uncontrolled one. Both assume that comfort produces fragility and that deliberate discomfort produces resilience.
I ran my first game day in 2019. We simulated the complete failure of our primary database during peak traffic. The team had 30 minutes of warning and 2 hours to respond. 3 engineers had never experienced a database failover. By the end of the exercise, all 3 could execute the runbook from memory. When the actual database failure occurred 4 months later, at 11 PM on a Thursday, the recovery took 8 minutes. Without the game day, our estimated recovery time was 45 minutes based on comparable incidents at peer companies.
“We should project our thoughts ahead of us at every turn and have in mind every possible eventuality instead of only the usual course of events.” — Seneca, Letters to Lucilius, Letter 91
What does a Stoic chaos engineering practice look like?
A Stoic chaos engineering practice goes beyond technical fault injection to include organizational resilience: testing communication channels, decision-making processes, and human responses under pressure.
Most chaos engineering programs focus on infrastructure: kill a node, saturate a network link, corrupt a disk. These are valuable, but they address only half of Seneca’s teaching. The Stoics were not concerned solely with external events. They were concerned with the human response to external events. A complete chaos engineering practice must test both the system and the people operating it.
- Infrastructure chaos (testing the system): Random instance termination, network partition simulation, dependency failure injection, data corruption scenarios. These test the technical resilience of your architecture.
- Process chaos (testing the response): Simulate an incident where the primary on-call is unreachable. Run a game day where the runbook is intentionally wrong. Inject a failure during a leadership meeting to test whether communication protocols work under real organizational constraints.
- Cognitive chaos (testing the human): Present the team with an incident that has no runbook. Simulate cascading failures where the second failure arrives before the first is resolved. Test the team’s ability to make decisions with incomplete information.
I implemented all 3 layers for a team managing 7 production services. Over 12 months, we ran 24 game days. The results: infrastructure recovery time decreased by 62%. Process failures (wrong person paged, unclear communication, missing runbook) decreased by 78%. The most important metric: team confidence during real incidents, measured by post-incident survey, increased from 3.1 to 4.4 on a 5-point scale.
Why do organizations resist premeditation of failure?
Organizations resist premeditated failure for the same reason individuals resist Stoic premeditation: it forces confrontation with vulnerability that optimism prefers to conceal.
Every time I propose a game day to leadership, I encounter the same objection: “What if we break something?” This is the anti-Stoic position in its purest form. The fear of a controlled failure prevents the preparation for an uncontrolled one. Seneca addressed this directly: “It is not that we dare not because things are difficult, but things are difficult because we dare not.”
The resistance runs deeper than risk calculation. Game days reveal uncomfortable truths. They show that the monitoring has blind spots. They show that the runbooks are outdated. They show that 2 of the 5 team members do not know how to access the production database. These truths exist whether or not you run the game day. The game day just makes them visible.
I tracked the reasons given for canceling or postponing scheduled game days across 4 organizations over 2 years. The top 3 reasons: “We have a big release coming up” (35% of cancellations), “The team is too busy” (28%), and “Leadership is nervous about risk” (22%). In each case, the stated reason was a form of avoidance. The big release is exactly when resilience testing matters most. The busy team is exactly the team that will be overwhelmed by an unplanned incident. The nervous leadership is exactly the leadership that has not confronted the reality of their system’s fragility.
How do you start a premeditatio malorum practice for your engineering team?
Start with the smallest possible failure in the most controlled possible environment, then gradually increase scope as the team builds the muscle memory of response.
- Week 1: Mental rehearsal: Gather the team and ask: “If our primary database went down right now, what would we do?” Map the response on a whiteboard. Identify every gap. This is pure premeditatio, no actual failure required.
- Week 4: Tabletop exercise: Present a written scenario and walk through the response step by step. No production systems involved. Test the process, not the infrastructure.
- Week 8: Staging chaos: Inject a real failure in a staging environment. Execute the runbook. Measure recovery time. Debrief.
- Week 12: Production game day: With full organizational awareness and rollback plans in place, introduce a controlled failure in production during business hours with all hands on deck.
Seneca did not practice premeditatio malorum because he enjoyed imagining catastrophe. He practiced it because he understood that the mind, like any system, performs under stress only as well as it has been trained. The Chaos Monkey does not enjoy terminating instances. It does so because Netflix understood that a system which has never experienced failure will not survive its first encounter with it. The ancient discipline and the modern practice share a single insight: resilience is not an attribute of systems that have never failed. It is an attribute of systems that have failed often, deliberately, and learned from every failure.