What is the Guardian Agent Pattern?
The Guardian Agent Pattern is a system architecture in which a dedicated AI agent (the guardian) evaluates every action proposed by operational agents against a defined set of safety policies before execution is permitted, functioning as an independent oversight layer within the agent system.
The pattern has 3 components. First, the operational agent (or fleet of agents) that performs the primary task: answering questions, processing documents, executing workflows. Second, the guardian agent, which receives every proposed action from operational agents and evaluates it against a policy set. Third, the policy set itself, a structured collection of rules ranging from hard constraints (“never execute a database DELETE without human approval”) to soft guidelines (“prefer conservative estimates in financial projections”).
I implemented this pattern in 3 production systems in 2025. The guardian agent operates as middleware in the agent execution pipeline. When an operational agent generates an action (a tool call, a user-facing response, a data modification), the action is serialized and passed to the guardian before execution. The guardian evaluates the action against the policy set and returns one of 3 verdicts: approve (action proceeds), modify (action proceeds with specific changes), or block (action is rejected with an explanation routed back to the operational agent).
How does the guardian evaluate actions without becoming a bottleneck?
The guardian uses a tiered evaluation strategy: a fast rule-based pre-filter handles 70% of actions in under 20ms, and only the remaining 30% (which match risk patterns) receive full LLM-based policy evaluation.
- Tier 1: Rule-Based Pre-Filter (sub-20ms): Pattern-matching rules check for obvious violations: SQL DELETE/DROP statements, PII in user-facing responses (regex-based detection), actions targeting production databases, and requests exceeding cost thresholds. Actions that pass all rules and match no risk patterns are approved instantly. This handles approximately 70% of actions.
- Tier 2: LLM Policy Evaluation (150-300ms): Actions flagged by the pre-filter receive a full evaluation by a dedicated guardian model (I use Claude 3.5 Haiku for its speed-to-reasoning tradeoff). The guardian model receives the proposed action, the policy set, and relevant context, then provides a structured verdict with reasoning. This catches nuanced violations that rules cannot detect: responses that are technically accurate but misleading, actions that are individually safe but risky in sequence, and edge cases where policy intent conflicts with literal compliance.
- Tier 3: Human Escalation: When the guardian model’s confidence in its verdict falls below 0.7, or when the action matches a “always escalate” category (financial transactions above $5,000, actions affecting more than 100 records, any action a policy explicitly marks as requiring human approval), the action is queued for human review. This accounts for approximately 2% of actions.
How does this connect to the Stoic concept of the inner citadel?
Marcus Aurelius described an “inner citadel” of rational judgment that mediates between external events and internal responses. The guardian agent serves exactly this function: it mediates between the operational agent’s impulses and the system’s actions, ensuring that responses are governed by policy rather than raw capability.
The Stoic inner citadel is not a wall that blocks all action. It is a deliberative faculty that subjects each impulse to rational evaluation before granting assent. The guardian agent mirrors this structure. It does not prevent the operational agent from proposing any action. It evaluates each proposed action against a rational framework (the policy set) and grants or withholds assent based on that evaluation.
This metaphor has practical implications. The inner citadel is effective because it is always active, not invoked selectively. A guardian agent that can be bypassed is not a safety mechanism. It is a suggestion box. In my implementations, the guardian sits in the execution pipeline, not alongside it. There is no code path from the operational agent to action execution that does not pass through the guardian. This is a non-negotiable architectural constraint.
How do you define and maintain the policy set?
The policy set should be structured as a hierarchy of hard constraints (never violate), soft guidelines (prefer to follow), and domain-specific rules, stored as versioned documents with the same change management discipline as the prompt contracts they resemble.
- Step 1: Define hard constraints first. These are non-negotiable rules that the guardian must enforce with zero tolerance. Examples: “Never expose PII in user-facing responses,” “Never execute irreversible database operations without human approval,” “Never generate content that contradicts regulatory requirements.” Hard constraints should be few (10-20) and unambiguous.
- Step 2: Define soft guidelines. These are preferred behaviors that can be overridden when context justifies it. Examples: “Prefer conservative estimates,” “Avoid technical jargon in user-facing responses,” “Default to the most recent data source when multiple sources conflict.” Soft guidelines allow the guardian to exercise judgment.
- Step 3: Add domain-specific rules as the system operates. Every production incident should produce a new policy rule that prevents recurrence. After 3 months of operation, the typical policy set grows from 15-20 rules to 40-60 rules. This growth is healthy. It represents the system learning from experience.
- Step 4: Version and test the policy set. Every policy change should be evaluated against a test suite of 100+ scenarios (including the specific incident that motivated the change) to verify that the new rule catches the intended violation without blocking legitimate actions. I track the false positive rate (legitimate actions blocked) and target below 3%.
The Guardian Agent Pattern does not eliminate risk. It structures the management of risk into an architectural layer that is explicit, testable, and improvable. The alternative, distributing safety logic throughout the operational agent’s prompts and hoping it holds, is the approach most teams use today. It is also the approach that produces the incidents that fill AI safety discussions with anecdotes rather than architecture.