Multi-Agent Orchestration System

01 — Problem

Single-Agent Pipelines Couldn’t Scale the Work

I was generating data dashboards from CSV files — read the data, write analysis code, produce a styled HTML report. A single LLM call could handle one step, but chaining all three produced brittle outputs. The code generation step would hallucinate column names. The visualization step would ignore the analysis findings. Each stage had different failure modes, and a monolithic prompt couldn’t address them independently.

I needed a pipeline where each cognitive task was isolated, validated, and recoverable. If the code agent produced invalid Python, only that stage should retry — not the entire run. The problem wasn’t intelligence. It was reliability across interdependent steps.

02 — Architecture

Three Agents, One State Machine

The system uses CrewAI to define three specialized agents, each with a constrained role and a Pydantic output schema:

Agent 1 — Data Analyst

Reads the source CSV, profiles column types, identifies statistical patterns, and produces a structured analysis brief. Output schema enforces that every claim references a specific column name and row count. No vague summaries allowed.

Agent 2 — Code Engineer

Receives the analysis brief and writes Python visualization code using Plotly. The code is executed in a sandboxed subprocess. If it throws an exception, the agent receives the traceback and retries — up to 3 attempts before escalating to a fallback template.

Agent 3 — Report Composer

Takes the analysis brief and rendered charts, then assembles a styled HTML dashboard with KPI cards, chart embeds, and narrative sections. Output is validated against an HTML structure schema before being written to disk.

Key Design Decisions

Why CrewAI over LangGraph? CrewAI’s role-based agent definitions matched my mental model better — each agent has a backstory, goal, and constrained tool access. LangGraph’s graph-first approach felt over-engineered for a 3-node linear pipeline. I chose legibility over flexibility.

Why Pydantic schemas on every output? Without schema validation, agent outputs drift silently. The Data Analyst might return prose instead of structured findings. Pydantic catches this at the boundary, not three steps later when the Report Composer fails on malformed input.

Why subprocess execution for code? Running LLM-generated Python in the main process is a reliability risk. A subprocess with a 30-second timeout isolates failures. If the generated code hangs or crashes, the parent process stays healthy and can trigger a retry.

03 — Outcomes

Measured Results

3
Specialized Agents

each with isolated failure domains and retry logic

87%
First-Pass Validation

outputs conform to schema without retry on first attempt

14s
Avg Pipeline Time

from CSV input to rendered HTML dashboard

0
Manual Interventions

post-launch — retry logic handles all recoverable failures

04 — Reflection

Reliability Is the Architecture

The most important lesson from this project had nothing to do with AI. It was a distributed systems insight: treat each agent like a microservice with a contract. Define the input schema. Define the output schema. Handle failures at the boundary. The “intelligence” of the agent is secondary to the reliability of the handoff.

What I’d change: the Streamlit Mission Control GUI I built for monitoring runs is useful but overbuilt for a personal tool. A simple CLI with structured logging would have been sufficient. I spent 2 weeks on the GUI that could have gone toward adding a fourth agent for automated data cleaning.

“The question isn’t whether your agents are smart enough. It’s whether your system is honest enough to tell you when they fail.”

Outcomes

3 specialized agents coordinated per run; 87% first-pass output validation rate; 14-second average pipeline completion; 0 manual interventions required post-launch

Multi-Agent Orchestration System

Single-Agent Pipelines Couldn’t Scale the Work

Three Agents, One State Machine

Agent 1 — Data Analyst

Agent 2 — Code Engineer

Agent 3 — Report Composer

Key Design Decisions

Measured Results

Reliability Is the Architecture

Outcomes

Related Writing

On Finite Tokens and Infinite Tasks

Doing Academic Philosophy in the Age of AI

The Case for Boring Automation