01 — Problem
AI Demos That Don’t Produce Anything Real
Most AI agent demonstrations generate text about analytics. They summarize hypothetical datasets or produce markdown tables with invented numbers. I needed the opposite — a pipeline that reads real files from disk, writes Python code that actually executes, and produces HTML dashboards with computed metrics grounded in actual data. If the code doesn’t run to exit 0, the output isn’t valid. No exceptions.
The constraint was operational: I wanted to feed the system a CSV of enrollment data at 11 PM and wake up to a rendered dashboard with real charts. Not a summary. Not a description of what a dashboard might contain. The artifact itself.
02 — Architecture
Three Agents, Sequential Handoff, Local LLM
The pipeline runs three CrewAI agents in sequence, each with a single responsibility and a validation gate before the handoff:
Scout — Data Reader
Reads raw data from the filesystem, profiles column types and distributions, and produces a structured analysis brief. The brief is schema-validated: every statistical claim must reference a specific column name and row count. This prevents the downstream agents from operating on hallucinated data descriptions.
Engineer — Code Writer + Executor
Receives the analysis brief and writes Python visualization code using Plotly. The code runs in a sandboxed subprocess with a 60-second timeout. If it throws an exception, the agent receives the full traceback and rewrites — up to 3 attempts. Quality validation is binary: exit 0 or retry. No LLM judges, no subjective scoring.
Presenter — Dashboard Assembler
Takes the analysis brief and rendered chart files, then composes a styled HTML dashboard with KPI cards, chart embeds, and narrative sections. The output is validated against an HTML structure schema to ensure all required elements are present before writing to disk.
Key Design Decisions
Why local LLM (Ollama + Qwen 2.5 Coder 14B) instead of API calls? Cost and iteration speed. During development, I ran 200+ pipeline executions tuning prompts and retry logic. At API pricing, that would have cost $40–60. Locally, the cost was electricity. The 4070 Ti handles 14B parameters comfortably with 8-second average inference.
Why binary validation instead of LLM-based quality scoring? Subjective quality scores introduce a second unreliable system to evaluate the first. “Does the code execute successfully?” is a deterministic question with no ambiguity. This made reliability tractable — I could reason about failure modes without modeling the evaluator’s behavior.
Why Streamlit for Mission Control? The GUI supports freeform and preset mission types with save/reload, real-time agent progress tracking, and output preview. It’s overbuilt for a personal tool — but it let me observe agent behavior during development in a way that log files couldn’t match.
03 — Outcomes
Measured Results
Specialized Agents
each with isolated failure domains and typed output schemas
Local LLM Parameters
Qwen 2.5 Coder via Ollama on RTX 4070 Ti — zero API cost
First-Pass Success
code executes to exit 0 without retry on initial generation
Inference Cost
200+ pipeline runs during development at zero marginal cost
04 — Reflection
Constrain the Output, Not the Intelligence
The most important design choice was making validation binary rather than qualitative. Every attempt to use LLM-based quality scoring introduced more variance, not less. The pipeline became reliable when I stopped asking “is this output good?” and started asking “does this output meet a verifiable specification?” The distinction sounds subtle. In practice, it’s the difference between a demo and a tool.
What I’d change: the Scout agent’s data profiling is too verbose. It describes every column in detail, which wastes context window tokens for the downstream agents. A smarter Scout would identify only the 3–5 most analytically interesting columns and profile those deeply, rather than giving equal attention to every field.
“Agent reliability comes from constraining output, not expanding capability. A system that does less but validates everything outperforms one that attempts more but verifies nothing.”
Outcomes
3 specialized agents with typed output schemas; 82% first-pass code execution success; 14B-parameter local LLM at $0 inference cost; 200+ pipeline runs during development