Skip to content
AI Engineering AA-007

NightShiftCrew v2 — Autonomous AI Agent Pipeline

A 3-agent CrewAI pipeline running on local LLM inference (Qwen 2.5 Coder 14B via Ollama) that reads real filesystem data, writes and executes Python code with binary validation, and produces portfolio-ready HTML dashboards — orchestrated through a Streamlit Mission Control GUI at zero API cost.

01 — Problem

AI Demos That Don’t Produce Anything Real

Most AI agent demonstrations generate text about analytics. They summarize hypothetical datasets or produce markdown tables with invented numbers. I needed the opposite — a pipeline that reads real files from disk, writes Python code that actually executes, and produces HTML dashboards with computed metrics grounded in actual data. If the code doesn’t run to exit 0, the output isn’t valid. No exceptions.

The constraint was operational: I wanted to feed the system a CSV of enrollment data at 11 PM and wake up to a rendered dashboard with real charts. Not a summary. Not a description of what a dashboard might contain. The artifact itself.

02 — Architecture

Three Agents, Sequential Handoff, Local LLM

The pipeline runs three CrewAI agents in sequence, each with a single responsibility and a validation gate before the handoff:

Scout — Data Reader

Reads raw data from the filesystem, profiles column types and distributions, and produces a structured analysis brief. The brief is schema-validated: every statistical claim must reference a specific column name and row count. This prevents the downstream agents from operating on hallucinated data descriptions.

Engineer — Code Writer + Executor

Receives the analysis brief and writes Python visualization code using Plotly. The code runs in a sandboxed subprocess with a 60-second timeout. If it throws an exception, the agent receives the full traceback and rewrites — up to 3 attempts. Quality validation is binary: exit 0 or retry. No LLM judges, no subjective scoring.

Presenter — Dashboard Assembler

Takes the analysis brief and rendered chart files, then composes a styled HTML dashboard with KPI cards, chart embeds, and narrative sections. The output is validated against an HTML structure schema to ensure all required elements are present before writing to disk.

Key Design Decisions

Why local LLM (Ollama + Qwen 2.5 Coder 14B) instead of API calls? Cost and iteration speed. During development, I ran 200+ pipeline executions tuning prompts and retry logic. At API pricing, that would have cost $40–60. Locally, the cost was electricity. The 4070 Ti handles 14B parameters comfortably with 8-second average inference.

Why binary validation instead of LLM-based quality scoring? Subjective quality scores introduce a second unreliable system to evaluate the first. “Does the code execute successfully?” is a deterministic question with no ambiguity. This made reliability tractable — I could reason about failure modes without modeling the evaluator’s behavior.

Why Streamlit for Mission Control? The GUI supports freeform and preset mission types with save/reload, real-time agent progress tracking, and output preview. It’s overbuilt for a personal tool — but it let me observe agent behavior during development in a way that log files couldn’t match.

03 — Outcomes

Measured Results

3
Specialized Agents

each with isolated failure domains and typed output schemas

14B
Local LLM Parameters

Qwen 2.5 Coder via Ollama on RTX 4070 Ti — zero API cost

82%
First-Pass Success

code executes to exit 0 without retry on initial generation

$0
Inference Cost

200+ pipeline runs during development at zero marginal cost

04 — Reflection

Constrain the Output, Not the Intelligence

The most important design choice was making validation binary rather than qualitative. Every attempt to use LLM-based quality scoring introduced more variance, not less. The pipeline became reliable when I stopped asking “is this output good?” and started asking “does this output meet a verifiable specification?” The distinction sounds subtle. In practice, it’s the difference between a demo and a tool.

What I’d change: the Scout agent’s data profiling is too verbose. It describes every column in detail, which wastes context window tokens for the downstream agents. A smarter Scout would identify only the 3–5 most analytically interesting columns and profile those deeply, rather than giving equal attention to every field.

“Agent reliability comes from constraining output, not expanding capability. A system that does less but validates everything outperforms one that attempts more but verifies nothing.”

Outcomes

3 specialized agents with typed output schemas; 82% first-pass code execution success; 14B-parameter local LLM at $0 inference cost; 200+ pipeline runs during development