Building an LLM Evaluation Pipeline That Outlasts Models

Designing an LLM evaluation pipeline as durable infrastructure (rather than a one-time benchmark) enabled consistent quality measurement across 4 model migrations over 14 months, catching 23 regression issues before they reached production and reducing model migration time from 6 weeks to 9 days for a legal technology platform.

What problem did this system solve?

A legal technology company built their AI-powered contract analysis product on GPT-4 in early 2024. Within 14 months, they needed to evaluate and potentially migrate to 4 different models: GPT-4 Turbo, Claude 3.5 Sonnet, GPT-4o, and Claude 3.5 Sonnet (the October 2024 update). Each migration was a crisis. The team had no systematic way to evaluate whether the new model maintained the quality standards their 280 enterprise customers depended on.

Their evaluation process for the first migration (GPT-4 to GPT-4 Turbo) took 6 weeks: 3 weeks of manual testing by 4 domain experts, 2 weeks of customer-facing beta testing, and 1 week of incident response after subtle regressions surfaced in production. The second migration was equally painful. By the time the third model option emerged, the team was paralyzed. They knew they needed to switch (the new model was 40% cheaper with equivalent quality on general benchmarks), but they could not justify the migration cost.

I was engaged to build an evaluation pipeline that would make model migration a routine operation rather than a company-wide event.

How was the evaluation infrastructure designed?

The evaluation pipeline was designed around 4 principles: model-agnostic test cases, multi-dimensional scoring, automated regression detection, and human calibration loops.

The test suite contained 1,847 evaluation cases organized into 12 categories aligned with the product’s core capabilities: clause extraction (340 cases), risk classification (280 cases), obligation identification (220 cases), party detection (190 cases), date parsing (170 cases), defined term resolution (160 cases), and 6 additional categories covering edge cases and cross-cutting concerns. Each test case specified: the input document (or document section), the task instruction, the expected output (human-judged gold standard), and a scoring rubric with weighted dimensions.

The scoring system used 4 dimensions for each test case: factual accuracy (is the extracted information correct?), completeness (are all relevant items captured?), format compliance (does the output match the required schema?), and nuance handling (does the output correctly handle ambiguous or conditional language?). Each dimension was scored 0-5 by an LLM judge (a different model than the one being evaluated) using a detailed rubric with examples for each score level. The weighted composite score used weights calibrated against human expert judgments: accuracy (0.40), completeness (0.25), format (0.15), nuance (0.20).

I built the pipeline in Python using a custom framework (not LangSmith or Braintrust, which did not support the multi-dimensional scoring we needed at the time). The framework orchestrated: batch inference against the candidate model, parallel LLM-judge scoring across all 4 dimensions, score aggregation by category, statistical comparison against the baseline model’s scores, and report generation with per-category breakdowns and flagged regressions.

A full evaluation run against all 1,847 cases took 3.5 hours and cost approximately $180 in inference fees (for both the candidate model inference and the judge model scoring). This compared favorably to the 3-week, $45,000 cost of manual evaluation for the first migration.

What were the measurable outcomes?

9 days

Model Migration Time (down from 6 weeks)

Regressions Caught Pre-Production

1,847

Evaluation Cases in Test Suite

$180

Cost Per Full Evaluation Run

Customer-Reported Regressions Post-Migration

Model Migrations Completed

The pipeline’s most valuable contribution was the 23 regressions it caught before they reached production. In the migration from GPT-4 Turbo to Claude 3.5 Sonnet, the pipeline identified that Claude’s clause extraction accuracy was 3 percentage points higher overall but 8 points lower on a specific category: indemnification clauses with nested conditions. This category contained only 28 test cases, making it invisible in aggregate metrics. Without per-category analysis, the regression would have reached production and affected the 40% of enterprise contracts that contain nested indemnification language.

The fix was straightforward: a prompt modification specific to the indemnification extraction task that included 3 few-shot examples of nested conditional clauses. This brought Claude’s performance on that category to parity with GPT-4 Turbo. The regression was discovered, diagnosed, and fixed in 2 days. Without the pipeline, it would have been discovered by a customer, probably weeks after migration.

What would I change in hindsight?

I would have invested in human calibration from day one rather than week 4. The LLM judge’s scoring was reasonably correlated with human judgment (Pearson r = 0.81) out of the box, but there were systematic biases: the judge over-scored format compliance (because well-formatted but incorrect outputs looked good) and under-scored nuance handling (because the judge lacked the legal domain expertise to recognize subtle hedging language). The calibration process, comparing judge scores to human expert scores on a 200-case calibration set and adjusting the scoring rubric, brought correlation to r = 0.91. Those first 3 weeks of uncalibrated evaluation produced results I later had to re-evaluate.

I also should have built A/B testing into the pipeline from the start. The ability to route 5% of production traffic to a candidate model and compare real-user outcomes with evaluation predictions would have provided the ground truth needed to continuously improve the evaluation’s predictive validity. I added this capability in month 6. The data confirmed that the evaluation pipeline’s predictions matched production quality outcomes with 94% accuracy at the category level.

The fundamental lesson is that evaluation infrastructure is not a project. It is a product. It requires ongoing investment, calibration, and iteration. The test suite grew from 1,200 to 1,847 cases over 14 months as new edge cases surfaced. The scoring rubric was revised 6 times. The judge model was changed twice (from GPT-4 to Claude 3.5 Sonnet to GPT-4o, based on which provided the most calibrated scores for legal text). Treating evaluation as a one-time effort is like treating a CI pipeline as a one-time setup. It works until the first change, and then it becomes a liability.