Local LLMs as Strategic Infrastructure

Building a hybrid inference architecture that routes queries between local open-source models and cloud APIs reduced monthly inference costs by 67% (from $8,400 to $2,770) while keeping p95 latency under 800ms and eliminating data sovereignty concerns for a healthcare analytics platform processing 45,000 daily queries containing protected health information.

What problem did this system solve?

A healthcare analytics company processed 45,000 daily queries against patient records, clinical notes, and diagnostic reports. Their existing architecture sent every query to GPT-4 via the OpenAI API. This created 3 compounding problems.

First, cost. At an average of 2,100 tokens per query (input + output), the monthly inference bill was $8,400 and growing 12% month-over-month as usage expanded. Second, data sovereignty. Every query transmitted protected health information (PHI) to OpenAI’s servers. While OpenAI’s BAA covered HIPAA requirements technically, the company’s largest hospital client demanded that PHI never leave their cloud environment. Third, latency. API round-trip times averaged 1.2 seconds with p95 at 3.4 seconds during peak hours, which made the system unusable for real-time clinical decision support.

The mandate was clear: process PHI-containing queries locally, use cloud APIs only for queries that required frontier reasoning and contained no sensitive data, and reduce costs by at least 50%.

How was the architecture designed?

I designed a 3-tier inference architecture with an intelligent router at its core.

The first tier was a local inference cluster running Llama 3 70B (quantized to 4-bit with GPTQ) on 4 NVIDIA A100 GPUs hosted in the client’s private cloud. This tier handled all queries containing PHI, which I identified using a combination of a fine-tuned NER model (trained on 12,000 labeled clinical text samples to detect PHI entities with 97.3% recall) and rule-based pattern matching for structured identifiers (MRNs, SSNs, dates of birth). The local cluster processed queries at an average latency of 340ms with p95 at 620ms.

The second tier was Claude 3.5 Sonnet via Anthropic’s API, reserved for complex reasoning tasks that contained no PHI. These included clinical guideline interpretation, differential diagnosis reasoning, and research synthesis. The PHI classifier ensured no sensitive data reached this tier. Average latency was 1.1 seconds, acceptable for these non-real-time analytical tasks.

The third tier was a locally-hosted Llama 3 8B model (also GPTQ quantized) running on 2 NVIDIA L40S GPUs. This handled simple, high-volume tasks: formatting structured data, generating standard report templates, translating between medical coding systems, and answering FAQ-type queries from the knowledge base. Latency averaged 95ms.

The router was a lightweight classifier (a fine-tuned DistilBERT model, 3ms inference time) that categorized each incoming query on 2 axes: PHI presence (yes/no) and reasoning complexity (simple/moderate/complex). The routing logic: if PHI is present, route to Tier 1 (local 70B) regardless of complexity. If no PHI and complex, route to Tier 2 (cloud Sonnet). If no PHI and simple/moderate, route to Tier 3 (local 8B). The router also maintained a fallback chain: if Tier 3 returned low-confidence output (measured by a calibrated confidence score), the query was promoted to Tier 1.

I deployed the local models using vLLM for efficient batched inference, with dynamic batching that grouped up to 16 concurrent requests. The entire infrastructure ran in the client’s AWS VPC with no external data egress for PHI-containing queries. Monitoring used Prometheus for GPU utilization, vLLM metrics for inference throughput, and a custom dashboard tracking per-tier query volumes, latencies, and costs.

What were the measurable outcomes?

$2,770

Monthly Cost (down from $8,400)

620ms

P95 Latency for PHI queries (down from 3.4s)

PHI Data Egress Incidents

45K

Daily Queries Processed

92%

Answer Quality Score (vs 94% all-cloud baseline)

73%

Queries Handled Locally

The cost breakdown: Tier 1 (local 70B) handled 31% of queries at a fixed GPU cost of $1,840/month. Tier 2 (cloud Sonnet) handled 27% at $930/month in API fees. Tier 3 (local 8B) handled 42% at a fixed GPU cost of $620/month. Total: $3,390/month in infrastructure costs minus $620 in existing GPU capacity the client was already paying for, netting $2,770/month.

The 2-point quality drop (from 94% to 92% on the internal eval suite) came almost entirely from the Tier 3 local 8B model, which struggled with nuanced medical terminology in approximately 8% of its assigned queries. I addressed this by fine-tuning the 8B model on 4,200 domain-specific examples, which brought quality to 93.4%. The remaining gap was a deliberate tradeoff: the cost and sovereignty benefits justified a marginal quality concession on low-complexity tasks.

What would I change in hindsight?

I would have invested more heavily in the PHI classifier from day one. The initial version had 94.1% recall, which sounds high but meant that approximately 2,600 queries per month containing PHI were misclassified as clean and routed to the cloud tier. I caught this in the first week of production monitoring and expedited a retraining cycle that brought recall to 97.3%, but those first 7 days represented an unacceptable data handling gap. For healthcare applications, the PHI classifier should be treated as a safety-critical component with the same rigor as the core inference models.

I also underestimated the operational complexity of managing local GPU infrastructure. The A100 cluster required 2 firmware updates, 1 vLLM version migration, and 3 model quantization adjustments in the first 4 months. Each required coordinated downtime and traffic rerouting. Building automated failover from Tier 1 to a privacy-preserving cloud option (using Anthropic’s data processing agreement) would have provided resilience I did not have initially.

The broader lesson is that local LLMs are not “set and forget” infrastructure. They are living systems that require the same operational investment as any production database or compute cluster. The sovereignty and cost benefits are real, but they come with operational overhead that must be budgeted honestly.