RAG as Data Infrastructure, Not Feature

Redesigning a retrieval-augmented generation system as foundational data infrastructure rather than a chatbot feature reduced per-query costs from $0.12 to $0.03 and improved answer accuracy from 67% to 91% across 2.3 million monthly queries for a B2B SaaS platform.

What problem did this system solve?

A B2B SaaS company serving 340 enterprise customers had bolted a “chat with your data” feature onto their existing product in Q2 2024. The implementation followed the standard tutorial pattern: embed documents, store in Pinecone, retrieve top-5 chunks, pass to GPT-4 with a system prompt. Within 3 months, support tickets related to incorrect AI answers consumed 22 hours per week of engineering time. Customer trust in the feature dropped to 34% based on in-app feedback ratings.

The core problem was architectural, not algorithmic. The RAG system had been designed as a feature: a self-contained module with its own data pipeline, its own embedding logic, and its own retrieval stack. It shared nothing with the platform’s existing data infrastructure. When a customer updated a document in the main application, the RAG index lagged by 4-18 hours. When a customer’s access permissions changed, the RAG system had no awareness. When product data schemas evolved, the embedding pipeline broke silently.

I was brought in to rebuild the system. The mandate was not “fix the chatbot.” It was “make retrieval a platform capability.”

How was the architecture designed?

The redesign treated RAG as a data infrastructure layer with 4 architectural principles: shared data pipelines, real-time synchronization, permission-aware retrieval, and evaluation-driven iteration.

The first decision was to eliminate the separate embedding pipeline entirely. Instead of a batch job that re-processed documents on a schedule, I integrated embedding generation into the platform’s existing change-data-capture (CDC) stream. Every document mutation (create, update, delete) in the primary PostgreSQL database triggered a Debezium event that flowed through Kafka to an embedding service. The embedding service chunked the document using a hybrid strategy (semantic paragraph boundaries with a 512-token maximum, 64-token overlap), generated embeddings via a locally-hosted BGE-large model, and wrote the vectors to a Qdrant cluster. End-to-end latency from document change to searchable vector: 2.3 seconds average.

Permission-aware retrieval was the second major architectural change. The original system retrieved documents regardless of whether the querying user had access to them in the main application. This was both a security vulnerability and a trust problem. I implemented a metadata filtering layer where every vector stored its associated permission scope (tenant ID, team IDs, user-level access flags) as payload metadata. At query time, the retrieval layer constructed a Qdrant filter clause from the user’s active permissions before executing the similarity search. This added 12ms of latency but eliminated an entire class of data leakage issues.

The retrieval pipeline itself was redesigned as a 3-stage process. First, a hybrid search combining dense vector similarity (BGE-large embeddings) with sparse keyword matching (BM25 via Elasticsearch) to handle both semantic and exact-match queries. Second, a cross-encoder reranker (a fine-tuned model trained on 8,400 labeled relevance pairs from the platform’s domain) that reordered the top 40 candidates. Third, a context assembly stage that formatted the top 6 results with source metadata, confidence scores, and structured section boundaries.

I deployed the embedding service as a horizontally scalable Kubernetes deployment with autoscaling based on Kafka consumer lag. The Qdrant cluster ran on 3 nodes with replication factor 2. The entire infrastructure was observable via Prometheus metrics and Grafana dashboards tracking embedding throughput, retrieval latency percentiles, reranker accuracy, and end-to-end query quality scores.

What were the measurable outcomes?

91%

Answer Accuracy (up from 67%)

$0.03

Per-Query Cost (down from $0.12)

2.3s

Index Sync Latency (down from 4-18 hrs)

2.3M

Monthly Queries Served

Permission Leakage Incidents (down from 3/month)

78%

User Trust Rating (up from 34%)

The cost reduction came from 3 sources. First, switching from OpenAI’s text-embedding-ada-002 ($0.0001/1K tokens) to a self-hosted BGE-large model eliminated per-query embedding costs entirely. The BGE model ran on 2 A10 GPUs at a fixed monthly cost of $1,200, processing an average of 89,000 embedding requests per day. Second, the reranker reduced the number of tokens sent to the generation model by 41% by selecting more relevant context, which translated directly to lower inference costs. Third, I replaced GPT-4 with Claude 3.5 Sonnet for the generation step after evaluation showed equivalent quality at 60% lower per-token cost for this specific task.

The accuracy improvement came primarily from 3 factors: real-time synchronization (eliminating stale data), the cross-encoder reranker (improving retrieval precision from 0.52 to 0.84), and structured context assembly (reducing model confusion from poorly formatted inputs).

What would I change in hindsight?

I underinvested in the evaluation infrastructure during the first 6 weeks. I built 340 test cases before launch, but the majority were synthetic. I should have invested the time to curate 200+ real user queries with human-judged relevance labels from day one. The synthetic test cases caught obvious failures but missed subtle domain-specific accuracy issues that only surfaced after production deployment. It took 3 weeks of post-launch iteration to close the gap.

I also should have implemented A/B testing infrastructure from the start. When I wanted to compare the BM25+dense hybrid approach against dense-only retrieval, I had to run offline evaluations because there was no mechanism for live traffic splitting. Building the A/B framework took 2 weeks that could have been saved with upfront planning.

The broader lesson is that RAG systems are not chatbot features. They are data infrastructure. They touch indexing, permissions, real-time synchronization, search, and evaluation. Treating them as a feature leads to the same problems as treating your database as a feature: it works until scale and complexity reveal every shortcut you took. The organizations that will build reliable AI applications are the ones that recognize retrieval as a platform capability and invest in it accordingly.