Semantic Document Processor

01 — Problem

The Documents Knew More Than the Search Engine

I had accumulated over 2,000 documents across my workforce education work — program proposals, compliance reports, enrollment analyses, vendor contracts, meeting transcripts. Every week, I’d need to find a specific policy detail or reference a decision made months ago. Full-text search returned too many results, most irrelevant. Keyword matching couldn’t understand that “tuition reimbursement policy” and “employer education benefit guidelines” referred to the same concept.

I needed a retrieval system that understood the semantic relationships between documents — one that could answer “what did we decide about the enrollment cap for the fall cohort?” by finding the relevant paragraph across 47 meeting transcripts, not by matching the word “enrollment” in 300 files.

02 — Architecture

Chunk, Embed, Retrieve, Verify

The processor follows a four-stage pipeline that transforms raw documents into a queryable semantic index:

Stage 1 — Intelligent Chunking

Documents are split using a recursive strategy that respects structural boundaries — section headings, paragraph breaks, and list boundaries. A naive fixed-size chunker would split a table in half or separate a conclusion from its preceding argument. My chunker uses heading detection to keep semantically coherent units together, with a target size of 512 tokens and a 64-token overlap between adjacent chunks.

Stage 2 — Embedding Generation

Each chunk is embedded using a sentence-transformer model (all-MiniLM-L6-v2 for speed, with an option to swap in a larger model for accuracy-critical collections). Embeddings are stored in Pinecone with metadata: source document, section heading, chunk position, and creation date. The metadata enables filtered queries — “find relevant chunks, but only from documents created after January 2025.”

Stage 3 — Hybrid Retrieval

Queries use a hybrid strategy: semantic similarity (cosine distance on embeddings) combined with keyword matching (BM25 on the raw text). The two scores are fused using reciprocal rank fusion. This handles both conceptual queries (“what’s our approach to credential stacking?”) and specific lookups (“Section 127 tax benefit amount”) — neither method alone covers both.

Stage 4 — LLM Verification

Retrieved chunks are passed to an LLM with the original query for answer synthesis. Critically, the prompt requires the model to cite which chunks support each claim. If the model can’t ground an answer in the retrieved chunks, it says so rather than hallucinating. This isn’t foolproof, but it reduces confabulation significantly compared to unconstrained generation.

Key Design Decisions

Why hybrid retrieval instead of pure semantic search? Pure semantic search fails on exact-match queries. If someone asks for “Form W-2 reporting requirements,” semantic similarity might return chunks about general tax compliance — related but not precise. BM25 keyword matching catches the exact term. Fusing both consistently outperformed either alone in my test set of 200 queries.

Why Pinecone instead of a self-hosted vector database? I evaluated Chroma, Weaviate, and Pinecone. For a collection under 500K vectors, Pinecone’s managed service eliminated operational overhead (backups, index tuning, resource scaling) for roughly the same query performance. The tradeoff is vendor lock-in, which I mitigate by keeping the embedding generation separate from the storage layer.

03 — Outcomes

Measured Results

2,147
Documents Indexed

across program proposals, transcripts, reports, and contracts

89%
Retrieval Relevance

top-5 chunks contain the answer on manual evaluation set

1.2s
Query-to-Answer Time

including embedding, retrieval, and LLM synthesis

14%
Hybrid Lift

relevance improvement of hybrid retrieval over semantic-only

04 — Reflection

The Hardest Part Is Chunking, Not Embedding

Everyone in the RAG space obsesses over embedding models and vector databases. In my experience, the chunking strategy has a larger impact on retrieval quality than any other component. A perfect embedding of a poorly-chunked document still retrieves garbage. I spent more time tuning chunk boundaries than I did on any other stage — and that time paid the highest return.

What I’d change: the LLM verification step should include a confidence score. Right now it either produces an answer or says “insufficient context.” A numeric confidence would let me set thresholds — high-confidence answers get returned directly, low-confidence answers get flagged for human review. This would make the system more useful in time-sensitive workflows.

“Search engines find documents. Retrieval systems find answers. The difference is in how much structure you impose before the query arrives.”

Outcomes

2,147 documents indexed; 89% retrieval relevance on evaluation set; 1.2s average query-to-answer time; 14% relevance lift from hybrid retrieval

Semantic Document Processor

The Documents Knew More Than the Search Engine

Chunk, Embed, Retrieve, Verify

Stage 1 — Intelligent Chunking

Stage 2 — Embedding Generation

Stage 3 — Hybrid Retrieval

Stage 4 — LLM Verification

Key Design Decisions

Measured Results

The Hardest Part Is Chunking, Not Embedding

Outcomes

Related Writing

On Finite Tokens and Infinite Tasks

Doing Academic Philosophy in the Age of AI

The Case for Boring Automation