The SEC §127 Education Benefits Intelligence Suite started as a simple idea: which public companies offer tuition assistance? The answer required processing 36,791 filings across 18,943 companies.
The False Positive Problem
The initial pipeline looked impressive on volume. It found thousands of “matches.” But when I audited the results, 58.2% were false positives — incidental keyword matches that had nothing to do with actual education benefit programs.
This is the trap of NLP on legal/financial text. The word “educational” appears in SEC filings for dozens of reasons that aren’t tuition assistance programs. Building scoring systems that distinguish genuine programs from incidental mentions required domain expertise, not just pattern matching. The same SEC EDGAR data feeds into KalmSkills, where employer signals enrich career intelligence — but the validation challenges are identical.
Validation as the Real Product
After the methodology overhaul, we went from 19,070 candidates to 1,852 verified companies — a 90.3% reduction. That sounds like a failure if you measure by volume. It’s a success if you measure by precision. The same validation-first mindset applies to event-driven systems — I explore the schema side of this problem in Building Event-Driven Data Pipelines.
In data intelligence, the pipeline that finds fewer, better results is always more valuable than the one that finds everything.
Geographic Enrichment
Adding location data from SEC CIK address lookups was supposed to be a quick add-on. It became its own pipeline, ultimately achieving 90.1% geographic coverage across 17,191 companies. The state-level market concentration analysis this enabled was worth the effort.
The broader lesson — that the most valuable engineering work is often the most unglamorous — is something I return to in The Case for Boring Technology.
Related
Projects: SEC §127 Education Benefits Intelligence Suite · KalmSkills · Schema Evolution Engine · Real-Time Analytics Pipeline
Writing: Building Event-Driven Data Pipelines · The Case for Boring Technology