Skip to content
Data Systems AA-010

SEC §127 Education Benefits Intelligence Suite

End-to-end intelligence suite that scans 36,791 SEC EDGAR filings to identify companies offering §127 education benefits — with XBRL validation that reduced a 58.2% false positive rate by 90.3%, geographic enrichment at 90.1% coverage, and per-company deal value computation across the public company universe.

01 — Problem

An Invisible Market Worth Billions, Buried in Filings

IRS Section 127 allows employers to provide up to $5,250 per employee per year in tax-free educational assistance. For workforce education providers, companies offering these benefits represent high-value prospects — employers who have already committed budget to talent development. The problem: no systematic way existed to identify which public companies offered §127 programs. The information was scattered across thousands of SEC filings in inconsistent formats — some mentioned “tuition reimbursement” in employee benefit disclosures, others referenced “educational assistance” in footnotes, and many used language so generic that keyword matching produced a 58% false positive rate.

I needed a pipeline that could scan the entire public company universe, identify genuine §127 programs with high precision, and size the revenue opportunity per company — not a one-time analysis, but a repeatable intelligence system.

02 — Architecture

Five Pipelines, One Intelligence Layer

The suite operates as five interconnected pipelines, each addressing a distinct phase of the intelligence cycle:

Pipeline 1 — SEC EDGAR Bulk Scanner

Downloads the full EDGAR submissions archive and scans 36,791 filing directories across 18,943 CIK-registered companies. Searches 10-K and 10-Q documents for §127, $5,250, tuition reimbursement, and educational assistance keywords. Implements polite HTTP access with retry/backoff per SEC fair-access guidelines. Output: a candidate list of companies with potential education benefit programs.

Pipeline 2 — Web Enrichment Crawler

Crawls employer careers and benefits pages, ATS job boards (Workday, Greenhouse, Lever), and vendor/partner pages (Guild, EdAssist, InStride) to detect tuition reimbursement evidence beyond SEC filings. 40+ strategic seed URL paths per domain with robots.txt compliance and rate limiting. This surface captures companies that offer §127 benefits but don’t disclose them in financial filings.

Pipeline 3 — XBRL Validation System

Analyzes SEC XBRL financial filings with education-specific detection rules. This was the critical reliability upgrade: after discovering the initial pipeline’s 58.2% false positive rate, I rebuilt the scoring system with conservative confidence thresholds and multi-criteria validation. The refined system reduced 19,070 initial candidates to 1,852 verified companies — a 90.3% false positive reduction.

Pipeline 4 — Geographic Intelligence

Enriches validated companies with location data from SEC CIK address lookups, achieving 90.1% geographic coverage across 17,191 companies. Includes international company filtering and state-level market concentration analysis for territory planning.

Pipeline 5 — Deal Calculator

Computes per-company deal values using the $5,250/year IRS maximum, company-size-tiered participation rates (6–12%), and multi-year contract assumptions. Produces individual company annual exposure estimates and aggregate market sizing by geography and industry vertical.

Key Design Decisions

Why rebuild the scoring system instead of tuning thresholds? The original pipeline used single-keyword matching with low thresholds to maximize recall. This captured nearly every company that mentioned “education” anywhere in their filings — including companies discussing “education” in unrelated contexts (product education, investor education, regulatory education). Tuning thresholds would have traded false positives for false negatives. The rebuild used multi-criteria validation: a company needed §127-specific language AND benefit disclosure context AND employee-facing framing. This achieved high precision without sacrificing meaningful recall.

Why 5 separate pipelines instead of a monolithic system? Each pipeline has different execution frequencies and failure modes. The EDGAR scanner runs quarterly when new filings drop. The web crawler runs monthly to catch careers page updates. The deal calculator runs on-demand for specific prospect lists. Separating them means a crawler failure doesn’t block financial analysis, and each can be independently tested and versioned.

03 — Outcomes

Measured Results

36,791
Filings Processed

across the full SEC EDGAR public company universe

90.3%
False Positive Reduction

from 19,070 candidates to 1,852 verified companies

90.1%
Geographic Coverage

17,191 companies enriched with location data

18,943
Companies Analyzed

from initial CIK registration through deal value computation

04 — Reflection

The 58% That Changed Everything

The pivotal moment in this project was not a technical breakthrough. It was the discovery that more than half of my initial results were wrong. The first version of the pipeline looked impressive: 19,000+ companies identified, clean spreadsheets, geographic heat maps. It felt like a success. Then I manually validated 200 samples and found that 58.2% were false positives — companies that mentioned “education” in contexts completely unrelated to employee benefits.

That moment redefined the project’s purpose. It stopped being about volume and became about precision. The entire XBRL validation system was built in response to that failure. In data intelligence, a system that confidently returns wrong answers is worse than one that returns nothing — because bad data drives bad decisions with the authority of computed evidence.

“The initial pipeline produced 19,000 results and felt like a triumph. The validation step revealed that 11,000 of them were wrong. That’s not a pipeline failure — it’s a lesson about the difference between data and intelligence.”

Outcomes

36,791 filings processed; 90.3% false positive reduction (19,070 to 1,852 verified); 90.1% geographic coverage; 18,943 companies analyzed end-to-end