01 — Problem
Batch Analytics Couldn’t Keep Pace with the System
The workforce education programs I managed generated enrollment, attendance, and completion data across 4 separate platforms. Reports ran nightly — which meant every dashboard was at least 12 hours stale. When enrollment for a new cohort spiked unexpectedly, I wouldn’t see it until the next morning’s batch job completed. By then, the registration window had passed and support staff had already been overwhelmed without warning.
I needed event-level telemetry that could surface operational anomalies within seconds, not hours. The goal wasn’t a faster batch job. It was a fundamentally different relationship with the data — one where the system told me what was happening, rather than waiting for me to ask.
02 — Architecture
Stream First, Query Second
The pipeline separates ingestion from analysis using a three-tier architecture designed for independent scaling:
Tier 1 — Event Ingestion (Kafka)
Platform webhooks and scheduled API pulls feed into Kafka topics partitioned by event type: enrollments, completions, attendance, and system errors. Each event carries a timestamp, source system identifier, and a schema version tag. Kafka handles backpressure natively — if the downstream consumer falls behind, events queue rather than drop.
Tier 2 — Stream Processing (Go Microservices)
Lightweight Go consumers read from Kafka topics and perform transformations: deduplication, field normalization, and threshold alerting. I chose Go over Python for this layer because of its goroutine model — each partition gets its own consumer goroutine with minimal memory overhead. A single 2-core instance handles 4 topics concurrently.
Tier 3 — Analytical Storage (ClickHouse)
Processed events land in ClickHouse, a columnar database optimized for aggregate queries. Dashboard queries that took 8–12 seconds against PostgreSQL complete in under 200ms in ClickHouse. The tradeoff is that ClickHouse isn’t designed for transactional updates — it’s append-only by design, which fits an event stream perfectly.
Key Design Decisions
Why Kafka instead of a simpler queue like Redis Streams? Durability and replay. Kafka retains events for a configurable period (I use 7 days). When I deploy a new consumer with different transformation logic, I can replay the last week of events to backfill the new schema. Redis Streams can do this in theory, but Kafka’s consumer group semantics make it reliable in practice.
Why ClickHouse instead of TimescaleDB? I tested both. TimescaleDB’s compression was good, but ClickHouse’s vectorized query engine was 3–4x faster on my aggregation patterns (COUNT/SUM grouped by day and program). For a read-heavy analytics workload, raw query speed won.
03 — Outcomes
Measured Results
Event-to-Dashboard Latency
from webhook receipt to visible in ClickHouse query
Source Systems Unified
LMS, CRM, scheduling, and registration platforms
Dashboard Query Time
down from 8–12 seconds on the previous PostgreSQL setup
Event Replay Window
enabling schema migrations without data loss
04 — Reflection
Latency Is a Decision About Trust
The technical challenge was straightforward — Kafka, Go, ClickHouse are well-documented and well-tested. The harder problem was organizational. Stakeholders had built their workflows around stale data. When the dashboard started showing real-time numbers, it surfaced operational gaps that batch reporting had been quietly hiding. A program with declining attendance looked stable in a daily report but alarming in a live feed.
What I’d change: the Go microservices are probably over-engineered for my actual throughput. A Python consumer with asyncio would have been adequate for the event volume I process. I chose Go partly to learn it, which is a legitimate reason — but I should be honest that it added complexity without proportional benefit at this scale.
“Real-time data doesn’t just change how fast you see problems. It changes whether you see them at all.”
Outcomes
Sub-800ms event-to-dashboard latency; 4 source systems unified; Dashboard queries reduced from 8–12s to 200ms; 7-day event replay window for schema migrations