Real-Time Analytics Pipeline

01 — Problem

Batch Analytics Couldn’t Keep Pace with the System

The workforce education programs I managed generated enrollment, attendance, and completion data across 4 separate platforms. Reports ran nightly — which meant every dashboard was at least 12 hours stale. When enrollment for a new cohort spiked unexpectedly, I wouldn’t see it until the next morning’s batch job completed. By then, the registration window had passed and support staff had already been overwhelmed without warning.

I needed event-level telemetry that could surface operational anomalies within seconds, not hours. The goal wasn’t a faster batch job. It was a fundamentally different relationship with the data — one where the system told me what was happening, rather than waiting for me to ask.

02 — Architecture

Stream First, Query Second

The pipeline separates ingestion from analysis using a three-tier architecture designed for independent scaling:

Tier 1 — Event Ingestion (Kafka)

Platform webhooks and scheduled API pulls feed into Kafka topics partitioned by event type: enrollments, completions, attendance, and system errors. Each event carries a timestamp, source system identifier, and a schema version tag. Kafka handles backpressure natively — if the downstream consumer falls behind, events queue rather than drop.

Tier 2 — Stream Processing (Go Microservices)

Lightweight Go consumers read from Kafka topics and perform transformations: deduplication, field normalization, and threshold alerting. I chose Go over Python for this layer because of its goroutine model — each partition gets its own consumer goroutine with minimal memory overhead. A single 2-core instance handles 4 topics concurrently.

Tier 3 — Analytical Storage (ClickHouse)

Processed events land in ClickHouse, a columnar database optimized for aggregate queries. Dashboard queries that took 8–12 seconds against PostgreSQL complete in under 200ms in ClickHouse. The tradeoff is that ClickHouse isn’t designed for transactional updates — it’s append-only by design, which fits an event stream perfectly.

Key Design Decisions

Why Kafka instead of a simpler queue like Redis Streams? Durability and replay. Kafka retains events for a configurable period (I use 7 days). When I deploy a new consumer with different transformation logic, I can replay the last week of events to backfill the new schema. Redis Streams can do this in theory, but Kafka’s consumer group semantics make it reliable in practice.

Why ClickHouse instead of TimescaleDB? I tested both. TimescaleDB’s compression was good, but ClickHouse’s vectorized query engine was 3–4x faster on my aggregation patterns (COUNT/SUM grouped by day and program). For a read-heavy analytics workload, raw query speed won.

03 — Outcomes

Measured Results

<800ms
Event-to-Dashboard Latency

from webhook receipt to visible in ClickHouse query

4
Source Systems Unified

LMS, CRM, scheduling, and registration platforms

200ms
Dashboard Query Time

down from 8–12 seconds on the previous PostgreSQL setup

7 days
Event Replay Window

enabling schema migrations without data loss

04 — Reflection

Latency Is a Decision About Trust

The technical challenge was straightforward — Kafka, Go, ClickHouse are well-documented and well-tested. The harder problem was organizational. Stakeholders had built their workflows around stale data. When the dashboard started showing real-time numbers, it surfaced operational gaps that batch reporting had been quietly hiding. A program with declining attendance looked stable in a daily report but alarming in a live feed.

What I’d change: the Go microservices are probably over-engineered for my actual throughput. A Python consumer with asyncio would have been adequate for the event volume I process. I chose Go partly to learn it, which is a legitimate reason — but I should be honest that it added complexity without proportional benefit at this scale.

“Real-time data doesn’t just change how fast you see problems. It changes whether you see them at all.”

Outcomes

Sub-800ms event-to-dashboard latency; 4 source systems unified; Dashboard queries reduced from 8–12s to 200ms; 7-day event replay window for schema migrations

Real-Time Analytics Pipeline

Batch Analytics Couldn’t Keep Pace with the System

Stream First, Query Second

Tier 1 — Event Ingestion (Kafka)

Tier 2 — Stream Processing (Go Microservices)

Tier 3 — Analytical Storage (ClickHouse)

Key Design Decisions

Measured Results

Latency Is a Decision About Trust

Outcomes

Related Writing

On Finite Tokens and Infinite Tasks

Doing Academic Philosophy in the Age of AI

The Case for Boring Automation