Skip to content
Infrastructure AA-004

Observability Platform

Unified observability stack combining Prometheus metrics, Loki log aggregation, and Grafana dashboards — replacing SSH-and-tail debugging with correlated telemetry, domain-specific alerting, and 91% faster root cause identification across 4 microservices.

01 — Problem

Failures Were Invisible Until They Were Catastrophic

I was running 4 microservices — an enrollment API, a notification worker, a reporting aggregator, and a credential verifier — across 2 servers. When something broke, I found out because a stakeholder emailed me. There was no centralized logging, no metric correlation, no alerting. Each service wrote its own log format to its own file. Debugging a failed enrollment meant SSH-ing into both servers, tailing 4 different log files, and mentally reconstructing the event timeline. A 5-minute outage took 45 minutes to diagnose.

I needed a single pane of glass where I could see the health of every service, correlate events across them, and get notified before a user did.

02 — Architecture

Three Pillars: Metrics, Logs, Traces

The platform collects three types of telemetry, each with a specialized storage and query layer:

Metrics — Prometheus

Each microservice exposes a /metrics endpoint in Prometheus exposition format. Prometheus scrapes these endpoints every 15 seconds and stores time-series data for 30 days. I defined custom metrics beyond the defaults: enrollment_processing_duration_seconds, credential_verification_failures_total, and notification_delivery_latency_seconds. These domain-specific metrics surface problems that generic CPU/memory monitoring would miss entirely.

Logs — Loki

All services emit structured JSON logs to stdout. Promtail agents ship these to Loki, which indexes them by service name, log level, and timestamp — but not by full-text content. This makes Loki dramatically cheaper to operate than Elasticsearch for my volume. The tradeoff is that ad-hoc text searches are slower, but I rarely need them. 90% of my queries filter by service + time range + log level.

Visualization — Grafana

Grafana dashboards combine Prometheus metrics and Loki logs into unified views. The enrollment health dashboard shows request rate, error rate, and p95 latency as time-series graphs, with log panels below showing the most recent errors in context. Alert rules fire when error rate exceeds 5% over a 5-minute window or when p95 latency exceeds 2 seconds.

Key Design Decisions

Why Loki instead of Elasticsearch? Cost and operational simplicity. Elasticsearch requires cluster management, index lifecycle policies, and significant RAM. Loki runs as a single binary, stores logs on disk, and indexes only metadata. For a team of one monitoring 4 services, this is the right tradeoff. I sacrifice full-text search performance to eliminate an entire class of operational burden.

Why custom domain metrics? Default infrastructure metrics (CPU, memory, disk) tell you that something is wrong. Domain metrics tell you what is wrong. When credential_verification_failures_total spikes, I know exactly which service and which function to investigate. Generic metrics would only tell me that a container is using more CPU than usual.

03 — Outcomes

Measured Results

4
Services Monitored

unified under a single observability stack

91%
Faster Diagnosis

mean time to identify root cause — from 45 min to 4 min

12
Custom Alerts

domain-specific rules that surface operational problems proactively

30 days
Retention Window

for metrics and logs — enough to catch recurring patterns

04 — Reflection

You Can’t Fix What You Can’t See

The most valuable outcome of this project wasn’t faster debugging — it was confidence. Before the observability platform, deploying a change to any service carried an ambient anxiety. I wouldn’t know if something broke until hours later. After deployment, I could watch the metric graphs in real time and see immediately whether error rates or latency had changed. Confidence isn’t a soft metric. It directly affects how often you ship.

What I’d change: I’d add distributed tracing from the start. Prometheus and Loki tell me what happened and where it happened, but they can’t trace a single request across multiple services. Adding OpenTelemetry tracing would close that gap and let me correlate a failed enrollment to the specific downstream service call that caused it.

“Observability isn’t about collecting data. It’s about reducing the distance between a system’s behavior and your understanding of it.”

Outcomes

4 services unified under single observability stack; 91% reduction in mean time to diagnose; 12 custom domain-specific alert rules; 30-day metric and log retention