01 — Problem
Failures Were Invisible Until They Were Catastrophic
I was running 4 microservices — an enrollment API, a notification worker, a reporting aggregator, and a credential verifier — across 2 servers. When something broke, I found out because a stakeholder emailed me. There was no centralized logging, no metric correlation, no alerting. Each service wrote its own log format to its own file. Debugging a failed enrollment meant SSH-ing into both servers, tailing 4 different log files, and mentally reconstructing the event timeline. A 5-minute outage took 45 minutes to diagnose.
I needed a single pane of glass where I could see the health of every service, correlate events across them, and get notified before a user did.
02 — Architecture
Three Pillars: Metrics, Logs, Traces
The platform collects three types of telemetry, each with a specialized storage and query layer:
Metrics — Prometheus
Each microservice exposes a /metrics endpoint in Prometheus exposition format. Prometheus scrapes these endpoints every 15 seconds and stores time-series data for 30 days. I defined custom metrics beyond the defaults: enrollment_processing_duration_seconds, credential_verification_failures_total, and notification_delivery_latency_seconds. These domain-specific metrics surface problems that generic CPU/memory monitoring would miss entirely.
Logs — Loki
All services emit structured JSON logs to stdout. Promtail agents ship these to Loki, which indexes them by service name, log level, and timestamp — but not by full-text content. This makes Loki dramatically cheaper to operate than Elasticsearch for my volume. The tradeoff is that ad-hoc text searches are slower, but I rarely need them. 90% of my queries filter by service + time range + log level.
Visualization — Grafana
Grafana dashboards combine Prometheus metrics and Loki logs into unified views. The enrollment health dashboard shows request rate, error rate, and p95 latency as time-series graphs, with log panels below showing the most recent errors in context. Alert rules fire when error rate exceeds 5% over a 5-minute window or when p95 latency exceeds 2 seconds.
Key Design Decisions
Why Loki instead of Elasticsearch? Cost and operational simplicity. Elasticsearch requires cluster management, index lifecycle policies, and significant RAM. Loki runs as a single binary, stores logs on disk, and indexes only metadata. For a team of one monitoring 4 services, this is the right tradeoff. I sacrifice full-text search performance to eliminate an entire class of operational burden.
Why custom domain metrics? Default infrastructure metrics (CPU, memory, disk) tell you that something is wrong. Domain metrics tell you what is wrong. When credential_verification_failures_total spikes, I know exactly which service and which function to investigate. Generic metrics would only tell me that a container is using more CPU than usual.
03 — Outcomes
Measured Results
Services Monitored
unified under a single observability stack
Faster Diagnosis
mean time to identify root cause — from 45 min to 4 min
Custom Alerts
domain-specific rules that surface operational problems proactively
Retention Window
for metrics and logs — enough to catch recurring patterns
04 — Reflection
You Can’t Fix What You Can’t See
The most valuable outcome of this project wasn’t faster debugging — it was confidence. Before the observability platform, deploying a change to any service carried an ambient anxiety. I wouldn’t know if something broke until hours later. After deployment, I could watch the metric graphs in real time and see immediately whether error rates or latency had changed. Confidence isn’t a soft metric. It directly affects how often you ship.
What I’d change: I’d add distributed tracing from the start. Prometheus and Loki tell me what happened and where it happened, but they can’t trace a single request across multiple services. Adding OpenTelemetry tracing would close that gap and let me correlate a failed enrollment to the specific downstream service call that caused it.
“Observability isn’t about collecting data. It’s about reducing the distance between a system’s behavior and your understanding of it.”
Outcomes
4 services unified under single observability stack; 91% reduction in mean time to diagnose; 12 custom domain-specific alert rules; 30-day metric and log retention