Cloud Architecture Without Netflix’s Problems
01
What problem were they trying to solve?
The company had grown from a monolithic Rails application to a microservices architecture over 18 months. The motivation was reasonable: their monolith had become slow to deploy (22-minute build times) and difficult to scale during traffic spikes (Black Friday, product launches). But the architecture they adopted was modeled on Netflix’s publicly documented patterns: service mesh with Istio, event sourcing with Kafka, polyglot persistence with 4 different database technologies, and full Kubernetes orchestration across 3 environments.
The problem was that they were not Netflix. Netflix has over 2,500 engineers, processes 1.5 billion hours of video per quarter, and serves 260 million subscribers across 190 countries. This company had 14 engineers, processed 340,000 orders per month, and served customers in 2 countries. The architectural patterns designed for Netflix’s scale created operational overhead that consumed 45% of their engineering capacity, leaving only 8 person-months per quarter for feature development from a team that should have been delivering 14.
02
How was the architecture redesigned for actual scale?
I conducted a 2-week architecture assessment. The first step was profiling actual system load. Peak traffic was 420 requests per second during a product launch. Average daily traffic was 38 requests per second. The system processed 11,000 orders per day at peak. These numbers are significant: they are well within the capacity of a single well-configured application server. The microservices architecture was solving a scaling problem that did not exist.
The redesign followed 4 principles:
Principle 1: Consolidate services that share data. 8 of the 12 microservices accessed the same PostgreSQL database. They were not microservices. They were a distributed monolith with network calls where function calls should have been. I merged them into a single application with 8 internal modules, reducing inter-service network calls from 2.3 million per day to zero. Response time for the checkout flow dropped from 340 milliseconds (with 7 inter-service hops) to 45 milliseconds (single process, no network overhead).
Principle 2: Replace Kafka with PostgreSQL LISTEN/NOTIFY for internal events. The Kafka cluster processed 180,000 messages per day, well below the threshold where Kafka’s distributed log architecture provides value. Kafka’s operational overhead (3-node cluster, ZooKeeper management, topic configuration, consumer group management) required approximately 8 hours per week of engineering time. PostgreSQL’s built-in LISTEN/NOTIFY handled the same event volume with zero additional infrastructure. For the 2 use cases that needed durable message queuing (order fulfillment and inventory sync), I used a simple outbox pattern with a cron job.
Principle 3: Replace Kubernetes with managed platform services. The Kubernetes cluster ran 34 pods across 6 nodes with Istio service mesh. I migrated the consolidated application to AWS ECS Fargate (3 tasks behind an Application Load Balancer) and the 4 remaining standalone services to AWS Lambda. Monthly infrastructure cost dropped from $8,400 to $4,900. Deployment time dropped from 12 minutes to 3 minutes. The team no longer needed to maintain Kubernetes manifests, Helm charts, or Istio configuration.
Principle 4: Consolidate to a single database technology. The system used PostgreSQL, Redis, MongoDB, and DynamoDB. The MongoDB instance stored product catalog data (1.2 GB). DynamoDB stored session data (average 40,000 active sessions). Redis served as both cache and message broker. I consolidated to PostgreSQL (with JSONB columns for catalog data) and Redis (for caching and sessions). Operational complexity decreased. Backup procedures simplified. The team needed expertise in 2 technologies instead of 4.
03
What were the measurable outcomes?
$8,400 → $4,900
Monthly Infrastructure Cost
12 → 3
Minutes to Deploy
340 → 45ms
Checkout Flow Latency
45% → 18%
Engineering Time on Infrastructure
12 → 3
Services to Maintain
4 → 2
Database Technologies
The team recovered 3.8 person-months per quarter of engineering capacity that had been consumed by infrastructure maintenance. In the 6 months after the migration, they shipped 14 features compared to 5 features in the 6 months before. Revenue per engineer, while not the only metric that matters, increased by 34%. The on-call rotation went from averaging 3 pages per week to 0.4 pages per week, because the system had fewer moving parts that could fail independently.
04
What would I change in hindsight?
I would have been more aggressive about consolidation. I left 4 services separate (payment processing, email delivery, image processing, and a third-party integration adapter) because they had genuinely different scaling characteristics. In retrospect, only the image processing service warranted separation (it was CPU-intensive and benefited from Lambda’s per-invocation scaling). The other 3 could have been additional modules in the consolidated application, reducing the total service count from 4 to 2.
I would also have started with a load test that proved the monolith could handle peak traffic before beginning the migration. I was confident based on the numbers, but the team was skeptical. A 30-minute load test demonstrating that a single ECS task could handle 500 requests per second (more than their peak of 420) would have built immediate confidence and reduced resistance to the consolidation approach.
The broader lesson is that architectural patterns are not universal truths. They are solutions to specific problems at specific scales. Netflix’s architecture solves Netflix’s problems. Your architecture should solve your problems. If your problems are “we have 14 engineers and 340,000 orders per month,” the solution is almost certainly not a service mesh and event sourcing. It is a well-structured application with good deployment automation and the discipline to keep it simple until the numbers demand otherwise.