System

The Architecture of Feature Flags at Scale

A feature flag system for 180 services and 2,400 flags maintained evaluation latency under 3 milliseconds while reducing deployment-related incidents by 71%.

A feature flag system I architected for a platform with 180 services and 2,400 flags maintained flag evaluation latency under 3 milliseconds at the 99th percentile while supporting targeted rollouts, A/B testing, and kill switches. The system reduced deployment-related incidents by 71% by decoupling code deployment from feature activation.

What problem does this system address?

At scale, feature flags create their own category of problems: flag dependency conflicts, stale flags that are never cleaned up, testing combinatorial explosions, and the operational overhead of managing thousands of flags across hundreds of services. This system addresses these problems with structured governance and performant evaluation.

Feature flags started as a simple concept: wrap new code in an if statement, toggle it on when ready. At 10 flags, this works. At 2,400 flags across 180 services, it becomes a distributed configuration management challenge. I encountered systems where 40% of flags were stale (the feature had been fully launched months or years ago but the flag was never removed). Engineers could not determine which flags were safe to remove because no one tracked flag dependencies. A/B tests conflicted with each other because there was no coordination between flag assignments. The feature flag system, meant to reduce risk, was creating its own risk category.

How is the system structured?

The system has 4 layers: a centralized flag management service, local evaluation SDKs with caching, a lifecycle governance framework, and a dependency tracking system that prevents conflicting flag states.

Layer 1: Centralized Flag Management

All flags are defined in a centralized service with metadata: owner (team), creation date, expected removal date, flag type (release toggle, experiment, operational kill switch, permission gate), and targeting rules. The management UI shows flag status across all environments (development, staging, production) and tracks evaluation metrics (how many times each flag is evaluated per day, what percentage of evaluations return true). Flags without evaluations for 30 days are automatically flagged for cleanup review.

Layer 2: Local Evaluation SDK

Each service includes a lightweight SDK that caches flag rules locally and evaluates them without network calls for each request. The SDK synchronizes with the central service every 30 seconds. This keeps evaluation latency under 3 milliseconds at the 99th percentile even during central service maintenance windows (the local cache continues serving stale rules). The SDK handles targeting rules: percentage-based rollouts, user-segment targeting, geographic targeting, and custom attribute matching.

Layer 3: Lifecycle Governance

Every flag has a lifecycle: created, active, fully launched (100% rollout), and removed. The governance framework enforces transitions: a flag cannot move from “created” to “active” without a test plan. A flag that has been at 100% rollout for more than 14 days triggers an automated reminder to the owning team to remove the flag and clean up the conditional code. If not removed within 30 days, it escalates to the team’s engineering manager. This governance reduced stale flags from 40% to 7% in the first 6 months.

Layer 4: Dependency Tracking

Flags can declare dependencies on other flags (“this flag should only be active when flag X is also active”) and conflicts (“this flag must not be active when flag Y is active”). The system validates these constraints before flag state changes. This prevents the A/B test conflicts that plagued the previous system, where two experiments could independently modify the same user experience and produce meaningless results. The dependency system caught 14 potential conflicts in the first quarter, each of which would have produced invalid experimental data.

How do you validate it works?

Validation uses 3 metrics: flag evaluation latency (under 3 milliseconds at p99), stale flag percentage (target under 10%), and deployment-incident rate (71% reduction after implementation).

The deployment-incident reduction is the headline metric. By decoupling code deployment from feature activation, the system allows code to be deployed (and therefore tested in production-like conditions) without exposing the feature to users. When the feature is activated via flag, it has already been deployed, monitored, and validated in production. If the feature causes issues, it can be deactivated in under 5 seconds without a code rollback. According to feature toggle research, the decoupling of deployment from release is the primary value of feature flags. This system scales that value to enterprise dimensions while managing the governance challenges that scale introduces. The patterns connect directly to what I described in the architecture of rollback: feature flags are the most granular rollback mechanism available.

adam@adam-analytics.com writes about AI systems, software architecture, and the philosophy of technology at .