Airflow 3.0 Migration: Event-Driven Orchestration

Migrating 47 DAGs to Airflow 3.0’s event-driven scheduling reduced median pipeline latency by 62%, from 23 minutes to 8.7 minutes, while eliminating 340 unnecessary pipeline runs per week that were triggered by time-based schedules rather than data availability.

01

What problem did the legacy orchestration system create?

The existing Airflow 2.x deployment ran 47 production DAGs on fixed cron schedules. Every pipeline ran at a predetermined time, regardless of whether its upstream data had arrived. This created two categories of waste: idle runs (pipelines that executed, found no new data, and completed with zero useful work) and stale runs (pipelines that executed before upstream data was ready, processed incomplete datasets, and required manual reruns).

I audited 4 weeks of pipeline execution logs. Of 1,340 total DAG runs, 340 (25.4%) were idle runs that processed zero new records. Another 89 (6.6%) were stale runs that required manual intervention to reprocess after upstream data arrived late. The engineering team spent an average of 7.2 hours per week managing these timing failures: adjusting cron schedules, adding artificial delays, building custom sensor operators that polled for data readiness.

The fundamental problem was architectural. Time-based scheduling assumes data arrives on a predictable schedule. In a system with 12 external data sources, 3 third-party APIs, and 2 partner feeds, data arrival times varied by 15 minutes to 4 hours on any given day. The cron schedule was a fiction. The team’s workarounds were an admission that the fiction didn’t hold.

02

How was the Airflow 3.0 event-driven architecture designed?

Airflow 3.0 introduced native event-driven scheduling through its Asset-based triggering system. Instead of cron expressions, DAGs declare their dependencies on data assets. When an upstream DAG updates an asset, downstream DAGs that depend on that asset are automatically triggered. This inverted the scheduling model: instead of “run at 6am and hope the data is there,” the pattern became “run when the data is confirmed present.”

I designed the migration in 3 phases over 11 weeks. Phase 1 (weeks 1 through 3): I mapped every DAG’s actual data dependencies using execution logs, not the documented dependencies (which were incomplete). This produced a dependency graph with 47 nodes and 83 edges. 12 dependencies were undocumented and discovered only through log analysis.

Phase 2 (weeks 4 through 7): I converted DAGs to asset-based scheduling, starting with leaf nodes (DAGs with no downstream dependents) and working backward toward root sources. Each conversion followed a pattern: replace the cron schedule_interval with an Asset dependency declaration, add an Asset outlet annotation to the DAG’s final task confirming the output dataset was updated, and run both the old cron-triggered and new event-triggered versions in parallel for 5 days to validate equivalence.

Phase 3 (weeks 8 through 11): I integrated dbt models through Cosmos, Astronomer’s dbt-Airflow integration library. Cosmos rendered each dbt model as an Airflow task, preserving dbt’s dependency graph while gaining Airflow’s scheduling and monitoring capabilities. The 23 dbt models that previously ran as a monolithic “dbt run” command were decomposed into individual tasks with explicit asset dependencies, enabling partial reruns and granular monitoring.

The critical design decision was the asset granularity. I defined assets at the table level, not the schema or database level. This meant a DAG that updated 3 tables published 3 asset events, and downstream DAGs could depend on individual tables rather than waiting for the entire upstream DAG to complete. This reduced unnecessary waiting by an average of 11 minutes per pipeline chain.

03

What were the measurable outcomes?

62%

Median Latency Reduction

340

Idle Runs Eliminated Per Week

89%

Reduction in Manual Reruns

7.2 hrs

Weekly Engineering Time Recovered

47

DAGs Migrated

11 wks

Total Migration Duration

Median pipeline latency dropped from 23 minutes to 8.7 minutes. The improvement came from two sources: eliminating artificial delays (sensors and sleep tasks that padded schedules to account for data arrival variance) and enabling immediate triggering when upstream data landed. The P95 latency improved even more dramatically, from 52 minutes to 14 minutes, because the worst-case scenarios (upstream data arriving 2 hours late) no longer cascaded into missed processing windows.

The 340 eliminated idle runs translated to approximately $1,800 per month in reduced compute costs (Kubernetes pod-hours and Snowflake warehouse credit-seconds that were previously consumed by pipelines processing empty datasets). The 89% reduction in manual reruns freed 6.4 hours of engineering time per week, time that was previously spent diagnosing timing failures and manually triggering corrective runs.

The dbt integration through Cosmos delivered an unexpected benefit: model-level observability. With each dbt model running as an individual Airflow task, I could monitor execution time, resource consumption, and failure rates at the model level rather than the project level. This revealed that 3 of 23 models consumed 71% of the total dbt execution time, a bottleneck that was invisible when dbt ran as a monolithic command.

04

What would I change in hindsight?

I would start with the dependency mapping, not end with it. I spent weeks 1 through 3 on dependency discovery because the existing documentation was incomplete. If I had maintained a living dependency map from the beginning of the Airflow 2.x deployment, the migration would have started at Phase 2, saving 3 weeks.

I underestimated the cultural adjustment. Engineers accustomed to cron schedules think in terms of “when does this run?” Event-driven scheduling requires thinking in terms of “what does this need before it can run?” This is a subtle but meaningful cognitive shift. Two engineers continued writing cron-based DAGs for 3 weeks after the migration because the mental model hadn’t updated. I should have invested more time in training, not on Airflow 3.0 syntax (which is well-documented), but on event-driven thinking as a design paradigm.

The asset granularity decision (table-level) was correct for most cases but created overhead for DAGs that produce 15 or more tables. Those DAGs now publish 15 separate asset events, and downstream DAGs that need all 15 tables require a compound dependency declaration. I would introduce an asset group abstraction for high-fanout producers, allowing downstream consumers to depend on “all outputs from DAG X” without enumerating every table.

Finally, I would build the parallel validation framework before starting migration, not during Phase 2. Running old and new versions simultaneously was essential for confidence, but I built the comparison tooling ad-hoc for each DAG. A reusable validation harness that automatically compared cron-triggered and event-triggered outputs would have saved approximately 2 weeks of migration effort.