Real-Time and Batch Are the Same System
Why is the streaming vs. batch debate a false binary?
Streaming and batch are not different systems; they are the same computational operations observed at different timescales, and the perceived architectural gap between them is an artifact of tooling history, not a fundamental property of data processing.
Consider what a batch job does: it reads a bounded set of records, applies transformations, and writes results. Now consider what a streaming job does: it reads an unbounded set of records, applies transformations, and writes results. The only structural difference is whether the input is bounded (a file, a table partition) or unbounded (a Kafka topic, a change stream). The transformations are identical. The output semantics are identical. The difference is when results are materialized.
I maintained parallel implementations of the same revenue calculation for 14 months: a batch version that ran nightly in dbt and a streaming version that ran continuously in Flink. The SQL logic was 89% identical. The remaining 11% was windowing syntax. Two codebases, two deployment pipelines, two monitoring systems, two on-call rotations, for the same calculation at different latencies.
What would a unified architecture look like?
A unified architecture treats temporal granularity as a parameter: the same transformation logic runs against micro-batches for near-real-time use cases, daily partitions for analytical use cases, and weekly windows for reporting, with latency determined by configuration, not code.
Apache Beam proposed this with its unified batch-and-streaming model. Delta Live Tables in Databricks approach it with tables that can be configured for triggered (batch) or continuous (streaming) execution. Apache Iceberg’s incremental reads enable the same table to be consumed as a batch source or a change stream. The building blocks exist. The mental model is what lags.
I collapsed the dual revenue pipeline into a single implementation using Databricks structured streaming with trigger configurations. The “batch” version uses trigger(availableNow=True), processing all available records in one pass. The “streaming” version uses trigger(processingTime=”30 seconds”), materializing results every 30 seconds. Same code. Same tests. Same monitoring. Different trigger parameter.
Where does the distinction still matter?
The distinction matters at infrastructure boundaries: message brokers, exactly-once guarantees, and state management impose real constraints that configuration alone cannot erase, but these are operational concerns, not architectural ones.
Kafka introduces ordering guarantees and consumer group semantics that batch file processing doesn’t need. Streaming state management (windows, watermarks, late-arriving data handling) adds complexity that batch processing avoids by assuming complete data. These are real differences. But they are differences in operational machinery, not in the computational logic that data teams care about. The question is whether those operational concerns should drive architecture or be abstracted behind it.
The streaming vs. batch debate persists because tooling vendors benefit from the distinction (it doubles the addressable market) and because engineers form identities around their specialization (“I’m a streaming engineer”). But the data doesn’t care whether it’s processed in a tight loop or a daily sweep. It cares about correctness, and correctness is independent of temporal granularity. The interesting question is not “batch or stream?” but “what latency does this use case require, and what is the simplest architecture that delivers it?”