A vendor-neutral guide to the architectural patterns behind modern ETL pipelines - batch, streaming, micro-batch, and CDC - with a decision table for choosing the right one per data flow.
The "batch versus streaming" debate is mostly settled in the field, and the answer is "both." Walk into any data platform that has survived a few years of real traffic and you will find a nightly batch job feeding a warehouse, a Kafka topic feeding a real-time dashboard, and a change-data-capture stream keeping a replica in sync, all running side by side. The interesting question in 2026 is not which pattern wins. It is which pattern you should apply to a given data flow, and how to stop the patterns from multiplying into a maintenance nightmare.
This post is about architectural patterns, not tools. We will name tools as examples, but the point is the shape of the pipeline: where the latency comes from, where the cost lands, and what breaks at scale. If you want a tool-by-tool breakdown, that is a different post. Here we care about the decision: given a source, a freshness requirement, and a budget, which pattern do you reach for?
The Core Patterns: Batch, Streaming, Micro-Batch, CDC
Four patterns cover almost everything you will build. They differ less in what they do than in when and how often they move data.
Batch ETL processes data in scheduled, bounded chunks - an hourly or nightly job that reads a fixed window, transforms it, and writes the result. It is the oldest pattern and still the default for analytics that tolerate hours of staleness. Batch is cheap because you amortize fixed overhead (cluster startup, query planning, file listing) across a large volume of rows, and you run infrastructure only when the job runs.
Streaming ETL processes each event, or small groups of events, as they arrive, with no fixed schedule. A stream processor like Apache Flink or Kafka Streams consumes from a log such as a Kafka topic, applies transformations, and emits results continuously. Latency is measured in milliseconds to seconds. The cost is a process that runs forever, plus the operational weight of stateful stream processing - checkpointing, watermarks, exactly-once sinks.
Micro-batch is the pragmatic middle. The engine buffers events for a short interval, then processes them as a tiny batch. Spark Structured Streaming is the canonical example: data arriving within a trigger interval is grouped into one batch, and you tune the interval from sub-second to minutes. Because multiple records are processed together, fixed overheads are amortized and the engine can use vectorized execution, while still keeping end-to-end latency low - Databricks documents micro-batch latencies as low as 100 milliseconds with exactly-once guarantees via checkpointing and write-ahead logs (Spark Structured Streaming docs).
CDC-driven ETL is incremental by construction. Instead of re-reading a source table, it reads the database's transaction log and emits only the rows that changed. This is the most efficient way to keep a downstream copy fresh, because the work is proportional to the change volume, not the table size. Log-based CDC platforms deliver changes in sub-second to single-digit seconds, while batch-oriented connectors typically surface them in minutes (Streamkap CDC tools comparison, 2026). CDC has its own production failure modes (schema drift, log retention, snapshot consistency) that deserve their own treatment; our Debezium production CDC patterns guide goes deep on those, and our data migration tools compared roundup covers the connectors that implement this pattern.
Definition - Micro-batch processing: a stream-processing model that buffers incoming events for a short, fixed trigger interval and then processes the accumulated events as one small batch, trading a small amount of latency for the throughput and cost benefits of batched, vectorized execution.
How to Choose: A Decision Table
The choice is workload-driven, not ideological. Match the pattern to the freshness the business actually needs and the budget you actually have. A dashboard that humans read every morning does not need sub-second data, and paying for a 24/7 stream processor to feed it is waste.
| Pattern | Typical latency | Operational complexity | Cost profile | Best fit |
|---|---|---|---|---|
| Batch | Minutes to hours | Low | Cheap; pay only while jobs run | Periodic analytics, reporting, large historical reprocessing |
| Micro-batch | Seconds to minutes | Medium | Moderate; always-on but amortized | Near-real-time dashboards, frequent incremental loads |
| Streaming | Milliseconds to seconds | High | Higher; 24/7 stateful processes | Fraud detection, alerting, live features, event-driven systems |
| CDC | Sub-second to seconds (log-based) | Medium to high | Efficient at high change volume; cost scales with change rate | Database replication, keeping warehouses and search indexes in sync |
Two practical rules fall out of this table. First, latency you do not need is latency you pay for twice - once in infrastructure and once in operational toil. Second, cost does not track latency linearly. CDC can be cheaper and fresher than a full batch reload, because it moves only the deltas. If your data changes slowly or hourly updates are fine, a scheduled sync is often simpler and cheaper than standing up streaming infrastructure (Streamkap, 2026).
The Hybrid Reality: Lambda, Kappa, and Why Teams Run Both
For years the reference design for "we need both fast and complete" was the Lambda architecture: a batch layer reprocessing all historical data for correctness, a speed layer handling fresh events with low latency, and a serving layer merging the two. It works, but it forces you to write and maintain the same transformation logic twice - once in batch, once in streaming - and to keep them consistent. That duplication is the pattern's defining tax.
Kappa architecture was the response. Treat everything as a stream, append all events to an immutable log, and run a single stream-processing codebase. Need to reprocess history? Replay the log. One pipeline, one set of logic, no merge layer. Kappa removes the duplicated code that makes Lambda painful (RisingWave, 2026).
In practice, most mature 2026 platforms land somewhere in between, and the dominant shape is not pure Lambda or pure Kappa. It is a streaming or micro-batch path for real-time views plus a lakehouse (Iceberg, Delta, or Hudi) holding the full history for analytics and reprocessing (Flexera, 2026). The streaming engine handles freshness; the table format handles correctness, time travel, and cheap historical scans. You get one source of truth in the lake without maintaining two divergent transformation codebases. Our Kafka, Flink, and ClickHouse blueprint walks through one concrete version of this hybrid.
The other half of the hybrid reality is ELT-in-warehouse. Because cloud warehouses can transform large volumes cheaply after loading, the default for a new analytics build in 2026 is to load raw data first and transform it in-place with a tool like dbt (Integrate.io, 2026). This flips the classic ETL order. You extract and load with a movement tool, then push transformation logic down into Snowflake, BigQuery, or Databricks. It is operationally simpler than a separate transform tier, but it relocates the cost: every transformation now burns warehouse compute. We cover the broader stack this sits inside in our modern data platform guide.
Where the Cost Actually Lands
The reliable surprise on every pipeline cost review is the same: moving data is cheap, transforming it is not. With ELT, transformation runs as warehouse queries, and warehouse compute is the line item that grows. Teams routinely default to running dbt on oversized warehouses; one report notes a 15-minute run on an XLARGE Snowflake warehouse can cost roughly 16x a SMALL run, and that combining incremental models with warehouse right-sizing can cut spend 30-60% without losing speed (Medium / Manik Hossain, 2026).
This is where pattern choice becomes cost engineering. Three habits matter more than the tool you pick:
- Make transforms incremental. A model that reprocesses the entire history on every run is the most common source of runaway warehouse bills. Process only new or changed rows, and the cost tracks the change rate, not the table size - the same principle that makes CDC efficient.
- Do not pay for freshness nobody reads. Streaming and micro-batch run infrastructure continuously. If the consumer is a daily report, a scheduled batch costs a fraction of an always-on stream for the same result.
- Put each transform in exactly one place. Duplicated logic across a batch path and a streaming path (the Lambda tax) doubles both compute and the bug surface. It is the single strongest argument for the lakehouse-centric hybrid.
For a deeper, diagnosis-first treatment of finding and fixing the expensive step in an existing pipeline, see our ETL process optimization field guide.
Anti-Patterns Worth Naming
A few failure modes show up often enough to call out by name.
Hidden batch in a "real-time" pipeline. A dashboard advertised as live, fed by a connector that actually polls every five minutes, is a micro-batch pipeline wearing a streaming label. This matters because you will design downstream SLAs around a freshness guarantee the source cannot meet. Know your real end-to-end latency, including the slowest hop, before you promise anything.
Over-orchestration. Wrapping a single SQL transform in a multi-stage DAG with retries, sensors, and branching adds operational surface without adding value. Orchestration earns its keep when you have genuine cross-system dependencies, not when you are scheduling one query.
Duplicated transformations. The same business logic implemented once in Spark for the batch path and again in Flink for the streaming path will drift. When the two disagree, you get the worst debugging session of the quarter, because both pipelines are "correct" according to their own code.
Streaming for slow-moving data. Standing up Kafka, a stream processor, schema registry, and 24/7 monitoring to move a dimension table that changes twice a day is complexity you will maintain forever for freshness no one uses.
Key Takeaways
- The patterns are not competitors. Batch, micro-batch, streaming, and CDC each fit a specific freshness and cost profile, and mature platforms run several at once.
- Choose by workload: match latency to what the business reads, because latency you do not need is paid for in both infrastructure and operational toil.
- CDC is incremental by design and is often both cheaper and fresher than a full batch reload, since the work scales with change volume, not table size.
- The 2026 hybrid default is a streaming or micro-batch path for live views plus a lakehouse for history, which sidesteps the Lambda tax of maintaining two transformation codebases.
- With ELT-in-warehouse, transformation is where the money goes. Incremental models and right-sized compute are the highest-leverage cost controls.
- Watch for hidden batch, over-orchestration, and duplicated transforms - the anti-patterns that quietly inflate cost and break trust in your freshness numbers.
Picking the right pattern per data flow is one of the higher-leverage decisions in a data platform, and it is easy to get wrong by defaulting to either extreme. If you are weighing batch against streaming for a specific workload, or untangling a pipeline that grew several patterns by accident, our data engineering team does exactly this kind of architecture work.