How data warehouse architecture works in 2026: storage and compute separated over object storage, open table formats as the warehouse interface, multi-engine query, streaming ingestion, and why the classic three-tier model no longer fits.
Most explainers on "data warehouse architecture" still draw the same picture they drew a decade ago: sources feed a staging area, ETL jobs clean and load the data, a warehouse stores it, and BI tools read from the warehouse. That diagram is not wrong so much as obsolete. Each of its boxes has been pulled apart and replaced by something else.
In 2026, a data warehouse is rarely a single storage engine. It is a query and governance interface sitting on top of open table formats, which themselves sit on commodity object storage. Storage and compute are separate services bought separately. Multiple engines read and write the same physical tables. Streaming data lands in those tables within seconds instead of waiting for a nightly batch. This post walks through that internal architecture layer by layer and explains why the three-tier model broke.
For the platform-wide view of how all of this fits together, see our companion piece on the modern data platform in 2026. If you want the category-level distinctions first, start with EDW vs data lake vs lakehouse.
Why the Classic Three-Tier Model Broke
The classical design was source -> staging -> warehouse -> BI, with the warehouse owning its own tightly coupled storage and compute. Teradata, early Redshift, and on-prem Oracle all worked this way: you sized a cluster, your data lived inside it, and adding query capacity meant adding storage you did not need (or the reverse). That coupling was the load-bearing assumption of the whole model, and it is the first thing to fall.
Three forces broke it. Cloud object storage made durable bulk storage roughly an order of magnitude cheaper than warehouse-attached disk, so keeping a second full copy of data inside the warehouse stopped making economic sense. Open table formats gave object storage the ACID transactions, schema evolution, and snapshot isolation that previously only a warehouse could offer. And the rise of streaming and machine learning workloads meant the same datasets had to be readable by SQL engines, Spark jobs, and Python notebooks at once, which a closed single-engine warehouse cannot do.
In a 2026 data warehouse architecture, the "warehouse" is no longer a storage engine. It is a semantic, query, and governance layer over open table formats stored in object storage, accessed by multiple compute engines that each scale independently of the data.
The result is that the tiers did not disappear. They moved. Staging became a streaming ingestion layer. The warehouse storage tier became object storage plus a table format. The compute tier became a fleet of interchangeable engines. The BI tier got a semantic layer in front of it. The boxes are still there; the walls between them are gone.
The Foundation: Object Storage and the Storage-Compute Split
Separation of storage and compute is the architectural decision everything else depends on. Data sits in cloud object storage (Amazon S3, Google Cloud Storage, Azure Data Lake Storage), and compute clusters are stateless services that read from and write to that storage over the network. You scale them independently. A reporting team can spin up a small warehouse, a backfill job can spin up a huge one, and both hit the same bytes without copying anything.
Snowflake popularized this pattern commercially with its three-layer model of a storage layer, independent virtual warehouses for compute, and a cloud services layer for metadata and optimization. The open lakehouse generalizes the same idea but makes the storage layer non-proprietary: instead of Snowflake's internal micro-partition format, the data sits in open Parquet files that any engine can read. We covered the commercial version of this split in detail in our Databricks vs Snowflake 2026 comparison.
The trade-off is latency. Object storage is high-throughput but high-latency compared to local NVMe, so engines lean hard on caching, columnar formats, and predicate pushdown to compensate. For analytical scans this is a good deal. For single-row point lookups it is not, which is one reason transactional (OLTP) systems stay separate from analytical (OLAP) ones - a distinction worth understanding on its own, covered in OLTP vs OLAP in 2026.
The Contract Layer: Open Table Formats as the Warehouse Interface
A pile of Parquet files in a bucket is a data lake, and historically that meant no transactions, no safe concurrent writes, and a query planner that had to list every file to know what existed. Open table formats fixed this by adding a metadata layer that turns those files into a real table.
An open table format is a specification and metadata layer (Apache Iceberg, Delta Lake, or Apache Hudi) that sits over data files in object storage and provides ACID transactions, schema and partition evolution, time travel, and a manifest of which files belong to a table at a given snapshot, so that independent engines can read and write the same table safely.
This is the layer that now plays the role the old warehouse storage engine used to play: it is the contract. Instead of "the warehouse owns the data," the table format owns the data, and the warehouse is just one of several clients. Iceberg has become the de facto interoperability standard here, and the Apache Iceberg table format post explains why its design won over the Hive table layout it replaced. The shift away from the older Hive tables approach is the through-line connecting the prior generation to this one.
The catalog completes the contract. The Apache Iceberg REST Catalog protocol standardizes how engines discover tables and commit changes, and projects like Apache Polaris (donated by Snowflake to the Apache Software Foundation in 2024) implement it as vendor-neutral catalog services. A catalog that speaks the Iceberg REST API is what lets Trino, Spark, and Snowflake agree on what a table is.
The Compute Layer: Many Engines, One Set of Tables
Once data lives in open tables behind a shared catalog, compute stops being a single product and becomes a choice per workload. The same Iceberg tables can be queried by Trino for interactive SQL, processed by Spark for heavy batch jobs, written by Flink for streaming, and scanned by ClickHouse for low-latency analytics. None of them owns the data.
The commercial platforms have conceded this. Apache Iceberg tables for Snowflake reached general availability in June 2024 with Snowflake version 8.20, letting external engines read Snowflake-managed Iceberg tables. Databricks followed: at Data + AI Summit in June 2025 it announced full Apache Iceberg support in Unity Catalog, exposing managed Iceberg tables through the Iceberg REST Catalog API so engines like Trino, Snowflake, and Amazon EMR can read them. When the two largest analytics vendors both ship read/write Iceberg interop, single-engine lock-in is no longer the default. Our practical guide to ClickHouse and Iceberg integration shows what this looks like for a fast OLAP engine reading the lakehouse directly.
Picking engines per workload has a cost: governance and observability now span multiple systems instead of one. Permissions, lineage, and cost attribution have to be enforced at the catalog and storage layer rather than inside a single product, which is harder and a frequent source of production gaps.
Streaming as a First-Class Input
In the old model, freshness was a batch-window problem: data was as current as the last nightly or hourly load. The 2026 architecture treats streaming as a primary ingestion path, not an afterthought. Events flow from Kafka through a stream processor (commonly Apache Flink) and land directly in Iceberg tables, where the same tables serve both the streaming writes and downstream analytical reads.
This collapses the old staging tier. Change Data Capture (CDC) streams from operational databases replay inserts, updates, and deletes into lakehouse tables continuously, so the warehouse reflects source-system state within seconds rather than hours. Iceberg's snapshot model and row-level deletes are what make this safe: a streaming writer can commit small files frequently while readers see consistent snapshots, and a background compaction job merges the small files later. The pairing is common enough that we wrote a dedicated guide on Flink and Iceberg for modern data lakes.
The pitfall is file fragmentation. High-frequency streaming commits produce many small files and large delete manifests, which slowly degrade read performance until maintenance is run. Compaction, snapshot expiration, and manifest rewriting are not optional in a streaming lakehouse; they are operational requirements, as we cover in Iceberg table maintenance best practices.
Choosing an Architecture: Lakehouse vs Cloud DW vs HTAP
The new model is not automatically the right one for every team. The decision turns on workload mix, engine diversity, and how much platform engineering you can staff. A single-engine cloud data warehouse is still the simplest thing that works for a team doing pure SQL analytics with no streaming or ML.
| Dimension | Open Lakehouse | Single-Engine Cloud DW | HTAP / Real-Time DB |
|---|---|---|---|
| Storage | Object storage + open table format | Vendor-managed (often proprietary) | Engine-managed |
| Compute | Multiple engines, same tables | One engine | One engine |
| Best for | Mixed SQL + Spark + streaming + ML | SQL-first analytics, BI | Low-latency serving on fresh data |
| Lock-in risk | Low (open format + REST catalog) | Higher (data inside the platform) | Higher |
| Operational burden | High (compaction, multi-engine governance) | Low | Medium |
| Freshness | Seconds (with streaming) | Minutes to hours (batch) | Sub-second |
| Examples | Iceberg + Trino/Spark/Flink | Snowflake, BigQuery, Redshift | ClickHouse, Apache Pinot, SingleStore |
A practical migration from a legacy Redshift- or Snowflake-only stack is incremental, not a rewrite. Land new and high-volume datasets in Iceberg first, point a query engine at both the warehouse and the lakehouse, and move workloads table by table as the open side proves out. The warehouse keeps serving what it already serves while the open layer grows underneath it. For the broader landscape this fits into, including the 2022 prior-generation reference, see our original architectures of a modern data platform post and the updated clickhouse vs snowflake comparison.
Key Takeaways
- The classic three-tier model (source -> staging -> warehouse -> BI) broke because cloud object storage, open table formats, and streaming/ML workloads each dissolved one of its tightly coupled walls.
- Separation of storage and compute is the foundational decision: data lives in S3/GCS/ADLS, and stateless compute engines scale independently of it.
- Open table formats (Iceberg, Delta, Hudi) are the new warehouse interface. They provide ACID transactions and schema evolution over object storage, and the table plus its catalog, not the engine, owns the data.
- Multiple engines now read and write the same tables. Snowflake (Iceberg GA in June 2024) and Databricks (full Iceberg support in Unity Catalog, June 2025) both expose Iceberg interop, eroding single-engine lock-in.
- Streaming via Kafka and Flink into Iceberg gives seconds-fresh data and collapses the old staging tier, at the cost of mandatory table maintenance (compaction, snapshot expiration).
- The lakehouse is not free. Multi-engine governance, observability, and operational upkeep are real costs; a single-engine cloud DW remains the simpler choice for SQL-only analytics.
Picking the right shape for your workloads, and migrating without a risky big-bang cutover, is exactly the kind of work BigData Boutique does with data teams. If you are weighing a lakehouse move or untangling a multi-engine stack, reach out.