A vendor-neutral reference architecture for the modern data platform: the functional layers every team needs - ingestion, storage, processing, serving, orchestration, governance, observability - and the real open and managed options at each one.
Most "modern data platform" diagrams are sales tools. A vendor draws seven boxes, and surprise: every box maps to a product they sell. That is not architecture. It is a pricing sheet with arrows.
This post takes the opposite stance. A modern data platform is a set of functional layers, not a list of products. If you understand the layers and what each one has to do, you can assemble a platform from open-source components, managed services, or a mix, and you can swap any single piece without rebuilding the rest. The job here is to give you that layer map and the real options at each layer, framed for the engineer who has to operate it - not for the vendor who wants to sell it.
This is the 2026 view. We published a prior-generation architecture map in 2022, and the shape of the stack has shifted since: streaming is the default ingestion mode, open table formats won the storage argument, and governance plus observability moved from "nice to have" to load-bearing. We will reference the older piece as a companion rather than repeat it.
The seven functional layers
A modern data platform is the set of functional layers that move data from where it is produced to where it is consumed, while keeping it correct, governed, and observable. Strip away brand names and every credible architecture exposes the same seven responsibilities: ingestion and change capture, storage and table format, processing and transformation, serving and query, orchestration, governance, and observability.
The layers are responsibilities, not servers. One tool can cover several layers - ClickHouse, for example, is both storage and a serving query engine - and one layer can be split across several tools. The value of thinking in layers is that it forces you to ask "is this responsibility covered?" before you ask "which product?". Industry write-ups tend to collapse this into five pillars (ingestion, storage, processing, consumption, governance), as in dataforest's 2026 state-of-architecture report. That is fine as a marketing summary. For engineers who carry the pager, orchestration and observability deserve to stand on their own, because they are where platforms quietly fail.
Here is the reference architecture as a layer map. Read it as "for each row, pick at least one tool; you do not need every column."
| Layer | What it has to do | Representative open / self-hosted | Representative managed |
|---|---|---|---|
| Ingestion + CDC | Get data in, capture row-level changes from source databases | Debezium, Kafka Connect, Airbyte, Apache NiFi | Fivetran, AWS DMS, Confluent Cloud, Estuary |
| Storage + table format | Durable, queryable storage with schema, snapshots, time travel | Apache Iceberg, Delta Lake, Apache Hudi on object storage | Snowflake, Databricks, BigQuery, S3 Tables |
| Processing + transformation | Reshape, join, and model raw data into trusted tables | Spark, Flink, dbt Core, SQLMesh | dbt Cloud, Databricks, Confluent Flink, EMR |
| Serving + query | Answer analytical and operational queries with low latency | ClickHouse, Trino, DuckDB, Apache Druid | ClickHouse Cloud, Athena, BigQuery, Snowflake |
| Orchestration | Schedule, sequence, retry, and backfill pipelines | Apache Airflow, Dagster, Prefect | Astronomer, Dagster+, MWAA, Google Cloud Composer |
| Governance | Catalog, lineage, access control, classification | DataHub, OpenMetadata, Apache Polaris, Unity Catalog OSS | Unity Catalog, Snowflake Horizon, Atlan, Collibra |
| Observability | Detect freshness, volume, and quality regressions before users do | OpenLineage, Great Expectations, Elementary | Monte Carlo, Metaplane, Bigeye |
A tool appearing in two rows is not a mistake. Convergence is the defining trend of this generation: storage engines grew query layers, query engines learned to write open table formats, and catalogs absorbed lineage and observability. The boundaries are blurring, which makes the layer model more useful, not less - it tells you what a converged tool is actually doing under the hood.
Storage and the table format question
Storage is where the 2022-to-2026 shift is most visible. The argument over open table formats is effectively settled. An open table format is a metadata specification that turns a directory of Parquet files in object storage into a real table with schema evolution, ACID transactions, snapshot isolation, and time travel, readable by any compliant engine. Apache Iceberg, Delta Lake, and Apache Hudi are the three that matter.
What changed is that the major warehouses stopped fighting it. In 2025 Databricks shipped full Apache Iceberg support through Unity Catalog's Iceberg REST Catalog API, meaning any REST-spec client - Spark, Flink, Trino - can read and write the same managed tables. Snowflake exposes the same surface through its Iceberg tables and the Polaris catalog. The practical result: your table format is now a more durable decision than your query engine. Pick the format deliberately, and engines become swappable clients against it. If you are weighing the two front-runners, we cover the trade-offs in Apache Iceberg vs Delta Lake.
The other half of storage is the hot path. A lakehouse on Iceberg is excellent for large historical scans and cheap retention, but a Parquet-on-S3 query that takes two seconds is a non-starter behind a user-facing dashboard. If you are still deciding between the warehouse, lake, and lakehouse shapes for this layer, we compare them head to head in EDW vs data lake vs lakehouse. That is where a columnar serving engine earns its place. The common 2026 pattern is two tiers: open table format for the durable, governed, everything-lives-here layer, and a fast columnar store (ClickHouse, Druid) for the sub-second hot data that powers product analytics and real-time dashboards. The lakehouse is the source of truth; the serving store is a materialized, queryable projection of the slice that needs to be fast. For background on the lakehouse pattern itself, see what is a data lakehouse and the medallion architecture.
Ingestion: streaming is the default, batch is the special case
The biggest mental shift since the last generation is that streaming stopped being the exotic option. The honest framing for 2026 is that batch is a subset of streaming - a bounded stream you happen to process on a schedule. Most new pipelines start from a continuous event log and add batching only where latency genuinely does not matter.
Three ingestion patterns cover almost everything, and most real platforms run all three:
- Change data capture (CDC). Stream row-level inserts, updates, and deletes out of operational databases by reading the transaction log. Debezium into Kafka is the open-source standard; AWS DMS and Fivetran are the managed equivalents. CDC is how you get OLTP data into the analytics plane without hammering production with polling queries, a split we examine in depth in OLTP vs OLAP in 2026.
- Event streaming. Application and clickstream events land directly on Kafka or a Kafka-compatible log (Redpanda, Confluent Cloud), then fan out to storage and stream processors.
- Batch and zero-ETL. Periodic loads for legacy systems and slow-moving reference data, plus the managed "zero-ETL" replication paths cloud vendors now offer between their own services.
The pitfall here is treating CDC as fire-and-forget. CDC changelogs are not append-only event streams; a downstream consumer has to interpret update and delete semantics correctly or it will silently double-count. As Gunnar Morling documents in his walkthrough of ingesting Debezium events with Flink SQL, the connector and format together decide whether events are read as an append-only stream or a changelog - and getting that wrong corrupts every aggregate downstream. Design your ingestion contract before you pick the tool. For the broader picture, our primer on what a data pipeline is sets the foundations, and ETL process optimization covers tuning the batch paths you do keep.
Processing, serving, and orchestration
These three layers are where data becomes usable, and they have consolidated hard around SQL.
Processing and transformation is dominated by declarative SQL modeling. dbt established the pattern - version-controlled SQL models, tests, generated docs - and SQLMesh extended it with column-level lineage and cheaper environment management. The consolidation became literal in late 2025: Fivetran and dbt Labs announced a merger, and Fivetran's earlier acquisition of Tobiko Data sent SQLMesh to the Linux Foundation. dbt Core remains Apache-2.0 licensed. The notable 2026 development is that the same dbt-style workflow now reaches streaming: adapters let you define Flink SQL views and streaming tables over Kafka topics as dbt models, so the skills built on Snowflake or BigQuery transfer to the real-time side instead of living in a separate Terraform-and-console world. For a concrete streaming-first stack, see our Kafka, Flink, and ClickHouse blueprint.
Serving and query is the layer your users actually touch. The split is between interactive analytics over the lakehouse (Trino, Athena, DuckDB for local and embedded work) and low-latency serving from a columnar engine (ClickHouse, Druid). The decision is latency-driven: if a human is waiting on the query inside an application, you want a purpose-built columnar store, not a lakehouse scan.
Orchestration ties it together. The question an orchestrator answers is not "what runs" but "what runs after what, what happens when it fails, and how do I backfill three months of history without a war room." Airflow remains the incumbent; Dagster and Prefect compete on asset-centric and Pythonic developer experience. Pick based on whether your team thinks in tasks (Airflow) or in data assets (Dagster). Either way, this layer is non-optional the moment you have more than a handful of dependent jobs.
Governance and observability: the load-bearing layers
These two layers are what separate a platform from a pile of pipelines, and they are where most teams under-invest until an incident forces the issue.
Governance is the catalog, lineage, access control, and classification that let people find data and trust it. The open standard worth building around is OpenLineage, which defines how jobs across Airflow, Spark, Flink, and dbt emit structured lineage events as they run. Catalogs like DataHub and OpenMetadata consume those events and build navigable lineage graphs. The 2026 angle is that catalogs are becoming the context layer for AI agents too: DataHub now exposes its metadata graph over the Model Context Protocol so AI tools can read governed context and write back tags, descriptions, and ownership. Governance is no longer paperwork; it is the substrate that both humans and agents query.
Observability is the monitoring layer for data itself - freshness, volume, schema drift, and quality, detected before a stakeholder notices a dashboard went stale. Open tools (Great Expectations, Elementary, the OpenLineage spine) and managed platforms (Monte Carlo, Metaplane) cover the same job: catch the regression at the pipeline, not in the boardroom. The practice of treating data pipelines with the same operational rigor as production software is what DataOps is about. If you take one thing from this section: instrument lineage and quality from day one. Retrofitting observability onto a platform that has already lost the trust of its users is far more expensive than building it in.
Right-sizing: what to skip
The full seven-layer architecture is what a platform looks like at scale. It is not what a five-person data team should build on day one. The most common failure mode is a startup assembling an enterprise stack - a separate tool per layer, a dedicated catalog, a streaming bus - for a workload a single Postgres replica and dbt could serve.
A reasonable progression:
- Early stage (one data engineer, sub-terabyte, a handful of analysts). Managed warehouse or a single columnar engine, dbt for transformation, scheduled loads. Skip the streaming bus, the standalone catalog, and the dedicated observability vendor. Lineage from dbt and warehouse-native access controls are enough.
- Growth stage (a platform team, multi-terabyte, real-time needs emerging). Introduce an open table format so you are not locked to one engine, add CDC and a streaming path for the workloads that need it, and stand up a real catalog and observability layer.
- Enterprise (central platform plus embedded teams, petabyte-scale, regulatory exposure). All seven layers, deliberately. Governance and observability become mandatory here, partly because regulation now requires it - the EU AI Act's 2026 enforcement cycle expects organizations to show that data used to train or prompt AI systems is documented and auditable.
The layer model helps precisely because it tells you what you are skipping. You are not omitting a feature; you are deferring a responsibility you have consciously decided you do not yet need.
Key takeaways
- A modern data platform is seven functional layers, not seven products: ingestion and CDC, storage and table format, processing, serving, orchestration, governance, and observability. Cover each responsibility; let one tool span several layers where it makes sense.
- The table format is now your most durable storage decision. With Databricks, Snowflake, and the open engines all converging on the Iceberg REST catalog, query engines became swappable clients against an open format you control.
- Streaming is the default ingestion mode, with CDC, event streaming, and batch as the three patterns. Treat CDC changelogs as changelogs, not append-only streams, or your aggregates will silently corrupt.
- Processing consolidated on declarative SQL (dbt, SQLMesh), and that workflow now reaches streaming via Flink SQL adapters. Serving is latency-driven: lakehouse scans for analytics, a columnar engine for sub-second product queries.
- Governance and observability are load-bearing, not optional. Build lineage (OpenLineage) and quality monitoring in from the start; retrofitting them after users lose trust is the expensive path.
- Right-size deliberately. Map your stage to the layers you actually need. The architecture above is the destination, not the starting line.