How to architect a modern data lake in 2026 - Apache Iceberg tables on S3 object storage, seconds-fresh streaming ingestion with Flink and Kafka, and federated REST catalogs (Polaris, Lakekeeper, Glue, Unity) that let multiple engines share one set of tables.

Data Lake Architecture in 2026: Iceberg on S3, Real-Time Ingestion, and Federated Catalogs

The phrase "data lake" used to mean a directory of Parquet files in a bucket and a Hive metastore that knew where they lived. That design still works, technically, and it still fails in all the usual ways: no atomic writes, no safe schema changes, queries that scan partitions they should skip, and a metastore that becomes a single point of contention. Building a data lake in 2026 means none of that. The architecture has consolidated around a small, stable set of decisions, and most of the hard problems now have boring, well-understood answers.

A modern data lake is Apache Iceberg tables sitting on object storage, fed by streaming ingestion, and fronted by a REST catalog that any engine can talk to. Object storage is the durable floor. Iceberg is the table abstraction that gives you transactions and time travel on top of immutable files. The catalog is the part that changed most recently, and the part most teams get wrong. This post walks through how to assemble those layers, which catalog options interoperate, how to land data with seconds of latency, and the operational work the design quietly assumes you will do.

This is the build piece. If you want the category-level distinctions first, read EDW vs data lake vs lakehouse. For the warehouse-side view of the same convergence, see data warehouse architecture in 2026. For how to organize the data once it lands, see the medallion architecture.

Object Storage Is the Foundation, Not a Detail

A modern data lake stores all table data and metadata as immutable objects in cloud object storage (Amazon S3, Google Cloud Storage, or Azure Data Lake Storage), and treats compute as a separate, stateless tier that can be scaled, replaced, or run in parallel without moving the data.

That separation is the whole point. Storage is cheap, durable, and effectively infinite; compute is expensive and bursty. Decoupling them lets you keep one physical copy of a dataset and point Trino, Spark, Flink, and a warehouse at it independently. Nobody copies data into an engine anymore just to query it.

Object storage gives you eleven nines of durability and near-linear read throughput, but it is not a filesystem, and treating it like one is where lakes go bad. Two physical details dominate performance. First, file size: aim for data files in the low-hundreds-of-megabytes range. Hundreds of tiny files turn a query into thousands of HTTP GETs and balloon metadata; a handful of multi-gigabyte files kill read parallelism and make rewrites expensive. Second, layout: Iceberg's hidden partitioning means you no longer encode partition values in directory paths, so you avoid both the small-files explosion of over-partitioning and the full-table scans of under-partitioning.

AWS made this easier in late 2024 with Amazon S3 Tables, a storage type that stores fully managed Iceberg tables and runs compaction and maintenance for you. It is a reasonable default if you are AWS-native and want the bucket to handle file optimization. The trade-off is that you give up some control over how and when that maintenance runs. On a self-managed lake you own that work, which we will come back to.

Iceberg Is the Table Layer

The table format is what turns a pile of objects into something you can safely write to from more than one job at a time. Apache Iceberg has become the default. It tracks table state through a tree of metadata files, manifest lists, and manifests that point at the actual data files, which is what gives it atomic commits, snapshot isolation, time travel, schema evolution, and partition evolution without rewriting data. Delta Lake and Apache Hudi solve the same problems; the gravity in 2026 is around Iceberg because the major catalogs and cloud vendors standardized on its REST API. For a format-level comparison see Iceberg vs Delta Lake, and for how the internals fit together, the Iceberg architecture deep dive.

The practical payoff is concurrency. Multiple writers can commit to the same table using optimistic concurrency: each commit swaps a metadata pointer atomically, and a losing writer retries against the new snapshot. Readers always see a consistent snapshot and never observe a half-written commit. That is the property a raw Parquet-on-S3 lake cannot give you, and it is why "schema migration broke the nightly job" stops being a recurring incident.

What Iceberg does not do is manage itself. Every commit creates a new snapshot and may add small files; left alone, a busy table accumulates thousands of snapshots and millions of objects. Compaction, snapshot expiration, and orphan-file cleanup are not optional extras. Skip them and read latency climbs week over week until someone notices the lake is slow and nobody knows why. We cover the routine in Iceberg table maintenance best practices.

The Catalog Layer: Federated, Not Locked In

This is the layer that changed most, and the one worth slowing down on. The catalog is the service that maps table names to their current metadata location and brokers commits. The old default, the Hive Metastore, was a JDBC-backed service that every engine integrated with separately and that became a bottleneck and a liability. The Iceberg REST Catalog specification replaced that model with a single OpenAPI-defined HTTP contract: implement the spec once on the server, and Spark, Flink, Trino, and PyIceberg all talk to it through one client.

An Iceberg REST catalog is a catalog service that implements the Apache Iceberg REST Catalog OpenAPI specification, exposing a standard HTTP interface for creating, loading, and committing tables - so any compliant engine can read and write the same tables without an engine-specific catalog integration.

"Federated" is the second shift. A federated catalog can present tables that physically live in other catalogs - Glue, Hive, Unity, or another REST endpoint - under one namespace and access model. Apache Polaris, donated to the Apache Software Foundation in August 2024 and promoted to a Top-Level Project in February 2026, can act as a catalog of catalogs and register external sources. On the cloud side, AWS shipped catalog federation in the Glue Data Catalog and a Glue Iceberg REST endpoint that lets engines reach S3 Tables and even federate to Databricks Unity Catalog for read access through the same standard API.

Pick the catalog before you pick the compute engine, because the catalog decides whether you are locked in. The comparison below covers the realistic options.

Catalog Type Governance / lock-in Best fit
Hive Metastore Legacy, self-hosted No federation; engine-specific quirks Existing Hadoop estates being migrated off
AWS Glue Data Catalog Managed (AWS) REST endpoint + federation to Unity/external AWS-native lakes, S3 Tables
Apache Polaris Open source (ASF) Vendor-neutral, federated, RBAC Multi-engine or multi-cloud, avoiding lock-in
Lakekeeper Open source REST spec + OPA authorization, audit events Self-hosted lakes wanting fine-grained governance
Databricks Unity Catalog Managed (Databricks) Iceberg read interop via REST; richest inside Databricks Databricks-centric platforms

Two open implementations are worth naming. Polaris is the vendor-neutral standard with broad backing. Lakekeeper is a Rust-based REST catalog whose recent releases focus on authorization - OPA-based policies, audit events, and Trino rule extensions - which makes it attractive when governance is the deciding factor. Both implement the same REST spec, so an engine pointed at one can be repointed at the other with a config change rather than a rewrite.

Streaming Ingestion: Seconds, Not Hours

Batch ingestion into a lake is a solved, dull problem. The interesting question in 2026 is how fresh you can make the data, and the answer is seconds. Two patterns dominate, and they differ mainly in how much infrastructure you want to run.

The lighter option is the Iceberg Kafka Connect sink: a community-maintained connector that reads Kafka topics, buffers records, and commits to Iceberg on a configurable interval, with automatic table creation, schema evolution from the Schema Registry, and partition routing. It needs only a Kafka Connect cluster. The heavier and more capable option is Apache Flink, which aligns its checkpoint barriers with Iceberg snapshot commits to get exactly-once delivery, and handles CDC changelogs by writing them as equality deletes - so updates and deletes from an upstream database land correctly, not just appends. The 2025 Flink Dynamic Iceberg Sink extends this to route a single stream into many tables with schema evolution and no downtime, which removes the old one-job-per-table tax. We go deeper in Flink and Iceberg.

Pattern Latency Exactly-once CDC / upserts Ops cost
Batch (Spark/Trino INSERT) Minutes to hours Via job idempotency Manual MERGE Low
Kafka Connect Iceberg sink Seconds to minutes Yes (KIP-447, Kafka 2.5+) Append-oriented Low (Connect cluster)
Flink streaming sink Seconds Yes (checkpoint = snapshot) Native (equality deletes) Higher (Flink cluster)

The pitfall with streaming into Iceberg is the small-files problem made continuous. A sink committing every few seconds writes a steady drip of small files and a snapshot per commit. Without compaction running in the background, a streaming table degrades faster than a batch one. Treat maintenance as part of the ingestion design, not an afterthought - size your commit interval and your compaction schedule together. ClickHouse can also query Iceberg tables directly for low-latency analytics over this data; see our ClickHouse and Iceberg integration guide.

A Reference Architecture and the Work It Hides

Put the layers together and an AWS-native lake looks like this: sources stream into Kafka; Flink or the Kafka Connect sink writes Iceberg tables to S3 (or S3 Tables); the Glue Data Catalog REST endpoint fronts those tables; and Trino, Spark, Athena, Redshift, and ClickHouse all query the same physical tables through that one catalog. A multi-cloud or vendor-neutral variant swaps Glue for Polaris or Lakekeeper and keeps everything else, which is the entire argument for the open REST spec: the catalog stops being a lock-in point.

The standard layering is five tiers - ingestion, storage, table format, catalog, and compute - with governance and lineage cutting across all of them. The compute tier is genuinely pluggable: Trino for interactive SQL, Spark for heavy ETL, Flink for streaming, ClickHouse for sub-second analytics, and external-table access from Snowflake or Redshift when a team lives in a warehouse. None of them owns the data.

What the clean diagram hides is operations, which is where lakes succeed or rot:

  • Maintenance is mandatory. Schedule compaction, snapshot expiration, and orphan-file cleanup per table. Streaming tables need it more often than batch ones.
  • Small files are the default failure mode. Over-partitioning and frequent streaming commits both produce them. Tune partition granularity and commit intervals deliberately.
  • Catalog choice is a one-way door if you pick a closed one. Standardize on the REST spec so the catalog and engines stay swappable.
  • Governance belongs in the catalog. Centralize access control, lineage, and audit at the catalog layer rather than per engine, so policy does not fork across query tools.

Key takeaways

  • A 2026 data lake is Iceberg tables on object storage, fed by streaming ingestion, fronted by a REST catalog - not raw files plus a Hive metastore.
  • Keep data files in the low-hundreds-of-megabytes range and use Iceberg hidden partitioning to avoid both small-files explosions and full scans.
  • The Iceberg REST Catalog spec is the interop layer; Polaris and Lakekeeper (open) and Glue federation give multi-engine access without lock-in, while Unity Catalog interops for reads.
  • Use the Kafka Connect Iceberg sink for simple seconds-fresh appends and Flink for exactly-once CDC and multi-table routing.
  • Budget for compaction, snapshot expiry, and orphan cleanup from day one - streaming makes the small-files problem continuous.

Designing or operating a lake on Iceberg and S3? BigData Boutique builds and tunes open lakehouse architectures end to end, from streaming ingestion and catalog selection to the Iceberg maintenance that keeps them fast.