A data lakehouse is a data architecture that combines the cheap, scalable storage of a data lake with the transactional guarantees, schema enforcement, and query performance that were previously exclusive to data warehouses. Instead of maintaining separate systems for raw data storage and structured analytics, a lakehouse uses open table formats like Apache Iceberg or Delta Lake on top of object storage to deliver both in a single layer.
The term gained traction around 2020 when Databricks formalized it, but the concept grew out of necessity. Organizations were running two expensive parallel systems -- a data lake for cheap storage and flexible schema, and a data warehouse for BI and SQL analytics -- with complex ETL pipelines shuttling data between them. The lakehouse eliminates that duplication.
How a Data Lakehouse Works
A lakehouse isn't a single product you install. It's an architecture pattern built from several layers:
- Object storage -- S3, GCS, ADLS, or HDFS holds all the raw data as files, usually Parquet or ORC. This is the cheapest tier and scales without limits.
- Open table format -- Apache Iceberg, Delta Lake, or Apache Hudi adds a metadata layer on top of those files. This layer provides ACID transactions, schema evolution, time travel, and partition management -- the features that make raw files behave like managed database tables.
- Catalog -- AWS Glue, Apache Polaris (Iceberg REST Catalog), Unity Catalog, or Hive Metastore tracks table locations and metadata. Engines discover tables through the catalog.
- Query engines -- Spark for batch processing, Trino or Dremio for interactive SQL, Apache Flink for streaming, ClickHouse for real-time analytics. Multiple engines read and write the same tables concurrently.
- Governance and access control -- fine-grained permissions, data lineage, and audit logging sit across the stack.
The practical result: analysts run SQL queries with warehouse-like performance, data engineers ingest raw data at lake-like cost, and machine learning teams access the same data without copies or exports. One copy of data, many consumers.
Data Lakehouse vs Data Warehouse
The distinction matters when choosing an architecture:
| Data Lakehouse | Data Warehouse | |
|---|---|---|
| Storage | Open formats (Parquet, ORC) on object storage | Proprietary format, tightly coupled to compute |
| Cost | Low -- object storage pricing, pay-per-query compute | High -- bundled storage + compute licensing |
| Schema | Schema-on-read and schema-on-write | Schema-on-write only |
| Data types | Structured, semi-structured, unstructured | Structured only |
| Engine lock-in | None -- multiple engines read the same tables | Tied to the warehouse vendor |
| ACID transactions | Yes (via table format) | Yes (native) |
| Real-time ingestion | Streaming ingestion with Flink, Spark Streaming | Limited, often micro-batch |
| Maturity | Newer, evolving rapidly | Decades of optimization, mature tooling |
A warehouse still makes sense when your team is small, queries are mostly BI/SQL, and you want fully managed infrastructure with minimal operational overhead. A lakehouse makes sense when you need to support multiple workloads (BI, ML, streaming, ad-hoc exploration) on the same data, when vendor lock-in is a concern, or when storage costs at warehouse scale are unsustainable.
Data Lake vs Data Lakehouse
A data lake is storage without structure. It holds files -- Parquet, CSV, JSON, images, logs -- in a hierarchical directory layout on object storage. There's no transaction management, no schema enforcement, and no guarantees about consistency. Two jobs writing to the same path can corrupt each other silently.
A data lakehouse adds the missing reliability layer. By introducing an open table format, the same object storage now supports:
- ACID transactions -- concurrent reads and writes don't corrupt data
- Schema enforcement and evolution -- columns have types, and you can safely add, rename, or drop them
- Time travel -- query data as it existed at any previous point
- Efficient file pruning -- metadata-level statistics skip irrelevant files without scanning them
The files are still Parquet on S3. The storage cost is identical. The difference is the metadata layer that makes those files usable as a managed analytical platform instead of a dumping ground.
Key Technologies Behind the Lakehouse
Apache Iceberg is the most broadly adopted open table format for lakehouse architectures. It supports the widest range of engines (Spark, Trino, Flink, Athena, BigQuery, Snowflake, Dremio, ClickHouse) and offers hidden partitioning, partition evolution, and the open REST Catalog standard. Its vendor-neutral governance model under the Apache Software Foundation makes it the default choice for multi-engine and multi-cloud setups.
Delta Lake is Databricks' open-source table format. Strong if you're building on Databricks, but historically tighter coupling to that ecosystem. Unity Catalog provides governance.
Apache Hudi specializes in record-level upserts and change data capture (CDC) pipelines. It excels when your workload is heavy on incremental updates rather than append-heavy analytics.
All three solve the same core problem. The choice usually follows your engine ecosystem -- Iceberg for broad compatibility, Delta Lake for Databricks-centric stacks, Hudi for CDC-heavy pipelines.
When to Build a Data Lakehouse
A lakehouse architecture pays off when:
- You're running both a data lake and a data warehouse and paying to move data between them
- Multiple teams need the same data in different engines (SQL analytics, ML training, streaming)
- Storage costs in your warehouse are growing faster than query volume
- You want to avoid vendor lock-in on your core analytical data
- You need to support streaming ingestion alongside batch analytics
It's not the right fit if your workload is purely BI dashboards on well-structured data with a small team -- a managed warehouse like Snowflake or BigQuery will be simpler to operate.
BigDataBoutique and Data Lakehouse Architecture
We design, build, and optimize data lakehouse architectures for production workloads. Our team has deep experience with Apache Iceberg, Apache Flink, ClickHouse, and the broader data engineering stack. Whether you're migrating from a legacy warehouse, consolidating lake and warehouse into a lakehouse, or building a greenfield analytics platform, we can help with architecture design, implementation, and ongoing optimization.
See our data engineering consulting and Databricks consulting services, or get in touch to discuss your architecture.