A data lake is a centralized storage repository that holds large volumes of raw data -- structured, semi-structured, and unstructured -- in its native format until it's needed. Unlike a traditional data warehouse, a data lake doesn't require a predefined schema. You land data first, decide what to do with it later. That property, called schema-on-read, is what makes lakes flexible enough to serve analytics, machine learning, search, and exploratory workloads from a single store.
The term was coined in 2010 by James Dixon, then CTO of Pentaho, who contrasted a "data mart" (bottled water, packaged for a specific use) with a "data lake" (a body of water in its natural state, available for many uses). The concept took off alongside Apache Hadoop and HDFS in the early 2010s, then shifted to cloud object storage -- Amazon S3, Google Cloud Storage, Azure Data Lake Storage -- as it became dramatically cheaper and more durable than on-premise clusters.
How a Data Lake Works
A data lake isn't a single product. It's an architectural pattern built from several distinct layers, each with its own technology choices.
Storage layer. This is the foundation -- cheap, durable, effectively unlimited capacity. On AWS that means S3; on Google Cloud it's GCS; on Azure it's ADLS Gen2. On-premise lakes still use HDFS, though the modern default is object storage even in private data centers (via systems like MinIO or Ceph). Data is stored as files: Parquet and ORC for analytical workloads, JSON and Avro for events, CSV and text for log-style data, plus images, audio, video, and anything else the business produces.
Ingestion layer. Data arrives from many sources -- transactional databases via change data capture, event streams via Apache Kafka or Kinesis, SaaS APIs via tools like Airbyte or Fivetran, application logs, IoT devices. Ingestion can be batch (periodic copies) or streaming (continuous), and most production lakes have both.
Cataloging and metadata layer. Without a catalog, a data lake quickly becomes a data swamp. AWS Glue Data Catalog, Apache Hive Metastore, Unity Catalog, and Apache Polaris track what tables exist, where their files live, what columns they contain, and who owns them. Modern lakes increasingly use open table formats -- Apache Iceberg, Delta Lake, Apache Hudi -- which add transactional metadata on top of the raw files.
Processing and query layer. Multiple engines read the same files concurrently. Apache Spark and Apache Flink for distributed processing. Trino, Presto, Dremio, and Amazon Athena for interactive SQL. ClickHouse for sub-second analytics. Elasticsearch or OpenSearch for search and log analytics. The same raw data can feed all of them without copies.
Governance, security, and access control. IAM policies, bucket-level encryption (KMS), row- and column-level security via Lake Formation or Unity Catalog, audit logging, and lineage tracking via tools like OpenLineage or DataHub.
Data Lake vs Data Warehouse
The two architectures answer different questions, and most mature organizations end up running both -- or consolidating onto a lakehouse.
| Data Lake | Data Warehouse | |
|---|---|---|
| Data types | Structured, semi-structured, unstructured | Structured only |
| Schema | Schema-on-read | Schema-on-write |
| Storage cost | Very low (object storage) | High (bundled with compute) |
| Compute | Decoupled, bring your own engine | Tightly coupled to vendor engine |
| Best for | ML, exploration, streaming, mixed workloads | BI, reporting, structured analytics |
| Query performance | Variable, depends on file layout | Highly optimized, predictable |
| Governance maturity | Improving rapidly | Long-established, mature tooling |
A warehouse like Snowflake, BigQuery, or Redshift gives you fast, predictable SQL performance on structured data, but you pay for compute and storage as a bundle, and you're locked into one vendor's engine. A data lake decouples storage from compute -- you store data once in an open format and run any engine you want against it. The price is operational complexity: you own the metadata, the file layout, the partitioning strategy, and the optimization.
Data Lake vs Data Lakehouse
A data lake stores files. A lakehouse adds the missing reliability layer on top of those files. The same Parquet files in the same S3 bucket, but now wrapped by an open table format (Apache Iceberg, Delta Lake, or Apache Hudi) that provides ACID transactions, schema evolution, time travel, and metadata-driven file pruning.
The practical effect: a lakehouse behaves like a warehouse for analytical workloads while keeping the cost profile and openness of a lake. Most teams building a new data platform today should plan for a lakehouse from day one rather than building a bare lake and migrating later. The migration -- converting Hive-style partitioned Parquet directories into managed Iceberg or Delta tables -- is doable but not free.
Key Technologies Behind Data Lakes
Object storage. Amazon S3 is the de facto default. Eleven nines of durability, virtually unlimited capacity, and a rich ecosystem of tools that integrate with it natively. Azure Data Lake Storage Gen2 and Google Cloud Storage offer equivalent functionality. On-premise alternatives include MinIO (S3-compatible) and Ceph.
File formats. Apache Parquet is the dominant columnar format for analytics -- efficient compression, predicate pushdown, and broad engine support. ORC is similar and widely used in the Hadoop ecosystem. Avro is common for streaming and event data where schema evolution matters. CSV and JSON still have a role for landing raw data before transformation.
Table formats. Apache Iceberg has emerged as the leading open table format for new lakehouse builds, with support across Spark, Trino, Flink, Athena, BigQuery, Snowflake, and ClickHouse. Delta Lake is dominant in Databricks-centric stacks. Apache Hudi specializes in record-level upserts and CDC workloads.
Catalogs. AWS Glue Data Catalog, Apache Hive Metastore, Unity Catalog, Apache Polaris (Iceberg REST), Nessie. The catalog is what turns a pile of files into queryable tables.
Query engines. Trino and Presto for interactive SQL across heterogeneous sources. Spark for large-scale batch transformations. Flink for streaming. Athena and BigQuery as serverless query layers. Dremio as a query accelerator with semantic layer features.
Orchestration. Apache Airflow, Dagster, and Prefect manage the pipelines that land, transform, and curate data in the lake.
Common Use Cases
Centralized analytics platform. A single source of truth for all of an organization's data -- transactional, behavioral, operational, third-party -- that BI tools, analysts, and data scientists query through SQL engines or notebook environments.
Machine learning and AI. ML training needs large volumes of historical data, often in non-tabular formats (images, text, audio). Data lakes naturally accommodate this. Feature stores often sit on top of lake storage. Training pipelines read directly from Parquet or image files in object storage.
Log and event analytics. Application logs, clickstreams, IoT telemetry, and security events land in the lake at high volume. Engines like ClickHouse, Athena, or OpenSearch query them for operational and security analytics.
Data archival and compliance. Cheap, durable storage for data that must be retained for regulatory reasons (financial transactions, healthcare records, audit logs) but is rarely queried. Lifecycle policies tier cold data to even cheaper storage classes like S3 Glacier.
Data sharing and monetization. Open formats and object storage make it straightforward to share datasets with partners or customers -- via signed URLs, cross-account access, or Iceberg REST catalogs -- without exporting copies.
Challenges
Data swamps. Without governance, catalogs, and clear ownership, a data lake quickly becomes a dumping ground where nothing is documented, nothing is trusted, and nothing gets used. This is the single most common failure mode. Investment in cataloging, lineage, and data quality is not optional -- it's what separates a lake from a swamp.
Small files problem. Streaming ingestion and frequent micro-batch writes produce huge numbers of tiny files, which devastates query performance. Compaction jobs (merging small files into larger ones) are a routine part of operating a lake. Table formats like Iceberg automate some of this, but the underlying problem remains.
Schema drift. Source systems change. Columns get added, renamed, or have their types altered. A lake without schema enforcement absorbs these changes silently and breaks downstream consumers later. Schema registries (Confluent Schema Registry) and table formats with schema evolution help.
Cost surprises. S3 storage is cheap, but request costs (PUT, GET, LIST) and cross-region transfer can dominate the bill on poorly designed pipelines. Engines that scan large amounts of data without effective partitioning or file pruning rack up compute costs. Understanding the cost model is essential.
Query performance. Lakes don't have the indexing, caching, and statistics that warehouses build automatically. Performance depends on file layout, partitioning, sort orders, and table format features. Without active tuning, queries that took seconds in a warehouse can take minutes on a lake.
Governance and security at scale. Fine-grained access control across object storage, catalogs, and multiple query engines is operationally complex. Lake Formation, Unity Catalog, and similar tools help, but they require careful design and ongoing maintenance.
When to Build a Data Lake
A data lake (or, more practically, a lakehouse) makes sense when:
- You have multiple data consumers (BI, ML, search, streaming) and want them sharing one copy of data
- Your data volume is growing faster than your warehouse budget
- You need to store semi-structured or unstructured data alongside relational data
- Vendor lock-in on analytical data is a concern
- You're integrating data from many sources with varying schemas and structures
It's overkill when your workload is purely BI dashboards on well-structured data and your team is small -- a managed warehouse will be cheaper and simpler to operate. The honest answer for most growing organizations today: don't build a pure data lake. Build a lakehouse from the start.
BigDataBoutique and Data Lake Architecture
We design and build production data lakes and lakehouses on AWS, GCP, and Azure -- from green-field architecture through migration off legacy warehouses and Hadoop clusters. Our team works extensively with Apache Iceberg, Apache Flink, ClickHouse, OpenSearch, and the broader data engineering stack.
See our data engineering consulting services, or get in touch to discuss your architecture.