The Medallion Architecture: Bronze, Silver, and Gold on Open Lakehouses

How to implement the Bronze, Silver, and Gold medallion pattern on open table formats - Apache Iceberg, Delta Lake, and Apache Hudi - without Databricks lock-in, plus the trade-offs and when to skip Bronze entirely.

The medallion architecture is a layering convention for lakehouse data: raw ingestion lands in a Bronze layer, cleaned and conformed data moves to Silver, and business-ready aggregates live in Gold. The term comes from Databricks, which popularized it around 2019-2020 alongside its lakehouse messaging. Almost every top search result on the topic is a Databricks or Microsoft Fabric doc, which leaves a misleading impression: that medallion is a Databricks feature.

It is not. Bronze/Silver/Gold is a naming convention for progressive refinement, and it maps cleanly onto any open table format. You can build the exact same pattern on Apache Iceberg, Delta Lake, or Apache Hudi, querying with Trino, Spark, or Flink, without touching a proprietary runtime. This post covers what each layer is actually for, how the three open formats support the operations each layer needs, where the pattern earns its keep, and where it becomes cargo-culted complexity.

What the Medallion Architecture Actually Is

The medallion architecture is a data design pattern that organizes a lakehouse into three quality tiers - Bronze, Silver, and Gold - where data is progressively cleaned, conformed, and aggregated as it flows from one layer to the next. The "medallion" name follows the Olympic ordering: Bronze is the least refined, Gold is the most. It is a multi-hop pipeline, nothing more exotic than that.

The underlying idea predates Databricks by decades. Staging areas, operational data stores, and dimensional marts in classic data warehousing describe the same progression of raw to refined. What the lakehouse era added is the ability to keep all three tiers as queryable tables on cheap object storage, with ACID guarantees provided by an open table format rather than a warehouse engine.

Here is the layer-by-layer breakdown most teams converge on:

Layer	Purpose	Typical contents	Schema discipline	Common consumers
Bronze	Raw landing zone	Source data as-ingested, append-only, full history	Schema-on-read, minimal enforcement	Reprocessing, audit, replay
Silver	Cleaned and conformed	Deduplicated, typed, joined to reference data, SCDs applied	Enforced schema, quality gates	Data scientists, ad-hoc analytics
Gold	Business-ready	Aggregates, marts, metrics, ML features	Strict, modeled for consumption	BI dashboards, semantic layer, reporting

The boundaries between Silver and Gold are the least standardized part of the whole pattern. Ask ten data engineers where conformed dimensions end and business marts begin, and you will get eleven answers. Treat the layer names as a shared vocabulary, not a specification.

Why It Is Table-Format-Neutral

Each layer needs a specific set of table capabilities. Bronze needs cheap appends and a durable history you can replay. Silver needs upserts, deletes, and merge logic to deduplicate and apply slowly changing dimensions. Gold needs fast reads and atomic overwrites for refreshed aggregates. Every one of those operations exists in Iceberg, Delta, and Hudi today. None of them is unique to Databricks.

What differs is how each format implements the operation and what it costs you. Iceberg leans on hidden partitioning and snapshot isolation, with a choice of copy-on-write or merge-on-read at the table level. Delta Lake ships the same MERGE INTO semantics with deletion vectors for efficient updates. Hudi was built around upserts from day one and exposes Copy-on-Write and Merge-on-Read as a first-class table type decision. We covered the Iceberg-versus-Delta decision in detail in our table format comparison, and Hudi's design in our introduction to Apache Hudi.

Capability needed	Apache Iceberg	Delta Lake	Apache Hudi
Append-only Bronze	Native, snapshot per commit	Native, transaction log	Native, supports bulk insert
Upsert / MERGE for Silver	`MERGE INTO`, COW or MOR	`MERGE INTO`, deletion vectors	`MERGE INTO`, COW or MOR table type
Time travel / replay	Snapshot history	Versioned log + time travel	Commit timeline, incremental queries
Partitioning	Hidden partitioning, evolves without rewrites	Explicit partition columns	Partition + record-level indexing
High-update workloads	Merge-on-read, compact periodically	Deletion vectors reduce rewrite cost	Merge-on-read designed for this

The practical takeaway: the medallion layers are a logical design, and the table format is an implementation detail you choose per workload. A streaming-heavy Silver layer with constant updates is a natural fit for Hudi or Iceberg merge-on-read. A read-mostly Gold layer is fine on copy-on-write in any of the three.

Bronze and Silver in Practice

Bronze is a landing zone. The discipline here is restraint: write source data with as little transformation as you can get away with, keep it append-only, and preserve enough history to reprocess everything downstream if a bug surfaces in Silver. Change-data-capture pipelines via Debezium and Kafka, or streaming ingestion through Flink, typically write straight into Bronze tables. If you are running Flink into Iceberg, our Flink and Iceberg guide walks through the connector setup.

Silver is where the real engineering lives. This is deduplication, type coercion, joining to reference data, and applying slowly changing dimensions and late-arriving records. The workhorse operation is an idempotent upsert. On any of the three open formats this is a MERGE INTO:

MERGE INTO silver.customers t
  USING (
    SELECT customer_id, name, email, updated_at
    FROM bronze.customers_raw
    WHERE _ingested_at > (SELECT max(_processed_at) FROM silver._watermark)
  ) s
  ON t.customer_id = s.customer_id
  WHEN MATCHED AND s.updated_at > t.updated_at THEN UPDATE SET *
  WHEN NOT MATCHED THEN INSERT *

The same statement runs on Iceberg, Delta, and Hudi through Spark or Trino with only minor dialect differences. The format choice governs cost, not syntax. On a write-heavy Silver table, copy-on-write rewrites whole data files on every merge, which is the main source of write amplification in medallion pipelines. Merge-on-read writes small delete and update deltas instead, trading cheaper writes for slower reads until compaction catches up. Hudi was explicitly designed so that "merge-on-read is better suited for write- or change-heavy workloads with fewer reads," while copy-on-write suits read-heavy data that changes less often, per the Hudi documentation.

Whatever format you pick, Silver and Bronze accumulate snapshots, log files, and small files fast. Table maintenance - expiring snapshots, compacting files, clustering - is not optional at this layer. We wrote up the Iceberg side in Iceberg table maintenance best practices; Delta has OPTIMIZE and VACUUM, and Hudi runs compaction and cleaning on its timeline.

Gold, and the "Do You Even Need Bronze" Debate

Gold tables are the business-grade output: aggregates, dimensional marts, metrics, and ML features modeled for direct consumption. Reads dominate here, so copy-on-write and aggressive file compaction usually win. Many teams point a semantic layer or BI tool straight at Gold and never expose Silver to analysts. Atomic overwrites - swapping a full snapshot of a daily aggregate in one commit - are a natural fit for snapshot-isolated formats, since readers never see a half-written table.

Now the honest part. The medallion pattern is genuinely useful when multiple teams share infrastructure, when source data is messy and needs auditable reprocessing, or when regulatory replay matters. In those settings the intermediate layers buy you scalability and governance. But applying three layers reflexively to every dataset is one of the most common anti-patterns in the field. A small dataset feeding a simple dashboard does not need a multi-hop pipeline; you are just stacking storage cost, latency, and maintenance for no return.

The Bronze layer draws the sharpest criticism. InfoQ's "The End of the Bronze Age" argues that a raw copy-everything Bronze layer is frequently broken, forcing reprocessing and post-mortems, and that modern ingestion tooling can often validate and land data in a usable state without a separate raw tier. The counter-argument is durability: if your CDC source can drift or your transforms have bugs, an immutable Bronze gives you a replay point that nothing downstream can corrupt. Both positions are defensible. The decision should rest on whether you actually need to replay, not on whether a reference diagram has three boxes.

A useful test: each layer must answer who consumes it and what guarantee it adds over the layer before. If a layer has no distinct consumer and adds no new guarantee, it is overhead. Bronze/Silver/Gold is a mental model for progressive refinement, not a mandate to build three tables for every feed.

Key Takeaways

The medallion architecture (Bronze raw, Silver cleaned, Gold business-ready) is a Databricks-coined naming convention for progressive data refinement, not a Databricks-only feature. It runs unchanged on open table formats.
Apache Iceberg, Delta Lake, and Apache Hudi all support the operations each layer needs - append for Bronze, MERGE INTO upserts for Silver, atomic overwrites and fast reads for Gold. The choice between them is about cost and workload shape, not capability.
Copy-on-write rewrites whole files and causes write amplification on update-heavy Silver tables; merge-on-read trades cheaper writes for slower reads and is the better fit for streaming or change-heavy layers. Both need scheduled compaction.
Table maintenance (snapshot expiry, file compaction, clustering) is mandatory at Bronze and Silver, where small files and snapshots pile up quickly.
Not every dataset needs all three layers. Validate each layer against a distinct consumer and a distinct guarantee. The Bronze layer in particular is worth keeping only when you genuinely need an immutable replay point.

Designing a lakehouse and weighing whether medallion layering fits your workloads and chosen table format? Talk to us - we build open, vendor-neutral data platforms on Iceberg, Delta, and Hudi.