Apache Iceberg on AWS: Glue Catalog, Athena, EMR, and S3 Tables

A practical guide to running Apache Iceberg on AWS - choosing between the Glue Data Catalog and S3 Tables, querying with Athena and EMR, and wiring the pieces into an AWS-native lakehouse over the Iceberg REST protocol.

This post is about the AWS-native Iceberg stack specifically, and how to run Apache Iceberg tables on AWS - not Iceberg internals. If you want how the format works under the hood - manifests, snapshots, hidden partitioning - read the Apache Iceberg architecture deep dive first. For the vendor-neutral view of assembling a lakehouse on object storage, see data lake architecture in 2026. Here we stay concrete: Glue Data Catalog versus S3 Tables, querying with Athena and EMR, maintenance, and how the Iceberg REST endpoint lets all of it interoperate.

Iceberg on AWS

Apache Iceberg on AWS is the combination of three swappable layers: an open table format (Iceberg) that adds transactions and schema evolution to files in Amazon S3, an AWS-native catalog (the Glue Data Catalog or S3 Tables) that tracks table metadata and pointers to the current snapshot, and one or more AWS query engines (Athena, EMR, Redshift) that read and write those tables through the catalog.

The format and the storage are settled. Your data lands as Parquet files in S3, and Iceberg metadata files sit alongside them describing which files belong to which snapshot. What you choose is the catalog and the engines. The catalog is the part teams get wrong, because it is the one component that every engine has to agree on. Two engines pointed at the same S3 prefix but different catalogs will corrupt each other's writes. One catalog, many engines, is the rule.

AWS gives you two catalog paths. The classic path is the AWS Glue Data Catalog managing Iceberg tables over buckets you own. The newer path, generally available since December 2024, is Amazon S3 Tables, a managed Iceberg service with its own bucket type and built-in maintenance. Both speak the Iceberg REST Catalog specification through the AWS Glue Iceberg REST endpoint, which is what makes open-source clients like PyIceberg and Spark talk to either one. Picking between them is the first real decision, so start there.

The Catalog Layer: Glue Data Catalog vs S3 Tables

The Glue Data Catalog is the AWS metastore most teams already use. It supports Iceberg natively, and since 2024 it can run managed table optimization on your behalf: a table optimizer continuously watches partitions and triggers compaction when a table or partition crosses a file-count threshold (the default kicks in past 100 files), then handles snapshot expiration and orphan-file cleanup. See AWS Glue auto compaction for Iceberg. You keep full control of the underlying buckets, file layout, and IAM, which matters when you have existing data, strict storage-tiering rules, or compliance requirements that demand owning the bucket.

Amazon S3 Tables is the opposite trade. It introduces table buckets, a purpose-built bucket type that stores tabular data and runs policy-driven maintenance - compaction, snapshot management, and unreferenced-file removal - automatically, with no optimizer to configure. AWS reports S3 Tables deliver up to 3x faster query throughput and up to 10x higher transactions per second than self-managed Iceberg tables, per the GA announcement. In 2025 the compaction engine gained sort and z-order strategies for both S3 Tables and Glue-optimized tables, per InfoQ's coverage. The cost is control: you give up direct ownership of the bucket and layout in exchange for AWS running the operational treadmill.

Dimension	Glue Data Catalog + S3 (self-managed)	Amazon S3 Tables
Bucket ownership	Your own S3 buckets and prefixes	Managed table buckets
Maintenance	Opt-in Glue table optimizer (compaction, expiry, orphan cleanup)	Built-in, policy-driven, automatic
Layout control	Full (file size, partitioning, tiering)	Limited; AWS manages physical layout
Setup effort	Higher - configure optimizer and IAM	Lower - create table bucket, start writing
Best for	Existing lakes, strict storage/compliance control	New analytics tables, teams that want zero ops
Catalog interop	Glue Iceberg REST endpoint	Glue Iceberg REST endpoint (after integration)

The pragmatic split: reach for S3 Tables when you are standing up new analytics tables and want maintenance to be someone else's problem. Stay on Glue plus your own buckets when you already have a lake, need fine control over storage tiering and file layout, or have governance rules that require you to own the bucket. Either way, do not skip maintenance entirely - unmaintained Iceberg tables accumulate small files and stale snapshots until reads slow to a crawl, which is exactly the failure mode the Iceberg table maintenance best practices post covers.

Querying Iceberg: Athena and EMR

Amazon Athena is the serverless front door. Athena engine version 3 supports the full Iceberg surface for tables that use Parquet and the Glue Data Catalog: reads, time travel, schema evolution, hidden partitioning, and transactional DML including MERGE INTO for upserts, per the Athena MERGE INTO docs. There is one limitation worth internalizing: Athena writes in merge-on-read mode only. Workloads that need copy-on-write semantics belong on Spark via EMR or Glue. For Athena tuning, see Athena cost and performance optimization.

A MERGE on an Iceberg table from Athena reads naturally:

MERGE INTO orders t
  USING staged_orders s
    ON t.order_id = s.order_id
  WHEN MATCHED THEN UPDATE SET status = s.status, updated_at = s.updated_at
  WHEN NOT MATCHED THEN INSERT (order_id, status, updated_at)
    VALUES (s.order_id, s.status, s.updated_at);

Amazon EMR is the heavy-compute path. EMR provisions clusters with Spark, Trino, Flink, and Hive, all of which can read and write the same Iceberg tables. This is where the canonical write path lives: Spark on EMR handles large backfills, copy-on-write tables, and complex transformations that Athena's serverless model is not built for; Flink on EMR handles streaming ingestion, including change-data-capture into Iceberg with equality deletes. The pairing of Flink and Iceberg is the standard way to land seconds-fresh data. Redshift rounds out the picture by reading the same tables through external schemas, so your warehouse queries the lake without a copy. The point that makes this stack worth the trouble: one Iceberg table, written once, is queryable by Athena, Spark, Trino, Flink, and Redshift concurrently because they all coordinate through a single catalog.

Interop and Federation Over the Iceberg REST Endpoint

The glue between every engine is the AWS Glue Iceberg REST endpoint. Once you integrate an S3 table bucket with the Glue Data Catalog, any Iceberg-compatible client connects to it through a standard REST catalog by pointing at https://glue.<region>.amazonaws.com/iceberg, enabling SigV4 signing with glue as the signing name, and setting the warehouse to <account-id>:s3tablescatalog/<table-bucket-name>. The official walkthrough is accessing S3 Tables via the Glue Iceberg REST endpoint. A PyIceberg catalog initialized against it looks like:

from pyiceberg.catalog import load_catalog
  
  catalog = load_catalog(
      "s3tablescatalog",
      **{
          "type": "rest",
          "warehouse": "111122223333:s3tablescatalog/my-table-bucket",
          "uri": "https://glue.us-east-1.amazonaws.com/iceberg",
          "rest.sigv4-enabled": "true",
          "rest.signing-name": "glue",
          "rest.signing-region": "us-east-1",
      },
  )

Because that endpoint implements the open Iceberg REST spec, the same table is reachable from PyIceberg, Spark, or any other conformant client without an AWS-proprietary connector. Access control is enforced through a combination of IAM policies and AWS Lake Formation grants, which is where fine-grained, table- and column-level permissions for external engines are configured.

Federation extends the reach the other way. Glue Data Catalog catalog federation lets AWS engines query remote Iceberg tables - those cataloged in another Iceberg REST catalog, including Databricks Unity Catalog - without copying or moving the data, per the catalog federation announcement. For a multi-engine read pattern beyond AWS's own services, the ClickHouse and Iceberg integration guide shows the same single-table-many-engines principle from a different angle.

Putting It Together: An AWS-Native Lakehouse

A reference shape for this stack reads cleanly from ingestion to query. Streaming events land in Amazon MSK (Kafka). Flink on EMR consumes them and writes Iceberg tables - applying CDC with equality deletes for mutable sources - into S3, registered in the Glue Data Catalog or an S3 Tables bucket. Batch sources flow through Spark on EMR or Glue ETL jobs into the same tables. From there, Athena serves ad-hoc SQL and MERGE upserts, EMR handles large transformations, and Redshift external schemas expose the tables to warehouse workloads. Lake Formation sits across the catalog enforcing access. Maintenance runs continuously, either as the Glue table optimizer or as the automatic S3 Tables service.

The decisions that determine whether this works in production:

One catalog per table, always. Every engine writing a given table must coordinate through the same catalog, or concurrent writes corrupt the table.
Pick the catalog by ownership needs. S3 Tables for hands-off new tables; Glue plus your own buckets when you need control over layout, tiering, or compliance.
Match the engine to the write. Athena for serverless SQL and merge-on-read DML; Spark on EMR for copy-on-write and backfills; Flink on EMR for streaming CDC.
Never skip maintenance. Compaction, snapshot expiry, and orphan-file cleanup are not optional; either let S3 Tables run them or enable the Glue optimizer.
Use the Iceberg REST endpoint for interop. It is the seam that lets PyIceberg, Spark, and federated catalogs reach the same tables without proprietary connectors.

Getting an Iceberg lakehouse right on AWS is less about the format and more about these operational seams - the catalog choice, the maintenance contract, and the federation boundaries. We help teams design and run exactly this stack. If you are weighing S3 Tables against a self-managed Glue setup, or untangling a multi-engine catalog, reach out.