Amazon Athena is a serverless, Trino-based SQL engine that queries data in Amazon S3 with no clusters to manage and pricing of $5 per TB scanned. This explainer covers its architecture, engine v3, Iceberg support, and when to pick it over Redshift Spectrum.
Most teams meet Amazon Athena the same way: data is already sitting in Amazon S3, someone needs to run a SQL query against it, and nobody wants to stand up a cluster to do it. You point Athena at the bucket, define a table, and run SELECT. There is nothing to provision, patch, or scale. That convenience is also the source of every surprise people hit later, from runaway scan bills to confusion about how it relates to Redshift, Glue, and the lakehouse table formats everyone is now standardizing on.
This is the conceptual pillar. If you are here for the tuning checklist - partitioning layouts, compression codecs, CTAS rewrites - we cover that in depth in our AWS Athena cost and performance optimization guide. This piece explains what Athena actually is, how the engine works, where Apache Iceberg fits, and how to decide between Athena and the Redshift options that look similar on paper.
What Amazon Athena is
Amazon Athena is a serverless, interactive query service that runs standard SQL directly against data stored in Amazon S3, with no infrastructure to manage and pricing based on the amount of data each query scans. You define a schema over files that already exist in S3, and Athena reads them in place. There is no ingestion step and no separate storage layer that you pay to keep running.
The model is "schema-on-read." Your data stays as Parquet, ORC, JSON, CSV, or Avro objects in a bucket. The table definition is metadata that tells Athena how to interpret those bytes - column names, types, file format, and partition layout. Because the table is just a view over files, you can drop and recreate it without touching a single object, and let multiple engines read the same files. Athena is the query layer, S3 is the storage layer, and the two scale independently. You are not sizing a warehouse to match peak load or paying for idle compute overnight. You pay per query, and only for the bytes that query reads.
How Athena works under the hood
Athena is built on open source query engines. The SQL engine is a managed fork that tracks Trino and Presto, so the dialect, functions, and execution model will feel familiar to anyone who has used either. When you submit a query, Athena parses it, builds a distributed plan, pulls table and partition metadata from a catalog, reads the relevant objects from S3, executes the plan across a pool of compute it manages for you, and writes results back to an S3 location you configure.
The catalog is where Glue enters the picture. Athena uses the AWS Glue Data Catalog as its metastore - the central registry of databases, tables, columns, and partition locations. A Glue crawler can populate the catalog automatically by inferring schema from your files, or you can define tables yourself with DDL. The same catalog is shared by Glue ETL jobs, Amazon EMR, and Redshift, which is why a table defined once can be queried by several services. The Data Catalog is billed separately from Athena under Glue pricing. For a deeper treatment of the catalog and Glue's job model, see our complete guide to AWS Glue.
One consequence of the serverless model: query compute is drawn from a shared, multi-tenant pool. Most of the time you will not notice, but during regional peak hours a large ad-hoc query can queue or run slower than it would on dedicated hardware. That trade-off - zero idle cost for less predictable latency - is the defining characteristic of the service, and it drives most of the "when to use it" decisions later.
Engine v2 vs v3
Athena engine version 3 is the current default for new workgroups. It moved to a continuous integration model that pulls improvements from upstream Trino and Presto faster, and AWS shipped it with over 90 query performance improvements, 50 new SQL functions, and 30 new features relative to v2. Practical additions include MATCH_RECOGNIZE for row pattern matching, listagg, reading LZ4 and ZSTD compressed Parquet, and faster Glue metadata retrieval for queries that touch many tables.
The upgrade is not free of edges. Engine v3 enforces ANSI SQL more strictly, so some queries that ran on v2 will fail until you adjust them. CONCAT now requires at least two arguments, nested columns in GROUP BY must be double quoted, and the Iceberg time-travel syntax changed from FOR SYSTEM_TIME AS OF to FOR TIMESTAMP AS OF. If you are migrating an older workgroup, test before you flip the engine version.
Pricing: you pay for bytes scanned
Standard Athena queries cost $5 per terabyte of data scanned, billed by the bytes actually read from S3, rounded up to the nearest megabyte, with a 10 MB minimum per query. You are not charged for failed queries or for DDL. There is no charge for the service when no queries are running, which is what makes Athena attractive for spiky, intermittent workloads.
Because the meter is bytes scanned, your storage layout is your cost-control lever. Three choices dominate the bill:
- Columnar formats. A query that selects four columns from a 200-column Parquet table reads only those four columns. The same query against raw CSV or JSON has to read every row in full. Converting to Parquet or ORC routinely cuts scan volume by an order of magnitude.
- Partitioning. Partitioning by a column such as date lets a
WHEREclause prune entire prefixes of S3 objects so they are never opened. A query over one day of a year's worth of logs should scan roughly 1/365 of the data. - Compression. Athena scans the compressed bytes. ZSTD or Snappy Parquet means fewer bytes read, which means a smaller bill, on top of the columnar pruning.
A subtle point: SELECT * with no filter on a large table reads everything and bills accordingly, and so does the AWS console's habit of running a preview query. The concrete tuning patterns - file sizing, bucketing, predicate pushdown, CTAS to reshape data - live in our dedicated Athena cost and performance guide. For predictable, heavy usage there is also a capacity-based option, where you reserve compute as Data Processing Units rather than paying per scan, which can be cheaper and gives more consistent latency.
Partition projection
Partition projection is worth calling out because it changes both performance and operations. Instead of storing every partition's location in the Glue Data Catalog and looking them up at query time, you describe the partition scheme as table properties - a range and a type - and Athena computes partition values and locations in memory. The GetPartitions call to Glue disappears, in-memory calculation replaces a remote lookup, and you no longer need a crawler to register new partitions as data lands.
CREATE EXTERNAL TABLE logs (
request_id string,
status int,
message string
)
PARTITIONED BY (dt string)
STORED AS PARQUET
LOCATION 's3://my-bucket/logs/'
TBLPROPERTIES (
'projection.enabled' = 'true',
'projection.dt.type' = 'date',
'projection.dt.range' = '2023-01-01,NOW',
'projection.dt.format' = 'yyyy-MM-dd',
'projection.dt.interval' = '1',
'projection.dt.interval.unit'= 'DAYS',
'storage.location.template' = 's3://my-bucket/logs/${dt}/'
);
The catch is that projection only applies when the table is read through Athena. Read the same table from Redshift Spectrum or EMR and they fall back to standard catalog partitions. And if more than half of your projected partitions are empty, the in-memory enumeration can be slower than plain catalog partitions, so projection suits dense, predictable layouts such as daily logs rather than sparse ones.
Athena and Apache Iceberg
For years, Athena tables were read-mostly. You could append files, but row-level UPDATE and DELETE were not part of the picture, because plain Hive-style tables on S3 have no transaction layer. Apache Iceberg closes that gap, and Athena supports it natively.
Athena supports read, time travel, write, and DDL on Apache Iceberg tables, which brings ACID transactions, row-level INSERT, UPDATE, DELETE, and MERGE, schema evolution, hidden partitioning, and snapshot-based time travel to data sitting in S3. Athena creates and operates on Iceberg v2 tables. Time travel reads a consistent snapshot as of a timestamp or snapshot ID, which means you can query the table as it looked yesterday without keeping a separate copy:
-- Query the table as of a point in time
SELECT * FROM orders FOR TIMESTAMP AS OF (current_timestamp - interval '1' day);
-- Or as of a specific snapshot
SELECT * FROM orders FOR VERSION AS OF 949530903748831860;
Iceberg also changes how partitioning works. With hidden partitioning, the table tracks the relationship between a column and its partition transform in metadata, so queries do not need a special WHERE clause to benefit from pruning, and you can evolve the partition scheme without rewriting old data. That is a real operational improvement over the partition projection approach above, though projection still has a place for append-only log tables that do not need DML.
Athena's Iceberg support pairs naturally with managed table storage. AWS S3 Tables (table buckets) provide fully managed Iceberg tables with automatic compaction and snapshot maintenance, which removes the table-housekeeping chores you would otherwise script yourself. If you want the conceptual grounding before choosing, our pieces on Iceberg vs Delta Lake and the broader data lake architecture put these formats in context, and we have a practical lakehouse integration guide for teams already running Iceberg.
Beyond SQL: federation and Apache Spark
Athena is not limited to S3. Through Athena Federated Query, a Lambda-based connector lets a single SQL statement join S3 data with sources such as Amazon DynamoDB, Amazon RDS and other relational databases, Amazon DocumentDB, OpenSearch, and Redshift. The connector translates Athena's requests into each source's native API, so you can correlate a data lake table with operational data without an ETL pipeline copying it into S3 first. The classic pattern here - exploratory analysis and ETL with Presto-style federation over Glue-cataloged data - is something we have written about in our Presto and AWS Glue walkthrough.
There is also Athena for Apache Spark, a serverless PySpark environment with notebooks in the Athena console for work that does not fit cleanly into SQL - complex transformations, ML feature engineering, iterative exploration. It uses the same Glue Data Catalog as the SQL engine, and Spark applications are billed at $0.35 per DPU-hour, where one DPU is 4 vCPUs and 16 GB of memory, with no charge for the notebook node. It gives you Spark's flexibility without operating an EMR cluster, with the same caveat as the rest of Athena: shared serverless capacity over guaranteed throughput.
Workgroups tie these capabilities together for governance. A workgroup isolates queries, sets the result-output location and encryption, and can enforce a per-query data-scanned limit so a single bad SELECT * cannot run up a four-figure bill. Use separate workgroups to split costs between teams and to apply different controls to ad-hoc analysts versus production pipelines.
When to use Athena, and when not to
Athena is the right tool when your data lives in S3, your query patterns are interactive or intermittent, and you would rather not run a warehouse around the clock. It shines for log analytics, ad-hoc exploration, lightweight ETL via CTAS, and serving as the SQL front door to a data lake. The decision that trips people up most often is Athena versus the Redshift options, because all three can query S3.
| Amazon Athena | Redshift Spectrum | Redshift Serverless | |
|---|---|---|---|
| Compute model | Fully serverless, shared pool | Runs on your Redshift cluster's compute | Serverless Redshift (RPU-based) |
| Where data lives | S3 (plus federated sources) | S3, joined with Redshift tables | Redshift managed storage, plus S3 via Spectrum |
| Pricing | $5 per TB scanned (or DPU capacity) | $5 per TB scanned on S3 + Redshift compute | Billed per Redshift Processing Unit-hour |
| Latency profile | Variable under load, no idle cost | Tied to cluster size, more consistent | Consistent, scales with usage |
| Best for | Ad-hoc and intermittent S3 queries | Joining S3 data with an existing warehouse | Warehouse workloads without cluster management |
| Setup overhead | Define a table, run SQL | Requires a Redshift cluster | Provision a serverless workgroup |
The short version: pick Athena for serverless, pay-per-query analytics over S3 with minimal setup, especially when usage is bursty. Pick Redshift Spectrum when queries are anchored to an existing Redshift warehouse and you need to join S3 data with warehouse tables or load results back in. Pick Redshift Serverless when you want full warehouse semantics - high concurrency, consistent low latency, materialized views, result caching - without managing a cluster. Athena and Spectrum share the $5 per TB external-scan price; the difference is whether you are also paying for Redshift compute and getting Redshift's predictability in return.
Athena is the wrong choice in a few clear cases. It is not built for sub-second BI dashboards refreshing for hundreds of concurrent users, where the shared pool and per-query planning show up as latency and a purpose-built warehouse serves better. It is not ideal for very high, steady concurrency, where reserved compute is faster and more predictable. And for large, constant workloads, per-TB scanning can cost more than a right-sized cluster running the same queries all day - the point where capacity reservations or Redshift Serverless win on price.
Key takeaways
- Athena is serverless SQL over S3. A managed Trino/Presto fork queries files in place using schema-on-read, with the Glue Data Catalog as the shared metastore. No clusters, no ingestion step.
- Engine v3 is the default and tracks upstream Trino closely, with stricter ANSI SQL that can break some v2 queries on upgrade. Test before flipping the version.
- You pay $5 per TB scanned. Columnar formats, partitioning, and compression are the cost levers; partition projection removes per-query Glue lookups for dense, predictable layouts.
- Iceberg brings ACID DML, schema evolution, and time travel to S3 data, and Athena supports it natively on Iceberg v2 tables, with S3 Tables handling the maintenance.
- Federated Query and Athena for Apache Spark extend Athena past S3 SQL to other data sources and to serverless PySpark, all on the same catalog.
- Choose Athena over Redshift Spectrum or Serverless when workloads are intermittent and S3-centric; choose Redshift when queries are warehouse-anchored or need consistent low latency at high concurrency.
If you are weighing Athena against Redshift, Snowflake, or an Iceberg-based lakehouse for a specific workload, our team helps data teams design and tune these architectures for cost and performance.