AWS Athena is great at querying large volumes of data stored in Amazon S3. However, its pricing structure can be complex, and needs understanding to prevent cost overruns. In this article, we will share proven tactics for increasing its performance while keeping costs to a minimum.

What is AWS Athena?

Amazon Athena is a powerful tool that allows users to analyze data stored in Amazon S3 using SQL. One of its main strengths is its ease of use. Even if you're new to data analysis, you can quickly get started with Athena and begin exploring your data.

To query files, you need to map them to tables created in AWS Glue Data Catalog. The table definition in Glue stores the schema, file location, and additional metadata.

Here are some of the reasons why Athena is so easy to use:

  • Serverless architecture: Athena is a serverless service, so you don't need to manage any infrastructure. It also allows users to pay by the data scanned in requests they send. This makes it easy to get started and scale your data analysis as needed.
  • SQL support: Athena supports standard SQL, so you can use the same skills you already have to analyze your data.
  • Integration with BI tools: Athena integrates with popular BI tools like Quicksight, Tableau, Looker, Qlik, PowerBI, Metabase, and Superset. This makes it easy to visualize and explore your data.

If you're just getting started with building your data platform, Athena is a great tool for batch data transformation using SQL and data exploration. It's easy to use, powerful, and scalable.

How can AWS Athena performance and cost be optimized?

Despite being easy to get started with, Athena costs can quickly ramp up if you use it inefficiently.

To reduce costs and improve query performance, be sure to follow industry best practices for preparing data for analysis and writing queries. Let's take a look at eight specific tactics that will help optimize your Athena experience.

1. Minimize data scanning by using data partitioning

Athena charges per query based on the volume of data scanned, plus additional fees for data storage and other considerations. So, minimizing the amount of data scanned is crucial for cost savings.

One typical optimization strategy is partitioning tables based on attributes like date or location to reduce the amount of data examined. For example, if your table contains daily sales data, partitioning it by date lets queries scan only the partitions corresponding to a specified period range rather than the entire table.

For Glue tables stored on S3, data partitioning organizes data into subdirectories within S3, typically based on one or more columns (e.g., date or region). This structure allows queries to read only relevant partitions, reducing the volume of data scanned and lowering query costs.

You can define a partition when creating a table using the PARTITIONED BY clause in the CREATE TABLE statement.

CREATE EXTERNAL TABLE users (
first string,
last string,
username string
)
PARTITIONED BY (id string)
STORED AS parquet
LOCATION 's3://DOC-EXAMPLE-BUCKET/folder/'

For Glue tables that don’t use a special format like Iceberg or Delta Lake, you need to update the table’s partitions in Glue for storing their metadata.

If the table is already but the partitions aren’t mapped in correctly in the Glue Metastore, you can add partitions in two ways:

Discovering partitions automatically

To automatically discover partitions based on the Hive partition format, use the MSCK REPAIR TABLE command. For example:
This works when the file layout matches the Hive format for directories.

MSCK REPAIR TABLE students   

Adding partitions manually

To add partitions manually – likely in scenarios where the files aren’t organized using the Hive-style format – use the ALTER TABLE command for each partition separately. For example:

ALTER TABLE orders ADD
  PARTITION (dt = '2016-05-14', country = 'IN');

How to effectively use partitioning in Athena? Here are a few tips:

  • Partitions help reduce data scanning only when the columns they are mapped to are used as filters in queries.
  • Glue stores additional metadata for each partition, which Athena needs to read to run queries. A lot of partitions can significantly slow down the query planning stage. Glue partition indexes can improve this.
  • Partitions work best on columns with bounded cardinality. Excessive partitioning results in many small files and degrades performance.

2. Bucketing

In traditional Hive tables, buckets are implemented on the file level as separate files within each partition directory (each representing a single bucket). The number of buckets is fixed, and each file is created based on the hash value of a bucketing column. This ensures data distribution across these files for efficient querying.

Athena can then use the value to determine the bucket containing those records and only read the files in that bucket, significantly reducing the amount of data that needs to be read.

Bucketing is more efficient than partitioning for high-cardinality columns because it reduces the number of directories and metadata overhead, organizing data into a manageable number of files while supporting targeted queries on specific data ranges.

To create a bucketed column, use CLUSTERED BY (column) INTO <number of buckets> BUCKETS in the CREATE TABLE statement. For example:

As with partitions, bucketing is only effective if you use the columns in the CLUSTERED BY clause in your query filters.

CREATE TABLE orders(id int, order_date date, price decimal)
PARTITIONED BY(order_date date)
CLUSTERED BY(id) INTO 256 BUCKETS;        

3. Columnar file formats

Another effective way to improve query performance in AWS Athena is using columnar data stores. Examples of such stores are Apache ORC and Apache Parquet, which use different encoding protocols and are splittable.

Benefits of Using Columnar Formats in AWS Athena

  • Data is stored in a binary format, which is easier to serialize.
  • Smaller in size due to built-in compression and efficient encoding.
  • Stores schema and metadata within the file itself.
  • Ability to skip irrelevant columns that aren't selected in the query, which is crucial since Athena charges by the amount of data scanned.

Predicate Pushdown in columnar file formats

Predicate pushdown in Parquet and ORC enables Athena to retrieve only the necessary data blocks, improving query efficiency. These files are structured with a header, data blocks, and metadata. Each block contains columnar data along with metadata that includes statistics, such as minimum and maximum values. By leveraging these block-level statistics, Athena determines whether to read or skip specific blocks when querying specific column values

4. File Sizes and compression

In distributed systems like Athena, each file that needs to be scanned for a queryintroduces overhead during query planning and execution, as the system must read metadata and establish connections for every file. This can significantly degrade performance, especially with a large number of small files. To optimize query performance and reduce overhead, it is recommended to aim for file sizes between 128 MB and 1 GB. This range strikes a balance between efficient data processing and manageable file handling in distributed environments.

Compression codecs play a critical role in query performance and file size optimization. Splittable codecs, such as Snappy and LZO, allow large files to be divided into smaller chunks that can be processed in parallel by multiple query workers, improving efficiency in distributed systems. Non-splittable codecs, like GZIP, often achieve better compression ratios, resulting in smaller file sizes but at the cost of processing efficiency. Because GZIP compresses files into a single stream, it prevents parallel processing, potentially creating bottlenecks during query execution. For use cases where query performance is critical, especially with columnar file formats like Parquet and ORC, splittable codecs are typically preferred. However, GZIP may be a good choice when minimizing storage costs is a priority.

5. Optimizing SQL queries

Optimize Order By

When sorting data, Athena returns all rows of data to be sorted to a single worker, which is memory intensive and can cause the query to fail.

To mitigate this, you can use the LIMIT clause to restrict the query to the top N results. This approach reduces the volume of data processed by the worker, improving both query efficiency and reliability.

Example:

SELECT * FROM items ORDER BY item_date LIMIT 150

Optimize complex WHERE clauses

When optimizing complex WHERE clauses, the order and type of predicates can significantly impact query performance. While SQL query planners typically rearrange predicates for efficiency, it's still a good practice to design queries with performance in mind. Place simpler and cheaper predicates—such as those that involve constants, equality comparisons, or partition/bucket columns—at the beginning of the clause. This allows the query engine to quickly eliminate a large portion of irrelevant data before evaluating more complex or expensive conditions, such as subqueries or functions. Structuring the WHERE clause thoughtfully can help reduce the amount of data processed, resulting in faster query execution.

Let’s look at an example with a query involving a predicate that uses the geolocation functions. Consider a scenario where you want to filter rows based on whether a point lies within a polygon using the ST_Contains function in AWS Athena. For instance

SELECT *  
FROM geolocation_data  
WHERE ST_Contains(ST_GeometryFromText('POLYGON((...))'), ST_Point(longitude, latitude));  

While this query works, the ST_Contains function can be computationally expensive, as it evaluates every point against the complex polygon. You can improve performance by adding a bounding box predicate that quickly eliminates points outside the polygon’s bounding box before invoking ST_Contains. For example:

SELECT *  
FROM geolocation_data  
WHERE longitude BETWEEN min_long AND max_long  
  AND latitude BETWEEN min_lat AND max_lat  
  AND ST_Contains(ST_GeometryFromText('POLYGON((...))'), ST_Point(longitude, latitude));  

The bounding box predicate (longitude BETWEEN min_long AND max_long and latitude BETWEEN min_lat AND max_lat) is a simpler filter that excludes a large number of irrelevant rows, reducing the workload for the more expensive ST_Contains function. This combination ensures the query is both accurate and efficient.

Optimize JOIN

Optimizing joins in Athena requires careful consideration, particularly for queries running without table statistics. Without statistics, Athena cannot dynamically reorder joins to prioritize the most selective filters or efficiently plan the join order, often resulting in suboptimal performance. It also cannot leverage broadcast joins, where smaller tables are distributed to all workers to speed up execution.

By collecting and analyzing table statistics, Athena gains insight into data distribution and table sizes, enabling it to reorder joins for better performance and decide when to use broadcast joins. This optimization reduces data shuffling and improves query execution time, especially for complex queries involving large datasets.

To enable Athena to optimize queries and improve join performance, you can use the ANALYZE statement to collect table statistics. For example:

ANALYZE table_name COMPUTE STATISTICS;  

This command gathers essential statistics such as the number of rows, data size, and column-level details (e.g., distinct values, null counts). These statistics help Athena make informed decisions about join order and the use of broadcast joins.

Optimize GROUP BY

Statistics also play a crucial role in optimizing GROUP BY operations. When performing a GROUP BY query, Athena needs to determine the most efficient way to aggregate and group the data. Without statistics, it may struggle to choose the best execution plan, leading to inefficient resource usage and longer query times. By collecting statistics, Athena gains insights into column distribution, and the number of distinct values, which helps it estimate the number of groups and the data size for each group.

This can allow Athena to choose a more efficient approach for the aggregation process, reducing memory usage and improving query performance..

Additionally, when examining a large amount of data, you may need to determine the number of unique values for a column. Use the following query:

SELECT COUNT(DISTINCT column_name) FROM table_name GROUP BY group_col

However, using such a query to calculate the exact number of distinct values has a high memory requirement. So, if the exact number doesn’t matter that much, you can use the approx_distinct() function as an alternative. This allows you to save memory usage and speed up query execution by counting unique hashes.

SELECT approx_distinct(column_name) FROM table_name GROUP BY group_col    

The disadvantage of the approx_distinct() is its inaccuracy. The standard error varies within 2.3% and depends on the distribution of the data and the number of unique values in the column.

6. Using Iceberg

Apache Iceberg is an open table format for working with large datasets.

Iceberg enhances Athena's performance and usability by introducing ACID capabilities, which are not natively supported by Athena on top of vanilla Hive tables. This ensures reliable and consistent operations, such as row-level updates and deletes, even in distributed environments.

Additionally, Iceberg simplifies 'day 2' operations, like adding or renaming columns, repartitioning tables, and applying partition transforms, which can otherwise be cumbersome.

Furthermore, Iceberg allows you to generate detailed column statistics using the AWS Glue console or AWS CLI. These statistics provide valuable insights that enable Athena to optimize query execution, resulting in improved performance and more efficient resource utilization.

An example of creating an Iceberg table:

CREATE TABLE table_iceberg(
id int,
name string,
data string)
PARTITIONED BY (name)
LOCATION 's3://data/table_iceberg/'
TBLPROPERTIES (
'table_type'='ICEBERG',
'format'='parquet'
)   

Performing timely Iceberg table maintenance is important to optimize its performance and reduce costs as well.

7. Using Provisioned Capacity

When ramping up usage of Athena, rate limits can become a common challenge, particularly when dealing with high-query volumes or complex workloads.

Using provisioned capacity can help mitigate these issues by providing dedicated compute resources that bypass the default rate limits, ensuring near-zero latency and no queuing.

Multiple queries can concurrently share the provisioned capacity, and you can augment capacity units as needed to align with evolving business demands.

Additionally, depending on the nature of your queries, provisioned capacity can also be more cost-effective than relying on Athena's on-demand model, especially for high volume of data scans. By strategically leveraging provisioned capacity, you can optimize both performance and cost efficiency.

The optimal entry point for adopting provisioned capacity is when your monthly spending on Athena reaches or exceeds $100. For more information on how to create a capacity reservation, check here.

Lastly

AWS Athena is a powerful tool for analyzing large data sets, but knowing how to use it efficiently is essential. Our world-class experts are here to help you run Athena at optimal performance while keeping your costs to a minimum. Contact us for more information.