Complete Guide to AWS Glue

A practical engineering guide to AWS Glue covering its architecture, job types, worker sizing, Data Catalog, cost optimization, and where Glue fits in a modern data platform.

AWS Glue is a fully managed, serverless data integration service that handles ETL workloads and provides a central metadata catalog for the AWS analytics ecosystem. Since its launch in August 2017, it has grown from a straightforward ETL offering into a core piece of the AWS data platform - powering schema management for Athena, EMR, and Redshift Spectrum while running Spark, Python, and Ray workloads with zero infrastructure to manage.

What Glue actually solves for data engineers: no more provisioning Spark clusters, no more running your own Hive Metastore, no more stitching metadata together across analytics services. You define sources and transformations, Glue pulls compute from a warm pool, runs the job, tears it down. Billing is per-second. But getting value out of it means understanding the components, the sizing model, and the trade-offs - particularly where Glue makes sense versus where other tools fit better.

Architecture and Core Components

AWS Glue is really two things that can be used independently: a metadata catalog and an ETL execution engine.

The Data Catalog is a persistent, Hive Metastore-compatible metadata repository - one per AWS account per region. It stores table definitions (column names, types, physical locations, SerDe info, partition keys) organized into databases. Athena, EMR, Redshift Spectrum, and Lake Formation all share it as their metastore. That makes it the metadata layer for your data lake, and it tends to outlive any individual ETL job or query engine you choose. Pricing is generous: the first million objects and million requests per month are free. Beyond that, $1 per 100,000 objects/month and $1 per million requests.

Crawlers scan data stores (S3, JDBC databases, DynamoDB, and others), infer schemas using built-in or custom classifiers, and populate the Data Catalog. They detect new files, schema changes, and new partitions. Convenient for bootstrapping a catalog. Less so long-term - many teams move to managing catalog entries via Terraform or boto3 once schemas stabilize, since crawlers can misidentify column types and add cost when running frequently on large datasets.

ETL Jobs are the compute layer. Submit a job, Glue provisions a Spark cluster (or Ray cluster), runs your code, releases the resources. Glue extends standard Spark with two abstractions worth knowing: GlueContext for catalog-aware read/write operations, and DynamicFrame for handling schema inconsistencies - say, a field that's an integer in some records and a string in others - without crashing the job.

Job Types and Worker Sizing

Four job types, each targeting a different workload profile.

Spark ETL jobs (glueetl) are the workhorse. PySpark or Scala on a managed Spark cluster, handling batch processing from megabytes to terabytes. From Glue 4.0 onward (Spark 3.3), Adaptive Query Execution (AQE) is on by default - it coalesces post-shuffle partitions and converts sort-merge joins to broadcast joins at runtime, using actual data statistics rather than static estimates.

Streaming ETL jobs (gluestreaming) run Spark Structured Streaming with 100-second micro-batches by default. Ingest from Kinesis Data Streams or Apache Kafka/MSK, transform in-flight, write to S3 or JDBC targets. Checkpointing tracks read positions here, not job bookmarks.

Python Shell jobs (pythonshell) run on a single instance. No Spark cluster. Good for API calls, small file processing, lightweight orchestration. You get either 0.0625 DPU (1/16) or 1 DPU.

Ray jobs (glueray), introduced with Glue 4.0, use the Ray framework for Python-native distributed computing. ML inference, distributed Python, workloads where Spark is overkill or a poor fit.

Choosing Worker Types

Worker sizing has a direct impact on both cost and performance. The DPU (Data Processing Unit) is Glue's compute abstraction: 1 DPU = 4 vCPUs, 16 GB memory, billed at $0.44/DPU-hour.

Worker	DPU	vCPU	Memory	Disk	Best for
G.1X	1	4	16 GB	94 GB	Standard transforms, joins
G.2X	2	8	32 GB	138 GB	Moderate transforms, ML
G.4X	4	16	64 GB	256 GB	Large aggregations
G.8X	8	32	128 GB	512 GB	Most demanding workloads

If your jobs are hitting OOM errors, look at R-type workers before blindly scaling up. R.1X through R.8X offer a 1:8 vCPU-to-memory ratio instead of G-type's 1:4. An R.2X gives you 8 vCPUs and 64 GB memory at $0.52/DPU-hour - frequently cheaper than jumping to a G.4X just because you need more RAM.

Auto Scaling (Glue 3.0+) lets Glue dynamically adjust worker count. You set NumberOfWorkers as the ceiling, and Glue adds or removes executors based on Spark parallelism at each stage. AWS benchmarks showed up to 83% cost reduction on variable workloads.

Practical Patterns and Cost Optimization

The single most common performance problem in Glue: small files. Millions of tiny files in S3 will exhaust the Spark driver's memory just building the file index. The fix is the groupFiles option, which batches small files into larger in-memory partitions:

datasource = glueContext.create_dynamic_frame.from_catalog(
      database="my_db",
      table_name="my_table",
      additional_options={
          "groupFiles": "inPartition",
          "groupSize": "1048576"  # 1 MB target group size
      },
      transformation_ctx="datasource"
  )

On the write side, coalesce(N) before output prevents producing thousands of tiny files. Unlike repartition(N), it skips the full shuffle.

Job bookmarks enable incremental processing between runs. For S3 sources, they track last_modified timestamps. For JDBC, a monotonically increasing key column. One gotcha worth memorizing: the transformation_ctx parameter is required for bookmarks to function. Omit it, and bookmarks silently stop working - no error, no warning. You also need both job.init() and job.commit() in your script.

For tables with many partitions, pair catalogPartitionPredicate with push_down_predicate. They do different things. catalogPartitionPredicate filters server-side in the Data Catalog via partition indexes. push_down_predicate kicks in after listing partitions but before reading S3 files:

datasource = glueContext.create_dynamic_frame.from_catalog(
      database="my_db",
      table_name="my_table",
      push_down_predicate="year='2025' and month='12'",
      additional_options={
          "catalogPartitionPredicate": "year='2025' and month='12'"
      },
      transformation_ctx="datasource"
  )

Controlling Costs

Three strategies with the biggest payoff:

Right-size workers. Start with Auto Scaling and a generous maximum. Watch glue.driver.aggregate.numActiveExecutors in CloudWatch. If peak usage sits at 15 out of 50 allocated workers, drop the max to 20.
Flex execution for non-urgent jobs. $0.29/DPU-hour instead of $0.44 - a 34% cut. Flex jobs may start later and run longer since they draw from spare capacity, but for nightly batches and backfills, the math works out.
Set explicit timeouts. Default for standard jobs is 2,880 minutes - that's 48 hours. A stuck job can burn DPU-hours for two full days before Glue kills it. Set timeout to 2-3x your expected runtime.

Where Glue Fits - and Where It Doesn't

Glue works best when you want serverless ETL tightly integrated with AWS analytics. Crawling S3, converting JSON/CSV to Parquet or Iceberg, making data queryable via Athena - that's the sweet spot. No clusters to manage, and the Data Catalog serves as the metadata backbone for everything downstream.

The current version, Glue 5.0 (December 2024), runs Spark 3.5.4 with Python 3.11 and Java 17. Open table format support (Apache Iceberg, Delta Lake, Apache Hudi) has been native since Glue 3.0 and ships with updated library versions in 5.0. Lake Formation integration brings Spark-native fine-grained access control. The Glue Schema Registry manages Avro, JSON Schema, and Protobuf schemas on Kafka/Kinesis streams at no extra cost.

Where it falls short. Long-running, cost-sensitive workloads at scale run cheaper on EMR, which offers more control and Spot instance support. Need Flink, Trino, or HBase? EMR. Databricks provides a more complete platform - collaborative notebooks, Photon's vectorized engine, MLflow for ML workflows. For pipeline orchestration beyond simple job chaining, Airflow (MWAA) or Step Functions are the better fit; Glue's built-in workflow capabilities only cover Glue jobs and crawlers.

It often comes down to team structure. Teams without dedicated platform engineers lean toward Glue for the zero-ops model. Teams running large-scale production pipelines tend to land on EMR or Databricks, where the engineering investment pays back through lower compute bills and broader flexibility.

Key Takeaways

AWS Glue is two things: a Hive Metastore-compatible Data Catalog shared by Athena, EMR, and Redshift Spectrum, and a serverless ETL engine running Spark, Python, or Ray jobs.
Start with G.1X workers and Auto Scaling. Adjust based on CloudWatch metrics. Reach for R-type workers when memory is the bottleneck, not CPU.
Address the small files problem early - groupFiles on input, coalesce() on output. Use partition indexes and catalogPartitionPredicate for heavily partitioned tables.
Flex execution saves 34% on non-urgent jobs. Always set explicit timeouts - the 48-hour default is a cost trap.
Glue 5.0 delivers native Iceberg/Delta/Hudi support and Lake Formation integration. For AWS-native data lake ETL, it's the lowest-friction option. For complex, cost-sensitive, or multi-engine workloads, look at EMR or Databricks.

For help designing your data platform, optimizing Glue jobs, or evaluating alternatives, contact our team.