AWS Glue is Amazon's serverless data integration service that orchestrates extract, transform, and load (ETL) workflows at scale. Built on Apache Spark, it eliminates infrastructure management while providing a Data Catalog as a central metadata repository, crawlers for schema inference, and visual and code-based ETL authoring. AWS Glue excels at preparing messy, semi-structured data for analytics—whether through batch jobs, streaming pipelines, or visual no-code transforms. Understanding the distinction between DynamicFrames (Glue's schema-flexible abstraction) and Spark DataFrames, mastering job bookmarks for incremental processing, and leveraging performance optimization techniques like pushdown predicates are essential for cost-effective, production-grade Glue implementations.
What This Cheat Sheet Covers
This topic spans 25 focused tables and 182 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Glue Job Types
| Type | Example | Description |
|---|---|---|
job_type = 'glueetl'worker_type = 'G.1X' | • Runs Apache Spark on serverless Glue infrastructure • best for large-scale batch processing • uses DynamicFrames or standard Spark DataFrames • charged per DPU-hour. | |
job_type = 'pythonshell'max_capacity = 1.0 | • Lightweight pure Python job (no Spark) • ideal for small datasets, API calls, or simple transformations • supports pandas and boto3 • cheaper than Spark jobs. | |
job_type = 'glueray'python_version = '3.9' | • Distributed Ray.io framework for scaling Python workloads • supports parallel processing without Spark • end of new customer support April 30, 2026. |