AWS Glue Cheat Sheet

Updated 2026-05-28

Next Topic: Azure Data Factory Cheat Sheet

🧠Study flashcards on this topic157 cards · spaced repetition→

AWS Glue is Amazon's serverless data integration service that orchestrates extract, transform, and load (ETL) workflows at scale. Built on Apache Spark, it eliminates infrastructure management while providing a Data Catalog as a central metadata repository, crawlers for schema inference, and visual and code-based ETL authoring. AWS Glue excels at preparing messy, semi-structured data for analytics — whether through batch jobs, streaming pipelines, or visual no-code transforms. Understanding the distinction between DynamicFrames (Glue's schema-flexible abstraction) and Spark DataFrames, mastering job bookmarks for incremental processing, and leveraging performance optimization techniques like pushdown predicates are essential for cost-effective, production-grade Glue implementations. AWS Glue 5.1 (the current default, released November 2025) runs Spark 3.5.6 with Java 17 and adds Iceberg v3, materialized views, and an AI-powered Spark troubleshooting agent.

What This Cheat Sheet Covers

This topic spans 29 focused tables and 222 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Glue Job TypesTable 2: Data Catalog and CrawlersTable 3: DynamicFrames vs DataFramesTable 4: Glue Studio Visual Job ComponentsTable 5: Glue TransformationsTable 6: Glue DataBrew TransformationsTable 7: AWS Glue Data QualityTable 8: Job Bookmarks and Incremental ProcessingTable 9: Glue Connections and Data SourcesTable 10: Glue Workflows and OrchestrationTable 11: Job Optimization TechniquesTable 12: Monitoring and LoggingTable 13: Spark and Python VersionsTable 14: Sensitive Data DetectionTable 15: Job Parameters and ArgumentsTable 16: Security and IAMTable 17: Streaming ETL FeaturesTable 18: Schema RegistryTable 19: Glue Flex and Auto ScalingTable 20: Development and TestingTable 21: Data Lake Table FormatsTable 22: Iceberg OperationsTable 23: Data Catalog Federation and External SourcesTable 24: Zero-ETL IntegrationsTable 25: SageMaker Lakehouse and S3 TablesTable 26: Generative AI FeaturesTable 27: Pricing ModelTable 28: Common Patterns and Best PracticesTable 29: Troubleshooting and Common Errors

Quick IndexSubscribe to unlock

A jump-to index of every table row in this cheat sheet.

Mind MapSubscribe to unlock

An interactive map of every table and concept in this topic.

Table 1: Glue Job Types

Choose the right execution model first — the job type determines available runtimes, DPU billing, and integration depth. Spark ETL handles the vast majority of workloads; Python Shell covers lightweight scripted automation; Streaming covers near-real-time pipelines.

Type	Example	Description
Spark ETL Job	`job_type = 'glueetl'` `worker_type = 'G.2X'`	• Runs Apache Spark on serverless Glue infrastructure • best for large-scale batch processing • uses DynamicFrames or standard Spark DataFrames • charged per DPU-hour
Streaming ETL Job	`streaming = True` `sources: Kinesis, Kafka`	• Continuous near-real-time processing • reads from Kinesis Data Streams or MSK • uses Spark Structured Streaming • supports checkpointing and micro-batching
Visual ETL Job	Created via Glue Studio UI	• Drag-and-drop interface for building ETL pipelines without code • auto-generates PySpark or Scala scripts • supports custom transforms and DataBrew recipes • available in SageMaker Unified Studio

Table 1: Glue Job Types

Type	Example	Description
Spark ETL Job	`job_type = 'glueetl'` `worker_type = 'G.2X'`	• Runs Apache Spark on serverless Glue infrastructure • best for large-scale batch processing • uses DynamicFrames or standard Spark DataFrames • charged per DPU-hour
Streaming ETL Job	`streaming = True` `sources: Kinesis, Kafka`	• Continuous near-real-time processing • reads from Kinesis Data Streams or MSK • uses Spark Structured Streaming • supports checkpointing and micro-batching
Visual ETL Job	Created via Glue Studio UI	• Drag-and-drop interface for building ETL pipelines without code • auto-generates PySpark or Scala scripts • supports custom transforms and DataBrew recipes • available in SageMaker Unified Studio