Skip to main content

Menu

LEVEL 0
0/5 XP
HomeAboutTopicsPricingMy VaultStats

Categories

🤖 Artificial Intelligence
☁️ Cloud and Infrastructure
💾 Data and Databases
💼 Professional Skills
🎯 Programming and Development
🔒 Security and Networking
📚 Specialized Topics
HomeAboutTopicsPricingMy VaultStats
LEVEL 0
0/5 XP
GitHub
© 2026 CheatGrid™. All rights reserved.
Privacy PolicyTerms of UseAboutContact

AWS Glue Cheat Sheet

AWS Glue Cheat Sheet

Back to Data Engineering
Updated 2026-04-12
Next Topic: Azure Data Factory Cheat Sheet

AWS Glue is Amazon's serverless data integration service that orchestrates extract, transform, and load (ETL) workflows at scale. Built on Apache Spark, it eliminates infrastructure management while providing a Data Catalog as a central metadata repository, crawlers for schema inference, and visual and code-based ETL authoring. AWS Glue excels at preparing messy, semi-structured data for analytics—whether through batch jobs, streaming pipelines, or visual no-code transforms. Understanding the distinction between DynamicFrames (Glue's schema-flexible abstraction) and Spark DataFrames, mastering job bookmarks for incremental processing, and leveraging performance optimization techniques like pushdown predicates are essential for cost-effective, production-grade Glue implementations.

What This Cheat Sheet Covers

This topic spans 25 focused tables and 182 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Glue Job TypesTable 2: Data Catalog and CrawlersTable 3: DynamicFrames vs DataFramesTable 4: Glue Studio Visual Job ComponentsTable 5: Glue TransformationsTable 6: Glue DataBrew TransformationsTable 7: AWS Glue Data QualityTable 8: Job Bookmarks and Incremental ProcessingTable 9: Glue Connections and Data SourcesTable 10: Glue Workflows and OrchestrationTable 11: Job Optimization TechniquesTable 12: Monitoring and LoggingTable 13: Spark and Python VersionsTable 14: Sensitive Data DetectionTable 15: Job Parameters and ArgumentsTable 16: Security and IAMTable 17: Streaming ETL FeaturesTable 18: Schema RegistryTable 19: Glue Flex and Auto ScalingTable 20: Development and TestingTable 21: Data Lake Table FormatsTable 22: Catalog FederationTable 23: Pricing and Cost OptimizationTable 24: Common Patterns and Best PracticesTable 25: Troubleshooting and Debugging

Table 1: Glue Job Types

TypeExampleDescription
Spark ETL Job
job_type = 'glueetl'
worker_type = 'G.1X'
• Runs Apache Spark on serverless Glue infrastructure
• best for large-scale batch processing
• uses DynamicFrames or standard Spark DataFrames
• charged per DPU-hour.
Python Shell Job
job_type = 'pythonshell'
max_capacity = 1.0
• Lightweight pure Python job (no Spark)
• ideal for small datasets, API calls, or simple transformations
• supports pandas and boto3
• cheaper than Spark jobs.
Ray Job
job_type = 'glueray'
python_version = '3.9'
• Distributed Ray.io framework for scaling Python workloads
• supports parallel processing without Spark
• end of new customer support April 30, 2026.

More in Data Engineering

  • Apache Pinot Real-Time OLAP Cheat Sheet_v1_tables
  • Azure Data Factory Cheat Sheet
  • Airbyte Open-Source ELT Cheat Sheet
  • Change Data Capture (CDC) Cheat Sheet
  • Databricks Delta Live Tables (DLT) Cheat Sheet
  • Great Expectations Data Quality Cheat Sheet
View all 61 topics in Data Engineering