Apache Airflow is a Python-based platform for programmatically authoring, scheduling, and monitoring workflows as directed acyclic graphs (DAGs). Originally developed at Airbnb and open-sourced in 2015, it has become the de facto standard for data pipeline orchestration across batch, streaming, and machine learning workflows. Airflow's core strength lies in its code-as-configuration approach where workflows are defined in Python, enabling version control, testing, and dynamic generation. The platform operates on the principle that tasks are discrete units of work arranged in a DAG, with dependencies explicitly defined to ensure proper execution order—a model that scales from simple ETL pipelines to complex multi-team data platforms orchestrating thousands of workflows.
What This Cheat Sheet Covers
This topic spans 25 focused tables and 212 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: DAG Configuration Parameters
| Parameter | Example | Description |
|---|---|---|
dag_id='daily_etl_pipeline' | • Unique identifier for the DAG • must be unique across all DAGs in the same Airflow instance. | |
schedule='@daily'schedule='0 6 * * *' | • Defines when the DAG runs • accepts cron expressions, timedelta objects, presets ( @hourly, @daily, @weekly, @monthly), timetables, or None for manual-only. | |
start_date=datetime(2026, 1, 1) | • First logical date from which DAG runs can be scheduled • should be timezone-aware and typically a static past date. | |
catchup=False | • If True, schedules all missed runs between start_date and current date• if False, only schedules from the current date forward—critical for avoiding backfill on first deploy. | |
max_active_runs=3 | • Maximum number of concurrent DAG runs allowed • prevents resource exhaustion when a DAG is scheduled frequently. | |
default_args={'retries': 2, 'retry_delay': timedelta(minutes=5)} | Dictionary of default parameters applied to all tasks in the DAG unless overridden at the task level. |