Dagster is a modern data orchestration platform designed around software-defined assetsβa declarative approach where data pipelines are modeled as first-class objects rather than task-based workflows. Originally developed to address limitations in traditional orchestrators like Airflow, it provides data-aware orchestration with built-in observability, type-checking, and testing capabilities. Core to Dagster's philosophy is treating data assets (tables, files, models) as the primary abstraction rather than tasks, enabling automatic lineage tracking, easier debugging, and a more intuitive mental model for data engineers. The framework supports both asset-based and op-based (task-based) workflows, though assets are recommended for most use cases as they provide superior observability and composability out of the box.
What This Cheat Sheet Covers
This topic spans 16 focused tables and 116 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Asset Definitions
| Concept | Example | Description |
|---|---|---|
@dg.asset def customers(): return pd.read_csv("data.csv") | β’ Defines a software-defined asset β a Python function that computes and persists data β’ the asset key is derived from the function name. | |
@dg.asset(deps=[raw_customers]) def clean_customers(): | β’ Declares upstream dependencies using deps β Dagster ensures parent assets run firstβ’ use when upstream asset isn't used as function input. | |
@dg.asset def process(data: AssetIn("source")): | Explicitly configures input behavior for an upstream asset β allows custom partition mappings, metadata, or key overrides. | |
@dg.asset(outs={"a": AssetOut(), "b": AssetOut()}) def multi(): yield Output(val, "a") | Defines multiple outputs from a single asset function β each output is tracked as a separate asset with distinct metadata. | |
dg.materialize([customers, orders]) | The act of executing an asset's function and persisting results to storage β can be triggered via UI, CLI, schedules, or sensors. | |
@dg.external_asset(key="s3_data") def upstream(): pass | Models assets produced outside Dagster (e.g., by Airflow or manual processes) β allows lineage tracking without assuming orchestration control. |