Ray for Distributed AI and ML Cheat Sheet

Updated 2026-05-21

Next Topic: Real-Time Machine Learning Pipelines Cheat Sheet

Ray is an open-source Python framework for distributed computing that provides a unified runtime for building and scaling AI and ML workloads across laptops, clusters, and cloud platforms. It solves the core pain point of distributed systems complexity by exposing three simple primitives — tasks, actors, and objects — that map naturally to any Python program. The key mental model is that Ray is not a framework you adopt wholesale but a layer you add incrementally: a single decorator turns a regular function into a distributed task, and the rest of your code stays unchanged.

What This Cheat Sheet Covers

This topic spans 13 focused tables and 128 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Ray Core — TasksTable 2: Ray Core — ActorsTable 3: Ray Core — Object Store and PatternsTable 4: Ray Core — Scheduling and Placement GroupsTable 5: Ray Cluster Setup and Runtime EnvironmentsTable 6: Ray Train — Distributed TrainingTable 7: Ray Tune — Hyperparameter SearchTable 8: Ray Serve — Model Deployment and CompositionTable 9: Ray Data — Distributed Data ProcessingTable 10: KubeRay — Ray on KubernetesTable 11: Ray Dashboard and ObservabilityTable 12: Fault ToleranceTable 13: Design Patterns and Anti-patterns

Table 1: Ray Core — Tasks

Tasks are the fundamental unit of stateless parallelism in Ray. Decorating a Python function with @ray.remote registers it as a remote function that executes asynchronously on any available worker in the cluster; results are retrieved lazily via ray.get(), enabling you to fire off thousands of tasks without blocking.

Technique	Example	Description
@ray.remote (task)	`@ray.remote` `def add(x, y):` `return x + y` `ref = add.remote(1, 2)`	Decorates a Python function to run as a distributed remote task; `.remote()` returns an `ObjectRef` immediately without blocking.
ray.get	`result = ray.get(ref)` `results = ray.get([r1, r2, r3])`	Blocks the caller and retrieves the concrete value(s) from one or more `ObjectRef`s; accepts a single ref or a list.
ray.wait	`ready, remaining = ray.wait(` `refs, num_returns=2, timeout=5.0)`	Returns two lists — refs that are ready and those still pending — without blocking indefinitely; use to process results as they finish rather than waiting for all.
ray.init	`ray.init()` `ray.init(address="ray://head:10001")` `ray.init(num_cpus=4)`	Starts a local Ray runtime or connects to an existing cluster; call once at the start of your program.
Resource specification (tasks)	`@ray.remote(num_cpus=2, num_gpus=1,` `memory=2_000_000_000)` `def train(): ...`	Declares logical resources reserved for the task; Ray uses these for scheduling, not for enforcing physical limits.

Table 1: Ray Core — Tasks

Technique	Example	Description
@ray.remote (task)	`@ray.remote` `def add(x, y):` `return x + y` `ref = add.remote(1, 2)`	Decorates a Python function to run as a distributed remote task; `.remote()` returns an `ObjectRef` immediately without blocking.
ray.get	`result = ray.get(ref)` `results = ray.get([r1, r2, r3])`	Blocks the caller and retrieves the concrete value(s) from one or more `ObjectRef`s; accepts a single ref or a list.
ray.wait	`ready, remaining = ray.wait(` `refs, num_returns=2, timeout=5.0)`	Returns two lists — refs that are ready and those still pending — without blocking indefinitely; use to process results as they finish rather than waiting for all.
ray.init	`ray.init()` `ray.init(address="ray://head:10001")` `ray.init(num_cpus=4)`	Starts a local Ray runtime or connects to an existing cluster; call once at the start of your program.
Resource specification (tasks)	`@ray.remote(num_cpus=2, num_gpus=1,` `memory=2_000_000_000)` `def train(): ...`	Declares logical resources reserved for the task; Ray uses these for scheduling, not for enforcing physical limits.