Kubeflow is an open-source, cloud-native MLOps platform built on Kubernetes that orchestrates the entire machine learning lifecycle β from interactive development in notebooks to distributed training, hyperparameter tuning, and production model serving. It addresses the core challenge of reproducibly moving ML workloads from a data scientist's laptop to scalable, multi-tenant infrastructure without rewriting pipelines. The critical mental model to carry through this cheat sheet is that almost everything in Kubeflow is a Kubernetes Custom Resource β InferenceService, PyTorchJob, Experiment, Notebook β so standard kubectl tooling, RBAC, and Kubernetes-native observability all apply directly.
What This Cheat Sheet Covers
This topic spans 19 focused tables and 141 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Kubeflow Core Components Overview
Kubeflow is a suite of composable components rather than a monolithic framework; knowing which component does what prevents the confusion of treating it as a single tool. Each component addresses a distinct phase of the ML lifecycle and can be installed and used independently.
| Component | Example | Description |
|---|---|---|
+ Compiler().compile() | ML workflow orchestration β defines, compiles, and executes multi-step ML pipelines as Kubernetes pods; uses Argo Workflows as its execution engine | |
JupyterLab / VS Code / RStudio on K8s pod | Spawns interactive IDE containers (JupyterLab, VS Code via code-server, RStudio) as Kubernetes pods inside a user's profile namespace | |
PyTorchJob, TFJob, MPIJob CRDs | Distributed training β Kubernetes operators that orchestrate multi-node, multi-GPU training jobs across PyTorch, TensorFlow, MPI, XGBoost, JAX, and more | |
InferenceService CRD | Model serving β standardized inference platform supporting serverless autoscaling, canary rollouts, multi-framework runtimes, and OpenAI-compatible APIs |