Mathematics for Machine Learning Cheat Sheet

Updated 2026-05-21

Next Topic: Modular Arithmetic for Programming Cheat Sheet

Mathematics for machine learning is the formal language underlying every algorithm that learns from data. It spans linear algebra, multivariate calculus, probability theory, information theory, and optimization — topics that appear in every paper, every framework, and every model architecture. Without this foundation it is possible to use ML tools as black boxes, but understanding it is what separates practitioners who can debug, design, and innovate from those who can only configure. The crucial mental model: most ML is simply optimization of a scalar loss function over a high-dimensional parameter space, and every concept here — from eigenvectors to KL divergence — is a tool for understanding or improving that optimization.

What This Cheat Sheet Covers

This topic spans 18 focused tables and 133 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Vectors, Matrices, and TensorsTable 2: Vector and Matrix NormsTable 3: Linear Independence, Rank, and Vector SpacesTable 4: Eigendecomposition and the Spectral TheoremTable 5: Matrix DecompositionsTable 6: Vector Calculus — Gradients, Jacobians, HessiansTable 7: Matrix Calculus Identities for MLTable 8: Backpropagation and the Chain Rule in Neural NetworksTable 9: Probability FundamentalsTable 10: Common Probability Distributions in MLTable 11: Maximum Likelihood and MAP EstimationTable 12: Information Theory EssentialsTable 13: Convexity and Optimization TheoryTable 14: Loss Functions and Their GradientsTable 15: Regularization — L1, L2, and Elastic NetTable 16: Gradient Descent and OptimizersTable 17: Principal Component Analysis (PCA)Table 18: Bias-Variance Tradeoff

Table 1: Vectors, Matrices, and Tensors

Linear algebra gives ML its fundamental data structures. A scalar is a single number, a vector is an ordered array, a matrix is a 2-D grid, and a tensor generalizes to arbitrary dimensions — these are the objects that represent data, parameters, and transformations throughout every ML pipeline.

Concept	Example	Description
Vector	$\mathbf{x} = [x_1, x_2, \ldots, x_n]^\top \in \mathbb{R}^n$	• Ordered list of $n$ real numbers • represents a data point, a weight sequence, or an embedding
Matrix	$A \in \mathbb{R}^{m \times n}$ , element $A_{ij}$	• 2-D array of numbers • a dataset of $m$ examples with $n$ features forms an $m \times n$ matrix
Tensor	$\mathcal{T} \in \mathbb{R}^{H \times W \times C}$ (e.g., RGB image)	• Generalization of matrices to $k$ dimensions • the native data type in deep learning frameworks
Dot product (inner product)	$\mathbf{u} \cdot \mathbf{v} = \sum_i u_i v_i = \lVert \mathbf{u} \rVert \lVert \mathbf{v} \rVert \cos\theta$	• Produces a scalar • cosine similarity uses the normalized form to measure directional similarity between vectors
Matrix multiplication	$(AB)_{ij} = \sum_k A_{ik} B_{kj}$ , $A \in \mathbb{R}^{m\times k}$ , $B \in \mathbb{R}^{k\times n}$	• Core neural-network operation • each layer applies a linear transformation $A\mathbf{x} + \mathbf{b}$ via matrix-vector product

Table 1: Vectors, Matrices, and Tensors

Concept	Example	Description
Vector	$\mathbf{x} = [x_1, x_2, \ldots, x_n]^\top \in \mathbb{R}^n$	• Ordered list of $n$ real numbers • represents a data point, a weight sequence, or an embedding
Matrix	$A \in \mathbb{R}^{m \times n}$ , element $A_{ij}$	• 2-D array of numbers • a dataset of $m$ examples with $n$ features forms an $m \times n$ matrix
Tensor	$\mathcal{T} \in \mathbb{R}^{H \times W \times C}$ (e.g., RGB image)	• Generalization of matrices to $k$ dimensions • the native data type in deep learning frameworks
Dot product (inner product)	$\mathbf{u} \cdot \mathbf{v} = \sum_i u_i v_i = \lVert \mathbf{u} \rVert \lVert \mathbf{v} \rVert \cos\theta$	• Produces a scalar • cosine similarity uses the normalized form to measure directional similarity between vectors
Matrix multiplication	$(AB)_{ij} = \sum_k A_{ik} B_{kj}$ , $A \in \mathbb{R}^{m\times k}$ , $B \in \mathbb{R}^{k\times n}$	• Core neural-network operation • each layer applies a linear transformation $A\mathbf{x} + \mathbf{b}$ via matrix-vector product