Convolutional Neural Networks are a class of deep learning models designed to process data with a grid-like topology — most famously images — by exploiting local spatial correlations through learnable filters. They sit at the heart of modern computer vision, enabling tasks from image classification and object detection to medical imaging and autonomous driving. Unlike fully connected networks, CNNs achieve parameter sharing and translation equivariance by sliding the same filter across the entire input, which dramatically reduces parameter count while preserving spatial structure. The key mental model to hold throughout is that a CNN is a hierarchy of feature detectors: early layers learn edges and textures, middle layers learn parts, and deep layers learn semantics — and every design choice (kernel size, stride, normalization, skip connections) shapes how information flows through that hierarchy.
What This Cheat Sheet Covers
This topic spans 11 focused tables and 95 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Convolution Operation Fundamentals
The convolution operation is the mathematical core of CNNs: a learnable filter slides across the input, computing dot products at each position to produce a feature map. Understanding the mechanics — how filters, stride, padding, and dilation interact — is the prerequisite for understanding every CNN architecture.
| Operation | Example | Description |
|---|---|---|
# output[i,j] = sum(input[i:i+k, j:j+k] * kernel)# over all channels | Slides a learnable kernel across the input and computes the element-wise dot product at each position; the learned kernel weights are shared across all spatial positions. | |
O = \lfloor (W - K + 2P) / S \rfloor + 1 | Computes output spatial dimension where W = input size, K = kernel size, P = padding, S = stride; must yield an integer or the configuration is invalid. | |
P = (K - 1) / 2 (stride=1) | Adds zeros around input borders so the output has the same spatial size as the input when stride=1; prevents boundary features from being underrepresented. | |
P = 0 → output shrinks by K-1 | No padding added; output spatial size shrinks by K-1 per dimension; used when spatial reduction is intentional. | |
stride=2 halves output H/W | Controls how many pixels the filter advances per step; stride > 1 downsamples the feature map, reducing computation and spatial resolution. |