Convolutional Neural Networks (CNNs) Cheat Sheet

Updated 2026-05-20

Next Topic: Data Augmentation Strategies for Deep Learning Cheat Sheet

Convolutional Neural Networks are a class of deep learning models designed to process data with a grid-like topology — most famously images — by exploiting local spatial correlations through learnable filters. They sit at the heart of modern computer vision, enabling tasks from image classification and object detection to medical imaging and autonomous driving. Unlike fully connected networks, CNNs achieve parameter sharing and translation equivariance by sliding the same filter across the entire input, which dramatically reduces parameter count while preserving spatial structure. The key mental model to hold throughout is that a CNN is a hierarchy of feature detectors: early layers learn edges and textures, middle layers learn parts, and deep layers learn semantics — and every design choice (kernel size, stride, normalization, skip connections) shapes how information flows through that hierarchy.

What This Cheat Sheet Covers

This topic spans 11 focused tables and 95 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Convolution Operation FundamentalsTable 2: Pooling LayersTable 3: Feature Maps, Channels, and DepthTable 4: Receptive FieldTable 5: Normalization TechniquesTable 6: Activation Functions in CNNsTable 7: Classic CNN ArchitecturesTable 8: ConvNeXt — Modern CNN DesignTable 9: Training Techniques and RegularizationTable 10: Transfer Learning with Pretrained BackbonesTable 11: Common Pitfalls and Debugging

Table 1: Convolution Operation Fundamentals

The convolution operation is the mathematical core of CNNs: a learnable filter slides across the input, computing dot products at each position to produce a feature map. Understanding the mechanics — how filters, stride, padding, and dilation interact — is the prerequisite for understanding every CNN architecture.

Operation	Example	Description
Convolution (cross-correlation)	`# output[i,j] = sum(input[i:i+k, j:j+k] * kernel)` `# over all channels`	Slides a learnable kernel across the input and computes the element-wise dot product at each position; the learned kernel weights are shared across all spatial positions.
Output size formula	$O = \lfloor (W - K + 2P) / S \rfloor + 1$	Computes output spatial dimension where $W$ = input size, $K$ = kernel size, $P$ = padding, $S$ = stride; must yield an integer or the configuration is invalid.
Zero padding (same padding)	`P = (K - 1) / 2` (stride=1)	Adds zeros around input borders so the output has the same spatial size as the input when stride=1; prevents boundary features from being underrepresented.
Valid padding	`P = 0` → output shrinks by `K-1`	No padding added; output spatial size shrinks by `K-1` per dimension; used when spatial reduction is intentional.
Stride	`stride=2` halves output H/W	Controls how many pixels the filter advances per step; stride > 1 downsamples the feature map, reducing computation and spatial resolution.

Table 1: Convolution Operation Fundamentals

Operation	Example	Description
Convolution (cross-correlation)	`# output[i,j] = sum(input[i:i+k, j:j+k] * kernel)` `# over all channels`	Slides a learnable kernel across the input and computes the element-wise dot product at each position; the learned kernel weights are shared across all spatial positions.
Output size formula	$O = \lfloor (W - K + 2P) / S \rfloor + 1$	Computes output spatial dimension where $W$ = input size, $K$ = kernel size, $P$ = padding, $S$ = stride; must yield an integer or the configuration is invalid.
Zero padding (same padding)	`P = (K - 1) / 2` (stride=1)	Adds zeros around input borders so the output has the same spatial size as the input when stride=1; prevents boundary features from being underrepresented.
Valid padding	`P = 0` → output shrinks by `K-1`	No padding added; output spatial size shrinks by `K-1` per dimension; used when spatial reduction is intentional.
Stride	`stride=2` halves output H/W	Controls how many pixels the filter advances per step; stride > 1 downsamples the feature map, reducing computation and spatial resolution.