Computer Vision Cheat Sheet

Updated 2026-04-28

Next Topic: Convolutional Neural Networks (CNNs) Cheat Sheet

🧠Study flashcards on this topic175 cards · spaced repetition→

Computer Vision is a field of artificial intelligence that enables machines to interpret and understand visual information from the world—images, videos, and camera streams. It powers applications from autonomous vehicles to medical imaging, bridging perception and decision-making. At its core, Computer Vision combines convolutional neural networks (CNNs), vision transformers, classical image processing, and foundation models to extract features, detect objects, and segment scenes. One critical insight: the choice of architecture and preprocessing directly determines whether your model generalizes to real-world variations in lighting, occlusion, and scale—clean training data and appropriate augmentation are not optional extras but foundational requirements.

What This Cheat Sheet Covers

This topic spans 20 focused tables and 199 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core Vision ArchitecturesTable 2: Object Detection MethodsTable 3: Image Segmentation TechniquesTable 4: Image Classification & Transfer LearningTable 5: CNN Building BlocksTable 6: Activation FunctionsTable 7: Data Augmentation TechniquesTable 8: Loss Functions for Computer VisionTable 9: Optimization & Training TechniquesTable 10: Evaluation MetricsTable 11: Image PreprocessingTable 12: Classical Feature ExtractionTable 13: Pose Estimation & Keypoint DetectionTable 14: Face Recognition & DetectionTable 15: 3D Computer Vision & Depth EstimationTable 16: Motion & Video AnalysisTable 17: Image Enhancement & RestorationTable 18: Model Interpretation & VisualizationTable 19: Advanced Training TechniquesTable 20: Specialized Applications

Quick IndexSubscribe to unlock

A jump-to index of every table row in this cheat sheet.

Mind MapSubscribe to unlock

An interactive map of every table and concept in this topic.

Table 1: Core Vision Architectures

These are the backbone networks that every vision pipeline is built on—from the CNNs that started the deep learning revolution (AlexNet, VGG, ResNet) to the transformer-based models (ViT, Swin) and self-supervised giants (DINOv2) that now dominate the field. Reading top to bottom roughly traces the historical progression, and the trade-off you keep running into is accuracy versus efficiency: a MobileNet runs on your phone, a Swin Transformer wins benchmarks but demands far more compute.

Architecture	Example	Description
Convolutional Neural Network (CNN)	`Conv2D(32, (3,3)) → ReLU → MaxPool2D(2,2)`	• Feedforward network using convolutional filters to extract spatial hierarchies of features • the foundation of modern computer vision.
ResNet (Residual Network)	`x + F(x)`	• Introduces skip connections that enable training very deep networks (50–152 layers) by mitigating vanishing gradients • backbone for many tasks.
EfficientNet	`Compound scaling: depth + width + resolution`	• Systematically scales network depth, width, and input resolution together using neural architecture search • state-of-the-art accuracy/efficiency trade-off.
Vision Transformer (ViT)	`image → patches → self-attention`	• Applies transformer self-attention directly to image patches • excels with large datasets, bypasses convolutional inductive bias.
Swin Transformer	`patches → local windows → shifted windows`	• Hierarchical ViT using shifted window self-attention for cross-window interaction; linear complexity with image size • ICCV 2021 best paper; dominant backbone for detection and segmentation.
ConvNeXt	`4×4 patchify → depthwise Conv7×7 → LayerNorm → GELU`	• Modernizes ResNet with transformer-inspired design choices (large kernels, LayerNorm, inverted bottleneck) • matches ViT performance while retaining CNN efficiency and hardware friendliness.

Table 1: Core Vision Architectures

Architecture	Example	Description
Convolutional Neural Network (CNN)	`Conv2D(32, (3,3)) → ReLU → MaxPool2D(2,2)`	• Feedforward network using convolutional filters to extract spatial hierarchies of features • the foundation of modern computer vision.
ResNet (Residual Network)	`x + F(x)`	• Introduces skip connections that enable training very deep networks (50–152 layers) by mitigating vanishing gradients • backbone for many tasks.
EfficientNet	`Compound scaling: depth + width + resolution`	• Systematically scales network depth, width, and input resolution together using neural architecture search • state-of-the-art accuracy/efficiency trade-off.
Vision Transformer (ViT)	`image → patches → self-attention`	• Applies transformer self-attention directly to image patches • excels with large datasets, bypasses convolutional inductive bias.
Swin Transformer	`patches → local windows → shifted windows`	• Hierarchical ViT using shifted window self-attention for cross-window interaction; linear complexity with image size • ICCV 2021 best paper; dominant backbone for detection and segmentation.
ConvNeXt	`4×4 patchify → depthwise Conv7×7 → LayerNorm → GELU`	• Modernizes ResNet with transformer-inspired design choices (large kernels, LayerNorm, inverted bottleneck) • matches ViT performance while retaining CNN efficiency and hardware friendliness.