Computer Vision is a field of artificial intelligence that enables machines to interpret and understand visual information from the world—images, videos, and camera streams. It powers applications from autonomous vehicles to medical imaging, bridging perception and decision-making. At its core, Computer Vision combines convolutional neural networks (CNNs), vision transformers, classical image processing, and foundation models to extract features, detect objects, and segment scenes. One critical insight: the choice of architecture and preprocessing directly determines whether your model generalizes to real-world variations in lighting, occlusion, and scale—clean training data and appropriate augmentation are not optional extras but foundational requirements.
What This Cheat Sheet Covers
This topic spans 20 focused tables and 199 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Vision Architectures
| Architecture | Example | Description |
|---|---|---|
Conv2D(32, (3,3)) → ReLU → MaxPool2D(2,2) | • Feedforward network using convolutional filters to extract spatial hierarchies of features • the foundation of modern computer vision. | |
x + F(x) | • Introduces skip connections that enable training very deep networks (50–152 layers) by mitigating vanishing gradients • backbone for many tasks. | |
Compound scaling: depth + width + resolution | • Systematically scales network depth, width, and input resolution together using neural architecture search • state-of-the-art accuracy/efficiency trade-off. | |
image → patches → self-attention | • Applies transformer self-attention directly to image patches • excels with large datasets, bypasses convolutional inductive bias. | |
patches → local windows → shifted windows | • Hierarchical ViT using shifted window self-attention for cross-window interaction; linear complexity with image size • ICCV 2021 best paper; dominant backbone for detection and segmentation. | |
4×4 patchify → depthwise Conv7×7 → LayerNorm → GELU | • Modernizes ResNet with transformer-inspired design choices (large kernels, LayerNorm, inverted bottleneck) • matches ViT performance while retaining CNN efficiency and hardware friendliness. |