Vision Transformers (ViTs) apply the Transformer architecture — originally designed for NLP — directly to sequences of image patches, replacing convolutional inductive biases with global self-attention from the very first layer. Introduced in Google's "An Image is Worth 16×16 Words" paper (2020), ViTs have since become the backbone of choice for large-scale vision pretraining, powering models like DINOv2, CLIP, and modern detection and segmentation systems. Unlike CNNs, which build spatial hierarchies through local receptive fields, ViTs require large datasets or strong self-supervised pretraining to compensate for their weaker inductive bias — but once pretrained, they transfer exceptionally well and scale more predictably with data and compute.
What This Cheat Sheet Covers
This topic spans 18 focused tables and 100 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Patch Embedding and Image Tokenization
The first transformation any ViT applies is splitting an image into non-overlapping fixed-size patches and projecting each into a learned embedding vector — this is the bridge between raw pixels and the sequence-of-tokens view that the Transformer encoder expects. Patch size is the primary speed/accuracy knob: smaller patches produce longer sequences with finer granularity, larger patches are faster but coarser.
| Technique | Example | Description |
|---|---|---|
# 224×224 image, 16×16 patch → 196 tokensproj = nn.Conv2d(3, d_model, 16, stride=16)x = proj(img).flatten(2).transpose(1,2) | Divides image into P×P patches, flattens each to a 1-D vector, and applies a linear projection (commonly implemented as a stride-P convolution); produces N = (H/P)×(W/P) patch tokens. | |
# ViT-B/16 → 196 tokens (faster, common)# ViT-B/32 → 49 tokens (faster, less accurate) | Smaller patches yield higher accuracy at greater compute cost; sequence length scales as (H/P)^2, so halving patch size quadruples tokens and roughly quadruples attention cost. | |
stem = nn.Conv2d(3, C, 4, stride=4) | ConvNeXt-style 4×4 non-overlapping convolution used as a patch embedding stem; mirrors ViT patch embedding without attention, common in hybrid and pure-CNN models modernized from ViT design. |