Vision Transformers (ViTs) Cheat Sheet

Updated 2026-05-21

Vision Transformers (ViTs) apply the Transformer architecture — originally designed for NLP — directly to sequences of image patches, replacing convolutional inductive biases with global self-attention from the very first layer. Introduced in Google's "An Image is Worth 16×16 Words" paper (2020), ViTs have since become the backbone of choice for large-scale vision pretraining, powering models like DINOv2, CLIP, and modern detection and segmentation systems. Unlike CNNs, which build spatial hierarchies through local receptive fields, ViTs require large datasets or strong self-supervised pretraining to compensate for their weaker inductive bias — but once pretrained, they transfer exceptionally well and scale more predictably with data and compute.

What This Cheat Sheet Covers

This topic spans 18 focused tables and 100 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Patch Embedding and Image TokenizationTable 2: Positional EncodingTable 3: Class Token and Pooling StrategiesTable 4: Multi-Head Self-Attention for ImagesTable 5: ViT Architecture Variants and Model SizesTable 6: ViT vs CNN TradeoffsTable 7: Swin Transformer — Hierarchical Windowed ViTTable 8: DeiT — Data-Efficient Training with DistillationTable 9: ConvNeXt — CNN Modernized with Transformer Design PrinciplesTable 10: DINOv2 — Self-Supervised PretrainingTable 11: CLIP — Contrastive Vision-Language AlignmentTable 12: Fine-Tuning ViT BackbonesTable 13: Self-Supervised Pretraining MethodsTable 14: Token Pruning and Merging for SpeedupTable 15: ViT for Dense Prediction (Detection and Segmentation)Table 16: Efficient and Mobile Vision TransformersTable 17: ViT for Video UnderstandingTable 18: Training Techniques and Best Practices

Table 1: Patch Embedding and Image Tokenization

The first transformation any ViT applies is splitting an image into non-overlapping fixed-size patches and projecting each into a learned embedding vector — this is the bridge between raw pixels and the sequence-of-tokens view that the Transformer encoder expects. Patch size is the primary speed/accuracy knob: smaller patches produce longer sequences with finer granularity, larger patches are faster but coarser.

Technique	Example	Description
Patch embedding	`# 224×224 image, 16×16 patch → 196 tokens` `proj = nn.Conv2d(3, d_model, 16, stride=16)` `x = proj(img).flatten(2).transpose(1,2)`	• Divides image into P×P patches, flattens each to a 1-D vector, and applies a linear projection (commonly implemented as a stride-P convolution) • produces N = (H/P)×(W/P) patch tokens
Patch size tradeoff	`# ViT-B/16 → 196 tokens (faster, common)` `# ViT-B/32 → 49 tokens (faster, less accurate)`	• Smaller patches yield higher accuracy at greater compute cost • sequence length scales as $(H/P)^2$ , so halving patch size quadruples tokens and roughly quadruples attention cost
Patchify stem	`stem = nn.Conv2d(3, C, 4, stride=4)`	• ConvNeXt-style 4×4 non-overlapping convolution used as a patch embedding stem • mirrors ViT patch embedding without attention, common in hybrid and pure-CNN models modernized from ViT design

Table 1: Patch Embedding and Image Tokenization

Technique	Example	Description
Patch embedding	`# 224×224 image, 16×16 patch → 196 tokens` `proj = nn.Conv2d(3, d_model, 16, stride=16)` `x = proj(img).flatten(2).transpose(1,2)`	• Divides image into P×P patches, flattens each to a 1-D vector, and applies a linear projection (commonly implemented as a stride-P convolution) • produces N = (H/P)×(W/P) patch tokens
Patch size tradeoff	`# ViT-B/16 → 196 tokens (faster, common)` `# ViT-B/32 → 49 tokens (faster, less accurate)`	• Smaller patches yield higher accuracy at greater compute cost • sequence length scales as $(H/P)^2$ , so halving patch size quadruples tokens and roughly quadruples attention cost
Patchify stem	`stem = nn.Conv2d(3, C, 4, stride=4)`	• ConvNeXt-style 4×4 non-overlapping convolution used as a patch embedding stem • mirrors ViT patch embedding without attention, common in hybrid and pure-CNN models modernized from ViT design