Skip to main content

Menu

LEVEL 0
0/5 XP
HomeAboutTopicsPricingMy VaultStats

Categories

🤖 Artificial Intelligence
☁️ Cloud and Infrastructure
💾 Data and Databases
💼 Professional Skills
🎯 Programming and Development
🔒 Security and Networking
📚 Specialized Topics
HomeAboutTopicsPricingMy VaultStats
LEVEL 0
0/5 XP
GitHub
© 2026 CheatGrid™. All rights reserved.
Privacy PolicyTerms of UseAboutContact

Vision Transformers (ViTs) Cheat Sheet

Vision Transformers (ViTs) Cheat Sheet

Back to AI and Machine Learning
Updated 2026-05-21
Next Topic: XGBoost Cheat Sheet

Vision Transformers (ViTs) apply the Transformer architecture — originally designed for NLP — directly to sequences of image patches, replacing convolutional inductive biases with global self-attention from the very first layer. Introduced in Google's "An Image is Worth 16×16 Words" paper (2020), ViTs have since become the backbone of choice for large-scale vision pretraining, powering models like DINOv2, CLIP, and modern detection and segmentation systems. Unlike CNNs, which build spatial hierarchies through local receptive fields, ViTs require large datasets or strong self-supervised pretraining to compensate for their weaker inductive bias — but once pretrained, they transfer exceptionally well and scale more predictably with data and compute.

What This Cheat Sheet Covers

This topic spans 18 focused tables and 100 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Patch Embedding and Image TokenizationTable 2: Positional EncodingTable 3: Class Token and Pooling StrategiesTable 4: Multi-Head Self-Attention for ImagesTable 5: ViT Architecture Variants and Model SizesTable 6: ViT vs CNN TradeoffsTable 7: Swin Transformer — Hierarchical Windowed ViTTable 8: DeiT — Data-Efficient Training with DistillationTable 9: ConvNeXt — CNN Modernized with Transformer Design PrinciplesTable 10: DINOv2 — Self-Supervised PretrainingTable 11: CLIP — Contrastive Vision-Language AlignmentTable 12: Fine-Tuning ViT BackbonesTable 13: Self-Supervised Pretraining MethodsTable 14: Token Pruning and Merging for SpeedupTable 15: ViT for Dense Prediction (Detection and Segmentation)Table 16: Efficient and Mobile Vision TransformersTable 17: ViT for Video UnderstandingTable 18: Training Techniques and Best Practices

Table 1: Patch Embedding and Image Tokenization

The first transformation any ViT applies is splitting an image into non-overlapping fixed-size patches and projecting each into a learned embedding vector — this is the bridge between raw pixels and the sequence-of-tokens view that the Transformer encoder expects. Patch size is the primary speed/accuracy knob: smaller patches produce longer sequences with finer granularity, larger patches are faster but coarser.

TechniqueExampleDescription
Patch embedding
# 224×224 image, 16×16 patch → 196 tokens
proj = nn.Conv2d(3, d_model, 16, stride=16)
x = proj(img).flatten(2).transpose(1,2)
Divides image into P×P patches, flattens each to a 1-D vector, and applies a linear projection (commonly implemented as a stride-P convolution); produces N = (H/P)×(W/P) patch tokens.
Patch size tradeoff
# ViT-B/16 → 196 tokens (faster, common)
# ViT-B/32 → 49 tokens (faster, less accurate)
Smaller patches yield higher accuracy at greater compute cost; sequence length scales as (H/P)^2, so halving patch size quadruples tokens and roughly quadruples attention cost.
Patchify stem
stem = nn.Conv2d(3, C, 4, stride=4)
ConvNeXt-style 4×4 non-overlapping convolution used as a patch embedding stem; mirrors ViT patch embedding without attention, common in hybrid and pure-CNN models modernized from ViT design.

More in AI and Machine Learning

  • Unsupervised Learning Cheat Sheet
  • XGBoost Cheat Sheet
  • AI Bias & Fairness Cheat Sheet
  • Edge AI and TinyML Cheat Sheet
  • Mixture of Experts (MoE) Architecture Cheat Sheet
  • ONNX and ONNX Runtime Cheat Sheet
View all 83 topics in AI and Machine Learning