Vision-Language Models (VLMs) are multimodal AI systems that seamlessly integrate visual perception and natural language understanding, enabling machines to reason about images, videos, and text simultaneously. These models power applications from visual question answering to zero-shot image classification, fundamentally changing how AI interprets the visual world. At their core, VLMs learn a shared embedding space where semantically similar images and text descriptions cluster together—a capability that emerged from contrastive learning techniques pioneered by CLIP and refined by subsequent architectures. A critical insight: the quality of vision-language alignment depends not just on model architecture, but on how effectively the model bridges the modality gap between continuous visual features and discrete linguistic tokens through specialized fusion mechanisms, and how well training data covers the diversity of visual scenes and instructions encountered at inference time.
What This Cheat Sheet Covers
This topic spans 16 focused tables and 130 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Foundation VLM Architectures
The VLM landscape has evolved through three architectural eras: early contrastive models (CLIP), efficient connector-based models (BLIP-2, LLaVA), and the current generation treating vision as a native LLM modality with dynamic resolution and strong reasoning. Understanding each generation's design tradeoffs — frozen encoders vs. joint training, contrastive vs. generative objectives, lightweight connectors vs. heavy cross-attention — is essential before working with any modern VLM.
| Model | Example | Description |
|---|---|---|
image_encoder = ViT-L/14text_encoder = Transformer | • Dual-encoder architecture trained on 400M image-text pairs using InfoNCE contrastive loss • enables powerful zero-shot transfer to downstream tasks without fine-tuning. | |
vision_encoder = CLIP ViTllm = Vicuna-13Bprojection = linear | • Large Language and Vision Assistant using GPT-4-generated instruction data • simple linear projection layer maps visual features to LLM token space for multimodal conversations | |
vision_encoder = CLIP-ViT-L-336pxconnector = MLP_2-layerllm = Vicuna-13B | • Replaces linear projection with 2-layer MLP connector and adds academic-task VQA data • achieves SOTA across 11 benchmarks using only 1.2M public training samples in ~1 day on 8 A100s | |
Q-Former + frozen vision encoder+ frozen LLM | • Connects frozen image encoder and LLM via lightweight Querying Transformer (Q-Former) with 32 learnable queries • achieves SOTA with 54× fewer trainable parameters than Flamingo | |
Perceiver Resampler + gated cross-attentionfew_shot_examples = 4 | • Few-shot learning specialist using Perceiver Resampler to compress visual tokens • integrates images into frozen LLM via gated cross-attention layers inserted between Transformer blocks | |
BLIP-2 + instruction_tuningdatasets = 26_transformed | • Instruction-tuned version of BLIP-2 trained on 26 vision-language datasets • follows natural language instructions for diverse tasks without task-specific heads. |