Vision-Language Models (VLMs) Cheat Sheet

Updated 2026-05-25

Next Topic: vLLM (LLM Inference Engine) Cheat Sheet

Vision-Language Models (VLMs) are multimodal AI systems that seamlessly integrate visual perception and natural language understanding, enabling machines to reason about images, videos, and text simultaneously. These models power applications from visual question answering to zero-shot image classification, fundamentally changing how AI interprets the visual world. At their core, VLMs learn a shared embedding space where semantically similar images and text descriptions cluster together—a capability that emerged from contrastive learning techniques pioneered by CLIP and refined by subsequent architectures. A critical insight: the quality of vision-language alignment depends not just on model architecture, but on how effectively the model bridges the modality gap between continuous visual features and discrete linguistic tokens through specialized fusion mechanisms, and how well training data covers the diversity of visual scenes and instructions encountered at inference time.

What This Cheat Sheet Covers

This topic spans 16 focused tables and 130 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Foundation VLM ArchitecturesTable 2: Vision EncodersTable 3: Text EncodersTable 4: Multimodal Fusion StrategiesTable 5: Pre-training ObjectivesTable 6: Core VLM TasksTable 7: Alignment & Training StrategiesTable 8: Attention MechanismsTable 9: Key Architectural ComponentsTable 10: Loss Functions & OptimizationTable 11: Evaluation MetricsTable 12: Benchmark DatasetsTable 13: Advanced TechniquesTable 14: Common Implementation DetailsTable 15: Video Understanding in VLMsTable 16: Hallucination & Safety in VLMs

Table 1: Foundation VLM Architectures

The VLM landscape has evolved through three architectural eras: early contrastive models (CLIP), efficient connector-based models (BLIP-2, LLaVA), and the current generation treating vision as a native LLM modality with dynamic resolution and strong reasoning. Understanding each generation's design tradeoffs — frozen encoders vs. joint training, contrastive vs. generative objectives, lightweight connectors vs. heavy cross-attention — is essential before working with any modern VLM.

Model	Example	Description
CLIP	`image_encoder = ViT-L/14` `text_encoder = Transformer`	• Dual-encoder architecture trained on 400M image-text pairs using InfoNCE contrastive loss • enables powerful zero-shot transfer to downstream tasks without fine-tuning.
LLaVA	`vision_encoder = CLIP ViT` `llm = Vicuna-13B` `projection = linear`	• Large Language and Vision Assistant using GPT-4-generated instruction data • simple linear projection layer maps visual features to LLM token space for multimodal conversations
LLaVA-1.5	`vision_encoder = CLIP-ViT-L-336px` `connector = MLP_2-layer` `llm = Vicuna-13B`	• Replaces linear projection with 2-layer MLP connector and adds academic-task VQA data • achieves SOTA across 11 benchmarks using only 1.2M public training samples in ~1 day on 8 A100s
BLIP-2	`Q-Former + frozen vision encoder` `+ frozen LLM`	• Connects frozen image encoder and LLM via lightweight Querying Transformer (Q-Former) with 32 learnable queries • achieves SOTA with 54× fewer trainable parameters than Flamingo
Flamingo	`Perceiver Resampler + gated cross-attention` `few_shot_examples = 4`	• Few-shot learning specialist using Perceiver Resampler to compress visual tokens • integrates images into frozen LLM via gated cross-attention layers inserted between Transformer blocks
InstructBLIP	`BLIP-2 + instruction_tuning` `datasets = 26_transformed`	• Instruction-tuned version of BLIP-2 trained on 26 vision-language datasets • follows natural language instructions for diverse tasks without task-specific heads.

Table 1: Foundation VLM Architectures

Model	Example	Description
CLIP	`image_encoder = ViT-L/14` `text_encoder = Transformer`	• Dual-encoder architecture trained on 400M image-text pairs using InfoNCE contrastive loss • enables powerful zero-shot transfer to downstream tasks without fine-tuning.
LLaVA	`vision_encoder = CLIP ViT` `llm = Vicuna-13B` `projection = linear`	• Large Language and Vision Assistant using GPT-4-generated instruction data • simple linear projection layer maps visual features to LLM token space for multimodal conversations
LLaVA-1.5	`vision_encoder = CLIP-ViT-L-336px` `connector = MLP_2-layer` `llm = Vicuna-13B`	• Replaces linear projection with 2-layer MLP connector and adds academic-task VQA data • achieves SOTA across 11 benchmarks using only 1.2M public training samples in ~1 day on 8 A100s
BLIP-2	`Q-Former + frozen vision encoder` `+ frozen LLM`	• Connects frozen image encoder and LLM via lightweight Querying Transformer (Q-Former) with 32 learnable queries • achieves SOTA with 54× fewer trainable parameters than Flamingo
Flamingo	`Perceiver Resampler + gated cross-attention` `few_shot_examples = 4`	• Few-shot learning specialist using Perceiver Resampler to compress visual tokens • integrates images into frozen LLM via gated cross-attention layers inserted between Transformer blocks
InstructBLIP	`BLIP-2 + instruction_tuning` `datasets = 26_transformed`	• Instruction-tuned version of BLIP-2 trained on 26 vision-language datasets • follows natural language instructions for diverse tasks without task-specific heads.