Vision-Language Models (VLMs) are multimodal AI systems that seamlessly integrate visual perception and natural language understanding, enabling machines to reason about images, videos, and text simultaneously. These models power applications from visual question answering to zero-shot image classification, fundamentally changing how AI interprets the visual world. At their core, VLMs learn a shared embedding space where semantically similar images and text descriptions cluster together—a capability that emerged from contrastive learning techniques pioneered by CLIP and refined by subsequent architectures. A critical insight: the quality of vision-language alignment depends not just on model architecture, but on how effectively the model bridges the modality gap between continuous visual features and discrete linguistic tokens through specialized fusion mechanisms.
What This Cheat Sheet Covers
This topic spans 14 focused tables and 91 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Foundation VLM Architectures
| Model | Example | Description |
|---|---|---|
image_encoder = ViT-L/14text_encoder = Transformer | • Dual-encoder architecture trained on 400M image-text pairs using InfoNCE contrastive loss • enables powerful zero-shot transfer to downstream tasks without fine-tuning. | |
model = MED(unimodal + multimodal)data = CapFilt bootstrapping | • Unified encoder-decoder framework with Multimodal Encoder-Decoder (MED) • uses image-text contrastive loss, matching loss, and captioning loss • includes CapFilt for noisy data filtering. | |
Q-Former + frozen vision encoder+ frozen LLM | • Connects frozen image encoder and LLM via lightweight Querying Transformer (Q-Former) with learnable queries • achieves SOTA with 54× fewer trainable parameters than Flamingo. | |
Perceiver Resampler + gated cross-attentionfew_shot_examples = 4 | • Few-shot learning specialist using Perceiver Resampler to compress visual tokens • integrates images into frozen LLM via gated cross-attention layers inserted between Transformer blocks. | |
vision_encoder = CLIP ViTllm = Vicuna-13Bprojection = linear | • Large Language and Vision Assistant using GPT-4 generated instruction data • simple linear projection layer maps visual features to LLM token space for multimodal conversations. |