Skip to main content

Menu

LEVEL 0
0/5 XP
HomeAboutTopicsPricingMy VaultStats

Categories

🤖 Artificial Intelligence
☁️ Cloud and Infrastructure
💾 Data and Databases
💼 Professional Skills
🎯 Programming and Development
🔒 Security and Networking
📚 Specialized Topics
HomeAboutTopicsPricingMy VaultStats
LEVEL 0
0/5 XP
GitHub
© 2026 CheatGrid™. All rights reserved.
Privacy PolicyTerms of UseAboutContact

Vision-Language Models (VLMs) Cheat Sheet

Vision-Language Models (VLMs) Cheat Sheet

Back to Generative AI
Updated 2026-03-17
Next Topic: World Models and Neural Simulators Cheat Sheet

Vision-Language Models (VLMs) are multimodal AI systems that seamlessly integrate visual perception and natural language understanding, enabling machines to reason about images, videos, and text simultaneously. These models power applications from visual question answering to zero-shot image classification, fundamentally changing how AI interprets the visual world. At their core, VLMs learn a shared embedding space where semantically similar images and text descriptions cluster together—a capability that emerged from contrastive learning techniques pioneered by CLIP and refined by subsequent architectures. A critical insight: the quality of vision-language alignment depends not just on model architecture, but on how effectively the model bridges the modality gap between continuous visual features and discrete linguistic tokens through specialized fusion mechanisms.

What This Cheat Sheet Covers

This topic spans 14 focused tables and 91 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Foundation VLM ArchitecturesTable 2: Vision EncodersTable 3: Text EncodersTable 4: Multimodal Fusion StrategiesTable 5: Pre-training ObjectivesTable 6: Core VLM TasksTable 7: Alignment & Training StrategiesTable 8: Attention MechanismsTable 9: Key Architectural ComponentsTable 10: Loss Functions & OptimizationTable 11: Evaluation MetricsTable 12: Benchmark DatasetsTable 13: Advanced TechniquesTable 14: Common Implementation Details

Table 1: Foundation VLM Architectures

ModelExampleDescription
CLIP
image_encoder = ViT-L/14
text_encoder = Transformer
• Dual-encoder architecture trained on 400M image-text pairs using InfoNCE contrastive loss
• enables powerful zero-shot transfer to downstream tasks without fine-tuning.
BLIP
model = MED(unimodal + multimodal)
data = CapFilt bootstrapping
• Unified encoder-decoder framework with Multimodal Encoder-Decoder (MED)
• uses image-text contrastive loss, matching loss, and captioning loss
• includes CapFilt for noisy data filtering.
BLIP-2
Q-Former + frozen vision encoder
+ frozen LLM
• Connects frozen image encoder and LLM via lightweight Querying Transformer (Q-Former) with learnable queries
• achieves SOTA with 54× fewer trainable parameters than Flamingo.
Flamingo
Perceiver Resampler + gated cross-attention
few_shot_examples = 4
• Few-shot learning specialist using Perceiver Resampler to compress visual tokens
• integrates images into frozen LLM via gated cross-attention layers inserted between Transformer blocks.
LLaVA
vision_encoder = CLIP ViT
llm = Vicuna-13B
projection = linear
• Large Language and Vision Assistant using GPT-4 generated instruction data
• simple linear projection layer maps visual features to LLM token space for multimodal conversations.

More in Generative AI

  • Vector Embeddings Cheat Sheet
  • World Models and Neural Simulators Cheat Sheet
  • Advanced RAG Patterns and Optimization Cheat Sheet
  • Chain-of-Thought Reasoning Cheat Sheet
  • Knowledge Distillation Cheat Sheet
  • MCP Servers Implementation Cheat Sheet
View all 77 topics in Generative AI