Skip to main content

Menu

LEVEL 0
0/5 XP
HomeAboutTopicsPricingMy VaultStatsPractice TestsCertifications

Categories

🎓 Certifications
🤖 Artificial Intelligence
☁️ Cloud and Infrastructure
💾 Data and Databases
💼 Professional Skills
🎯 Programming and Development
🔒 Security and Networking
📚 Specialized Topics
CheatGrid
HomeAboutTopicsPricingMy VaultStatsPractice TestsCertifications
LVLEVEL 0
0/5 XP
GitHub
© 2026 CheatGrid™. All rights reserved.
Privacy PolicyTerms of UseAboutContact

Vision-Language Models (VLMs) Cheat Sheet

Vision-Language Models (VLMs) Cheat Sheet

Back to Generative AI
Updated 2026-05-25
Next Topic: vLLM (LLM Inference Engine) Cheat Sheet

Vision-Language Models (VLMs) are multimodal AI systems that seamlessly integrate visual perception and natural language understanding, enabling machines to reason about images, videos, and text simultaneously. These models power applications from visual question answering to zero-shot image classification, fundamentally changing how AI interprets the visual world. At their core, VLMs learn a shared embedding space where semantically similar images and text descriptions cluster together—a capability that emerged from contrastive learning techniques pioneered by CLIP and refined by subsequent architectures. A critical insight: the quality of vision-language alignment depends not just on model architecture, but on how effectively the model bridges the modality gap between continuous visual features and discrete linguistic tokens through specialized fusion mechanisms, and how well training data covers the diversity of visual scenes and instructions encountered at inference time.

What This Cheat Sheet Covers

This topic spans 16 focused tables and 130 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Foundation VLM ArchitecturesTable 2: Vision EncodersTable 3: Text EncodersTable 4: Multimodal Fusion StrategiesTable 5: Pre-training ObjectivesTable 6: Core VLM TasksTable 7: Alignment & Training StrategiesTable 8: Attention MechanismsTable 9: Key Architectural ComponentsTable 10: Loss Functions & OptimizationTable 11: Evaluation MetricsTable 12: Benchmark DatasetsTable 13: Advanced TechniquesTable 14: Common Implementation DetailsTable 15: Video Understanding in VLMsTable 16: Hallucination & Safety in VLMs

Table 1: Foundation VLM Architectures

The VLM landscape has evolved through three architectural eras: early contrastive models (CLIP), efficient connector-based models (BLIP-2, LLaVA), and the current generation treating vision as a native LLM modality with dynamic resolution and strong reasoning. Understanding each generation's design tradeoffs — frozen encoders vs. joint training, contrastive vs. generative objectives, lightweight connectors vs. heavy cross-attention — is essential before working with any modern VLM.

ModelExampleDescription
CLIP
image_encoder = ViT-L/14
text_encoder = Transformer
• Dual-encoder architecture trained on 400M image-text pairs using InfoNCE contrastive loss
• enables powerful zero-shot transfer to downstream tasks without fine-tuning.
LLaVA
vision_encoder = CLIP ViT
llm = Vicuna-13B
projection = linear
• Large Language and Vision Assistant using GPT-4-generated instruction data
• simple linear projection layer maps visual features to LLM token space for multimodal conversations
LLaVA-1.5
vision_encoder = CLIP-ViT-L-336px
connector = MLP_2-layer
llm = Vicuna-13B
• Replaces linear projection with 2-layer MLP connector and adds academic-task VQA data
• achieves SOTA across 11 benchmarks using only 1.2M public training samples in ~1 day on 8 A100s
BLIP-2
Q-Former + frozen vision encoder
+ frozen LLM
• Connects frozen image encoder and LLM via lightweight Querying Transformer (Q-Former) with 32 learnable queries
• achieves SOTA with 54× fewer trainable parameters than Flamingo
Flamingo
Perceiver Resampler + gated cross-attention
few_shot_examples = 4
• Few-shot learning specialist using Perceiver Resampler to compress visual tokens
• integrates images into frozen LLM via gated cross-attention layers inserted between Transformer blocks
InstructBLIP
BLIP-2 + instruction_tuning
datasets = 26_transformed
• Instruction-tuned version of BLIP-2 trained on 26 vision-language datasets
• follows natural language instructions for diverse tasks without task-specific heads.

More in Generative AI

  • Vector Embeddings Cheat Sheet
  • vLLM (LLM Inference Engine) Cheat Sheet
  • Advanced RAG Patterns and Optimization Cheat Sheet
  • ColBERT and Late Interaction Retrieval Cheat Sheet
  • LangSmith Cheat Sheet
  • NL-to-SQL and Text-to-Code Generation Cheat Sheet
View all 95 topics in Generative AI