Multimodal AI Cheat Sheet

Updated 2026-04-05

Next Topic: NL-to-SQL and Text-to-Code Generation Cheat Sheet

Multimodal AI represents the convergence of vision, language, audio, and other sensory modalities within unified machine learning systems, enabling models to process and generate diverse data types—text, images, audio, and video—simultaneously. The field has evolved from contrastive dual-encoder models like CLIP to natively multimodal architectures where all modalities share a unified token vocabulary and are processed by a single transformer end-to-end, and to omni-models capable of real-time audio-visual conversation. The key architectural insight—alignment in a shared embedding space via contrastive learning, cross-attention, or joint pre-training—now extends to any-to-any generation through early-fusion token-based and unified diffusion-autoregressive frameworks. By 2025, multimodal AI drives applications from document intelligence and chart understanding to GUI automation, robotics, and scientific reasoning.

What This Cheat Sheet Covers

This topic spans 15 focused tables and 149 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core Multimodal ArchitecturesTable 2: Vision-Language Pre-training MethodsTable 3: Multimodal Fusion StrategiesTable 4: Vision Encoding TechniquesTable 5: Text Encoding for Multimodal ModelsTable 6: Key Multimodal Models and SystemsTable 7: Vision-Language TasksTable 8: Contrastive Learning ComponentsTable 9: Advanced Techniques and OptimizationsTable 10: Datasets for Multimodal LearningTable 11: Evaluation Metrics and BenchmarksTable 12: Training TechniquesTable 13: Common Challenges and SolutionsTable 14: Specialized Multimodal CapabilitiesTable 15: Audio-Visual Models and Capabilities

Table 1: Core Multimodal Architectures

Architecture	Example	Description
Vision-Language Pre-training (VLP)	`CLIP: train(image_enc, text_enc)` `BLIP: train(ViT, BERT, captioning)`	• Joint training of vision and language encoders on large-scale image-text pairs to learn aligned representations • foundational for zero-shot transfer.
Large Vision-Language Model (LVLM)	`LLaVA: CLIP-ViT → projector → Vicuna` `GPT-4o: vision encoder → GPT-4o`	• Integrates a frozen or fine-tuned LLM with visual input via projection layers or adapters • performs complex reasoning, instruction-following, and chain-of-thought.
Dual-Encoder with Contrastive Learning	`CLIP: maximize(cos_sim(img, text))`	• Learns separate encoders for images and text, aligning their embeddings via contrastive loss • enables zero-shot classification and cross-modal retrieval.
Encoder-Decoder Transformer	`T5Gemma: encoder(image) → decoder(text)` `Pix2Struct: visual encoder → T5 decoder`	• Uses a vision encoder to extract features and a language decoder to generate text • effective for image captioning, VQA, and generative tasks.
Cross-Attention Fusion	`Flamingo: img_features ⊗ text_tokens` `ViLBERT: co-attentional layers`	• Injects visual information into language models via cross-attention layers that attend from text to image features • preserves modular design.
Q-Former Bridging Architecture	`BLIP-2: frozen_img_enc ← Q-Former → frozen_LLM`	• Uses a lightweight Querying Transformer to bridge frozen vision and language models • reduces trainable parameters while maintaining strong performance.

Table 1: Core Multimodal Architectures

Architecture	Example	Description
Vision-Language Pre-training (VLP)	`CLIP: train(image_enc, text_enc)` `BLIP: train(ViT, BERT, captioning)`	• Joint training of vision and language encoders on large-scale image-text pairs to learn aligned representations • foundational for zero-shot transfer.
Large Vision-Language Model (LVLM)	`LLaVA: CLIP-ViT → projector → Vicuna` `GPT-4o: vision encoder → GPT-4o`	• Integrates a frozen or fine-tuned LLM with visual input via projection layers or adapters • performs complex reasoning, instruction-following, and chain-of-thought.
Dual-Encoder with Contrastive Learning	`CLIP: maximize(cos_sim(img, text))`	• Learns separate encoders for images and text, aligning their embeddings via contrastive loss • enables zero-shot classification and cross-modal retrieval.
Encoder-Decoder Transformer	`T5Gemma: encoder(image) → decoder(text)` `Pix2Struct: visual encoder → T5 decoder`	• Uses a vision encoder to extract features and a language decoder to generate text • effective for image captioning, VQA, and generative tasks.
Cross-Attention Fusion	`Flamingo: img_features ⊗ text_tokens` `ViLBERT: co-attentional layers`	• Injects visual information into language models via cross-attention layers that attend from text to image features • preserves modular design.
Q-Former Bridging Architecture	`BLIP-2: frozen_img_enc ← Q-Former → frozen_LLM`	• Uses a lightweight Querying Transformer to bridge frozen vision and language models • reduces trainable parameters while maintaining strong performance.