Multimodal AI represents the convergence of vision, language, audio, and other sensory modalities within unified machine learning systems, enabling models to process and generate diverse data types—text, images, audio, and video—simultaneously. The field has evolved from contrastive dual-encoder models like CLIP to natively multimodal architectures where all modalities share a unified token vocabulary and are processed by a single transformer end-to-end, and to omni-models capable of real-time audio-visual conversation. The key architectural insight—alignment in a shared embedding space via contrastive learning, cross-attention, or joint pre-training—now extends to any-to-any generation through early-fusion token-based and unified diffusion-autoregressive frameworks. By 2025, multimodal AI drives applications from document intelligence and chart understanding to GUI automation, robotics, and scientific reasoning.
What This Cheat Sheet Covers
This topic spans 15 focused tables and 149 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Multimodal Architectures
| Architecture | Example | Description |
|---|---|---|
CLIP: train(image_enc, text_enc)BLIP: train(ViT, BERT, captioning) | • Joint training of vision and language encoders on large-scale image-text pairs to learn aligned representations • foundational for zero-shot transfer. | |
LLaVA: CLIP-ViT → projector → VicunaGPT-4o: vision encoder → GPT-4o | • Integrates a frozen or fine-tuned LLM with visual input via projection layers or adapters • performs complex reasoning, instruction-following, and chain-of-thought. | |
CLIP: maximize(cos_sim(img, text)) | • Learns separate encoders for images and text, aligning their embeddings via contrastive loss • enables zero-shot classification and cross-modal retrieval. | |
T5Gemma: encoder(image) → decoder(text)Pix2Struct: visual encoder → T5 decoder | • Uses a vision encoder to extract features and a language decoder to generate text • effective for image captioning, VQA, and generative tasks. | |
Flamingo: img_features ⊗ text_tokensViLBERT: co-attentional layers | • Injects visual information into language models via cross-attention layers that attend from text to image features • preserves modular design. | |
BLIP-2: frozen_img_enc ← Q-Former → frozen_LLM | • Uses a lightweight Querying Transformer to bridge frozen vision and language models • reduces trainable parameters while maintaining strong performance. |