Llama Models (Meta) Cheat Sheet

Updated 2026-05-21

Meta's Llama (Large Language Model Meta AI) is a family of open-weight large language models that has evolved from a research-only release in 2023 into one of the most widely deployed model families in the world. Llama models range from compact 1B-parameter edge models to massive mixture-of-experts architectures exceeding 400B total parameters, enabling deployment on a single smartphone all the way to multi-GPU server clusters. What makes the family distinctive is open weights under a commercial-friendly community license, allowing developers to fine-tune, self-host, and build products without vendor lock-in. Understanding the family requires tracking several parallel axes at once: model generation (3.1, 3.2, 3.3, 4), size tier, modality (text-only vs. vision), and variant type (base vs. instruct) — each combination has distinct capabilities, prompt formats, and deployment requirements.

What This Cheat Sheet Covers

This topic spans 15 focused tables and 98 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Model Generations and Release TimelineTable 2: Model Sizes and Parameter TiersTable 3: Base vs. Instruct VariantsTable 4: Context WindowsTable 5: Architecture and Technical DesignTable 6: Llama 3.x Prompt Format and Special TokensTable 7: Llama 4 Prompt Format and Special TokensTable 8: Vision and Multimodal CapabilitiesTable 9: Code Llama FamilyTable 10: Llama Guard — Safety ClassificationTable 11: Purple Llama Safety SuiteTable 12: Hosting and Deployment OptionsTable 13: Quantization FormatsTable 14: Fine-Tuning MethodsTable 15: License and Commercial Terms

Table 1: Model Generations and Release Timeline

Each Llama generation introduced a major architectural or capability leap. Knowing which generation a model belongs to immediately signals its context window, multimodal support, and license terms.

Model	Example	Description
Llama 3 (April 2024)	`meta-llama/Meta-Llama-3-8B-Instruct`	• 8B and 70B dense decoder-only transformers • 128K-token vocabulary (up from 32K in Llama 2) • 8K context • trained on ~15T tokens • GQA across all sizes • strong reasoning and code
Llama 3.1 (July 2024)	`meta-llama/Llama-3.1-405B-Instruct`	• Adds 405B, expands context to 128K tokens, multilingual support (8 languages), native tool/function calling • 405B intended as teacher model for distillation • same dense architecture as Llama 3.
Llama 3.2 (September 2024)	`meta-llama/Llama-3.2-11B-Vision-Instruct`	• Adds 1B and 3B lightweight edge models + 11B and 90B vision models • first multimodal Llama • 128K context • vision models use cross-attention adapter architecture
Llama 3.3 (December 2024)	`meta-llama/Llama-3.3-70B-Instruct`	• Single-size 70B release matching near-405B performance at 70B compute cost • 128K context • 8-language support • text-only • released Dec 6, 2024.
Llama 4 Scout (April 2025)	`meta-llama/Llama-4-Scout-17B-16E-Instruct`	• First MoE Llama • 17B active / 109B total params • 16 experts • 10M-token context window (iRoPE architecture) • natively multimodal • fits on single H100 with INT4.

Table 1: Model Generations and Release Timeline

Each Llama generation introduced a major architectural or capability leap. Knowing which generation a model belongs to immediately signals its context window, multimodal support, and license terms.

Model	Example	Description
Llama 3 (April 2024)	`meta-llama/Meta-Llama-3-8B-Instruct`	• 8B and 70B dense decoder-only transformers • 128K-token vocabulary (up from 32K in Llama 2) • 8K context • trained on ~15T tokens • GQA across all sizes • strong reasoning and code
Llama 3.1 (July 2024)	`meta-llama/Llama-3.1-405B-Instruct`	• Adds 405B, expands context to 128K tokens, multilingual support (8 languages), native tool/function calling • 405B intended as teacher model for distillation • same dense architecture as Llama 3.
Llama 3.2 (September 2024)	`meta-llama/Llama-3.2-11B-Vision-Instruct`	• Adds 1B and 3B lightweight edge models + 11B and 90B vision models • first multimodal Llama • 128K context • vision models use cross-attention adapter architecture
Llama 3.3 (December 2024)	`meta-llama/Llama-3.3-70B-Instruct`	• Single-size 70B release matching near-405B performance at 70B compute cost • 128K context • 8-language support • text-only • released Dec 6, 2024.
Llama 4 Scout (April 2025)	`meta-llama/Llama-4-Scout-17B-16E-Instruct`	• First MoE Llama • 17B active / 109B total params • 16 experts • 10M-token context window (iRoPE architecture) • natively multimodal • fits on single H100 with INT4.