DeepSeek and Qwen Models Cheat Sheet

Updated 2026-05-21

Next Topic: Diffusion Models Cheat Sheet

DeepSeek (by DeepSeek-AI) and Qwen (by Alibaba Cloud) are the two most prominent open-weight large language model families from Chinese AI labs, collectively defining the frontier of non-proprietary AI in 2025–2026. Both families employ Mixture-of-Experts (MoE) architectures that activate only a fraction of total parameters per token, enabling frontier-level performance at dramatically reduced inference cost. The key insight for practitioners: these models are not monolithic — each family spans general-purpose, reasoning-specialized, code-specialized, and multimodal variants, all with different prompt formatting, context windows, and licensing terms, so choosing correctly requires understanding the full taxonomy.

What This Cheat Sheet Covers

This topic spans 14 focused tables and 103 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: DeepSeek Model Family OverviewTable 2: Qwen Model Family OverviewTable 3: DeepSeek Core Architecture — MoE and AttentionTable 4: Qwen Core Architecture — MoE and ReasoningTable 5: DeepSeek-R1 Distilled VariantsTable 6: Qwen2.5 and Qwen3 Dense Model SizesTable 7: Prompt Formatting — ChatML and Special TokensTable 8: DeepSeek and Qwen API AccessTable 9: Self-Hosting with vLLMTable 10: DeepSeek-Coder-V2 and Qwen Code ModelsTable 11: Qwen Vision-Language (VL) ModelsTable 12: Licensing and Open-Weight StatusTable 13: Recommended Inference ParametersTable 14: Key Benchmarks and Performance Reference

Table 1: DeepSeek Model Family Overview

The DeepSeek family spans five distinct model lines, each optimized for a different task profile. Understanding which line to reach for — and why — is the first decision any practitioner must make before deployment.

Model	Example	Description
DeepSeek-V3	`model="deepseek-chat"` (API)	• 671B total / 37B active MoE general-purpose model • 128K context • pre-trained on 14.8T tokens • MIT license
DeepSeek-R1	`model="deepseek-reasoner"` (API)	• 671B total / 37B active reasoning model • chain-of-thought via `<think>` blocks • 128K context • MIT license
DeepSeek-V3.1	toggle via chat template	• Hybrid model combining V3 direct answers and R1 chain-of-thought in one 671B checkpoint • 128K context
DeepSeek-R1-0528	`model="deepseek-ai/DeepSeek-R1-0528"`	• Updated R1 checkpoint (May 2025) • AIME 2025 score 87.5% vs 70.0% original • uses ~23K tokens per reasoning trace

Table 1: DeepSeek Model Family Overview

Model	Example	Description
DeepSeek-V3	`model="deepseek-chat"` (API)	• 671B total / 37B active MoE general-purpose model • 128K context • pre-trained on 14.8T tokens • MIT license
DeepSeek-R1	`model="deepseek-reasoner"` (API)	• 671B total / 37B active reasoning model • chain-of-thought via `<think>` blocks • 128K context • MIT license
DeepSeek-V3.1	toggle via chat template	• Hybrid model combining V3 direct answers and R1 chain-of-thought in one 671B checkpoint • 128K context
DeepSeek-R1-0528	`model="deepseek-ai/DeepSeek-R1-0528"`	• Updated R1 checkpoint (May 2025) • AIME 2025 score 87.5% vs 70.0% original • uses ~23K tokens per reasoning trace