Small Language Models (SLMs) are compact AI models with 1B-13B parameters designed for efficient deployment on edge devices and resource-constrained environments. Unlike their larger counterparts that require cloud infrastructure, SLMs enable on-device inference with faster response times, lower latency, and enhanced privacy — making them ideal for mobile, IoT, and offline applications. The critical insight: SLMs trade broad general knowledge for domain-specific expertise and efficiency, achieving 70-90% of LLM performance while using a fraction of resources through techniques like quantization, distillation, and pruning.
What This Cheat Sheet Covers
This topic spans 12 focused tables and 89 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core SLM Characteristics
Understanding what defines a small language model and how size correlates with deployment constraints helps determine when SLMs are the right choice over large models.
| Characteristic | Example | Description |
|---|---|---|
1B-13B parameters | Typically ranges from 100 million to 13 billion parameters; models above 13B are generally classified as LLMs • smaller parameter counts enable faster inference and lower memory footprint | |
FP16: ~2GB (1B) to 26GB (13B)INT4: ~0.5GB (1B) to 6.5GB (13B) | Size in GB depends on precision; FP16 requires ~2 bytes per parameter, INT4 ~0.5 bytes • critical for determining whether a model fits in device memory | |
500B-9T tokens | SLMs like Phi-4 (14B) trained on 9 trillion tokens • smaller models compensate for size through high-quality, curated datasets and longer training | |
<100ms per token on edge devices | SLMs achieve sub-100ms latency on mobile CPUs/GPUs • 2-5x faster than streaming from cloud LLMs due to elimination of network overhead |