Small Language Models (SLMs) Cheat Sheet

Updated 2026-05-18

Next Topic: spaCy Industrial NLP Library Cheat Sheet

Small Language Models (SLMs) are compact AI models with 1B-13B parameters designed for efficient deployment on edge devices and resource-constrained environments. Unlike their larger counterparts that require cloud infrastructure, SLMs enable on-device inference with faster response times, lower latency, and enhanced privacy — making them ideal for mobile, IoT, and offline applications. The critical insight: SLMs trade broad general knowledge for domain-specific expertise and efficiency, achieving 70-90% of LLM performance while using a fraction of resources through techniques like quantization, distillation, and pruning.

What This Cheat Sheet Covers

This topic spans 12 focused tables and 89 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core SLM CharacteristicsTable 2: Major SLM Model FamiliesTable 3: Model Compression TechniquesTable 4: Quantization MethodsTable 5: Knowledge Distillation ApproachesTable 6: Parameter-Efficient Fine-Tuning (PEFT)Table 7: On-Device Deployment FrameworksTable 8: Hardware and Memory RequirementsTable 9: Inference Optimization TechniquesTable 10: Evaluation and BenchmarkingTable 11: Domain Specialization StrategiesTable 12: SLM vs LLM Decision Framework

Table 1: Core SLM Characteristics

Understanding what defines a small language model and how size correlates with deployment constraints helps determine when SLMs are the right choice over large models.

Characteristic	Example	Description
Parameter Count	`1B-13B parameters`	• Typically ranges from 100 million to 13 billion parameters; models above 13B are generally classified as LLMs • smaller parameter counts enable faster inference and lower memory footprint
Model Size	`FP16: ~2GB (1B) to 26GB (13B)` `INT4: ~0.5GB (1B) to 6.5GB (13B)`	• Size in GB depends on precision; FP16 requires ~2 bytes per parameter, INT4 ~0.5 bytes • critical for determining whether a model fits in device memory
Training Data Volume	`500B-9T tokens`	• SLMs like Phi-4 (14B) trained on 9 trillion tokens • smaller models compensate for size through high-quality, curated datasets and longer training
Inference Latency	`<100ms per token on edge devices`	• SLMs achieve sub-100ms latency on mobile CPUs/GPUs • 2-5x faster than streaming from cloud LLMs due to elimination of network overhead

Table 1: Core SLM Characteristics

Understanding what defines a small language model and how size correlates with deployment constraints helps determine when SLMs are the right choice over large models.

Characteristic	Example	Description
Parameter Count	`1B-13B parameters`	• Typically ranges from 100 million to 13 billion parameters; models above 13B are generally classified as LLMs • smaller parameter counts enable faster inference and lower memory footprint
Model Size	`FP16: ~2GB (1B) to 26GB (13B)` `INT4: ~0.5GB (1B) to 6.5GB (13B)`	• Size in GB depends on precision; FP16 requires ~2 bytes per parameter, INT4 ~0.5 bytes • critical for determining whether a model fits in device memory
Training Data Volume	`500B-9T tokens`	• SLMs like Phi-4 (14B) trained on 9 trillion tokens • smaller models compensate for size through high-quality, curated datasets and longer training
Inference Latency	`<100ms per token on edge devices`	• SLMs achieve sub-100ms latency on mobile CPUs/GPUs • 2-5x faster than streaming from cloud LLMs due to elimination of network overhead