Model pruning is a neural network compression technique that systematically removes weights, neurons, channels, or entire structures from trained networks to reduce computational cost and memory footprint while preserving accuracy. Originally inspired by biological synaptic pruning, modern pruning methods balance sparsity (percentage of parameters removed) against performance degradation, enabling deployment on resource-constrained devices and reducing inference latency. Unlike quantization or knowledge distillation, pruning directly eliminates redundant parameters rather than representing them more efficiently. The lottery ticket hypothesis suggests that dense networks contain sparse subnetworks ("winning tickets") that, when trained in isolation, can match or exceed original performance—fundamentally changing our understanding of why over-parameterized networks train successfully.
What This Cheat Sheet Covers
This topic spans 13 focused tables and 85 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Pruning Types
Pruning methods differ mainly in what they remove and how regular the resulting sparsity is. Unstructured approaches drop individual weights for maximum compression but need special kernels to run fast, while structured ones—cutting whole channels, filters, neurons, attention heads, or even entire layers—keep the math dense and so deliver real speedups on ordinary hardware. The granularity you pick is the central trade-off between compression ratio and practical inference gains.
| Type | Example | Description |
|---|---|---|
mask = torch.abs(weight) > threshold | • Removes individual weights below a magnitude threshold • creates irregular sparse matrices with highest compression but requires specialized sparse kernels for speedup | |
prune_channels(conv_layer, indices)# Remove entire output channels | Removes entire output channels in convolutional layers, reducing both FLOPs and actual inference time on standard hardware without sparse kernel support. | |
remove_filters(conv_layer, bottom_k)# Prune complete 3D filters | • Eliminates complete convolutional filters (kernels), reducing model width and feature map dimensions • hardware-friendly and maintains dense matrix operations | |
mask_neurons(fc_layer, importance < t) | • Removes entire neurons from fully connected layers based on activation patterns or output contribution • more aggressive than weight pruning but preserves layer structure |