Automatic Speech Recognition (ASR) converts spoken language into text through neural models that process acoustic features, align temporal sequences, and decode linguistic content. ASR powers voice assistants, transcription services, accessibility tools, and real-time communication platforms across 100+ languages. Modern ASR entered a new phase in 2025–2026 with transformer-based architectures (Whisper, Conformer) and self-supervised pre-training (wav2vec2, HuBERT) reaching near-human accuracy on clean speech, while new entrants like IBM Granite Speech 3.3 8B (topping the Hugging Face Open ASR Leaderboard in 2026) and Mistral's Voxtral Realtime are pushing accuracy and native streaming further. The field is split between offline models optimizing for accuracy on pre-recorded audio and streaming models balancing latency with real-time transcription—a trade-off that fundamentally shapes architecture choices from CTC-based systems to Token-and-Duration Transducers (TDT).
What This Cheat Sheet Covers
This topic spans 21 focused tables and 135 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core ASR Architectures
The six dominant end-to-end architectures differ most sharply in how they align audio frames to output tokens—CTC assumes frame-independence, attention allows flexible reordering, and transducers (including TDT) model alignment jointly with language. Understanding these trade-offs determines latency, accuracy, and streaming viability before picking a model or toolkit.
| Architecture | Example | Description |
|---|---|---|
encoder: mel-spectrogram → embeddingsdecoder: autoregressive text generation | • Processes full audio context through encoder then generates text token-by-token • used in Whisper, Speech Transformer • achieves highest accuracy on offline transcription but introduces latency due to attention over entire sequence | |
prediction network + encoder + joint networkoutputs tokens frame-by-frame | • Streaming-first architecture emitting tokens incrementally without waiting for utterance end • powers Google Assistant, Apple Siri • encoder processes audio, prediction network models language, joint network decides when to emit; enables sub-200ms latency. | |
predict token + duration simultaneouslyskip blank frames intelligently | • Extends RNN-T by jointly predicting text token and how many encoder frames to skip, eliminating wasted blank-frame computations • up to 64% faster inference than RNN-T at similar accuracy • used in NVIDIA Parakeet-TDT models; outperforms conventional transducers on noisy speech. | |
convolution + multi-head self-attentioncombines local and global context | • Hybrid block merging CNN local feature extraction with Transformer global dependencies • state-of-the-art on LibriSpeech (WER ~2%) • balances parameter efficiency with accuracy; widely used in production systems (NVIDIA NeMo, AssemblyAI). |