Automatic Speech Recognition (ASR) converts spoken language into text through neural models that process acoustic features, align temporal sequences, and decode linguistic content. ASR powers voice assistants, transcription services, accessibility tools, and real-time communication platforms across 100+ languages. Modern ASR achieved a breakthrough in 2022β2025 with transformer-based architectures (Whisper, Conformer) and self-supervised pre-training (wav2vec2, HuBERT) reaching near-human accuracy on clean speech, though challenges remain in noisy environments, accented speech, and low-resource languages. The field is split between offline models optimizing for accuracy on pre-recorded audio and streaming models balancing latency with real-time transcriptionβa trade-off that fundamentally shapes architecture choices from CTC-based systems to RNN-Transducers.
What This Cheat Sheet Covers
This topic spans 20 focused tables and 114 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core ASR Architectures
| Architecture | Example | Description |
|---|---|---|
encoder: mel-spectrogram β embeddingsdecoder: autoregressive text generation | β’ Processes full audio context through encoder then generates text token-by-token β’ used in Whisper, Speech Transformer β’ achieves highest accuracy on offline transcription but introduces latency due to attention over entire sequence. | |
prediction network + encoder + joint networkoutputs tokens frame-by-frame | β’ Streaming-first architecture that emits tokens incrementally without waiting for utterance end β’ powers Google Assistant, Apple Siri β’ encoder processes audio, prediction network models language, joint network decides when to emit β’ enables sub-200ms latency for real-time applications. | |
convolution + multi-head self-attentioncombines local and global context | β’ Hybrid block merging CNN local feature extraction with Transformer global dependencies β’ state-of-the-art on LibriSpeech (WER ~2%) β’ balances parameter efficiency with accuracy β’ widely used in production systems (NVIDIA NeMo, AssemblyAI). | |
listener: BiLSTM encoderspeller: attention-based decoder | β’ Early end-to-end model using attention to align encoder states with output characters β’ introduced sequence-to-sequence ASR (2015) β’ slower than RNN-T due to full-utterance attention β’ foundation for modern encoder-decoder designs but largely superseded by Transformers. |