Speech-to-Text (ASR) Models Cheat Sheet

Updated 2026-05-25

Next Topic: Stable Diffusion Cheat Sheet

Automatic Speech Recognition (ASR) converts spoken language into text through neural models that process acoustic features, align temporal sequences, and decode linguistic content. ASR powers voice assistants, transcription services, accessibility tools, and real-time communication platforms across 100+ languages. Modern ASR entered a new phase in 2025–2026 with transformer-based architectures (Whisper, Conformer) and self-supervised pre-training (wav2vec2, HuBERT) reaching near-human accuracy on clean speech, while new entrants like IBM Granite Speech 3.3 8B (topping the Hugging Face Open ASR Leaderboard in 2026) and Mistral's Voxtral Realtime are pushing accuracy and native streaming further. The field is split between offline models optimizing for accuracy on pre-recorded audio and streaming models balancing latency with real-time transcription—a trade-off that fundamentally shapes architecture choices from CTC-based systems to Token-and-Duration Transducers (TDT).

What This Cheat Sheet Covers

This topic spans 21 focused tables and 135 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core ASR ArchitecturesTable 2: Pre-Trained Foundation ModelsTable 3: Self-Supervised Learning TechniquesTable 4: Training Objectives & Loss FunctionsTable 5: Acoustic Feature ExtractionTable 6: Decoding StrategiesTable 7: Language Model IntegrationTable 8: Evaluation MetricsTable 9: Benchmark DatasetsTable 10: Streaming vs Offline TranscriptionTable 11: Data Augmentation TechniquesTable 12: Noise Robustness & EnhancementTable 13: Speaker DiarizationTable 14: Multilingual & Code-Switching ASRTable 15: Domain Adaptation & Fine-tuningTable 16: Production DeploymentTable 17: ASR Toolkits & FrameworksTable 18: Advanced TechniquesTable 19: Punctuation & Post-ProcessingTable 20: Emerging Trends & Research FrontiersTable 21: Voice Activity Detection (VAD)

Table 1: Core ASR Architectures

The six dominant end-to-end architectures differ most sharply in how they align audio frames to output tokens—CTC assumes frame-independence, attention allows flexible reordering, and transducers (including TDT) model alignment jointly with language. Understanding these trade-offs determines latency, accuracy, and streaming viability before picking a model or toolkit.

Architecture	Example	Description
Encoder-Decoder Transformer	`encoder: mel-spectrogram → embeddings` `decoder: autoregressive text generation`	• Processes full audio context through encoder then generates text token-by-token • used in Whisper, Speech Transformer • achieves highest accuracy on offline transcription but introduces latency due to attention over entire sequence
RNN-Transducer (RNN-T)	`prediction network + encoder + joint network` `outputs tokens frame-by-frame`	• Streaming-first architecture emitting tokens incrementally without waiting for utterance end • powers Google Assistant, Apple Siri • encoder processes audio, prediction network models language, joint network decides when to emit; enables sub-200ms latency.
Token-and-Duration Transducer (TDT)	`predict token + duration simultaneously` `skip blank frames intelligently`	• Extends RNN-T by jointly predicting text token and how many encoder frames to skip, eliminating wasted blank-frame computations • up to 64% faster inference than RNN-T at similar accuracy • used in NVIDIA Parakeet-TDT models; outperforms conventional transducers on noisy speech.
Conformer	`convolution + multi-head self-attention` `combines local and global context`	• Hybrid block merging CNN local feature extraction with Transformer global dependencies • state-of-the-art on LibriSpeech (WER ~2%) • balances parameter efficiency with accuracy; widely used in production systems (NVIDIA NeMo, AssemblyAI).

Table 1: Core ASR Architectures

Architecture	Example	Description
Encoder-Decoder Transformer	`encoder: mel-spectrogram → embeddings` `decoder: autoregressive text generation`	• Processes full audio context through encoder then generates text token-by-token • used in Whisper, Speech Transformer • achieves highest accuracy on offline transcription but introduces latency due to attention over entire sequence
RNN-Transducer (RNN-T)	`prediction network + encoder + joint network` `outputs tokens frame-by-frame`	• Streaming-first architecture emitting tokens incrementally without waiting for utterance end • powers Google Assistant, Apple Siri • encoder processes audio, prediction network models language, joint network decides when to emit; enables sub-200ms latency.
Token-and-Duration Transducer (TDT)	`predict token + duration simultaneously` `skip blank frames intelligently`	• Extends RNN-T by jointly predicting text token and how many encoder frames to skip, eliminating wasted blank-frame computations • up to 64% faster inference than RNN-T at similar accuracy • used in NVIDIA Parakeet-TDT models; outperforms conventional transducers on noisy speech.
Conformer	`convolution + multi-head self-attention` `combines local and global context`	• Hybrid block merging CNN local feature extraction with Transformer global dependencies • state-of-the-art on LibriSpeech (WER ~2%) • balances parameter efficiency with accuracy; widely used in production systems (NVIDIA NeMo, AssemblyAI).