Skip to main content

Menu

LEVEL 0
0/5 XP
HomeAboutTopicsPricingMy VaultStats

Categories

πŸ€– Artificial Intelligence
☁️ Cloud and Infrastructure
πŸ’Ύ Data and Databases
πŸ’Ό Professional Skills
🎯 Programming and Development
πŸ”’ Security and Networking
πŸ“š Specialized Topics
HomeAboutTopicsPricingMy VaultStats
LEVEL 0
0/5 XP
GitHub
Β© 2026 CheatGridβ„’. All rights reserved.
Privacy PolicyTerms of UseAboutContact

Speech-to-Text (ASR) Models Cheat Sheet

Speech-to-Text (ASR) Models Cheat Sheet

Back to Generative AI
Updated 2026-03-17
Next Topic: Stable Diffusion Cheat Sheet

Automatic Speech Recognition (ASR) converts spoken language into text through neural models that process acoustic features, align temporal sequences, and decode linguistic content. ASR powers voice assistants, transcription services, accessibility tools, and real-time communication platforms across 100+ languages. Modern ASR achieved a breakthrough in 2022–2025 with transformer-based architectures (Whisper, Conformer) and self-supervised pre-training (wav2vec2, HuBERT) reaching near-human accuracy on clean speech, though challenges remain in noisy environments, accented speech, and low-resource languages. The field is split between offline models optimizing for accuracy on pre-recorded audio and streaming models balancing latency with real-time transcriptionβ€”a trade-off that fundamentally shapes architecture choices from CTC-based systems to RNN-Transducers.

What This Cheat Sheet Covers

This topic spans 20 focused tables and 114 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core ASR ArchitecturesTable 2: Pre-Trained Foundation ModelsTable 3: Self-Supervised Learning TechniquesTable 4: Training Objectives & Loss FunctionsTable 5: Acoustic Feature ExtractionTable 6: Decoding StrategiesTable 7: Language Model IntegrationTable 8: Evaluation MetricsTable 9: Benchmark DatasetsTable 10: Streaming vs Offline TranscriptionTable 11: Data Augmentation TechniquesTable 12: Noise Robustness & EnhancementTable 13: Speaker DiarizationTable 14: Multilingual & Code-Switching ASRTable 15: Domain Adaptation & Fine-tuningTable 16: Production Deployment & OptimizationTable 17: ASR Toolkits & FrameworksTable 18: Advanced TechniquesTable 19: Punctuation & Post-processingTable 20: Emerging Trends & Research Directions

Table 1: Core ASR Architectures

ArchitectureExampleDescription
Encoder-Decoder Transformer
encoder: mel-spectrogram β†’ embeddings
decoder: autoregressive text generation
β€’ Processes full audio context through encoder then generates text token-by-token
β€’ used in Whisper, Speech Transformer
β€’ achieves highest accuracy on offline transcription but introduces latency due to attention over entire sequence.
RNN-Transducer (RNN-T)
prediction network + encoder + joint network
outputs tokens frame-by-frame
β€’ Streaming-first architecture that emits tokens incrementally without waiting for utterance end
β€’ powers Google Assistant, Apple Siri
β€’ encoder processes audio, prediction network models language, joint network decides when to emit
β€’ enables sub-200ms latency for real-time applications.
Conformer
convolution + multi-head self-attention
combines local and global context
β€’ Hybrid block merging CNN local feature extraction with Transformer global dependencies
β€’ state-of-the-art on LibriSpeech (WER ~2%)
β€’ balances parameter efficiency with accuracy
β€’ widely used in production systems (NVIDIA NeMo, AssemblyAI).
Listen Attend Spell (LAS)
listener: BiLSTM encoder
speller: attention-based decoder
β€’ Early end-to-end model using attention to align encoder states with output characters
β€’ introduced sequence-to-sequence ASR (2015)
β€’ slower than RNN-T due to full-utterance attention
β€’ foundation for modern encoder-decoder designs but largely superseded by Transformers.

More in Generative AI

  • Speculative Decoding and LLM Serving Optimization Cheat Sheet
  • Stable Diffusion Cheat Sheet
  • Advanced RAG Patterns and Optimization Cheat Sheet
  • Chain-of-Thought Reasoning Cheat Sheet
  • Knowledge Distillation Cheat Sheet
  • MCP Servers Implementation Cheat Sheet
View all 77 topics in Generative AI