Skip to main content

Menu

LEVEL 0
0/5 XP
HomeAboutTopicsPricingMy VaultStatsPractice TestsCertifications

Categories

πŸŽ“ Certifications
πŸ€– Artificial Intelligence
☁️ Cloud and Infrastructure
πŸ’Ύ Data and Databases
πŸ’Ό Professional Skills
🎯 Programming and Development
πŸ”’ Security and Networking
πŸ“š Specialized Topics
CheatGrid
HomeAboutTopicsPricingMy VaultStatsPractice TestsCertifications
LVLEVEL 0
0/5 XP
GitHub
Β© 2026 CheatGridβ„’. All rights reserved.
Privacy PolicyTerms of UseAboutContact

Text-to-Speech (TTS) Synthesis Cheat Sheet

Text-to-Speech (TTS) Synthesis Cheat Sheet

Back to Generative AI
Updated 2026-05-25
Next Topic: Token Management Cheat Sheet

Text-to-Speech (TTS) synthesis is a branch of speech processing and deep learning that converts written text into natural-sounding human speech. The field has shifted from rule-based and concatenative systems to end-to-end generative models capable of producing highly realistic, expressive, emotionally rich speech β€” with the best systems now reaching human parity on standard benchmarks. Modern TTS typically involves two stages: converting text into intermediate acoustic representations (mel-spectrograms or discrete tokens), then generating raw audio through a vocoder or codec decoder. A critical insight for practitioners: the choice between autoregressive (higher naturalness, slower), non-autoregressive (faster, parallel), and flow/diffusion-based (probabilistic, controllable) paradigms fundamentally shapes the quality-speed trade-off, while the emergence of codec language models (VALL-E, Orpheus, Chatterbox) has opened a new paradigm treating TTS as conditional language modeling over discrete audio tokens.

What This Cheat Sheet Covers

This topic spans 17 focused tables and 152 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Foundational TTS ArchitecturesTable 2: Neural Vocoder ArchitecturesTable 3: Generative Model ParadigmsTable 4: Codec Language Models & Token-Based TTSTable 5: Text Preprocessing & Input RepresentationsTable 6: Acoustic Feature RepresentationsTable 7: Prosody & Expressive ControlTable 8: Multi-Speaker & Voice CloningTable 9: Duration & Alignment ModelingTable 10: Training Techniques & Loss FunctionsTable 11: Inference & Real-Time OptimizationTable 12: Quality Evaluation MetricsTable 13: Model Compression & DeploymentTable 14: Multilingual & Cross-Lingual TTSTable 15: Advanced Voice Cloning & AdaptationTable 16: Training Data & Dataset RequirementsTable 17: Responsible AI & Safety in TTS

Table 1: Foundational TTS Architectures

Modern TTS architectures span from seq2seq acoustic models to end-to-end flow-matching and LLM-based systems. Knowing which architecture underpins a model tells you what trade-offs to expect in terms of speed, quality, controllability, and voice cloning capability.

ModelExampleDescription
VITS
text β†’ conditional VAE + flows β†’ raw waveform
β€’ End-to-end model combining VAE, normalizing flows, and adversarial training to generate waveforms without a separate vocoder
β€’ achieves real-time synthesis and is widely used as a production backbone
Tacotron 2
text β†’ encoder β†’ attention β†’ decoder β†’ mel-spectrogram
β€’ Foundational autoregressive seq2seq model generating mel-spectrograms frame-by-frame using location-sensitive attention
β€’ high quality but slow inference
FastSpeech 2
text β†’ pitch/energy/duration predictors β†’ mel-spectrogram
β€’ Non-autoregressive model predicting pitch, energy, and duration directly from text
β€’ simpler pipeline and faster inference than its predecessor
StyleTTS 2
text β†’ style diffusion β†’ SLM discriminator β†’ waveform
β€’ First TTS to surpass human recordings on LJSpeech
β€’ models speaking style as a latent random variable via diffusion and uses WavLM as a discriminator for end-to-end training
F5-TTS
text (padded with fillers) β†’ DiT flow matching β†’ mel-spectrogram
β€’ Fully non-autoregressive flow-matching model using a Diffusion Transformer
β€’ no duration model or phoneme aligner needed
β€’ RTF 0.15, trained on 100K multilingual hours with Sway Sampling inference
FastSpeech
text β†’ FFT blocks β†’ length regulator β†’ mel-spectrogram
β€’ First non-autoregressive TTS using explicit duration prediction and parallel generation
β€’ ~270Γ— faster than Tacotron 2, superseded by FastSpeech 2.

More in Generative AI

  • Text-to-Image Prompting Cheat Sheet
  • Token Management Cheat Sheet
  • Advanced RAG Patterns and Optimization Cheat Sheet
  • ColBERT and Late Interaction Retrieval Cheat Sheet
  • LangSmith Cheat Sheet
  • NL-to-SQL and Text-to-Code Generation Cheat Sheet
View all 95 topics in Generative AI