Text-to-Speech (TTS) Synthesis Cheat Sheet

Updated 2026-05-25

Next Topic: Token Management Cheat Sheet

Text-to-Speech (TTS) synthesis is a branch of speech processing and deep learning that converts written text into natural-sounding human speech. The field has shifted from rule-based and concatenative systems to end-to-end generative models capable of producing highly realistic, expressive, emotionally rich speech — with the best systems now reaching human parity on standard benchmarks. Modern TTS typically involves two stages: converting text into intermediate acoustic representations (mel-spectrograms or discrete tokens), then generating raw audio through a vocoder or codec decoder. A critical insight for practitioners: the choice between autoregressive (higher naturalness, slower), non-autoregressive (faster, parallel), and flow/diffusion-based (probabilistic, controllable) paradigms fundamentally shapes the quality-speed trade-off, while the emergence of codec language models (VALL-E, Orpheus, Chatterbox) has opened a new paradigm treating TTS as conditional language modeling over discrete audio tokens.

What This Cheat Sheet Covers

This topic spans 17 focused tables and 152 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Foundational TTS ArchitecturesTable 2: Neural Vocoder ArchitecturesTable 3: Generative Model ParadigmsTable 4: Codec Language Models & Token-Based TTSTable 5: Text Preprocessing & Input RepresentationsTable 6: Acoustic Feature RepresentationsTable 7: Prosody & Expressive ControlTable 8: Multi-Speaker & Voice CloningTable 9: Duration & Alignment ModelingTable 10: Training Techniques & Loss FunctionsTable 11: Inference & Real-Time OptimizationTable 12: Quality Evaluation MetricsTable 13: Model Compression & DeploymentTable 14: Multilingual & Cross-Lingual TTSTable 15: Advanced Voice Cloning & AdaptationTable 16: Training Data & Dataset RequirementsTable 17: Responsible AI & Safety in TTS

Table 1: Foundational TTS Architectures

Modern TTS architectures span from seq2seq acoustic models to end-to-end flow-matching and LLM-based systems. Knowing which architecture underpins a model tells you what trade-offs to expect in terms of speed, quality, controllability, and voice cloning capability.

Model	Example	Description
VITS	`text → conditional VAE + flows → raw waveform`	• End-to-end model combining VAE, normalizing flows, and adversarial training to generate waveforms without a separate vocoder • achieves real-time synthesis and is widely used as a production backbone
Tacotron 2	`text → encoder → attention → decoder → mel-spectrogram`	• Foundational autoregressive seq2seq model generating mel-spectrograms frame-by-frame using location-sensitive attention • high quality but slow inference
FastSpeech 2	`text → pitch/energy/duration predictors → mel-spectrogram`	• Non-autoregressive model predicting pitch, energy, and duration directly from text • simpler pipeline and faster inference than its predecessor
StyleTTS 2	`text → style diffusion → SLM discriminator → waveform`	• First TTS to surpass human recordings on LJSpeech • models speaking style as a latent random variable via diffusion and uses WavLM as a discriminator for end-to-end training
F5-TTS	`text (padded with fillers) → DiT flow matching → mel-spectrogram`	• Fully non-autoregressive flow-matching model using a Diffusion Transformer • no duration model or phoneme aligner needed • RTF 0.15, trained on 100K multilingual hours with Sway Sampling inference
FastSpeech	`text → FFT blocks → length regulator → mel-spectrogram`	• First non-autoregressive TTS using explicit duration prediction and parallel generation • ~270× faster than Tacotron 2, superseded by FastSpeech 2.

Table 1: Foundational TTS Architectures

Model	Example	Description
VITS	`text → conditional VAE + flows → raw waveform`	• End-to-end model combining VAE, normalizing flows, and adversarial training to generate waveforms without a separate vocoder • achieves real-time synthesis and is widely used as a production backbone
Tacotron 2	`text → encoder → attention → decoder → mel-spectrogram`	• Foundational autoregressive seq2seq model generating mel-spectrograms frame-by-frame using location-sensitive attention • high quality but slow inference
FastSpeech 2	`text → pitch/energy/duration predictors → mel-spectrogram`	• Non-autoregressive model predicting pitch, energy, and duration directly from text • simpler pipeline and faster inference than its predecessor
StyleTTS 2	`text → style diffusion → SLM discriminator → waveform`	• First TTS to surpass human recordings on LJSpeech • models speaking style as a latent random variable via diffusion and uses WavLM as a discriminator for end-to-end training
F5-TTS	`text (padded with fillers) → DiT flow matching → mel-spectrogram`	• Fully non-autoregressive flow-matching model using a Diffusion Transformer • no duration model or phoneme aligner needed • RTF 0.15, trained on 100K multilingual hours with Sway Sampling inference
FastSpeech	`text → FFT blocks → length regulator → mel-spectrogram`	• First non-autoregressive TTS using explicit duration prediction and parallel generation • ~270× faster than Tacotron 2, superseded by FastSpeech 2.