Text-to-Speech (TTS) synthesis is a branch of speech processing and deep learning that converts written text into natural-sounding human speech. The field has shifted from rule-based and concatenative systems to end-to-end generative models capable of producing highly realistic, expressive, emotionally rich speech β with the best systems now reaching human parity on standard benchmarks. Modern TTS typically involves two stages: converting text into intermediate acoustic representations (mel-spectrograms or discrete tokens), then generating raw audio through a vocoder or codec decoder. A critical insight for practitioners: the choice between autoregressive (higher naturalness, slower), non-autoregressive (faster, parallel), and flow/diffusion-based (probabilistic, controllable) paradigms fundamentally shapes the quality-speed trade-off, while the emergence of codec language models (VALL-E, Orpheus, Chatterbox) has opened a new paradigm treating TTS as conditional language modeling over discrete audio tokens.
What This Cheat Sheet Covers
This topic spans 17 focused tables and 152 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Foundational TTS Architectures
Modern TTS architectures span from seq2seq acoustic models to end-to-end flow-matching and LLM-based systems. Knowing which architecture underpins a model tells you what trade-offs to expect in terms of speed, quality, controllability, and voice cloning capability.
| Model | Example | Description |
|---|---|---|
text β conditional VAE + flows β raw waveform | β’ End-to-end model combining VAE, normalizing flows, and adversarial training to generate waveforms without a separate vocoder β’ achieves real-time synthesis and is widely used as a production backbone | |
text β encoder β attention β decoder β mel-spectrogram | β’ Foundational autoregressive seq2seq model generating mel-spectrograms frame-by-frame using location-sensitive attention β’ high quality but slow inference | |
text β pitch/energy/duration predictors β mel-spectrogram | β’ Non-autoregressive model predicting pitch, energy, and duration directly from text β’ simpler pipeline and faster inference than its predecessor | |
text β style diffusion β SLM discriminator β waveform | β’ First TTS to surpass human recordings on LJSpeech β’ models speaking style as a latent random variable via diffusion and uses WavLM as a discriminator for end-to-end training | |
text (padded with fillers) β DiT flow matching β mel-spectrogram | β’ Fully non-autoregressive flow-matching model using a Diffusion Transformer β’ no duration model or phoneme aligner needed β’ RTF 0.15, trained on 100K multilingual hours with Sway Sampling inference | |
text β FFT blocks β length regulator β mel-spectrogram | β’ First non-autoregressive TTS using explicit duration prediction and parallel generation β’ ~270Γ faster than Tacotron 2, superseded by FastSpeech 2. |