Text-to-Speech (TTS) synthesis is a branch of speech processing and deep learning that converts written text into natural-sounding human speech. Modern neural TTS has transformed from rule-based and concatenative systems to end-to-end generative models that can produce highly realistic, expressive, and emotionally rich speech. At its core, TTS involves two main stages: converting text into intermediate acoustic representations (like mel-spectrograms), and then generating raw audio waveforms through a vocoder. The field has seen explosive growth with architectures like Tacotron, FastSpeech, and VITS enabling both high quality and fast inference. A critical insight: the choice between autoregressive (sequential, higher quality) and non-autoregressive (parallel, faster) models fundamentally shapes the trade-off between naturalness and speed, while the vocoder determines final audio fidelity and computational cost.
What This Cheat Sheet Covers
This topic spans 15 focused tables and 124 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Foundational TTS Architectures
| Model | Example | Description |
|---|---|---|
text → encoder → attention → decoder → mel-spectrogram | • Autoregressive seq2seq model that generates mel-spectrograms frame-by-frame • uses location-sensitive attention for text-to-audio alignment • high quality but slow inference. | |
text → FFT blocks → length regulator → mel-spectrogram | • Non-autoregressive model that generates mel-spectrograms in parallel • uses explicit duration prediction • 270x faster than autoregressive models. | |
text → pitch/energy/duration predictors → mel-spectrogram | • Improves FastSpeech by predicting pitch, energy, and duration directly from ground truth • simplifies training pipeline and improves voice quality. | |
text → conditional VAE → raw waveform (end-to-end) | • End-to-end model that combines variational autoencoder with normalizing flows • generates waveforms directly without separate vocoder • achieves real-time synthesis. |