Blog posts

2024

FastSpeech 2 and 2s: Fast, High-Quality, and Fully End-to-End TTS

14 minute read

Published:

FastSpeech 2 simplifies the TTS training pipeline by eliminating the teacher-student distillation process and adding pitch, energy, and duration as explicit conditioning features. FastSpeech 2s takes this a step further by directly generating waveform in a fully end-to-end manner.

Transformer TTS

7 minute read

Published:

In this post, I’ll highlight Transformer TTS, which brought the Transformer architecture to neural text-to-speech (Speech Synthesis). The model directly addresses two major limitations of RNN-based systems like Tacotron 2: poor parallelism and weak long-range dependency modeling. Instead of relying on recurrence, it uses self-attention throughout. This allows the model to train much faster without sacrificing output quality, and in some cases, it actually improves it.