Skip to the content.

Authors

Paper

Abstract

Several fully end-to-end text-to-speech (TTS) models have been proposed that have shown better performance compared to cascade models (i.e., training acoustic and vocoder models separately). However, they often generate unstable pitch contour with audible artifacts when the dataset contains emotional attributes, i.e., large diversity of pronunciation and prosody. To address this problem, we propose Period VITS, a novel end-to-end TTS model that incorporates an explicit periodicity generator. In the proposed method, we introduce a frame pitch predictor that predicts prosodic features, such as pitch and voicing flags, from the input text. From these features, the proposed periodicity generator produces a sample-level sinusoidal source that enables the waveform decoder to accurately reproduce the pitch. Finally, the entire model is jointly optimized in an end-to-end manner with variational inference and adversarial objectives. As a result, the decoder becomes capable of generating more stable, expressive, and natural output waveforms. Theexperimental results showed that the proposed model significantly outperforms baseline models in terms of naturalness, with improved pitch stability in the generated samples.

Demo

Systems

The table below summarizes the systems for comparison. E2E: end-to-end model. FPN: frame prior network in VISinger [1]. FS2: FastSpeech2 [2]. P-VITS: Period VITS (i.e. Our proposed model). *: Not the same but a similar architecture.

Model Type Duration input FPN Pitch input Periodicity generator input
VITS [3] E2E No No None -
FPN-VITS E2E Yes Yes None -
CAT-P-VITS E2E Yes Yes Frame-level to decoder -
Sine-P-VITS E2E Yes Yes Sample-level to decoder Sine-wave
P-VITS E2E Yes Yes Sample-level to decoder Sine-wave + V/UV + noise
FS2+P-HiFi-GAN [4] Cascade Yes * Sample-level to decoder Sine-wave + V/UV + noise

Audio samples (Japanese)

Neutral style

Model Sample 1 (Female) Sample 2 (Male)
Reference
VITS
FPN-VITS
CAT-P-VITS
Sine-P-VITS
P-VITS
FS2+P-HiFi-GAN

Happiness style

Model Sample 1 (Female) Sample 2 (Male)
Reference
VITS
FPN-VITS
CAT-P-VITS
Sine-P-VITS
P-VITS
FS2+P-HiFi-GAN

Sadness style

Model Sample 1 (Female) Sample 2 (Male)
Reference
VITS
FPN-VITS
CAT-P-VITS
Sine-P-VITS
P-VITS
FS2+P-HiFi-GAN

Acknowledgements

This work was supported by Clova Voice, NAVER Corp., Seongnam, Korea.

References