TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model   for Speech Synthesis with Explicit Pitch and Duration Prediction

Stanislav Beliaev; Boris Ginsburg

arXiv:2104.08189·eess.AS·June 21, 2021·5 cites

TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction

Stanislav Beliaev, Boris Ginsburg

PDF

Open Access 1 Repo

TL;DR

TalkNet 2 introduces a non-autoregressive, efficient convolutional model for speech synthesis that explicitly predicts pitch and duration, achieving high-quality speech with fewer parameters and faster inference.

Contribution

The paper presents TalkNet 2, a novel non-autoregressive convolutional model with explicit pitch and duration prediction for improved speech synthesis efficiency and quality.

Findings

01

Achieves MOS 4.08 on LJSpeech, close to state-of-the-art.

02

Uses only 13.2M parameters, nearly half of comparable models.

03

Enables fast training and inference suitable for embedded devices.

Abstract

We propose TalkNet, a non-autoregressive convolutional neural model for speech synthesis with explicit pitch and duration prediction. The model consists of three feed-forward convolutional networks. The first network predicts grapheme durations. An input text is expanded by repeating each symbol according to the predicted duration. The second network predicts pitch value for every mel frame. The third network generates a mel-spectrogram from the expanded text conditioned on predicted pitch. All networks are based on 1D depth-wise separable convolutional architecture. The explicit duration prediction eliminates word skipping and repeating. The quality of the generated speech nearly matches the best auto-regressive models - TalkNet trained on the LJSpeech dataset got MOS 4.08. The model has only 13.2M parameters, almost 2x less than the present state-of-the-art text-to-speech models. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rishikksh20/TalkNet2-pytorch
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing