A$^3$T: Alignment-Aware Acoustic and Text Pretraining for Speech   Synthesis and Editing

He Bai; Renjie Zheng; Junkun Chen; Xintong Li; Mingbo Ma; Liang Huang

arXiv:2203.09690·eess.AS·June 22, 2022·1 cites

A$^3$T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing

He Bai, Renjie Zheng, Junkun Chen, Xintong Li, Mingbo Ma, Liang Huang

PDF

Open Access 2 Repos

TL;DR

A$^3$T is a novel pretraining framework that enhances speech synthesis and editing by reconstructing masked acoustic signals with alignment-aware training, leading to high-quality speech generation and improved multi-speaker synthesis.

Contribution

The paper introduces A$^3$T, a new alignment-aware pretraining method for speech that improves synthesis and editing without external speaker verification.

Findings

01

Outperforms state-of-the-art models in speech editing.

02

Enhances multi-speaker speech synthesis quality.

03

Enables high-quality speech reconstruction and editing.

Abstract

Recently, speech representation learning has improved many speech-related tasks such as speech recognition, speech classification, and speech-to-text translation. However, all the above tasks are in the direction of speech understanding, but for the inverse direction, speech synthesis, the potential of representation learning is yet to be realized, due to the challenging nature of generating high-quality speech. To address this problem, we propose our framework, Alignment-Aware Acoustic-Text Pretraining (A $^{3}$ T), which reconstructs masked acoustic signals with text input and acoustic-text alignment during training. In this way, the pretrained model can generate high quality reconstructed spectrogram, which can be applied to the speech editing and unseen speaker TTS directly. Experiments show A $^{3}$ T outperforms SOTA models on speech editing, and improves multi-speaker speech synthesis…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling