A$^3$T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing
He Bai, Renjie Zheng, Junkun Chen, Xintong Li, Mingbo Ma, Liang Huang

TL;DR
A$^3$T is a novel pretraining framework that enhances speech synthesis and editing by reconstructing masked acoustic signals with alignment-aware training, leading to high-quality speech generation and improved multi-speaker synthesis.
Contribution
The paper introduces A$^3$T, a new alignment-aware pretraining method for speech that improves synthesis and editing without external speaker verification.
Findings
Outperforms state-of-the-art models in speech editing.
Enhances multi-speaker speech synthesis quality.
Enables high-quality speech reconstruction and editing.
Abstract
Recently, speech representation learning has improved many speech-related tasks such as speech recognition, speech classification, and speech-to-text translation. However, all the above tasks are in the direction of speech understanding, but for the inverse direction, speech synthesis, the potential of representation learning is yet to be realized, due to the challenging nature of generating high-quality speech. To address this problem, we propose our framework, Alignment-Aware Acoustic-Text Pretraining (AT), which reconstructs masked acoustic signals with text input and acoustic-text alignment during training. In this way, the pretrained model can generate high quality reconstructed spectrogram, which can be applied to the speech editing and unseen speaker TTS directly. Experiments show AT outperforms SOTA models on speech editing, and improves multi-speaker speech synthesis…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
