Preserving Speaker Information in Direct Speech-to-Speech Translation with Non-Autoregressive Generation and Pretraining
Rui Zhou, Akinori Ito, Takashi Nose

TL;DR
This paper introduces a pretraining-enhanced non-autoregressive speech-to-speech translation method that effectively preserves speaker identity, improves translation quality, and maintains fast inference speed.
Contribution
It proposes a self-supervised pretraining approach and feature fusion strategies to better preserve speaker information in non-autoregressive S2ST, building on previous work.
Findings
BLEU score improved by 1.14 over previous methods
Significant enhancements in MOS and speaker similarity
Minimal increase in inference time (0.04s per utterance)
Abstract
Speech-to-Speech Translation (S2ST) refers to the conversion of speech in one language into semantically equivalent speech in another language, facilitating communication between speakers of different languages. Speech-to-Discrete Unit Translation (S2UT), a mainstream approach for end-to-end S2ST, addresses challenges such as error propagation across modules and slow inference speed often encountered in traditional cascade systems. However, as discrete units primarily capture content information, conventional S2UT methods fail to retain speaker-specific characteristics from the source. Our previous work, SC-S2UT, introduced a speaker adapter and a unit-to-mel structure, enabling the preservation of speaker information and non-autoregressive speech generation. Building on this foundation, this study proposes a self-supervised pretraining method to enrich the information extracted by both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis
