Preserving Speaker Information in Direct Speech-to-Speech Translation with Non-Autoregressive Generation and Pretraining

Rui Zhou; Akinori Ito; Takashi Nose

arXiv:2412.07316·cs.SD·November 10, 2025

Preserving Speaker Information in Direct Speech-to-Speech Translation with Non-Autoregressive Generation and Pretraining

Rui Zhou, Akinori Ito, Takashi Nose

PDF

Open Access 1 Repo

TL;DR

This paper introduces a pretraining-enhanced non-autoregressive speech-to-speech translation method that effectively preserves speaker identity, improves translation quality, and maintains fast inference speed.

Contribution

It proposes a self-supervised pretraining approach and feature fusion strategies to better preserve speaker information in non-autoregressive S2ST, building on previous work.

Findings

01

BLEU score improved by 1.14 over previous methods

02

Significant enhancements in MOS and speaker similarity

03

Minimal increase in inference time (0.04s per utterance)

Abstract

Speech-to-Speech Translation (S2ST) refers to the conversion of speech in one language into semantically equivalent speech in another language, facilitating communication between speakers of different languages. Speech-to-Discrete Unit Translation (S2UT), a mainstream approach for end-to-end S2ST, addresses challenges such as error propagation across modules and slow inference speed often encountered in traditional cascade systems. However, as discrete units primarily capture content information, conventional S2UT methods fail to retain speaker-specific characteristics from the source. Our previous work, SC-S2UT, introduced a speaker adapter and a unit-to-mel structure, enabling the preservation of speaker information and non-autoregressive speech generation. Building on this foundation, this study proposes a self-supervised pretraining method to enrich the information extracted by both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhouruitohoku99/sc-s2ut
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis