Analyzing Speech Unit Selection for Textless Speech-to-Speech   Translation

Jarod Duret (LIA); Yannick Est\`eve (LIA); Titouan Parcollet (CAM)

arXiv:2407.18332·eess.AS·July 29, 2024

Analyzing Speech Unit Selection for Textless Speech-to-Speech Translation

Jarod Duret (LIA), Yannick Est\`eve (LIA), Titouan Parcollet (CAM)

PDF

Open Access

TL;DR

This paper investigates how the selection of discrete speech units affects the performance of textless speech-to-speech translation systems, revealing that optimal units for resynthesis differ from those for translation quality.

Contribution

It provides a detailed analysis of target speech unit selection criteria and highlights the discrepancy between units optimized for resynthesis versus translation performance.

Findings

01

Units good for speech resynthesis do not always improve translation.

02

Discrepancy in optimization criteria impacts translation system performance.

03

Study covers tasks like speech recognition, synthesis, speaker, and emotion recognition.

Abstract

Recent advancements in textless speech-to-speech translation systems have been driven by the adoption of self-supervised learning techniques. Although most state-of-the-art systems adopt a similar architecture to transform source language speech into sequences of discrete representations in the target language, the criteria for selecting these target speech units remains an open question. This work explores the selection process through a study of downstream tasks such as automatic speech recognition, speech synthesis, speaker recognition, and emotion recognition. Interestingly, our findings reveal a discrepancy in the optimization of discrete speech units: units that perform well in resynthesis performance do not necessarily correlate with those that enhance translation efficacy. This discrepancy underscores the nuanced complexity of target feature selection and its impact on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsFeature Selection