Leveraging Unit Language Guidance to Advance Speech Modeling in Textless Speech-to-Speech Translation

Yuhao Zhang; Xiangnan Ma; Kaiqi Kou; Peizhuo Liu; Weiqiao Shan; Benyou Wang; Tong Xiao; Yuxin Huang; Zhengtao Yu; Jingbo Zhu

arXiv:2505.15333·cs.CL·May 22, 2025

Leveraging Unit Language Guidance to Advance Speech Modeling in Textless Speech-to-Speech Translation

Yuhao Zhang, Xiangnan Ma, Kaiqi Kou, Peizhuo Liu, Weiqiao Shan, Benyou Wang, Tong Xiao, Yuxin Huang, Zhengtao Yu, Jingbo Zhu

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a unit language approach for textless speech-to-speech translation, addressing cross-modal and cross-lingual challenges, and demonstrates significant improvements in multilingual speech translation performance.

Contribution

The paper proposes a novel unit language representation and task prompt modeling to improve speech modeling in textless S2ST, overcoming key cross-modal and cross-lingual challenges.

Findings

01

Significant performance improvements over baseline models.

02

Achieves results comparable to text-based models.

03

Effective mitigation of source-target unit language conflict.

Abstract

The success of building textless speech-to-speech translation (S2ST) models has attracted much attention. However, S2ST still faces two main challenges: 1) extracting linguistic features for various speech signals, called cross-modal (CM), and 2) learning alignment of difference languages in long sequences, called cross-lingual (CL). We propose the unit language to overcome the two modeling challenges. The unit language can be considered a text-like representation format, constructed using $n$ -gram language modeling. We implement multi-task learning to utilize the unit language in guiding the speech modeling process. Our initial results reveal a conflict when applying source and target unit languages simultaneously. We propose task prompt modeling to mitigate this conflict. We conduct experiments on four languages of the Voxpupil dataset. Our method demonstrates significant improvements…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xiaozhang521/Unit_Language
pytorchOfficial

Videos

Leveraging Unit Language Guidance to Advance Speech Modeling in Textless Speech-to-Speech Translation· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems