SeamlessExpressiveLM: Speech Language Model for Expressive Speech-to-Speech Translation with Chain-of-Thought
Hongyu Gong, Bandhav Veluri

TL;DR
SeamlessExpressiveLM is a unified speech language model that improves expressive speech-to-speech translation by decomposing the task into semantic translation and style transfer steps, outperforming cascaded models in quality and efficiency.
Contribution
It introduces a single model with chain-of-thought prompting for expressive S2ST, eliminating the need for style-aligned data and enhancing translation and style transfer performance.
Findings
Outperforms cascaded LMs in semantic quality
Achieves better style transfer accuracy
Uses fewer parameters for comparable or better results
Abstract
Expressive speech-to-speech translation (S2ST) is a key research topic in seamless communication, which focuses on the preservation of semantics and speaker vocal style in translated speech. Early works synthesized speaker style aligned speech in order to directly learn the mapping from speech to target speech spectrogram. Without reliance on style aligned data, recent studies leverage the advances of language modeling (LM) and build cascaded LMs on semantic and acoustic tokens. This work proposes SeamlessExpressiveLM, a single speech language model for expressive S2ST. We decompose the complex source-to-target speech mapping into intermediate generation steps with chain-of-thought prompting. The model is first guided to translate target semantic content and then transfer the speaker style to multi-stream acoustic units. Evaluated on Spanish-to-English and Hungarian-to-English…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
