SeamlessExpressiveLM: Speech Language Model for Expressive   Speech-to-Speech Translation with Chain-of-Thought

Hongyu Gong; Bandhav Veluri

arXiv:2405.20410·cs.CL·June 3, 2024·1 cites

SeamlessExpressiveLM: Speech Language Model for Expressive Speech-to-Speech Translation with Chain-of-Thought

Hongyu Gong, Bandhav Veluri

PDF

Open Access

TL;DR

SeamlessExpressiveLM is a unified speech language model that improves expressive speech-to-speech translation by decomposing the task into semantic translation and style transfer steps, outperforming cascaded models in quality and efficiency.

Contribution

It introduces a single model with chain-of-thought prompting for expressive S2ST, eliminating the need for style-aligned data and enhancing translation and style transfer performance.

Findings

01

Outperforms cascaded LMs in semantic quality

02

Achieves better style transfer accuracy

03

Uses fewer parameters for comparable or better results

Abstract

Expressive speech-to-speech translation (S2ST) is a key research topic in seamless communication, which focuses on the preservation of semantics and speaker vocal style in translated speech. Early works synthesized speaker style aligned speech in order to directly learn the mapping from speech to target speech spectrogram. Without reliance on style aligned data, recent studies leverage the advances of language modeling (LM) and build cascaded LMs on semantic and acoustic tokens. This work proposes SeamlessExpressiveLM, a single speech language model for expressive S2ST. We decompose the complex source-to-target speech mapping into intermediate generation steps with chain-of-thought prompting. The model is first guided to translate target semantic content and then transfer the speaker style to multi-stream acoustic units. Evaluated on Spanish-to-English and Hungarian-to-English…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling