MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation
Yifan Peng, Ilia Kulikov, Yilin Yang, Sravya Popuri, Hui Lu, Changhan, Wang, Hongyu Gong

TL;DR
This paper introduces MSLM, a multitask decoder-only speech language model that enables multilingual speech-to-speech translation without text data, preserving speaker style across languages.
Contribution
The work presents a novel multitask speech language model that performs textless multilingual S2ST with speaker style preservation, without relying on text training data.
Findings
Supports multilingual S2ST without text data
Preserves speaker style during translation
Operates effectively in a multitask setting
Abstract
There have been emerging research interest and advances in speech-to-speech translation (S2ST), translating utterances from one language to another. This work proposes Multitask Speech Language Model (MSLM), which is a decoder-only speech language model trained in a multitask setting. Without reliance on text training data, our model is able to support multilingual S2ST with speaker style preserved.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques
