MSLM-S2ST: A Multitask Speech Language Model for Textless   Speech-to-Speech Translation with Speaker Style Preservation

Yifan Peng; Ilia Kulikov; Yilin Yang; Sravya Popuri; Hui Lu; Changhan; Wang; Hongyu Gong

arXiv:2403.12408·cs.CL·March 20, 2024·1 cites

MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation

Yifan Peng, Ilia Kulikov, Yilin Yang, Sravya Popuri, Hui Lu, Changhan, Wang, Hongyu Gong

PDF

Open Access

TL;DR

This paper introduces MSLM, a multitask decoder-only speech language model that enables multilingual speech-to-speech translation without text data, preserving speaker style across languages.

Contribution

The work presents a novel multitask speech language model that performs textless multilingual S2ST with speaker style preservation, without relying on text training data.

Findings

01

Supports multilingual S2ST without text data

02

Preserves speaker style during translation

03

Operates effectively in a multitask setting

Abstract

There have been emerging research interest and advances in speech-to-speech translation (S2ST), translating utterances from one language to another. This work proposes Multitask Speech Language Model (MSLM), which is a decoder-only speech language model trained in a multitask setting. Without reliance on text training data, our model is able to support multilingual S2ST with speaker style preserved.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques