SLM-S2ST: A multimodal language model for direct speech-to-speech translation

Yuxuan Hu; Haibin Wu; Ruchao Fan; Xiaofei Wang; Heng Lu; Yao Qian; Jinyu Li

arXiv:2506.04392·eess.AS·February 12, 2026

SLM-S2ST: A multimodal language model for direct speech-to-speech translation

Yuxuan Hu, Haibin Wu, Ruchao Fan, Xiaofei Wang, Heng Lu, Yao Qian, Jinyu Li

PDF

Open Access

TL;DR

This paper introduces SLM-S2ST, a multimodal language model capable of direct speech-to-speech translation, leveraging an audio transformer and vocoder to produce high-quality translated speech, outperforming existing models on benchmark datasets.

Contribution

The paper presents SLM-S2ST, a novel multimodal model that extends previous speech-aware language models to directly generate translated speech using an audio transformer and vocoder.

Findings

01

SLM-S2ST outperforms baseline models on CVSS-C dataset.

02

Scaling data and model size achieves SOTA performance.

03

Efficient speech-to-speech translation with high quality.

Abstract

Speech-aware language models (LMs) have demonstrated capabilities in understanding spoken language while generating text-based responses. However, enabling them to produce speech output efficiently and effectively remains a challenge. In this paper, we present SLM-S2ST, a multimodal LM for direct speech-to-speech translation (S2ST), built on the open-source Phi4-MM model. SLM-S2ST extends its predecessor by generating translated speech using an audio transformer head that predicts audio tokens with a delay relative to text tokens, followed by a streaming vocoder for waveform synthesis. Our experimental results on the CVSS-C dataset demonstrate SLM-S2ST's superior performance, significantly surpassing existing baseline models trained on the same dataset. Furthermore, when we scale up the training data and the model size, SLM-S2ST reaches on-par performance with the current SOTA model.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling