SALM-Duplex: Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model

Ke Hu; Ehsan Hosseini-Asl; Chen Chen; Edresson Casanova; Subhankar Ghosh; Piotr \.Zelasko; Zhehuai Chen; Jason Li; Jagadeesh Balam; Boris Ginsburg

arXiv:2505.15670·cs.CL·July 28, 2025

SALM-Duplex: Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model

Ke Hu, Ehsan Hosseini-Asl, Chen Chen, Edresson Casanova, Subhankar Ghosh, Piotr \.Zelasko, Zhehuai Chen, Jason Li, Jagadeesh Balam, Boris Ginsburg

PDF

Open Access

TL;DR

This paper introduces SALM-Duplex, a novel speech-to-speech model that enables real-time, continuous dialogue with improved reasoning and turn-taking, achieved with less data and no speech pretraining.

Contribution

It presents the first duplex S2S model with continuous input, channel fusion, and reduced bitrate, simplifying development and enhancing performance over previous models.

Findings

01

Outperforms previous duplex models in reasoning and turn-taking.

02

Halves bitrate to 0.6 kbps compared to prior work.

03

Requires less speech data by skipping speech pretraining.

Abstract

Spoken dialogue is an intuitive form of human-computer interaction, yet current speech language models often remain constrained to turn-based exchanges, lacking real-time adaptability such as user barge-in. We propose a novel duplex speech to speech (S2S) architecture featuring continuous user inputs and codec agent outputs with channel fusion that directly models simultaneous user and agent streams. Using a pretrained streaming encoder for user input enables the first duplex S2S model without requiring speech pretrain. Separate architectures for agent and user modeling facilitate codec fine-tuning for better agent voices and halve the bitrate (0.6 kbps) compared to previous works. Experimental results show that the proposed model outperforms previous duplex models in reasoning, turn-taking, and barge-in abilities. The model requires significantly less speech data, as speech pretrain is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Speech and Audio Processing