DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action

Haoyang Zhang; Jun Chen; Donghang Wu; Yuxin Li; Yuxin Zhang; Xiangyu Tony Zhang; Che Liu; Qingjian Lin; Yizhou Peng; Hexin Liu; Eng Siong Chng; Chao Yan; Boyong Wu; Yechang Huang; Xuerui Yang; and Fei Tian

arXiv:2605.20755·eess.AS·May 21, 2026

DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action

Haoyang Zhang, Jun Chen, Donghang Wu, Yuxin Li, Yuxin Zhang, Xiangyu Tony Zhang, Che Liu, Qingjian Lin, Yizhou Peng, Hexin Liu, Eng Siong Chng, Chao Yan, Boyong Wu, Yechang Huang, Xuerui Yang, and Fei Tian

PDF

1 Repo

TL;DR

DuplexSLA introduces a native full-duplex speech-language-action model enabling continuous listening, speaking, planning, and tool calling in real-time conversations, advancing dialogue AI capabilities.

Contribution

It presents a novel dual-stream, three-channel model architecture that integrates in-conversation planning and tool calling without external modules.

Findings

01

Joint decoding of speech and actions on a shared timeline.

02

Semantic-driven turn-taking control within the backbone.

03

Constructed DuplexSLA-Bench for comprehensive evaluation.

Abstract

Recent advances in spoken dialogue language models have shifted from turn-based to full-duplex designs, where the model continuously listens to the user while generating responses. However, existing duplex backbones still lack a native channel for in-conversation planning and tool calling, leaving real-time agentic behaviour either tied to turn boundaries or relegated to an external cascade. We propose DuplexSLA, a native full-duplex Speech-Language-Action foundation model that decodes assistant audio together with a structured action stream on a shared 160 ms chunk timeline. DuplexSLA is built on a dual-stream three-channel formulation: a continuous user audio channel, a discrete assistant audio channel, and a rate-limited textual action channel, all decoded jointly by a single backbone, so that listening, speaking, planning, and tool calling unfold on one shared clock. Two…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hyzhang24/DuplexSLA
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.