Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems

Siyuan Liu; Jiahui Xu; Feng Jiang; Kuang Wang; Zefeng Zhao; Chu-Ren Huang; Jinghang Gu; Changqing Yin; Haizhou Li

arXiv:2602.23266·cs.CL·February 27, 2026

Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems

Siyuan Liu, Jiahui Xu, Feng Jiang, Kuang Wang, Zefeng Zhao, Chu-Ren Huang, Jinghang Gu, Changqing Yin, Haizhou Li

PDF

Open Access

TL;DR

This paper introduces DDTSR, a low-latency streaming framework for spoken dialogue systems that enables simultaneous listening, reasoning, and speaking, significantly reducing response time while maintaining discourse quality.

Contribution

The paper presents a novel discourse-aware dual-track streaming architecture that enables real-time, low-latency responses in spoken dialogue systems through innovative model synergy and streaming collaboration.

Findings

01

Reduces response latency by up to 51%

02

Maintains discourse coherence and quality

03

Compatible with diverse LLM backbones

Abstract

Achieving human-like responsiveness is a critical yet challenging goal for cascaded spoken dialogue systems. Conventional ASR-LLM-TTS pipelines follow a strictly sequential paradigm, requiring complete transcription and full reasoning before speech synthesis can begin, which results in high response latency. We propose the Discourse-Aware Dual-Track Streaming Response (DDTSR) framework, a low-latency architecture that enables listen-while-thinking and speak-while-thinking. DDTSR is built upon three key mechanisms: (1) connective-guided small-large model synergy, where an auxiliary small model generates minimal-committal discourse connectives while a large model performs knowledge-intensive reasoning in parallel; (2) streaming-based cross-modal collaboration, which dynamically overlaps ASR, LLM inference, and TTS to advance the earliest speakable moment; and (3) curriculum-learning-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Topic Modeling · Speech Recognition and Synthesis