LTS-VoiceAgent: A Listen-Think-Speak Framework for Efficient Streaming Voice Interaction via Semantic Triggering and Incremental Reasoning

Wenhao Zou; Yuwei Miao; Zhanyu Ma; Jun Xu; Jiuchong Gao; Jinghua Hao; Renqing He; Jingwen Xu

arXiv:2601.19952·cs.SD·January 29, 2026

LTS-VoiceAgent: A Listen-Think-Speak Framework for Efficient Streaming Voice Interaction via Semantic Triggering and Incremental Reasoning

Wenhao Zou, Yuwei Miao, Zhanyu Ma, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He, Jingwen Xu

PDF

Open Access

TL;DR

LTS-VoiceAgent introduces a novel framework for streaming voice interaction that separates listening, thinking, and speaking, enabling more efficient and accurate real-time responses through semantic triggers and incremental reasoning.

Contribution

It proposes the LTS-VoiceAgent framework with a Dynamic Semantic Trigger and Dual-Role Stream Orchestrator for improved streaming voice interaction.

Findings

01

Achieves better accuracy-latency trade-off than baselines

02

Enables parallel thinking and speaking without blocking responses

03

Performs well on multiple benchmark datasets

Abstract

Real-time voice agents face a dilemma: end-to-end models often lack deep reasoning, while cascaded pipelines incur high latency by executing ASR, LLM reasoning, and TTS strictly in sequence, unlike human conversation where listeners often start thinking before the speaker finishes. Since cascaded architectures remain the dominant choice for complex tasks, existing cascaded streaming strategies attempt to reduce this latency via mechanical segmentation (e.g., fixed chunks, VAD-based splitting) or speculative generation, but they frequently either break semantic units or waste computation on predictions that must be rolled back. To address these challenges, we propose LTS-VoiceAgent, a Listen-Think-Speak framework that explicitly separates when to think from how to reason incrementally. It features a Dynamic Semantic Trigger to detect meaningful prefixes, and a Dual-Role Stream…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Topic Modeling