VoxMind: An End-to-End Agentic Spoken Dialogue System

Tianle Liang; Yifu Chen; Shengpeng Ji; Yijun Chen; Zhiyang Jia; Jingyu Lu; Fan Zhuo; Xueyi Pu; Yangzhuo Li; and Zhou Zhao

arXiv:2604.15710·cs.SD·April 20, 2026

VoxMind: An End-to-End Agentic Spoken Dialogue System

Tianle Liang, Yifu Chen, Shengpeng Ji, Yijun Chen, Zhiyang Jia, Jingyu Lu, Fan Zhuo, Xueyi Pu, Yangzhuo Li, and Zhou Zhao

PDF

1 Repo 1 Models

TL;DR

VoxMind is a comprehensive end-to-end spoken dialogue system that integrates agentic capabilities, structured reasoning, and efficient multi-agent tool management to significantly improve task completion rates while maintaining conversational quality.

Contribution

The paper introduces VoxMind, a novel framework that combines tool use, reasoning, and asynchronous multi-agent management to enhance spoken dialogue systems.

Findings

01

Task completion rate increased from 34.88% to 74.57%.

02

Outperforms Gemini-2.5-Pro on spoken agent tasks.

03

Achieves significant improvements while maintaining conversational quality.

Abstract

Recent end-to-end spoken dialogue models enable natural interaction. However, as user demands become increasingly complex, models that rely solely on conversational abilities often struggle to cope. Incorporating agentic capabilities is therefore essential: by enabling tool use, these models can extend their knowledge boundaries and better solve real-world tasks. Yet, existing research has largely concentrated on core perception and generation, with comparatively limited exploration of such tool-augmented extensions. To bridge this gap, we present VoxMind, an integrated framework designed to equip end-to-end spoken dialogue models with comprehensive agentic abilities. Leveraging our curated 470-hour AgentChat dataset, we incorporate a "Think-before-Speak" mechanism, enabling the model to internalize structured reasoning as a critical prerequisite for planning and response generation.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

MM-Speech/VoxMind
github

Models

🤗
leungtianle/VoxMind
model· 12 dl· ♡ 1
12 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.