TL;DR
VoxMind is a comprehensive end-to-end spoken dialogue system that integrates agentic capabilities, structured reasoning, and efficient multi-agent tool management to significantly improve task completion rates while maintaining conversational quality.
Contribution
The paper introduces VoxMind, a novel framework that combines tool use, reasoning, and asynchronous multi-agent management to enhance spoken dialogue systems.
Findings
Task completion rate increased from 34.88% to 74.57%.
Outperforms Gemini-2.5-Pro on spoken agent tasks.
Achieves significant improvements while maintaining conversational quality.
Abstract
Recent end-to-end spoken dialogue models enable natural interaction. However, as user demands become increasingly complex, models that rely solely on conversational abilities often struggle to cope. Incorporating agentic capabilities is therefore essential: by enabling tool use, these models can extend their knowledge boundaries and better solve real-world tasks. Yet, existing research has largely concentrated on core perception and generation, with comparatively limited exploration of such tool-augmented extensions. To bridge this gap, we present VoxMind, an integrated framework designed to equip end-to-end spoken dialogue models with comprehensive agentic abilities. Leveraging our curated 470-hour AgentChat dataset, we incorporate a "Think-before-Speak" mechanism, enabling the model to internalize structured reasoning as a critical prerequisite for planning and response generation.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
