Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception

Zhen Wan; Chao-Han Huck Yang; Jinchuan Tian; Hanrong Ye; Ankita Pasad; Szu-wei Fu; Arushi Goel; Ryo Hachiuma; Shizhe Diao; Kunal Dhawan; Sreyan Ghosh; Yusuke Hirota; Zhehuai Chen; Rafael Valle; Chenhui Chu; Shinji Watanabe; Yu-Chiang Frank Wang; Boris Ginsburg

arXiv:2601.09413·cs.SD·May 19, 2026

Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception

Zhen Wan, Chao-Han Huck Yang, Jinchuan Tian, Hanrong Ye, Ankita Pasad, Szu-wei Fu, Arushi Goel, Ryo Hachiuma, Shizhe Diao, Kunal Dhawan, Sreyan Ghosh, Yusuke Hirota, Zhehuai Chen, Rafael Valle, Chenhui Chu, Shinji Watanabe, Yu-Chiang Frank Wang, Boris Ginsburg

PDF

TL;DR

Speech-Hands introduces a self-reflective voice agentic framework that improves speech recognition and audio reasoning by learning when to trust internal models versus external perception, enhancing robustness and accuracy.

Contribution

The paper presents a novel self-reflection mechanism for audio understanding models, significantly improving performance and robustness across speech recognition and audio reasoning tasks.

Findings

01

Outperforms baselines by 12.1% WER on seven benchmarks

02

Achieves 77.37% accuracy on audio QA tasks

03

Generalizes from speech recognition to complex audio reasoning

Abstract

We introduce a voice-agentic framework that learns one critical omni-understanding skill: knowing when to trust itself versus when to consult external audio perception. Our work is motivated by a crucial yet counterintuitive finding: naively fine-tuning an omni-model on both speech recognition and external sound understanding tasks often degrades performance, as the model can be easily misled by noisy hypotheses. To address this, our framework, Speech-Hands, recasts the problem as an explicit self-reflection decision. This learnable reflection primitive proves effective in preventing the model from being derailed by flawed external candidates. We show that this agentic action mechanism generalizes naturally from speech recognition to complex, multiple-choice audio reasoning. Across the OpenASR leaderboard, Speech-Hands consistently outperforms strong baselines by 12.1% WER on seven…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Topic Modeling