Listen, Pause, and Reason: Toward Perception-Grounded Hybrid Reasoning for Audio Understanding
Jieyi Wang, Yazhe Niu, Dexuan Xu, and Zhongyu Wei

TL;DR
This paper introduces a perception-grounded hybrid reasoning framework for audio understanding, combining structured auditory scene perception with reasoning to improve robustness and multi-speaker comprehension.
Contribution
It presents a new hierarchical decoupling strategy, a two-stage hybrid perception-reasoning model, and novel training techniques for improved audio reasoning.
Findings
HyPeR outperforms baseline models on multiple benchmarks.
The model achieves performance comparable to large-scale models.
Perceptual grounding enhances multi-speaker and ambiguous audio understanding.
Abstract
Recent Large Audio Language Models have demonstrated impressive capabilities in audio understanding. However, they often suffer from perceptual errors, while reliable audio reasoning is unattainable without first grounding the model's perception in structured auditory scenes. Inspired by Auditory Scene Analysis, we first introduce a Perception-Aware Question Answering (PAQA) dataset. PAQA implements a hierarchical decoupling strategy that separates speech from environmental sound and distinguishes multiple speakers, providing explicit perceptual reasoning for training. Building on this, we propose HyPeR, a two-stage Hybrid Perception-Reasoning framework. In Stage I, we finetune the model on PAQA to perceive acoustic attributes in complex audio. In Stage II, we leverage GRPO to refine the model's internal deliberation. We also introduce PAUSE tokens to facilitate latent computation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
