TL;DR
This paper introduces a unified audio schema (UAS) that structures audio data into transcription, paralinguistics, and non-linguistic events, improving fine-grained perception in AudioLLMs without sacrificing reasoning abilities.
Contribution
It proposes a holistic supervision framework that enhances AudioLLMs' acoustic perception by explicitly organizing audio information into a structured JSON format.
Findings
UAS-Audio improves perception accuracy by 10.9% on MMSU.
The framework maintains strong reasoning capabilities.
Experiments validate effectiveness across multiple architectures.
Abstract
Recent Audio Large Language Models (AudioLLMs) exhibit a striking performance inversion: while excelling at complex reasoning tasks, they consistently underperform on fine-grained acoustic perception. We attribute this gap to a fundamental limitation of ASR-centric training, which provides precise linguistic targets but implicitly teaches models to suppress paralinguistic cues and acoustic events as noise. To address this, we propose Unified Audio Schema (UAS), a holistic and structured supervision framework that organizes audio information into three explicit components -- Transcription, Paralinguistics, and Non-linguistic Events -- within a unified JSON format. This design achieves comprehensive acoustic coverage without sacrificing the tight audio-text alignment that enables reasoning. We validate the effectiveness of this supervision strategy by applying it to both discrete and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
