Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs

Linhao Zhang; Yuhan Song; Aiwei Liu; Chuhan Wu; Sijun Zhang; Wei Jia; Yuan Liu; Houfeng Wang; Xiao Zhou

arXiv:2604.12506·cs.CL·April 15, 2026

Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs

Linhao Zhang, Yuhan Song, Aiwei Liu, Chuhan Wu, Sijun Zhang, Wei Jia, Yuan Liu, Houfeng Wang, Xiao Zhou

PDF

1 Repo 1 Models

TL;DR

This paper introduces a unified audio schema (UAS) that structures audio data into transcription, paralinguistics, and non-linguistic events, improving fine-grained perception in AudioLLMs without sacrificing reasoning abilities.

Contribution

It proposes a holistic supervision framework that enhances AudioLLMs' acoustic perception by explicitly organizing audio information into a structured JSON format.

Findings

01

UAS-Audio improves perception accuracy by 10.9% on MMSU.

02

The framework maintains strong reasoning capabilities.

03

Experiments validate effectiveness across multiple architectures.

Abstract

Recent Audio Large Language Models (AudioLLMs) exhibit a striking performance inversion: while excelling at complex reasoning tasks, they consistently underperform on fine-grained acoustic perception. We attribute this gap to a fundamental limitation of ASR-centric training, which provides precise linguistic targets but implicitly teaches models to suppress paralinguistic cues and acoustic events as noise. To address this, we propose Unified Audio Schema (UAS), a holistic and structured supervision framework that organizes audio information into three explicit components -- Transcription, Paralinguistics, and Non-linguistic Events -- within a unified JSON format. This design achieves comprehensive acoustic coverage without sacrificing the tight audio-text alignment that enables reasoning. We validate the effectiveness of this supervision strategy by applying it to both discrete and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Tencent/Unified_Audio_Schema
github

Models

🤗
tencent/Unified_Audio_Schema
model· 60 dl· ♡ 9
60 dl♡ 9

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.