When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models

Cheng Wang; Gelei Deng; Xianglin Yang; Han Qiu; Tianwei Zhang

arXiv:2508.15407·cs.CL·August 22, 2025

When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models

Cheng Wang, Gelei Deng, Xianglin Yang, Han Qiu, Tianwei Zhang

PDF

Open Access

TL;DR

This paper introduces MCR-BENCH, a benchmark revealing that large audio-language models tend to prioritize text over audio when faced with conflicting multimodal information, leading to performance issues.

Contribution

The study is the first to evaluate modality prioritization in LALMs with a comprehensive benchmark and analyzes factors influencing text bias and mitigation strategies.

Findings

01

LALMs favor textual input over audio in conflicting scenarios

02

Performance drops significantly on audio-centric tasks with inconsistent data

03

Supervised finetuning can reduce but not eliminate text bias

Abstract

Large Audio-Language Models (LALMs) are enhanced with audio perception capabilities, enabling them to effectively process and understand multimodal inputs that combine audio and text. However, their performance in handling conflicting information between audio and text modalities remains largely unexamined. This paper introduces MCR-BENCH, the first comprehensive benchmark specifically designed to evaluate how LALMs prioritize information when presented with inconsistent audio-text pairs. Through extensive evaluation across diverse audio understanding tasks, we reveal a concerning phenomenon: when inconsistencies exist between modalities, LALMs display a significant bias toward textual input, frequently disregarding audio evidence. This tendency leads to substantial performance degradation in audio-centric tasks and raises important reliability concerns for real-world applications. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing