When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models
Cheng Wang, Gelei Deng, Xianglin Yang, Han Qiu, Tianwei Zhang

TL;DR
This paper introduces MCR-BENCH, a benchmark revealing that large audio-language models tend to prioritize text over audio when faced with conflicting multimodal information, leading to performance issues.
Contribution
The study is the first to evaluate modality prioritization in LALMs with a comprehensive benchmark and analyzes factors influencing text bias and mitigation strategies.
Findings
LALMs favor textual input over audio in conflicting scenarios
Performance drops significantly on audio-centric tasks with inconsistent data
Supervised finetuning can reduce but not eliminate text bias
Abstract
Large Audio-Language Models (LALMs) are enhanced with audio perception capabilities, enabling them to effectively process and understand multimodal inputs that combine audio and text. However, their performance in handling conflicting information between audio and text modalities remains largely unexamined. This paper introduces MCR-BENCH, the first comprehensive benchmark specifically designed to evaluate how LALMs prioritize information when presented with inconsistent audio-text pairs. Through extensive evaluation across diverse audio understanding tasks, we reveal a concerning phenomenon: when inconsistencies exist between modalities, LALMs display a significant bias toward textual input, frequently disregarding audio evidence. This tendency leads to substantial performance degradation in audio-centric tasks and raises important reliability concerns for real-world applications. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
