When Audio-LLMs Don't Listen: A Cross-Linguistic Study of Modality Arbitration
Jayadev Billa

TL;DR
This study investigates how speech-enabled language models prioritize text over audio when conflicting information arises, revealing a bias towards text and exploring factors influencing arbitration behavior across multiple languages and models.
Contribution
Introduces ALME, a large dataset for cross-linguistic audio-text conflict evaluation, and analyzes arbitration behavior and biases in state-of-the-art audio-LLMs.
Findings
Models follow text 10-26 times more often than audio in conflicts.
Framing transcripts as corrupted reduces model reliance on text.
Arbitration behavior depends more on reasoning than on audio input quality.
Abstract
When audio and text conflict, speech-enabled language models follow text far more often than they do when arbitrating between two conflicting text sources, even under explicit instructions to trust the audio. We introduce ALME (Audio-LLM Modality Evaluation), a dataset of 57,602 controlled audio-text conflict stimuli across eight languages, together with Text Dominance Ratio (TDR), which measures how often a model follows conflicting text when instructed to follow audio. Gemini 2.0 Flash and GPT-4o show TDR 10--26 higher than a baseline that replaces audio with its transcript under otherwise identical conditions (Gemini 2.0 Flash: 16.6% vs. 1.6%; GPT-4o: 23.2% vs. 0.9%). These results suggest that text dominance reflects not only information content, but also an asymmetry in arbitration accessibility, i.e., how easily the model can use competing representations at decision time.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques
