Can Large Audio Language Models Ignore Multilingual Distractors? An Evaluation of Their Selective Auditory Attention Capabilities
Heejoon Koo

TL;DR
This paper evaluates Large Audio Language Models' ability to focus on target speech amidst multilingual distractors, revealing limitations in their selective auditory attention especially under noisy conditions.
Contribution
Introduction of MUSA, a multilingual benchmark for assessing source-grounded spoken-language understanding and reasoning in LALMs under cocktail party scenarios.
Findings
Model performance degrades with severe noise interference.
Source separation reduces acoustic overlap but doesn't improve source attribution.
Errors are mainly due to distractor-grounded source confusion.
Abstract
Robust selective auditory attention under multilingual interference is critical for reliable deployment of Large Audio Language Models (LALMs). We introduce MUSA, a cocktail party-inspired multilingual benchmark for source-grounded spoken-language understanding and reasoning. Each item pairs an English target dialogue with a semantically plausible distractor in English, Spanish, Korean, or Chinese, and evaluates models across (1) single, (2) source separation-based two-stage, (3) and end-to-end cocktail party settings under controlled SNRs. Evaluating two closed-source and four open-weight LALMs, we find that strong single performance does not ensure robust selective auditory attention: cocktail party accuracy degrades under severe SNRs, and errors are dominated by distractor-grounded source confusion. In addition, separation reduces acoustic overlap but leaves source attribution…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
