ALICE: A Multifaceted Evaluation Framework of Large Audio-Language Models' In-Context Learning Ability

Yen-Ting Piao; Jay Chiehen Liao; Wei-Tang Chien; Toshiki Ogimoto; Shang-Tse Chen; Yun-Nung Chen; Chun-Yi Lee; Shao-Yuan Lo

arXiv:2603.20433·cs.SD·March 24, 2026

ALICE: A Multifaceted Evaluation Framework of Large Audio-Language Models' In-Context Learning Ability

Yen-Ting Piao, Jay Chiehen Liao, Wei-Tang Chien, Toshiki Ogimoto, Shang-Tse Chen, Yun-Nung Chen, Chun-Yi Lee, Shao-Yuan Lo

PDF

Open Access

TL;DR

This paper introduces ALICE, a comprehensive framework for evaluating large audio-language models' ability to learn from in-context examples with audio input, revealing their strengths in format adherence but limitations in core task performance.

Contribution

The paper presents ALICE, a novel three-stage evaluation framework specifically designed to assess LALMs' in-context learning capabilities with audio conditioning, filling a significant research gap.

Findings

01

Demonstrations improve format compliance but not task accuracy.

02

In-context learning often degrades core task performance.

03

LALMs struggle with cross-modal semantic grounding.

Abstract

While Large Audio-Language Models (LALMs) have been shown to exhibit degraded instruction-following capabilities, their ability to infer task patterns from in-context examples under audio conditioning remains unstudied. To address this gap, we present ALICE, a three-stage framework that progressively reduces textual guidance to systematically evaluate LALMs' in-context learning ability under audio conditioning. Evaluating six LALMs across four audio understanding tasks under two output constraint categories, we uncover a consistent asymmetry across all stages and LALMs: in-context demonstrations reliably improve format compliance but fail to improve, and often degrade, the core task performance. This suggests that LALMs can glean surface-level formatting patterns from demonstrations but may struggle to leverage cross-modal semantic grounding to reliably infer task objectives from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing