AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding

Sanjoy Chowdhury; Karren D. Yang; Xudong Liu; Fartash Faghri; Pavan Kumar Anasosalu Vasu; Oncel Tuzel; Dinesh Manocha; Chun-Liang Li; Raviteja Vemulapalli

arXiv:2512.16250·cs.AI·December 19, 2025

AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding

Sanjoy Chowdhury, Karren D. Yang, Xudong Liu, Fartash Faghri, Pavan Kumar Anasosalu Vasu, Oncel Tuzel, Dinesh Manocha, Chun-Liang Li, Raviteja Vemulapalli

PDF

Open Access

TL;DR

AMUSE is a new benchmark for evaluating multimodal large language models in agentic, multi-speaker audio-visual understanding tasks, revealing current limitations and proposing RAFT for improved alignment and reasoning.

Contribution

The paper introduces AMUSE, a comprehensive benchmark for agentic multi-speaker audio-visual tasks, and proposes RAFT, a novel data-efficient alignment framework for enhancing model reasoning.

Findings

01

Current models show weak multi-speaker reasoning.

02

Models exhibit inconsistent behavior in agentic settings.

03

RAFT improves accuracy by up to 39.52% on the benchmark.

Abstract

Recent multimodal large language models (MLLMs) such as GPT-4o and Qwen3-Omni show strong perception but struggle in multi-speaker, dialogue-centric settings that demand agentic reasoning tracking who speaks, maintaining roles, and grounding events across time. These scenarios are central to multimodal audio-video understanding, where models must jointly reason over audio and visual streams in applications such as conversational video assistants and meeting analytics. We introduce AMUSE, a benchmark designed around tasks that are inherently agentic, requiring models to decompose complex audio-visual interactions into planning, grounding, and reflection steps. It evaluates MLLMs across three modes zero-shot, guided, and agentic and six task families, including spatio-temporal speaker grounding and multimodal dialogue summarization. Across all modes, current models exhibit weak…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Social Robot Interaction and HRI