AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand   Audio-Visual Information?

Kaixiong Gong; Kaituo Feng; Bohao Li; Yibing Wang; Mofan Cheng; Shijia; Yang; Jiaming Han; Benyou Wang; Yutong Bai; Zhuoran Yang; Xiangyu Yue

arXiv:2412.02611·cs.CV·December 4, 2024

AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

Kaixiong Gong, Kaituo Feng, Bohao Li, Yibing Wang, Mofan Cheng, Shijia, Yang, Jiaming Han, Benyou Wang, Yutong Bai, Zhuoran Yang, Xiangyu Yue

PDF

Open Access 1 Repo 3 Datasets

TL;DR

This paper introduces AV-Odyssey Bench, a comprehensive audio-visual benchmark to evaluate whether multimodal large language models truly understand combined audio and visual information, revealing current limitations.

Contribution

The paper presents AV-Odyssey Bench, a new dataset with 4,555 problems designed to objectively assess multimodal understanding in large language models.

Findings

01

Current MLLMs struggle with simple audio-visual tasks.

02

Benchmark reveals significant limitations in existing models' understanding.

03

Provides insights for future dataset and model development.

Abstract

Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini 1.5 Pro, and Reka Core, have expanded their capabilities to include vision and audio modalities. While these models demonstrate impressive performance across a wide range of audio-visual applications, our proposed DeafTest reveals that MLLMs often struggle with simple tasks humans find trivial: 1) determining which of two sounds is louder, and 2) determining which of two sounds has a higher pitch. Motivated by these observations, we introduce AV-Odyssey Bench, a comprehensive audio-visual benchmark designed to assess whether those MLLMs can truly understand the audio-visual information. This benchmark encompasses 4,555 carefully crafted problems, each incorporating text, visual, and audio components. To successfully infer answers, models must effectively leverage clues from both visual and audio inputs. To ensure…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

AV-Odyssey/AV-Odyssey
pytorch

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Natural Language Processing Techniques · Video Analysis and Summarization