WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

Jack Hong; Shilin Yan; Jiayin Cai; Xiaolong Jiang; Yao Hu; Weidi Xie

arXiv:2502.04326·cs.CV·March 3, 2026

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, Weidi Xie

PDF

Open Access 2 Datasets 3 Reviews

TL;DR

WorldSense is a comprehensive benchmark designed to evaluate multi-modal video understanding across visual, audio, and text inputs, highlighting current models' limitations in real-world scenarios.

Contribution

We introduce WorldSense, the first benchmark for omni-modal video understanding with diverse tasks, high-quality annotations, and a focus on audio-visual synergy, to advance multimodal model evaluation.

Findings

01

Existing models achieve only 65.1% accuracy on real-world tasks.

02

Models struggle with understanding complex, multi-modal scenarios.

03

WorldSense reveals significant gaps in current multimodal understanding capabilities.

Abstract

We introduce WorldSense, the first benchmark to assess the multi-modal video understanding, that simultaneously encompasses visual, audio, and text inputs. In contrast to existing benchmarks, our WorldSense has several features: (i)collaboration of omni-modality, we design the evaluation tasks to feature a strong coupling of audio and video, requiring models to effectively utilize the synergistic perception of omni-modality; (ii)diversity of videos and tasks, WorldSense encompasses a diverse collection of 1,662 audio-visual synchronised videos, systematically categorized into 8 primary domains and 67 fine-grained subcategories to cover the broad scenarios, and 3,172 multi-choice QA pairs across 26 distinct tasks to enable the comprehensive evaluation; (iii)high-quality annotations, all the QA pairs are manually labeled by 80 expert annotators with multiple rounds of correction to ensure…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- A novel audio-video benchmark focused on omni-modal understanding. - Expert curated by annotators and manual QA design for better quality. - Paper is well-written.

Weaknesses

- Although, the dataset is diverse, the distribution of question difficulties across categories or cognitive levels is not very clear. Some tasks might be more perception-heavy than reasoning-heavy, which could bias model comparison.

Reviewer 02Rating 10Confidence 5

Strengths

1. Solid motivation. The paper focus on omnimodal reasoning, requiring the models to utilize visual and audio inputs together to answer the question. This designed correlations make it distinct from most existing MLLM benchmarks. 2. High-quality human-reviewed annotations. Unlike recent benchmarks, WorldSense is reviewed and revised by human expert, instead of relying solely on LLMs. This is a guarantee for the quality of the benchmark. It is very rare these days. 3. Comprehensive experiments. T

Weaknesses

1. Video caption is not a good representative for "text" modality. This makes the "omni" a little overclaim. Given real-world constraints, audio–video may already suffice for evaluating omnimodality, as seen in emerging “world models.” 2. The question types are restricted in multiple-choice QA. This is common in existing benchmarks, but given the ability of MLLMs, free-form answers can yield deeper insights and align with user-end usage. 3. The benchmark is currently focusing on perception and r

Reviewer 03Rating 6Confidence 4

Strengths

1. Requiring both video and audio modalities for accurate responses to each question in WorldSense facilitates a more comprehensive evaluation of current MLLMs in omni-modal reasoning. 2. Experimentally, the performance drop of current video-audio MLLMs indicates that the fusion between modalities is ineffective or even detrimental.

Weaknesses

1. While WorldSense emphasizes real-world omni-modal perception, understanding, and reasoning, the benchmark primarily consists of QA pairs. I believe that interactive question answering would be more practical for real-world scenarios. Moreover, isn’t the term "omni-modal" somewhat overstated, given that the benchmark only includes video, audio, and text modalities? 2. Lack of analysis on why the fusion of open-source audio and video models failed. While I wouldn’t tend to reject the paper for

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Natural Language Processing Techniques · Speech and dialogue systems