VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs

Tianxiang Jiang; Sheng Xia; Yicheng Xu; Linquan Wu; Xiangyu Zeng; Limin Wang; Yu Qiao; Yi Wang

arXiv:2511.20272·cs.CV·November 26, 2025

VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs

Tianxiang Jiang, Sheng Xia, Yicheng Xu, Linquan Wu, Xiangyu Zeng, Limin Wang, Yu Qiao, Yi Wang

PDF

Open Access 1 Models 2 Datasets

TL;DR

This paper introduces VKnowU, a benchmark for evaluating visual knowledge in multimodal large language models, revealing current models' limitations and proposing a new dataset and baseline to enhance understanding of physical and social worlds.

Contribution

The paper presents VKnowU, a comprehensive benchmark for visual knowledge, and introduces VideoKnow+, a baseline model that improves understanding of physical and social principles in MLLMs.

Findings

01

Leading models underperform compared to humans in visual knowledge tasks.

02

VideoKnow+ achieves a +3.7% improvement on VKnowU benchmark.

03

Visual knowledge is crucial for more generalizable multimodal models.

Abstract

While Multimodal Large Language Models (MLLMs) have become adept at recognizing objects, they often lack the intuitive, human-like understanding of the world's underlying physical and social principles. This high-level vision-grounded semantics, which we term visual knowledge, forms a bridge between perception and reasoning, yet remains an underexplored area in current MLLMs. To systematically evaluate this capability, we present VKnowU, a comprehensive benchmark featuring 1,680 questions in 1,249 videos, covering 8 core types of visual knowledge spanning both world-centric (e.g., intuitive physics) and human-centric (e.g., subjective intentions). Evaluation of 23 SOTA MLLMs reveals that leading models still fall short of human performance, with particularly notable gaps in the world-centric. To bridge this gap, we introduce a new dataset, VKnowQA, and VideoKnow+, a baseline model that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Eurayka/VideoKnow
model· 2 dl· ♡ 2
2 dl♡ 2

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Advanced Graph Neural Networks