VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs
Tianxiang Jiang, Sheng Xia, Yicheng Xu, Linquan Wu, Xiangyu Zeng, Limin Wang, Yu Qiao, Yi Wang

TL;DR
This paper introduces VKnowU, a benchmark for evaluating visual knowledge in multimodal large language models, revealing current models' limitations and proposing a new dataset and baseline to enhance understanding of physical and social worlds.
Contribution
The paper presents VKnowU, a comprehensive benchmark for visual knowledge, and introduces VideoKnow+, a baseline model that improves understanding of physical and social principles in MLLMs.
Findings
Leading models underperform compared to humans in visual knowledge tasks.
VideoKnow+ achieves a +3.7% improvement on VKnowU benchmark.
Visual knowledge is crucial for more generalizable multimodal models.
Abstract
While Multimodal Large Language Models (MLLMs) have become adept at recognizing objects, they often lack the intuitive, human-like understanding of the world's underlying physical and social principles. This high-level vision-grounded semantics, which we term visual knowledge, forms a bridge between perception and reasoning, yet remains an underexplored area in current MLLMs. To systematically evaluate this capability, we present VKnowU, a comprehensive benchmark featuring 1,680 questions in 1,249 videos, covering 8 core types of visual knowledge spanning both world-centric (e.g., intuitive physics) and human-centric (e.g., subjective intentions). Evaluation of 23 SOTA MLLMs reveals that leading models still fall short of human performance, with particularly notable gaps in the world-centric. To bridge this gap, we introduce a new dataset, VKnowQA, and VideoKnow+, a baseline model that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Advanced Graph Neural Networks
