MVP-Bench: Can Large Vision--Language Models Conduct Multi-level Visual Perception Like Humans?
Guanzhen Li, Yuxi Xie, Min-Yen Kan

TL;DR
This paper introduces MVP-Bench, a comprehensive benchmark to evaluate large vision-language models' ability to perform multi-level visual perception, revealing significant gaps compared to human perception especially in high-level semantic understanding.
Contribution
The paper presents MVP-Bench, the first benchmark systematically assessing both low- and high-level visual perception of LVLMs across natural and synthetic images.
Findings
LVLMs perform poorly on high-level perception tasks compared to humans.
GPT-4o achieves only 56% accuracy on Yes/No questions in high-level perception.
Models struggle to generalize understanding of synthetic images' semantics.
Abstract
Humans perform visual perception at multiple levels, including low-level object recognition and high-level semantic interpretation such as behavior understanding. Subtle differences in low-level details can lead to substantial changes in high-level perception. For example, substituting the shopping bag held by a person with a gun suggests violent behavior, implying criminal or violent activity. Despite significant advancements in various multimodal tasks, Large Visual-Language Models (LVLMs) remain unexplored in their capabilities to conduct such multi-level visual perceptions. To investigate the perception gap between LVLMs and humans, we introduce MVP-Bench, the first visual-language benchmark systematically evaluating both low- and high-level visual perception of LVLMs. We construct MVP-Bench across natural and synthetic images to investigate how manipulated content influences…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Domain Adaptation and Few-Shot Learning
