MVP-Bench: Can Large Vision--Language Models Conduct Multi-level Visual   Perception Like Humans?

Guanzhen Li; Yuxi Xie; Min-Yen Kan

arXiv:2410.04345·cs.CV·October 8, 2024

MVP-Bench: Can Large Vision--Language Models Conduct Multi-level Visual Perception Like Humans?

Guanzhen Li, Yuxi Xie, Min-Yen Kan

PDF

Open Access 1 Repo

TL;DR

This paper introduces MVP-Bench, a comprehensive benchmark to evaluate large vision-language models' ability to perform multi-level visual perception, revealing significant gaps compared to human perception especially in high-level semantic understanding.

Contribution

The paper presents MVP-Bench, the first benchmark systematically assessing both low- and high-level visual perception of LVLMs across natural and synthetic images.

Findings

01

LVLMs perform poorly on high-level perception tasks compared to humans.

02

GPT-4o achieves only 56% accuracy on Yes/No questions in high-level perception.

03

Models struggle to generalize understanding of synthetic images' semantics.

Abstract

Humans perform visual perception at multiple levels, including low-level object recognition and high-level semantic interpretation such as behavior understanding. Subtle differences in low-level details can lead to substantial changes in high-level perception. For example, substituting the shopping bag held by a person with a gun suggests violent behavior, implying criminal or violent activity. Despite significant advancements in various multimodal tasks, Large Visual-Language Models (LVLMs) remain unexplored in their capabilities to conduct such multi-level visual perceptions. To investigate the perception gap between LVLMs and humans, we introduce MVP-Bench, the first visual-language benchmark systematically evaluating both low- and high-level visual perception of LVLMs. We construct MVP-Bench across natural and synthetic images to investigate how manipulated content influences…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

guanzhenli/mvp-bench
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Domain Adaptation and Few-Shot Learning