Do MLLMs Exhibit Human-like Perceptual Behaviors? HVSBench: A Benchmark for MLLM Alignment with Human Perceptual Behavior
Jiaying Lin, Shuquan Ye, Dan Xu, Wanli Ouyang, Rynson W.H. Lau

TL;DR
HVSBench is a large-scale benchmark designed to evaluate whether Multimodal Large Language Models (MLLMs) exhibit human-like perceptual behaviors across various visual tasks, revealing a significant perceptual gap compared to humans.
Contribution
This paper introduces HVSBench, the first comprehensive benchmark with over 85,000 samples to assess MLLM alignment with human visual perception across multiple categories.
Findings
MLLMs achieve only moderate performance on HVSBench
Humans significantly outperform MLLMs in perceptual tasks
The benchmark highlights the perceptual gap and the need for more human-aligned models
Abstract
While Multimodal Large Language Models (MLLMs) excel at many vision tasks, it is unknown if they exhibit human-like perceptual behaviors. To evaluate this, we introduce HVSBench, the first large-scale benchmark with over 85,000 samples designed to test MLLM alignment with the human visual system (HVS). The benchmark covers 13 categories across 5 key fields: Prominence, Subitizing, Prioritizing, Free-Viewing, and Searching. Our comprehensive evaluation reveals a significant perceptual gap: even state-of-the-art MLLMs achieve only moderate results. In contrast, human participants demonstrate strong performance, significantly outperforming all models. This underscores the high quality of HVSBench and the need for more human-aligned AI. We believe our benchmark will be a critical tool for developing the next generation of explainable MLLMs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
