Beyond the Visible: Benchmarking Occlusion Perception in Multimodal Large Language Models
Zhaochen Liu, Kaiwen Gao, Shuyi Liang, Bin Xiao, Limeng Qiao, Lin Ma, Tingting Jiang

TL;DR
This paper introduces O-Bench, a new benchmark for evaluating multimodal large language models on occlusion perception, revealing significant gaps compared to human performance and identifying key failure patterns.
Contribution
The paper presents O-Bench, the first VQA benchmark for occlusion perception, with a novel layered synthesis approach and comprehensive evaluation of 22 MLLMs.
Findings
Current MLLMs lag behind humans in occlusion perception.
Model scaling and thinking processes do not close the performance gap.
Identified failure patterns include conservative bias, fragile gestalt, and difficulty with quantitative tasks.
Abstract
Occlusion perception, a critical foundation for human-level spatial understanding, embodies the challenge of integrating visual recognition and reasoning. Though multimodal large language models (MLLMs) have demonstrated remarkable capabilities, their performance on occlusion perception remains under-explored. To address this gap, we introduce O-Bench, the first visual question answering (VQA) benchmark specifically designed for occlusion perception. Based on SA-1B, we construct 1,365 images featuring semantically coherent occlusion scenarios through a novel layered synthesis approach. Upon this foundation, we annotate 4,588 question-answer pairs in total across five tailored tasks, employing a reliable, semi-automatic workflow. Our extensive evaluation of 22 representative MLLMs against the human baseline reveals a significant performance gap between current MLLMs and humans, which, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
