Beyond the Visible: Benchmarking Occlusion Perception in Multimodal Large Language Models

Zhaochen Liu; Kaiwen Gao; Shuyi Liang; Bin Xiao; Limeng Qiao; Lin Ma; Tingting Jiang

arXiv:2508.04059·cs.CV·August 7, 2025

Beyond the Visible: Benchmarking Occlusion Perception in Multimodal Large Language Models

Zhaochen Liu, Kaiwen Gao, Shuyi Liang, Bin Xiao, Limeng Qiao, Lin Ma, Tingting Jiang

PDF

TL;DR

This paper introduces O-Bench, a new benchmark for evaluating multimodal large language models on occlusion perception, revealing significant gaps compared to human performance and identifying key failure patterns.

Contribution

The paper presents O-Bench, the first VQA benchmark for occlusion perception, with a novel layered synthesis approach and comprehensive evaluation of 22 MLLMs.

Findings

01

Current MLLMs lag behind humans in occlusion perception.

02

Model scaling and thinking processes do not close the performance gap.

03

Identified failure patterns include conservative bias, fragile gestalt, and difficulty with quantitative tasks.

Abstract

Occlusion perception, a critical foundation for human-level spatial understanding, embodies the challenge of integrating visual recognition and reasoning. Though multimodal large language models (MLLMs) have demonstrated remarkable capabilities, their performance on occlusion perception remains under-explored. To address this gap, we introduce O-Bench, the first visual question answering (VQA) benchmark specifically designed for occlusion perception. Based on SA-1B, we construct 1,365 images featuring semantically coherent occlusion scenarios through a novel layered synthesis approach. Upon this foundation, we annotate 4,588 question-answer pairs in total across five tailored tasks, employing a reliable, semi-automatic workflow. Our extensive evaluation of 22 representative MLLMs against the human baseline reveals a significant performance gap between current MLLMs and humans, which, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.