SceneFunRI: Reasoning the Invisible for Task-Driven Functional Object Localization
Posheng Chen, Powen Cheng, Gueter Josmy Faure, Hung-Ting Su, Winston H. Hsu

TL;DR
SceneFunRI introduces a new benchmark for evaluating vision-language models' ability to infer the locations of occluded objects using reasoning and commonsense knowledge.
Contribution
It presents SceneFunRI, a benchmark dataset for reasoning about invisible objects, highlighting current limitations of models and guiding future research.
Findings
Baseline models perform poorly on invisible object reasoning tasks.
Prompting strategies improve model performance but still show significant room for improvement.
Invisible-region reasoning remains a challenging and unstable capability in current vision-language models.
Abstract
In real-world scenes, target objects may reside in regions that are not visible. While humans can often infer the locations of occluded objects from context and commonsense knowledge, this capability remains a major challenge for vision-language models (VLMs). To address this gap, we introduce SceneFunRI, a benchmark for Reasoning the Invisible. Based on the SceneFun3D dataset, SceneFunRI formulates the task as a 2D spatial reasoning problem via a semi-automatic pipeline and comprises 855 instances. It requires models to infer the locations of invisible functional objects from task instructions and commonsense reasoning. The strongest baseline model (Gemini 3 Flash) only achieves an CAcc@75 of 15.20, an mIoU of 0.74, and a Dist of 28.65. We group our prompting analysis into three categories: Strong Instruction Prompting, Reasoning-based Prompting, and Spatial Process of Elimination…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
