From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs
Le Zhang, Jihan Yang, Soundarya Krishnan, Jimit Majmudar, Xiou Ge, Prasoon Puri, Prathamesh Nandkishor Saraf, Shruti Bhargava, Dhivya Piraviperumal, Yinan Ling, Cindy Pan, Hong Yu, Aishwarya Agrawal, Bo-Hsiang Tseng

TL;DR
This paper introduces SFI-Bench, a video-based benchmark designed to evaluate higher-order spatial and functional reasoning in multimodal large language models, highlighting current models' limitations in grounded intelligence.
Contribution
The paper presents SFI-Bench, a novel benchmark with expert-annotated questions to assess advanced reasoning in multimodal models, focusing on spatial and functional understanding.
Findings
Current MLLMs struggle with integrating spatial memory and functional reasoning.
SFI-Bench reveals significant gaps in models' ability to perform complex reasoning tasks.
Benchmark provides a diagnostic tool for measuring progress in grounded multimodal intelligence.
Abstract
Human-level agentic intelligence extends beyond low-level geometric perception, evolving from recognizing where things are to understanding what they are for. While existing benchmarks effectively evaluate the geometric perception capabilities of multimodal large language models (MLLMs), they fall short of probing the higher-order cognitive abilities required for grounded intelligence. To address this gap, we introduce the Spatial-Functional Intelligence Benchmark (SFI-Bench), a video-based benchmark with over 1,500 expert-annotated questions derived from diverse egocentric indoor video scans. SFI-Bench systematically evaluates two complementary dimensions of advanced reasoning: (1) Structured Spatial Reasoning, which requires understanding complex layouts and forming coherent spatial representations, and (2) Functional Reasoning, which involves inferring object affordances and their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
