EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models
Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, Zhongyu Wei

TL;DR
This paper introduces EmbSpatial-Bench, a new benchmark for evaluating spatial understanding in embodied tasks using large vision-language models, revealing current limitations and proposing a tuning dataset for improvement.
Contribution
The paper presents EmbSpatial-Bench for assessing spatial understanding in LVLMs and introduces EmbSpatial-SFT, a dataset to enhance their embodied spatial reasoning capabilities.
Findings
Current LVLMs, including GPT-4V, show insufficient spatial understanding.
EmbSpatial-Bench effectively evaluates embodied spatial reasoning.
EmbSpatial-SFT improves LVLMs' spatial understanding after tuning.
Abstract
The recent rapid development of Large Vision-Language Models (LVLMs) has indicated their potential for embodied tasks.However, the critical skill of spatial understanding in embodied environments has not been thoroughly evaluated, leaving the gap between current LVLMs and qualified embodied intelligence unknown. Therefore, we construct EmbSpatial-Bench, a benchmark for evaluating embodied spatial understanding of LVLMs.The benchmark is automatically derived from embodied scenes and covers 6 spatial relationships from an egocentric perspective.Experiments expose the insufficient capacity of current LVLMs (even GPT-4V). We further present EmbSpatial-SFT, an instruction-tuning dataset designed to improve LVLMs' embodied spatial understanding.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Semantic Web and Ontologies
