EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks   with Large Vision-Language Models

Mengfei Du; Binhao Wu; Zejun Li; Xuanjing Huang; Zhongyu Wei

arXiv:2406.05756·cs.AI·June 11, 2024

EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models

Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, Zhongyu Wei

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces EmbSpatial-Bench, a new benchmark for evaluating spatial understanding in embodied tasks using large vision-language models, revealing current limitations and proposing a tuning dataset for improvement.

Contribution

The paper presents EmbSpatial-Bench for assessing spatial understanding in LVLMs and introduces EmbSpatial-SFT, a dataset to enhance their embodied spatial reasoning capabilities.

Findings

01

Current LVLMs, including GPT-4V, show insufficient spatial understanding.

02

EmbSpatial-Bench effectively evaluates embodied spatial reasoning.

03

EmbSpatial-SFT improves LVLMs' spatial understanding after tuning.

Abstract

The recent rapid development of Large Vision-Language Models (LVLMs) has indicated their potential for embodied tasks.However, the critical skill of spatial understanding in embodied environments has not been thoroughly evaluated, leaving the gap between current LVLMs and qualified embodied intelligence unknown. Therefore, we construct EmbSpatial-Bench, a benchmark for evaluating embodied spatial understanding of LVLMs.The benchmark is automatically derived from embodied scenes and covers 6 spatial relationships from an egocentric perspective.Experiments expose the insufficient capacity of current LVLMs (even GPT-4V). We further present EmbSpatial-SFT, an instruction-tuning dataset designed to improve LVLMs' embodied spatial understanding.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mengfeidu/embspatial-bench
noneOfficial

Datasets

Phineas476/EmbSpatial-Bench
dataset· 103 dl
103 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Semantic Web and Ontologies