Evaluating VLMs' Spatial Reasoning Over Robot Motion: A Step Towards Robot Planning with Motion Preferences
Wenxi Wu, Jingjing Zhang, Martim Brand\~ao

TL;DR
This paper assesses the spatial reasoning abilities of state-of-the-art Vision-Language Models in robot motion planning, highlighting their potential and limitations in understanding user preferences and constraints in a zero-shot and fine-tuned setting.
Contribution
It provides a systematic evaluation of VLMs' spatial reasoning in robot motion tasks, introducing querying methods and analyzing performance trade-offs.
Findings
Qwen2.5-VL achieves 71.4% zero-shot accuracy
Fine-tuning improves accuracy to 75%
GPT-4o performs less effectively
Abstract
Understanding user instructions and object spatial relations in surrounding environments is crucial for intelligent robot systems to assist humans in various tasks. The natural language and spatial reasoning capabilities of Vision-Language Models (VLMs) have the potential to enhance the generalization of robot planners on new tasks, objects, and motion specifications. While foundation models have been applied to task planning, it is still unclear the degree to which they have the capability of spatial reasoning required to enforce user preferences or constraints on motion, such as desired distances from objects, topological properties, or motion style preferences. In this paper, we evaluate the capability of four state-of-the-art VLMs at spatial reasoning over robot motion, using four different querying methods. Our results show that, with the highest-performing querying method,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Constraint Satisfaction and Optimization · Robotic Path Planning Algorithms
