Task-oriented Robotic Manipulation with Vision Language Models
Nurhan Bulus Guran, Hanchi Ren, Jingjing Deng, Xianghua Xie

TL;DR
This paper introduces a novel framework combining Vision Language Models and structured spatial reasoning to improve robotic manipulation by better understanding spatial relationships and object attributes.
Contribution
The work presents a new integration of VLMs with a spatial reasoning pipeline and a dataset with annotated spatial and attribute information, advancing robot understanding of complex scenes.
Findings
Enhanced spatial relationship comprehension in robots
Improved object manipulation accuracy
First method integrating VLMs with structured spatial reasoning
Abstract
Vision Language Models (VLMs) play a crucial role in robotic manipulation by enabling robots to understand and interpret the visual properties of objects and their surroundings, allowing them to perform manipulation based on this multimodal understanding. Accurately understanding spatial relationships remains a non-trivial challenge, yet it is essential for effective robotic manipulation. In this work, we introduce a novel framework that integrates VLMs with a structured spatial reasoning pipeline to perform object manipulation based on high-level, task-oriented input. Our approach is the transformation of visual scenes into tree-structured representations that encode the spatial relations. These trees are subsequently processed by a Large Language Model (LLM) to infer restructured configurations that determine how these objects should be organised for a given high-level task. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotic Path Planning Algorithms · Robotics and Automated Systems · Robot Manipulation and Learning
