T-Rex: Task-Adaptive Spatial Representation Extraction for Robotic Manipulation with Vision-Language Models
Yiteng Chen, Wenbo Li, Shiyi Wang, Huiping Zhuang, Qingyao Wu

TL;DR
T-Rex is a framework that dynamically adapts spatial representation extraction in robotic manipulation tasks using vision-language models, improving efficiency and understanding without extra training.
Contribution
It introduces a task-adaptive scheme for spatial representation extraction, addressing limitations of fixed methods in VLM-based robotic manipulation.
Findings
Enhanced spatial understanding in real-world robots
Improved efficiency and stability without additional training
Effective adaptation to task complexity
Abstract
Building a general robotic manipulation system capable of performing a wide variety of tasks in real-world settings is a challenging task. Vision-Language Models (VLMs) have demonstrated remarkable potential in robotic manipulation tasks, primarily due to the extensive world knowledge they gain from large-scale datasets. In this process, Spatial Representations (such as points representing object positions or vectors representing object orientations) act as a bridge between VLMs and real-world scene, effectively grounding the reasoning abilities of VLMs and applying them to specific task scenarios. However, existing VLM-based robotic approaches often adopt a fixed spatial representation extraction scheme for various tasks, resulting in insufficient representational capability or excessive extraction time. In this work, we introduce T-Rex, a Task-Adaptive Framework for Spatial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotics and Automated Systems · Robot Manipulation and Learning
