Physically Grounded Vision-Language Models for Robotic Manipulation

Jensen Gao; Bidipta Sarkar; Fei Xia; Ted Xiao; Jiajun Wu; Brian; Ichter; Anirudha Majumdar; Dorsa Sadigh

arXiv:2309.02561·cs.RO·March 5, 2024·2 cites

Physically Grounded Vision-Language Models for Robotic Manipulation

Jensen Gao, Bidipta Sarkar, Fei Xia, Ted Xiao, Jiajun Wu, Brian, Ichter, Anirudha Majumdar, Dorsa Sadigh

PDF

Open Access 1 Models

TL;DR

This paper introduces PhysObjects, a large dataset of physical object concepts, and demonstrates that fine-tuning vision-language models on this data enhances robotic manipulation by improving physical reasoning and task success rates.

Contribution

The paper presents PhysObjects, a new dataset for physical concepts, and shows that physically grounded VLMs improve robotic manipulation and reasoning about physical object properties.

Findings

01

Fine-tuning VLMs on PhysObjects enhances physical concept understanding.

02

PhysGrounded VLMs improve robotic task planning and success rates.

03

The dataset and methods are publicly available for further research.

Abstract

Recent advances in vision-language models (VLMs) have led to improved performance on tasks such as visual question answering and image captioning. Consequently, these models are now well-positioned to reason about the physical world, particularly within domains such as robotic manipulation. However, current VLMs are limited in their understanding of the physical concepts (e.g., material, fragility) of common objects, which restricts their usefulness for robotic manipulation tasks that involve interaction and physical reasoning about such objects. To address this limitation, we propose PhysObjects, an object-centric dataset of 39.6K crowd-sourced and 417K automated physical concept annotations of common household objects. We demonstrate that fine-tuning a VLM on PhysObjects improves its understanding of physical object concepts, including generalization to held-out concepts, by capturing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
bidiptas/PG-InstructBLIP
model· ♡ 17
♡ 17

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling