LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

Xiang Li; Cristina Mata; Jongwoo Park; Kumara Kahatapitiya; Yoo Sung; Jang; Jinghuan Shang; Kanchana Ranasinghe; Ryan Burgert; Mu Cai; Yong Jae; Lee; Michael S. Ryoo

arXiv:2406.20095·cs.RO·January 31, 2025·2 cites

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Sung, Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae, Lee, Michael S. Ryoo

PDF

Open Access 1 Repo 5 Models 1 Video 3 Reviews

TL;DR

LLaRA introduces a framework that adapts pretrained vision-language models for robotic control by generating conversational instruction data from existing datasets and enhancing it with self-supervised tasks, enabling effective transfer and state-of-the-art performance.

Contribution

The paper presents a novel method to fine-tune vision-language models for robotics using automated data generation and self-supervised auxiliary tasks, improving robotic action decision-making.

Findings

01

LLaRA achieves state-of-the-art results on multiple robotic tasks.

02

The approach maintains the generalization capabilities of large language models.

03

Efficient transfer from vision-language models to robotic control is demonstrated.

Abstract

Vision Language Models (VLMs) have recently been leveraged to generate robotic actions, forming Vision-Language-Action (VLA) models. However, directly adapting a pretrained VLM for robotic control remains challenging, particularly when constrained by a limited number of robot demonstrations. In this work, we introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations and enables an efficient transfer of a pretrained VLM into a powerful VLA, motivated by the success of visual instruction tuning in Computer Vision. First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets, aligning robotic actions with image pixel coordinates. Further, we enhance this dataset in a self-supervised manner by defining six auxiliary tasks, without…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

The pursuit of robotics-oriented pretext tasks for training large-capacity open source models remains quite compelling.

Weaknesses

Section 2, Section 6: The manuscript introduces relevant contemporary approaches — some with strong inductive biases — but the experimental comparisons with the method proposed by the manuscript remain shallow. I would like to see some direct comparisons with existing VLAs in the experiments section, in addition to ablations on, e.g., different action space representations (e.g., versus RoboPoint). Section 2 (L146-148): The manuscript states, "Moreover, all the aforementioned studies lack of co

Reviewer 02Rating 6Confidence 4

Strengths

- The paper reformulates BC data into an image-text format compatible with VLMs. This is not a new idea (RT-2), but they do introduce a new output space and auxiliary objectives. - Restricting output to 2D coordinates sets the model apart from prior 3D-focused models like RT-2, aligning well with tasks where 3D positioning is unnecessary. Various curated datasets support 2D grounding and action prediction, aiding spatial and relational understanding. - The model outperforms versions of VIMA an

Weaknesses

- It was not directly clear to me which components contribute to performance in the RT-2 and VIMA comparison. RT-2 also uses auxiliary tasks and internet data; it's unclear how the proposed auxiliary tasks compares. An ablation of the auxiliary tasks used in RT-2 versus LLaRA would help isolate the contribution of each component. - The contributions of each dataset are unclear; a more interpretable format is needed to highlight key influences on performance.

Reviewer 03Rating 6Confidence 4

Strengths

They present these results on both a simulated and physical benchmark. The paper also includes a substantial set of results in the appendix.

Weaknesses

I'm concerned about the generality of the work, based in large part on the inconclusive trends presented. I'll provide a series of questions below but my primary concern is that it's unclear on training. It appears that training either has no effect or hurts performance in most cases. There's some improvement when two epochs are run (in some conditions, but not all). Similarly, how much data and when/where/why it helps are unclear. This doesn't detract from the fact that a nice system was d

Code & Models

Repositories

lostxine/llara
pytorchOfficial

Models

Videos

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications