LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Sung, Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae, Lee, Michael S. Ryoo

TL;DR
LLaRA introduces a framework that adapts pretrained vision-language models for robotic control by generating conversational instruction data from existing datasets and enhancing it with self-supervised tasks, enabling effective transfer and state-of-the-art performance.
Contribution
The paper presents a novel method to fine-tune vision-language models for robotics using automated data generation and self-supervised auxiliary tasks, improving robotic action decision-making.
Findings
LLaRA achieves state-of-the-art results on multiple robotic tasks.
The approach maintains the generalization capabilities of large language models.
Efficient transfer from vision-language models to robotic control is demonstrated.
Abstract
Vision Language Models (VLMs) have recently been leveraged to generate robotic actions, forming Vision-Language-Action (VLA) models. However, directly adapting a pretrained VLM for robotic control remains challenging, particularly when constrained by a limited number of robot demonstrations. In this work, we introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations and enables an efficient transfer of a pretrained VLM into a powerful VLA, motivated by the success of visual instruction tuning in Computer Vision. First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets, aligning robotic actions with image pixel coordinates. Further, we enhance this dataset in a self-supervised manner by defining six auxiliary tasks, without…
Peer Reviews
Decision·ICLR 2025 Poster
The pursuit of robotics-oriented pretext tasks for training large-capacity open source models remains quite compelling.
Section 2, Section 6: The manuscript introduces relevant contemporary approaches — some with strong inductive biases — but the experimental comparisons with the method proposed by the manuscript remain shallow. I would like to see some direct comparisons with existing VLAs in the experiments section, in addition to ablations on, e.g., different action space representations (e.g., versus RoboPoint). Section 2 (L146-148): The manuscript states, "Moreover, all the aforementioned studies lack of co
- The paper reformulates BC data into an image-text format compatible with VLMs. This is not a new idea (RT-2), but they do introduce a new output space and auxiliary objectives. - Restricting output to 2D coordinates sets the model apart from prior 3D-focused models like RT-2, aligning well with tasks where 3D positioning is unnecessary. Various curated datasets support 2D grounding and action prediction, aiding spatial and relational understanding. - The model outperforms versions of VIMA an
- It was not directly clear to me which components contribute to performance in the RT-2 and VIMA comparison. RT-2 also uses auxiliary tasks and internet data; it's unclear how the proposed auxiliary tasks compares. An ablation of the auxiliary tasks used in RT-2 versus LLaRA would help isolate the contribution of each component. - The contributions of each dataset are unclear; a more interpretable format is needed to highlight key influences on performance.
They present these results on both a simulated and physical benchmark. The paper also includes a substantial set of results in the appendix.
I'm concerned about the generality of the work, based in large part on the inconclusive trends presented. I'll provide a series of questions below but my primary concern is that it's unclear on training. It appears that training either has no effect or hurts performance in most cases. There's some improvement when two epochs are run (in some conditions, but not all). Similarly, how much data and when/where/why it helps are unclear. This doesn't detract from the fact that a nice system was d
Code & Models
- 🤗variante/llava-1.5-7b-llara-D-inBC-VIMA-80kmodel· 6 dl· ♡ 16 dl♡ 1
- 🤗variante/llava-1.5-7b-llara-D-inBC-Aux-D-VIMA-80kmodel· 5 dl· ♡ 15 dl♡ 1
- 🤗variante/llara-maskrcnnmodel· ♡ 1♡ 1
- 🤗variante/llava-1.5-7b-llara-D-inBC-Aux-B-VIMA-80kmodel· 9 dl· ♡ 29 dl♡ 2
- 🤗variante/llava-1.5-7b-llara-D-RT2-Style-VIMA-80kmodel· 1 dl1 dl
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications
