Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better

Danny Driess; Jost Tobias Springenberg; Brian Ichter; Lili Yu; Adrian Li-Bell; Karl Pertsch; Allen Z. Ren; Homer Walke; Quan Vuong; Lucy Xiaoyang Shi; Sergey Levine

arXiv:2505.23705·cs.LG·May 30, 2025

Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better

Danny Driess, Jost Tobias Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Z. Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, Sergey Levine

PDF

Open Access 1 Video

TL;DR

This paper introduces a method to preserve the semantic knowledge of large vision-language models during training of vision-language-action systems, enabling faster training, real-time inference, and better generalization for robotic control.

Contribution

It proposes a knowledge insulation technique that maintains pretrained VLM knowledge during VLA training, improving training speed and control performance.

Findings

01

Naive inclusion of action experts harms training and knowledge transfer.

02

Knowledge insulation mitigates knowledge degradation during training.

03

The method enhances real-time control and generalization in robotic systems.

Abstract

Vision-language-action (VLA) models provide a powerful approach to training control policies for physical systems, such as robots, by combining end-to-end learning with transfer of semantic knowledge from web-scale vision-language model (VLM) training. However, the constraints of real-time control are often at odds with the design of VLMs: the most powerful VLMs have tens or hundreds of billions of parameters, presenting an obstacle to real-time inference, and operate on discrete tokens rather than the continuous-valued outputs that are required for controlling robots. To address this challenge, recent VLA models have used specialized modules for efficient continuous control, such as action experts or continuous output heads, which typically require adding new untrained parameters to the pretrained VLM backbone. While these modules improve real-time and control capabilities, it remains…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Domain Adaptation and Few-Shot Learning

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion