ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models

Puhao Li; Yingying Wu; Ziheng Xi; Wanlin Li; Yuzhe Huang; Zhiyuan Zhang; Yinghan Chen; Jianan Wang; Song-Chun Zhu; Tengyu Liu; Siyuan Huang

arXiv:2506.16211·cs.RO·June 23, 2025

ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models

Puhao Li, Yingying Wu, Ziheng Xi, Wanlin Li, Yuzhe Huang, Zhiyuan Zhang, Yinghan Chen, Jianan Wang, Song-Chun Zhu, Tengyu Liu, Siyuan Huang

PDF

Open Access

TL;DR

ControlVLA introduces a novel few-shot learning framework that adapts pre-trained vision-language-action models to object-centric robotic manipulation tasks with minimal demonstrations, outperforming traditional methods.

Contribution

It proposes a zero-initialized projection layer approach for efficient fine-tuning of pre-trained models to object-centric tasks in robotics.

Findings

01

Achieves 76.7% success rate with only 10-20 demonstrations.

02

Outperforms traditional methods requiring over 100 demonstrations.

03

Extensible to long-horizon tasks and robust to unseen objects.

Abstract

Learning real-world robotic manipulation is challenging, particularly when limited demonstrations are available. Existing methods for few-shot manipulation often rely on simulation-augmented data or pre-built modules like grasping and pose estimation, which struggle with sim-to-real gaps and lack extensibility. While large-scale imitation pre-training shows promise, adapting these general-purpose policies to specific tasks in data-scarce settings remains unexplored. To achieve this, we propose ControlVLA, a novel framework that bridges pre-trained VLA models with object-centric representations via a ControlNet-style architecture for efficient fine-tuning. Specifically, to introduce object-centric conditions without overwriting prior knowledge, ControlVLA zero-initializes a set of projection layers, allowing them to gradually adapt the pre-trained manipulation policies. In real-world…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Reinforcement Learning in Robotics · Multimodal Machine Learning Applications