RL Token: Bootstrapping Online RL with Vision-Language-Action Models

Charles Xu; Jost Tobias Springenberg; Michael Equi; Ali Amin; Adnan Esmail; Sergey Levine; Liyiming Ke

arXiv:2604.23073·cs.LG·May 4, 2026

RL Token: Bootstrapping Online RL with Vision-Language-Action Models

Charles Xu, Jost Tobias Springenberg, Michael Equi, Ali Amin, Adnan Esmail, Sergey Levine, Liyiming Ke

PDF

TL;DR

This paper presents RLT, a method for efficient online reinforcement learning fine-tuning of pretrained vision-language-action models using an RL token, significantly improving robot task performance within hours.

Contribution

The introduction of the RL token interface enables rapid, sample-efficient online RL fine-tuning of large VLAs for real-world robotic tasks.

Findings

01

RLT improves task speed by up to 3x on the hardest parts.

02

Success rates increase significantly within minutes to hours.

03

RLT can outperform human teleoperation in some tasks.

Abstract

Vision-language-action (VLA) models can learn to perform diverse manipulation skills "out of the box," but achieving the precision and speed that real-world tasks demand requires further fine-tuning -- for example, via reinforcement learning (RL). We introduce a lightweight method that enables sample-efficient online RL fine-tuning of pretrained VLAs using just a few hours of real-world practice. We (1) adapt the VLA to expose an "RL token," a compact readout representation that preserves task-relevant pretrained knowledge while serving as an efficient interface for online RL, and (2) train a small actor-critic head on this RL token to refine the actions, while anchoring the learned policy to the VLA. Online RL with the RL token (RLT) makes it possible to fine-tune even large VLAs with RL quickly and efficiently. Across four real-robot tasks (screw installation, zip tie fastening,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.