Interactive Post-Training for Vision-Language-Action Models

Shuhan Tan; Kairan Dou; Yue Zhao; Philipp Kr\"ahenb\"uhl

arXiv:2505.17016·cs.LG·May 23, 2025

Interactive Post-Training for Vision-Language-Action Models

Shuhan Tan, Kairan Dou, Yue Zhao, Philipp Kr\"ahenb\"uhl

PDF

Open Access 1 Models 3 Reviews

TL;DR

RIPT-VLA is a reinforcement-learning-based post-training method that significantly improves vision-language-action models' success rates with minimal supervision and demonstrates strong generalization and robustness.

Contribution

It introduces a scalable, interactive post-training paradigm for VLA models that enhances performance with sparse rewards and minimal data, outperforming existing methods.

Findings

01

Achieved 97.5% success rate on OpenVLA-OFT model

02

Improved QueST model success by 21.2%

03

Enabled a 4% SFT model to reach 97% success within 15 iterations

Abstract

We introduce RIPT-VLA, a simple and scalable reinforcement-learning-based interactive post-training paradigm that fine-tunes pretrained Vision-Language-Action (VLA) models using only sparse binary success rewards. Existing VLA training pipelines rely heavily on offline expert demonstration data and supervised imitation, limiting their ability to adapt to new tasks and environments under low-data regimes. RIPT-VLA addresses this by enabling interactive post-training with a stable policy optimization algorithm based on dynamic rollout sampling and leave-one-out advantage estimation. RIPT-VLA has the following characteristics. First, it applies to various VLA models, resulting in an improvement on the lightweight QueST model by 21.2%, and the 7B OpenVLA-OFT model to an unprecedented 97.5% success rate. Second, it is computationally efficient and data-efficient: with only one…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

• Clear, simple recipe for interactive post‑training after SFT. Algorithm and the dynamic sampling heuristic are easy to implement. Section 4 and Algorithm 1 are well written. • Some gains in the areas that matter most for data scarcity. On LIBERO‑Long and few‑shot settings, the improvements over SFT are moderate for Stage 2 models, and the “1‑demo to workable policy” result is an interesting result if generalizable. Certainly gains are shown, though only one (seemingly easy) example was show

Weaknesses

• Binary reward is probably too constraining for many real settings and it's perhaps interesting to used a learned reward signal. • External validity. All results are in simulators, with LIBERO and ML45, and the paper does not demonstrate the method on a physical robot or discuss how robust success detectors would be implemented in hardware. For example, how might you replay an exact context? This would need to include the environment state, so you'd have to do environmental resets, which se

Reviewer 02Rating 4Confidence 4

Strengths

The main strength of this paper is showing that a simple, critic-free RL algorithm can effectively fine-tune VLA models from sparse binary rewards. This is a practical and stable approach, and the results convincingly show its ability to boost performance, especially when expert demonstrations are scarce.

Weaknesses

1. The main contribution of this paper is the dynamic sampling strategy, which, while effective, is more of an engineering improvement than a fundamentally new RL algorithm. The novelty lies more in the successful application and adaptation of this critic-free paradigm to the VLA domain, rather than in inventing the RL method itself. 2. All experiments are conducted in simulation. While this is standard practice, policies trained with RL can be sensitive to the sim-to-real gap. The paper claims

Reviewer 03Rating 2Confidence 4

Strengths

This paper presents a simple yet efficient training recipe for VLA models, including RLOO[1] and Dynamic sampling[2]. >reference >> >> [1] Buy 4 reinforce samples, get a baseline for free! >> >> [2] DAPO: An Open-Source LLM Reinforcement Learning System at Scale.

Weaknesses

1. The main concern is the lack of novelty. Dynamic sampling is commonly used in LLMs, such as DAPO [1], which is not cited properly. 2. The citation of RLOO is wrong! It should be [1], but using [2] in this paper. 3. For VLA models, the experiments in real robotics should be considered. >reference >> >> [1] DAPO: An Open-Source LLM Reinforcement Learning System at Scale. >> >> [2] Buy 4 reinforce samples, get a baseline for free! >> >> [3] Attention, Learn to Solve Routing Problems!

Code & Models

Models

🤗
tanshh97/RIPT_VLA
model· ♡ 5
♡ 5

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Reinforcement Learning in Robotics

MethodsShrink and Fine-Tune