Human-in-the-loop Online Rejection Sampling for Robotic Manipulation
Guanxing Lu, Rui Zhao, Haitao Lin, He Zhang, Yansong Tang

TL;DR
This paper introduces Hi-ORS, a rejection sampling-based method that enhances the stability and robustness of robotic manipulation policies by combining online fine-tuning with human-in-the-loop corrections, achieving rapid and effective learning.
Contribution
The paper presents Hi-ORS, a novel post-training approach that stabilizes value estimation and incorporates human feedback for efficient robotic manipulation policy fine-tuning.
Findings
Hi-ORS fine-tunes policies in 1.5 hours of real-world training.
Outperforms RL and IL baselines in effectiveness and efficiency.
Policies exhibit strong error-recovery behaviors at test time.
Abstract
Reinforcement learning (RL) is widely used to produce robust robotic manipulation policies, but fine-tuning vision-language-action (VLA) models with RL can be unstable due to inaccurate value estimates and sparse supervision at intermediate steps. In contrast, imitation learning (IL) is easy to train but often underperforms due to its offline nature. In this paper, we propose Hi-ORS, a simple yet effective post-training method that utilizes rejection sampling to achieve both training stability and high robustness. Hi-ORS stabilizes value estimation by filtering out negatively rewarded samples during online fine-tuning, and adopts a reward-weighted supervised training objective to provide dense intermediate-step supervision. For systematic study, we develop an asynchronous inference-training framework that supports flexible online human-in-the-loop corrections, which serve as explicit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
