ReinforceGen: Hybrid Skill Policies with Automated Data Generation and Reinforcement Learning

Zihan Zhou; Animesh Garg; Ajay Mandlekar; Caelan Garrett

arXiv:2512.16861·cs.RO·December 19, 2025

ReinforceGen: Hybrid Skill Policies with Automated Data Generation and Reinforcement Learning

Zihan Zhou, Animesh Garg, Ajay Mandlekar, Caelan Garrett

PDF

Open Access 3 Reviews

TL;DR

ReinforceGen introduces a hybrid approach combining task decomposition, imitation learning, and reinforcement learning to improve long-horizon robotic manipulation tasks, achieving high success rates with minimal demonstrations.

Contribution

The paper presents ReinforceGen, a novel system that integrates automated data generation, task segmentation, and reinforcement learning for enhanced robotic skill acquisition.

Findings

01

Achieves 80% success rate on Robosuite tasks.

02

Fine-tuning improves performance by 89%.

03

Effective combination of imitation and reinforcement learning.

Abstract

Long-horizon manipulation has been a long-standing challenge in the robotics community. We propose ReinforceGen, a system that combines task decomposition, data generation, imitation learning, and motion planning to form an initial solution, and improves each component through reinforcement-learning-based fine-tuning. ReinforceGen first segments the task into multiple localized skills, which are connected through motion planning. The skills and motion planning targets are trained with imitation learning on a dataset generated from 10 human demonstrations, and then fine-tuned through online adaptation and reinforcement learning. When benchmarked on the Robosuite dataset, ReinforceGen reaches 80% success rate on all tasks with visuomotor controls in the highest reset range setting. Additional ablation studies show that our fine-tuning approaches contributes to an 89% average performance…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

- RL finetuning is a significant problem for MimicGen-style robotic manipulation policies. - The system dividing the long-horizon task into motion-planning stages and RL stages effectively reduces the burdens for RL exploration. - Experimental results demonstrate its effectiveness compared with the previous HSP method.

Weaknesses

- The RL pipeline only works in simulation, since privileged states are required to train the initial pose predictors and termination classifiers. Given this, is it really necessary to train these modules? A much simpler approach may also works: using ground truth motion-planning poses and ground truth success detectors for RL in simulation, then distilling all modules into an end-to-end policy. - Another simple and direct approach is not compared with: apply MimicGen, BC, and residual RL for ea

Reviewer 02Rating 4Confidence 4

Strengths

1. Termination Classification: The authors introduce a learned termination classifier that minimizes the gap between training and deployment by rejecting low-confidence terminations, thereby reducing train–test mismatch during execution. 2. Initiation Pose Prediction: The initiation pose predictor is continuously updated during the connection segment and can trigger replanning when necessary, which substantially reduces pose error and improves task success rates.

Weaknesses

1. Rationale for Imitation Learning vs. Pure RL: It is unclear why imitation learning (IL) is necessary. How would a purely reinforcement-learning pipeline perform under the same training budget and environment settings? Does IL primarily improve computational efficiency (sample/compute efficiency) or final success rate—and by how much? 2. Gap to Stronger Oracles and Data Regimes: Although the method improves over a vanilla HSP baseline, it still underperforms settings that use privileged state

Reviewer 03Rating 2Confidence 4

Strengths

The method addresses a real problem in robot learning - learn to solve a task from a substantially small number of expert demonstrations. Learning is divided to offline and online parts. This makes sense since the compounding error can be reduced by the online part (in the same spirit as Dagger).

Weaknesses

**Major:** - The work claims to contribute to robot learning, but no experiments with real robots are made - everything is simulated - The work is a good piece of engineering work where certain parts from existing works are put together and then experimented, but lacks in scientific research questions or contributions - Results table does not include comparison to existing works despite that for the used tasks related works have been published **Moderate:** - You claim that during t

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Reinforcement Learning in Robotics · Soft Robotics and Applications