IN-RIL: Interleaved Reinforcement and Imitation Learning for Policy Fine-Tuning
Dechen Gao, Hang Wang, Hanchu Zhou, Nejib Ammar, Shatadal Mishra, Ahmadreza Moradipari, Iman Soltani, and Junshan Zhang

TL;DR
IN-RIL introduces an interleaved approach combining imitation learning and reinforcement learning for more stable, sample-efficient policy fine-tuning in robotics, outperforming traditional two-step methods.
Contribution
The paper proposes a novel interleaving method for IL and RL with gradient separation, improving stability and efficiency during policy fine-tuning.
Findings
Significantly improves sample efficiency in robot tasks.
Reduces performance collapse during online fine-tuning.
Achieves up to 6.3x success rate improvement.
Abstract
Imitation learning (IL) and reinforcement learning (RL) each offer distinct advantages for robotics policy learning: IL provides stable learning from demonstrations, and RL promotes generalization through exploration. While existing robot learning approaches using IL-based pre-training followed by RL-based fine-tuning are promising, this two-step learning paradigm often suffers from instability and poor sample efficiency during the RL fine-tuning phase. In this work, we introduce IN-RIL, INterleaved Reinforcement learning and Imitation Learning, for policy fine-tuning, which periodically injects IL updates after multiple RL updates and hence can benefit from the stability of IL and the guidance of expert data for more efficient exploration throughout the entire fine-tuning process. Since IL and RL involve different optimization objectives, we develop gradient separation mechanisms to…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
Originality - The paper presents a novel non-linear approach to combining IL and RL through alternating optimization rather than linear combination, which represents a significant departure from existing regularization-based methods. - It provides rigorous convergence analysis with Theorems 1-2 establishing optimal interleaving ratios and iteration complexity bounds under reasonable assumptions. Quality - Evaluation of IN-RIL spans 14 tasks across three diverse benchmarks (manipulation and loco
Major 1. The theoretical analysis relies on strong assumptions that lack empirical validation. Specifically, Assumption 1 (gradient relationship) is never measured during training, and it remains unclear when this assumption holds in practice. 2. Additionally, Theorem 1 suggests adaptive interleaving ratios based on gradient alignment, yet all experiments use fixed ratios without justification for why the simpler approach works despite theoretical recommendations. This gap between theory and p
* The concept of interleaving IL and RL updates, combined with the introduction of "gradient separation mechanisms," is a novel and interesting approach to a well-known problem. It presents a more sophisticated way to combine these learning paradigms than prior methods. * A significant strength of IN-RIL is its design as a general "plug-in" that is compatible with various state-of-the-art RL algorithms, both on-policy and off-policy. This makes the method broadly applicable and potentially v
* The term "gradient separation mechanisms" is introduced as a key contribution but may be unfamiliar to many readers. Could the authors provide a more intuitive explanation of this concept? What does it mean in practice to separate the gradients, and how is it implemented (e.g., via gradient surgery or network separation)? A simple illustrative example would be very helpful. * The setup described in Section 2, which covers the pre-training and fine-tuning phases, is largely standard. The pa
- The problem of combining IL and RL for sample-efficient and stable policy fine-tuning is very important and highly relevant to robotics/RL communities. - The authors provide a theoretical analysis to motivate the interleaving approach, and clearly stated their assumptions. - IN-RIL is demonstrated to improve the performance of different underlying RL algorithms (e.g., DPPO, IDQL) across a variety of benchmarks.
- **Incomplete and Concerning Statistical Reporting:** This is a major weakness. - Many results reported in Tables 1 and 2 have a variability measure of ± 0.00. Several learning curves in the figures are missing error bands (e.g., Fig. 4, Walker2D for IN-RIL; Fig. 5, Mug Rack, One Leg Med). Some curves appear to be incomplete (e.g., Fig. 5, Lamp Low for the RL-only baseline). Finally, the number of seeds used and the meaning of the variability measure doesn’t appear to be stated.
Theoretical analysis: The theoretical analysis is sound with appropriate assumptions and the paper includes both theoretical and empirical results. Task selection: The task selection makes sense and selects tasks of a variety of difficulty.
Novelty: The idea of doing phases of imitation learning and reinforcement learning is not new (https://arxiv.org/abs/2206.12030), while the theoretical angle is useful, it is hard to claim novelty. Results: Looking at the results for most of the environments, it is hard to visually tell the method performs better than RL only. There is no mention of how many seeds the runs are over or error bars. There is also a lack of recent state of the art baselines for RL methods and imitation learning.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
