Reinforcement Learning via Implicit Imitation Guidance
Perry Dong, Alec M. Lessing, Annie S. Chen, Chelsea Finn

TL;DR
This paper introduces Data-Guided Noise (DGN), a novel reinforcement learning approach that uses prior demonstration data to guide exploration without explicit imitation constraints, significantly improving sample efficiency.
Contribution
The paper proposes DGN, a new method that leverages demonstration data solely for exploration guidance, avoiding the pitfalls of imitation learning objectives.
Findings
Achieves 2-3x improvement over prior methods
Effective in seven simulated continuous control tasks
Guides exploration without explicit imitation constraints
Abstract
We study the problem of sample efficient reinforcement learning, where prior data such as demonstrations are provided for initialization in lieu of a dense reward signal. A natural approach is to incorporate an imitation learning objective, either as regularization during training or to acquire a reference policy. However, imitation learning objectives can ultimately degrade long-term performance, as it does not directly align with reward maximization. In this work, we propose to use prior data solely for guiding exploration via noise added to the policy, sidestepping the need for explicit behavior cloning constraints. The key insight in our framework, Data-Guided Noise (DGN), is that demonstrations are most useful for identifying which actions should be explored, rather than forcing the policy to take certain actions. Our approach achieves up to 2-3x improvement over prior…
Peer Reviews
Decision·Submitted to ICLR 2026
* The core motivation is strong and well-articulated. The idea of using demonstrations to guide exploration without being rigidly constrained by them is a compelling direction for combining imitation and reinforcement learning. * The paper's choice of baseline algorithms is comprehensive and appropriate. It compares against methods that represent key paradigms in the offline-to-online and imitation-augmented RL space (e.g., replay buffer initialization, explicit regularization, and reference
* The central mechanism of the paper lacks a clear theoretical or intuitive justification. It is not immediately obvious why the difference between the expert action and the current policy's mean action should define an optimal exploration distribution. While the results are strong, the paper would be more impactful if it provided more insight into *why* this specific formulation of exploration noise is so effective. * The paper makes strong claims about the failure modes of imitation-based
- The paper proposes a novel method for leveraging demonstration data to guide RL exploration by shaping the noise distribution, rather than through an explicit imitation loss. - The empirical evaluation is thorough. The method is compared against several relevant baselines (RLPD, RFT, IQL, IBRL), and the core components of the DGN framework are carefully validated through various ablation studies.
- **Contextualization w.r.t. related work**: The paper's discussion of combining Imitation Learning (IL) and Reinforcement Learning (RL) primarily focuses on two strategies: (1) combining IL/RL objectives and (2) using a separate IL policy to guide or propose actions. However, there are a couple other approaches for guiding the exploration of the RL agent with demonstrations while avoiding the need to balance between explicit IL/RL objectives, or incurring state-action distribution shift issues
- The paper is easy to read and the idea is simple - The performance outperforms compared approaches, and the paper provides some ablation studies on a subset of design choices
- Related work, combining imitation and reinforcement learning: I believe there is a line of research that uses hierarchical (inverse) RL [1-4] to better explore the environment. It would be nice if the paper includes some discussions on these papers as well. - Method - In section 4.1, it seems like the method will encourage the agent to explore indefinitely as the variance will be large asymptotically when the "expert" policy is suboptimal and the RL policy is optimal, or when there are multip
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Robot Manipulation and Learning · Domain Adaptation and Few-Shot Learning
MethodsALIGN
