Energy-Based Transfer for Reinforcement Learning
Zeyun Deng, Jasorsi Ghosh, Fiona Xie, Yuzhe Lu, Katia Sycara, Joseph Campbell

TL;DR
This paper introduces an energy-based transfer learning approach for reinforcement learning that uses out-of-distribution detection to improve sample efficiency by selectively guiding exploration based on the teacher's training distribution.
Contribution
It proposes a novel energy-based method that enables selective transfer in RL, addressing issues of sub-optimal guidance when tasks differ significantly.
Findings
Improved sample efficiency in RL tasks.
Enhanced performance in multi-task settings.
Energy scores reflect state-visitation density.
Abstract
Reinforcement learning algorithms often suffer from poor sample efficiency, making them challenging to apply in multi-task or continual learning settings. Efficiency can be improved by transferring knowledge from a previously trained teacher policy to guide exploration in new but related tasks. However, if the new task sufficiently differs from the teacher's training task, the transferred guidance may be sub-optimal and bias exploration toward low-reward behaviors. We propose an energy-based transfer learning method that uses out-of-distribution detection to selectively issue guidance, enabling the teacher to intervene only in states within its training distribution. We theoretically show that energy scores reflect the teacher's state-visitation density and empirically demonstrate improved sample efficiency and performance across both single-task and multi-task settings.
Peer Reviews
Decision·Submitted to ICLR 2026
1. The transfer learning in RL is an important and timely topic. 2. The proposed approach is simple yet effective, and conceptually easy to follow. 3. The paper is clearly written and well structured.
1. The central theoretical claim (Proposition 4.1) states that the logarithm of the stationary visitation density is proportional to the negative free energy $\phi(s)$. This relies on a very strong assumption, that is, the policy network optimized for reward maximization (e.g., PPO) implicitly forms a *realizable energy model* $p_{\theta^*}$ that perfectly fits the visitation distribution. In practice, the policy’s logits are trained for control, not density estimation, so equating them with a l
* This work tackles a good problem, in real world settings, it has been shown that with respect to a reward function the teacher may be considered sub-optimal in parts of the task. * The authors discuss between same task transfer, and mult-task transfer, which makes allows insight into both perspectives, unlike other works which may focus on one or the other. * The performance seems to be good on two applicable grid based environments and the authors clearly show that their method has advantage
Recently, there have been ways introduced to see if a policy is out of distribution specific to the RL domain, see [1,2,3,4], the energy function may not work in OOD scenarios, potentially in partially observable environments, and it might be necessary to take insight from [1,2,3,4] or discuss the expected changes from a unsupervised ood method to a suitable RL ood method. This may provide insight to section ```Higher covariate shift makes OOD detection more challenging ```. Although it is an
1. Originality: The paper introduces a novel energy-based mechanism to control knowledge transfer in RL. While energy models have been used in OOD detection, applying them to decide when to transfer across tasks is an original and creative idea. 2. Quality: The technical development is sound and theoretically grounded. The algorithm design is coherent with the underlying theory. Experiments are comprehensive, covering both single- and multi-task settings. The results consistently demonstrate th
1. Limited theoretical depth: The link between energy scores and teacher visitation density is only intuitively discussed, without formal guarantees on convergence or optimality. Providing stronger theoretical analysis—e.g., on transfer efficiency or sample complexity—would enhance rigor. 2.Incomplete ablation analysis: The effects of key components (energy threshold τ, decay schedule, regularization) are not fully disentangled. More systematic ablations would clarify each component’s contribut
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques
