TL;DR
This paper introduces PANI, a simple yet effective offline RL method that improves policy learning by injecting and penalizing noise in actions, inspired by diffusion models, leading to better performance on benchmarks.
Contribution
The paper proposes PANI, a novel noise injection technique for offline RL that enhances generalization without high computational costs, supported by a theoretical framework.
Findings
Significant performance improvements on multiple benchmarks.
Compatibility with various offline RL algorithms.
Theoretical validation via the noisy action MDP.
Abstract
Offline reinforcement learning (RL) optimizes a policy using only a fixed dataset, making it a practical approach in scenarios where interaction with the environment is costly. Due to this limitation, generalization ability is key to improving the performance of offline RL algorithms, as demonstrated by recent successes of offline RL with diffusion models. However, it remains questionable whether such diffusion models are necessary for highly performing offline RL algorithms, given their significant computational requirements during inference. In this paper, we propose Penalized Action Noise Injection (PANI), a method that simply enhances offline learning by utilizing noise-injected actions to cover the entire action space, while penalizing according to the amount of noise injected. This approach is inspired by how diffusion models have worked in offline RL algorithms. We provide a…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
* The OOD issue, are classic topics in offline RL, it is appreciated that the authors consider this issue from the new perspectives. The PANI is compatible with other methods and this merit is quite appealing. * It is also appreciated that the authors formalize the framework of PANI with Noise Action MDP, as well as proposing the concept of hybrid noise distribution, which builds the theoretical foundation for the entire methodology. * The reflected flow noise generator can produce complex m
1. Some related references are missing, and it is suggested to consider the related work in the manuscript. * https://arxiv.org/abs/2202.06239 * https://arxiv.org/abs/1911.11361 * https://ieeexplore.ieee.org/document/10432784 * https://arxiv.org/abs/2301.12130 2. From the experiments, it seems the TD3-AN always performs better than IQL-AN, is there any furhter explanations on this phenomenon? For Antmaze benchmark, it seems IQL-AN is inferior to most alternatives, it is suggested that the
The paper clearly explains the intuition and methodology of the proposed algorithm. Empirical evaluations on effectiveness and ablation studies are provided. The introduction of the Noisy Action MDP provides a principled explanation for why penalized noisy updates improve robustness to OOD actions. The method is straightforward, general, and easy to integrate into existing offline RL algorithms, requiring minimal modification. Adjusting the penalty according to the distance between the noisy a
When the dataset covers a very narrow action distribution while the action space is large, the injected noise may still fail to expose the Q-function to sufficiently diverse actions, or take extra learning time to sample a sufficient number of actions to represent the OOD action space. While the authors partly address this by using hybrid noise distributions, a more detailed discussion on how PANI behaves with highly concentrated expert datasets would strengthen the paper. The paper could bene
1. The paper is easy to follow. 2. The theoretical proof is sound, and experiments are strong in some way. 3. The algorithm is simple. It is a lightweight, "drop-in" modification requiring only minimal changes to the standard Q-update step of existing algorithms.
1. The practical implementation for real-world cases is limited. As we can imagine, in some critical scenarios, a small perturbation to actions will cause catastrophic failure. 2. In Table 1, PANI shows significant gains on older diffusion-free methods like TD3/IQL, but with limited comparison to QGPO, especially in challenges AntMaze tasks in Table 1. Could the author attempt to apply PANI to more advanced algorithms to demonstrate improvements over QGPO? 3. For Table 3, I am confused why the
## Strengths - This paper is easy to read and easy to follow - This paper conducts experiments on numerous datasets and tasks, including D4RL locomotion datasets, Adroit datasets, AntMaze datasets, and OGBench datasets - This paper includes detailed learning curves in the Appendix, which can be helpful for readers to understand the hyperparameter sensitivity of the proposed method
## Weaknesses - No codes are attached. It is not clear whether the results reported in the main text and the appendix are reproducible. The authors include the anonymous code link in Appendix C, but it does not work. Hence, it is difficult to judge the effectiveness of the AN method - There are numerous incomplete sentences in this paper, e.g., - Line 40, *Although these methods achieve strong empirical results.* - Line 322, *Q2. Is PANI computationally more efficient than diffusion-based m
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
