Taming OOD Actions for Offline Reinforcement Learning: An Advantage-Based Approach
Xuyang Chen, Keyu Yan, Wenhan Cao, Lin Zhao

TL;DR
This paper introduces ADAC, a novel offline RL method that evaluates OOD actions using advantage-like functions, improving policy generalization and outperforming existing methods on benchmarks.
Contribution
The paper proposes Advantage-based Diffusion Actor-Critic (ADAC), a new approach that discriminatively evaluates OOD actions via advantage functions, enhancing offline RL performance.
Findings
ADAC achieves state-of-the-art results on D4RL benchmarks.
Advantage modulation effectively distinguishes superior and inferior OOD actions.
The method shows strong performance on challenging offline RL tasks.
Abstract
Offline reinforcement learning (RL) learns policies from fixed datasets without online interactions, but suffers from distribution shift, causing inaccurate evaluation and overestimation of out-of-distribution (OOD) actions. Existing methods counter this by conservatively discouraging all OOD actions, which limits generalization. We propose Advantage-based Diffusion Actor-Critic (ADAC), which evaluates OOD actions via an advantage-like function and uses it to modulate the Q-function update discriminatively. Our key insight is that the (state) value function is generally learned more reliably than the action-value function; we thus use the next-state value to indirectly assess each action. We develop a PointMaze environment to clearly visualize that advantage modulation effectively selects superior OOD actions while discouraging inferior ones. Moreover, extensive experiments on the D4RL…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- Paper is well-written and easy to read. - A novel formulation of an OOD filter. Defining advantage via next-state value relative to a κ-quantile of behavior actions is simple, tunable, and aligns with the claim that V is often more reliable than Q in offline data. The analysis that expectile regression moves V toward a dataset-optimal value is helpful context. - Compelling qualitative evidence. The PointMaze visualization clearly shows ADAC stitching suboptimal trajectories and discovering st
- Missing prior work, prior work already links OOD action selection to “optimal next-state value.” - POR (Policy-Guided Offline RL) [1], which trains a guide policy toward optimal next states and uses that signal to permit OOD generalization. Please cite and compare against POR. - Accuracy of A(s,a) hinges on policy and model constraints. Although A(s,a) is intended to promote good OOD actions, its reliability depends on (i) the transition model producing realistic s’ for actions sampled from \p
- Good writing makes the paper easy to follow - Simple and effective idea - Various experiments to support the effectiveness of ADAC.
- The advantage-computing method in Eq. 9 seems ambiguous. The original advantage definition is $A(s,a) = Q(s,a)-V(s)$, which evaluates the advantage of action $a$ among other actions in state $s$. However, the advantage in ADAC is calculated based on the next state's value $V(s')$, which is quite different. Meanwhile, if the author needs to show the effectiveness of such an advantage-computing method, an ablation study on this should be conducted. - More recent offline RL frameworks should be
The paper is written in a clear and fluent manner, making it highly accessible. Its logic is solid, with rigorous and straightforward theoretical analyses. The experimental results are relatively comprehensive and demonstrate significant effectiveness.
1. The baselines provided in this paper are relatively outdated, as they all focus on works published before 2023. 2. One of the paper’s innovations lies in introducing a new calculation method for the reward function. However, the experiments **lack a comparison** between this new reward function and classical ones, making it unclear which part contributes to the improved algorithm performance. 3. The algorithm learns the value function, Q-function, dynamics, advantage function, and policy netw
> Comprehensive experimental tasks. The authors evaluate the proposed method on a wide range of benchmark tasks, and the empirical section includes substantial experimental data. >Visualization The PointMaze visualization clearly illustrates how ADAC distinguishes between good and bad OOD actions, providing intuitive insight into the method’s behavior. >Clear motivation. The paper provides a reasonable motivation for selectively addressing OOD actions rather than uniformly penalizing them.
> Incomplete related work analysis. The discussion of recent related work is not sufficiently comprehensive. In particular, the paper overlooks recent studies on OOD detection, OOD state/action correction in offline RL. The relationship between ADAC and those works should be analyzed in more depth to clarify novelty. > Unsubstantiated claim: “state value functions are easier to learn.” The statement that state-value functions are easier to learn than action-value functions is asserted without
* This paper finds that the (state) value function is generally learned more reliably than the Q-value function. It uses the next-state value to assess each action indirectly. * The experiments show that ADAC achieves SOTA performance.
* ADAC needs to sample multiple actions from the behavior policy, which may bring more computational burden. * Why is the V-function generally learnt more reliably than the action-value function?
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning
MethodsDiffusion
