TL;DR
RAD introduces a retrieval and generative modeling approach to improve offline reinforcement learning by dynamically retrieving high-quality states and planning towards them, enhancing generalization and decision-making in complex environments.
Contribution
The paper presents a novel method combining retrieval and diffusion-based generative modeling to improve trajectory stitching and generalization in offline RL.
Findings
RAD outperforms baselines across multiple benchmarks.
Retrieval-guided generation enhances decision-making in OOD states.
The approach improves long-horizon planning in offline RL.
Abstract
Offline reinforcement learning (RL) enables agents to learn policies from fixed datasets, avoiding costly or unsafe environment interactions. However, its effectiveness is often limited by dataset sparsity and the lack of transition overlap between suboptimal and expert trajectories, which makes long-horizon planning particularly challenging. Prior solutions based on synthetic data augmentation or trajectory stitching often fail to generalize to novel states and rely on heuristic stitching points. To address these challenges, we propose Retrieval High-quAlity Demonstrations (RAD) for decision-making, which combines non-parametric retrieval with diffusion-based generative modeling. RAD dynamically retrieves high-return states from the offline dataset as target states based on state similarity and return estimation, and plans toward them using a condition-guided diffusion model. Such…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Integrating state retrieval into offline RL is conceptually appealing, and the paper provides a clear motivation for using retrieved high-return states as adaptive guidance for policy improvement. 2. The experiments cover a wide range of D4RL tasks (MuJoCo, AntMaze, Kitchen, Maze2D) with solid baselines including model-free, model-based, and diffusion-based methods. RAD demonstrates competitive or superior performance on most datasets.
1. Several sections contain minor grammatical errors and redundant phrasing (e.g., “novelly integrates,” “makinga decision”). Figures could be improved for clarity and caption detail. 2. The distribution-shift test (training on Medium-Replay, testing with Random starts) is limited to three environments; more systematic OOD tests would strengthen claims. 3. Although the retrieval mechanism is new, the overall architecture largely reuses existing components from Diffuser/DiffuserLite, and the retr
The proposed RAD method combines target state retrieval with diffusion-based planning in offline RL. While trajectory stitching and diffusion planners exist, RAD’s idea of adaptive target retrieval at inference time (instead of static, offline augmentation) is a meaningful design point, appealing in sparse-reward or long-horizon domains (e.g., AntMaze), where “latching onto” good sub-goals helps escape low-value regions. The “target-then-plan” decomposition (TS → ES → PL) is conceptually clear,
The idea is straightforward, but lacks theoretical support. Firstly, “reachability” is asserted, not guaranteed. TS currently filters by cosine similarity and high return, then picks the candidate with the longest remaining length; there is no principled guarantee that the target is reachable without collisions/obstacles under the learned dynamics—especially salient in mazes or any environment with an inconsistent transition model (e.g., a wall separating the two near states exists and has not b
This work has reasonable novelty. This work isn't directly proposing a diffusion model but instead builds upon other works models and adds in a method for guiding these models to create the training data. The experimental evidence is reasonable. They compare to many baselines on a reasonable set of experiments. I would really like to see 95% confidence intervals here (like in table 1) as without them it is harder to distinguish great results from good ones. The actual results on the tasks perf
In general I feel like the writing could be more clear and I think we are missing some information that I feel is critical. I will put my questions in the question section but I don't feel confident I understand how your method actually runs. Paragraph near 155 - This paragraph is pretty hard to read/understand, what does "transit" mean here? Do you mean a trajectory from s_t to s_t^g?. The limitations of this method are not properly addressed. I'll put specific questions in the question part a
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
