RAD: Retrieval High-quality Demonstrations to Enhance Decision-making

Lu Guo; Yixiang Shan; Zhengbang Zhu; Qifan Liang; Lichang Song; Ting Long; Weinan Zhang; Yi Chang

arXiv:2507.15356·cs.AI·July 22, 2025

RAD: Retrieval High-quality Demonstrations to Enhance Decision-making

Lu Guo, Yixiang Shan, Zhengbang Zhu, Qifan Liang, Lichang Song, Ting Long, Weinan Zhang, Yi Chang

PDF

3 Reviews

TL;DR

RAD introduces a retrieval and generative modeling approach to improve offline reinforcement learning by dynamically retrieving high-quality states and planning towards them, enhancing generalization and decision-making in complex environments.

Contribution

The paper presents a novel method combining retrieval and diffusion-based generative modeling to improve trajectory stitching and generalization in offline RL.

Findings

01

RAD outperforms baselines across multiple benchmarks.

02

Retrieval-guided generation enhances decision-making in OOD states.

03

The approach improves long-horizon planning in offline RL.

Abstract

Offline reinforcement learning (RL) enables agents to learn policies from fixed datasets, avoiding costly or unsafe environment interactions. However, its effectiveness is often limited by dataset sparsity and the lack of transition overlap between suboptimal and expert trajectories, which makes long-horizon planning particularly challenging. Prior solutions based on synthetic data augmentation or trajectory stitching often fail to generalize to novel states and rely on heuristic stitching points. To address these challenges, we propose Retrieval High-quAlity Demonstrations (RAD) for decision-making, which combines non-parametric retrieval with diffusion-based generative modeling. RAD dynamically retrieves high-return states from the offline dataset as target states based on state similarity and return estimation, and plans toward them using a condition-guided diffusion model. Such…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

1. Integrating state retrieval into offline RL is conceptually appealing, and the paper provides a clear motivation for using retrieved high-return states as adaptive guidance for policy improvement. 2. The experiments cover a wide range of D4RL tasks (MuJoCo, AntMaze, Kitchen, Maze2D) with solid baselines including model-free, model-based, and diffusion-based methods. RAD demonstrates competitive or superior performance on most datasets.

Weaknesses

1. Several sections contain minor grammatical errors and redundant phrasing (e.g., “novelly integrates,” “makinga decision”). Figures could be improved for clarity and caption detail. 2. The distribution-shift test (training on Medium-Replay, testing with Random starts) is limited to three environments; more systematic OOD tests would strengthen claims. 3. Although the retrieval mechanism is new, the overall architecture largely reuses existing components from Diffuser/DiffuserLite, and the retr

Reviewer 02Rating 4Confidence 3

Strengths

The proposed RAD method combines target state retrieval with diffusion-based planning in offline RL. While trajectory stitching and diffusion planners exist, RAD’s idea of adaptive target retrieval at inference time (instead of static, offline augmentation) is a meaningful design point, appealing in sparse-reward or long-horizon domains (e.g., AntMaze), where “latching onto” good sub-goals helps escape low-value regions. The “target-then-plan” decomposition (TS → ES → PL) is conceptually clear,

Weaknesses

The idea is straightforward, but lacks theoretical support. Firstly, “reachability” is asserted, not guaranteed. TS currently filters by cosine similarity and high return, then picks the candidate with the longest remaining length; there is no principled guarantee that the target is reachable without collisions/obstacles under the learned dynamics—especially salient in mazes or any environment with an inconsistent transition model (e.g., a wall separating the two near states exists and has not b

Reviewer 03Rating 2Confidence 4

Strengths

This work has reasonable novelty. This work isn't directly proposing a diffusion model but instead builds upon other works models and adds in a method for guiding these models to create the training data. The experimental evidence is reasonable. They compare to many baselines on a reasonable set of experiments. I would really like to see 95% confidence intervals here (like in table 1) as without them it is harder to distinguish great results from good ones. The actual results on the tasks perf

Weaknesses

In general I feel like the writing could be more clear and I think we are missing some information that I feel is critical. I will put my questions in the question section but I don't feel confident I understand how your method actually runs. Paragraph near 155 - This paragraph is pretty hard to read/understand, what does "transit" mean here? Do you mean a trajectory from s_t to s_t^g?. The limitations of this method are not properly addressed. I'll put specific questions in the question part a

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.