Accelerating Diffusion Planners in Offline RL via Reward-Aware Consistency Trajectory Distillation
Xintong Duan, Yutong He, Fahim Tajwar, Ruslan Salakhutdinov, J. Zico Kolter, Jeff Schneider

TL;DR
This paper introduces a reward-aware consistency trajectory distillation method for offline reinforcement learning that significantly speeds up inference and improves reward outcomes compared to existing diffusion-based decision models.
Contribution
It presents a novel consistency distillation approach that incorporates reward signals directly, enabling single-step sampling and higher-reward trajectories in offline RL.
Findings
Achieves 9.7% higher rewards than previous state-of-the-art methods.
Offers up to 142x faster inference speed compared to diffusion models.
Demonstrates effectiveness on MuJoCo, FrankaKitchen, and long horizon planning benchmarks.
Abstract
Although diffusion models have achieved strong results in decision-making tasks, their slow inference speed remains a key limitation. While consistency models offer a potential solution, existing applications to decision-making either struggle with suboptimal demonstrations under behavior cloning or rely on complex concurrent training of multiple networks under the actor-critic framework. In this work, we propose a novel approach to consistency distillation for offline reinforcement learning that directly incorporates reward optimization into the distillation process. Our method achieves single-step sampling while generating higher-reward action trajectories through decoupled training and noise-free reward signals. Empirical evaluations on the Gym MuJoCo, FrankaKitchen, and long horizon planning benchmarks demonstrate that our approach can achieve a 9.7% improvement over previous…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper introduces a novel application of consistency distillation, resulting in a single-step model that effectively emulates multi-step diffusion processes, thereby significantly improving inference efficiency. 2. The proposed RACTD demonstrates strong empirical performance, substantially outperforming prior state-of-the-art diffusion-based reinforcement learning methods, particularly on the D4RL benchmark. 3. The paper presents comprehensive experiments and detailed implementation desc
1. The idea of the paper is interesting, and the contribution of accelerating the sampling stage through consistency distillation is clear. However, the work appears to rely heavily on existing consistency distillation techniques, with limited novelty beyond their direct application to diffusion-based planning. 2. The contribution of the proposed reward-aware consistency trajectory distillation is somewhat unclear. The method appears to employ a standard reward model as an auxiliary loss applie
- Clear experimental scope and benchmarks: Evaluation on widely used D4RL Gym-MuJoCo and FrankaKitchen datasets enhances relevance and comparability, with both offline and online model selection reported where applicable. - Focus on sampling efficiency: Explicit reporting of NFE alongside performance indicates attention to practical efficiency, which is crucial for diffusion-based methods.
- Limited novelty and contribution: The core idea is to augment the distillation process with a cumulative reward maximization objective. This training pipeline has appeared in prior work (e.g., Flow Q-Learning), and the paper does not clearly isolate what is fundamentally new beyond this template. - Central claim lacks rigorous empirical validation: The paper emphasizes incorporating a reward objective directly into consistency distillation rather than optimizing via a critic (e.g., Q-values or
- The paper is very well written and was genuinely an enjoyable read. Most questions that arise while reading are quickly addressed by the text, and all of the necessary context to follow the technical discussion is included. - The motivation is strong, with clear explanations of the limitations of prior work such as the slow inference speed of diffusion models and the sensitivity of actor-critic frameworks to hyperparameters. - The introduction of decoupled training in Section 3.4 is a genuine
- Not a substantial weakness, but there is a typo between lines 202 and 203 (struggle vs. struggles) - In Table 3, the paper introduces CTD as a comparison point without tying this acronym to a particular method and without comparing CTD's results to anything else in the main text (unless I am mistaken). Presumably this is Consistency Trajectory Distillation. Reading back through the paper, it appears that CTD corresponds to the authors’ approach without the reward aware component, that is, Cons
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion
