Horizon Imagination: Efficient On-Policy Rollout in Diffusion World Models

Lior Cohen; Ofir Nabati; Kaixin Wang; Navdeep Kumar; Shie Mannor

arXiv:2602.08032·cs.LG·February 18, 2026

Horizon Imagination: Efficient On-Policy Rollout in Diffusion World Models

Lior Cohen, Ofir Nabati, Kaixin Wang, Navdeep Kumar, Shie Mannor

PDF

Open Access 3 Reviews

TL;DR

Horizon Imagination introduces an efficient on-policy diffusion-based world model for reinforcement learning that denoises multiple future observations in parallel, reducing computational costs while maintaining high control performance.

Contribution

The paper presents Horizon Imagination, a novel on-policy diffusion process with stabilization and a new sampling schedule, enabling efficient parallel denoising in world models.

Findings

01

Maintains control performance with half the denoising steps.

02

Achieves superior generation quality under varied schedules.

03

Reduces computational costs in diffusion world models.

Abstract

We study diffusion-based world models for reinforcement learning, which offer high generative fidelity but face critical efficiency challenges in control. Current methods either require heavyweight models at inference or rely on highly sequential imagination, both of which impose prohibitive computational costs. We propose Horizon Imagination (HI), an on-policy imagination process for discrete stochastic policies that denoises multiple future observations in parallel. HI incorporates a stabilization mechanism and a novel sampling schedule that decouples the denoising budget from the effective horizon over which denoising is applied while also supporting sub-frame budgets. Experiments on Atari 100K and Craftium show that our approach maintains control performance with a sub-frame budget of half the denoising steps and achieves superior generation quality under varied schedules. Code is…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

The proposed horizon schedule is a neat design: by fixing ν while varying B, it breaks the tight coupling seen in pyramidal schedules, allowing consistent temporal denoising behavior across budgets and enabling sub-frame B < h operation. The stable action sampling a(π,ω) for discrete policies is elegant and theoretically justified: action changes between denoising steps are bounded below by total variation distance and above by a derived l_1 term, greatly reducing unnecessary flips during den

Weaknesses

The proposed method, especially the action sampling, is only for discrete action spaces. This limits applicability to many continuous-control tasks The paper would benefit if a per-stage runtime analysis and real-time control throughput (fps) comparison were presented to show the improvement. It would help to connect the theoretical bound to returns—e.g., does reduced action-flip rate correlate with improved advantage estimates or policy gradient variance?

Reviewer 02Rating 4Confidence 2

Strengths

- The paper tackles an interesting and important problem in world models for policy learning. I think the proposed method is quite important for the RL field.

Weaknesses

- I think the presentation of the paper can be improved. - Please also see my questions.

Reviewer 03Rating 6Confidence 2

Strengths

1. The diffusion-based world model studied in this paper is a rapidly growing research area with significant value in both offline data generation and online policy learning. It holds great promise for substantially reducing the cost of real-world interactions. 2. The authors effectively resolve the instability that occurs when diffusion models and policies interact to jointly sample multi-step trajectories by introducing a theoretically grounded stable action sampling method. This approach sig

Weaknesses

1. The stable action sampling algorithm proposed by the authors is only applicable to discrete action spaces, which limits the applicability of the Horizon Imagination framework in more general environments with continuous action spaces. 2. A key weakness is the limited scope of the experimental comparisons. The control performance results in Section 5.2 are structured as an internal ablation study, comparing the proposed parallel method only against an autoregressive baseline within their own

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning