Compositional Visual Planning via Inference-Time Diffusion Scaling
Yixin Zhang, Yunhao Luo, Utkarsh Aashu Mishra, Woo Chul Shin, Yongxin Chen, Danfei Xu

TL;DR
This paper introduces a training-free, diffusion-based compositional planning method for long-horizon robot tasks that enforces boundary agreement on Tweedie estimates, leading to more stable and globally consistent plans.
Contribution
It proposes a novel inference-time boundary agreement enforcement on Tweedie estimates in diffusion models, enabling stable long-horizon planning without additional training.
Findings
Significant improvements over existing methods.
Effective generalization to unseen start-goal pairs.
No additional training required for global consistency.
Abstract
Diffusion models excel at short-horizon robot planning, yet scaling them to long-horizon tasks remains challenging due to computational constraints and limited training data. Existing compositional approaches stitch together short segments by separately denoising each component and averaging overlapping regions. However, this suffers from instability as the factorization assumption breaks down in noisy data space, leading to inconsistent global plans. We propose that the key to stable compositional generation lies in enforcing boundary agreement on the estimated clean data (Tweedie estimates) rather than on noisy intermediate states. Our method formulates long-horizon planning as inference over a chain-structured factor graph of overlapping video chunks, where pretrained short-horizon video diffusion models provide local priors. At inference time, we enforce boundary agreement through a…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper presents a clear conceptual insight: enforcing boundary agreement on denoised Tweedie estimates rather than noisy diffusion states. This shift is both theoretically justified and empirically validated. - The toy example in Figure 2 effectively illustrates the key failure mode of noisy-state composition (boundary drift) and how the proposed approach closes those gaps. It’s an unusually clear and intuitive visualization of the underlying problem. - The proposed framework is training-fr
- While OOD performance is impressive, the method still relies on the base video diffusion model having seen the intermediate motion fragments during training. The generalization is compositional rather than truly extrapolative, which limits deployment in fully novel environments. - The method requires a goal image as conditioning, restricting applicability to scenarios where both start and goal visual states are available. Many planning settings might not have explicit goal frames. - The approa
1. While using $x_0$ prediction (Tweedie estimates) for guidance is not new, its novel application to compositional generation is a impressive contribution, using it to enforce boundary consistency between factors. 2. The paper provides a strong theoretical justification for why prior methods fail. By identifying and proving the "Noisy-Bethe Gap" (Theorem 1) , it formally explains that simple averaging in the noisy $x_t$ space is fundamentally flawed. 3. It addresses a critical and timely probl
1. The justification for using Diffusion-Sphere Guidance (DSG) specifically is not entirely clear. The paper would be significantly strengthened by an ablation study that compares DSG to standard, simpler guidance mechanisms (e.g., conventional gradient-based guidance or other proposed guidance methods). This would help demonstrate that the proposed method is general enough to improve performance even when paired with various other guidance techniques, not just DSG. 2. The paper's claims regardi
- **Clear Articulation of Problem and Strong Theoretical Grounding:** The paper's primary strength lies in its clear theoretical articulation of a critical problem in compositional planning. The authors compellingly argue that prior methods (e.g., DiffCollage) are flawed due to their reliance on a Bethe approximation in the noisy state. This critique is not just qualitative; it is substantially supported by the "Noisy-Bethe Gap Theorem" (Theorem 1, detailed in Appendix A), which provides a stron
- **Insufficient Discussion on the Nature of the Contribution:** The paper attributes its success to composing in the Tweedie ($x_0$ estimate) space. However, the powerful composing rules ($L_\text{sync}$ / $L_\text{async}$) are themselves very strong, explicit guidance heuristics. A key insight may be that the Tweedie estimation space is the first domain that enables such strong, explicit guidance to be applied stably (whereas it would fail in the noisy state). The authors are encouraged to
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics
