SWEET: Sparse World Modeling with Image Editing for Embodied Task Execution
Yiren Song, Yihan Wang, Xiyao Deng, Zhuoran Yan, Mike Zheng Shou

TL;DR
SWEET introduces a sparse visual planning framework using image editing to generate task-relevant keyframes for robot manipulation, reducing computational costs and improving reliability over dense video generation methods.
Contribution
The paper demonstrates that image editing models outperform video generation models for task-level state prediction and proposes SWEET, a novel sparse planning approach based on sequential image editing.
Findings
Image editing produces more reliable task keyframes than video generation.
SWEET improves keyframe prediction in unseen scenes.
The full pipeline enables robot actions from sequential keyframes.
Abstract
Visual prediction has emerged as a promising paradigm for embodied control, where future observations are generated and then translated into actions. However, dense video generation is computationally expensive and often unnecessary for many manipulation tasks, whose progress can be summarized by a small number of task-relevant visual states. In this work, we study whether image editing models can serve as sparse visual world models for robot manipulation by predicting task-level future states without dense video rollout. We first conduct a controlled comparison between the video generation model Wan2.2 and the image editing model FLUX-Kontext under the same robotic data setting, and find that image editing produces more reliable task-level keyframes with better visual fidelity and substantially lower inference cost. Motivated by this observation, we propose SWEET, a one-shot sparse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
