SWEET: Sparse World Modeling with Image Editing for Embodied Task Execution

Yiren Song; Yihan Wang; Xiyao Deng; Zhuoran Yan; Mike Zheng Shou

arXiv:2605.19319·cs.CV·May 20, 2026

SWEET: Sparse World Modeling with Image Editing for Embodied Task Execution

Yiren Song, Yihan Wang, Xiyao Deng, Zhuoran Yan, Mike Zheng Shou

PDF

TL;DR

SWEET introduces a sparse visual planning framework using image editing to generate task-relevant keyframes for robot manipulation, reducing computational costs and improving reliability over dense video generation methods.

Contribution

The paper demonstrates that image editing models outperform video generation models for task-level state prediction and proposes SWEET, a novel sparse planning approach based on sequential image editing.

Findings

01

Image editing produces more reliable task keyframes than video generation.

02

SWEET improves keyframe prediction in unseen scenes.

03

The full pipeline enables robot actions from sequential keyframes.

Abstract

Visual prediction has emerged as a promising paradigm for embodied control, where future observations are generated and then translated into actions. However, dense video generation is computationally expensive and often unnecessary for many manipulation tasks, whose progress can be summarized by a small number of task-relevant visual states. In this work, we study whether image editing models can serve as sparse visual world models for robot manipulation by predicting task-level future states without dense video rollout. We first conduct a controlled comparison between the video generation model Wan2.2 and the image editing model FLUX-Kontext under the same robotic data setting, and find that image editing produces more reliable task-level keyframes with better visual fidelity and substantially lower inference cost. Motivated by this observation, we propose SWEET, a one-shot sparse…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.