Show Me: Unifying Instructional Image and Video Generation with Diffusion Models
Yujiang Pu, Zhanbo Huang, Vishnu Boddeti, and Yu Kong

TL;DR
ShowMe introduces a unified diffusion-based framework for instructional image and video generation, improving consistency and realism by combining spatial and temporal modeling with novel rewards.
Contribution
The paper presents a novel unified diffusion model that handles both image manipulation and video prediction, integrating spatial and temporal components with consistency rewards.
Findings
Outperforms expert models in instructional image generation
Achieves superior results in video prediction benchmarks
Enhances structural fidelity and temporal coherence
Abstract
Generating visual instructions in a given context is essential for developing interactive world simulators. While prior works address this problem through either text-guided image manipulation or video prediction, these tasks are typically treated in isolation. This separation reveals a fundamental issue: image manipulation methods overlook how actions unfold over time, while video prediction models often ignore the intended outcomes. To this end, we propose ShowMe, a unified framework that enables both tasks by selectively activating the spatial and temporal components of video diffusion models. In addition, we introduce structure and motion consistency rewards to improve structural fidelity and temporal coherence. Notably, this unification brings dual benefits: the spatial knowledge gained through video pretraining enhances contextual consistency and realism in non-rigid image edits,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Motion and Animation
