Show Me: Unifying Instructional Image and Video Generation with Diffusion Models

Yujiang Pu; Zhanbo Huang; Vishnu Boddeti; and Yu Kong

arXiv:2511.17839·cs.CV·November 25, 2025

Show Me: Unifying Instructional Image and Video Generation with Diffusion Models

Yujiang Pu, Zhanbo Huang, Vishnu Boddeti, and Yu Kong

PDF

Open Access

TL;DR

ShowMe introduces a unified diffusion-based framework for instructional image and video generation, improving consistency and realism by combining spatial and temporal modeling with novel rewards.

Contribution

The paper presents a novel unified diffusion model that handles both image manipulation and video prediction, integrating spatial and temporal components with consistency rewards.

Findings

01

Outperforms expert models in instructional image generation

02

Achieves superior results in video prediction benchmarks

03

Enhances structural fidelity and temporal coherence

Abstract

Generating visual instructions in a given context is essential for developing interactive world simulators. While prior works address this problem through either text-guided image manipulation or video prediction, these tasks are typically treated in isolation. This separation reveals a fundamental issue: image manipulation methods overlook how actions unfold over time, while video prediction models often ignore the intended outcomes. To this end, we propose ShowMe, a unified framework that enables both tasks by selectively activating the spatial and temporal components of video diffusion models. In addition, we introduce structure and motion consistency rewards to improve structural fidelity and temporal coherence. Notably, this unification brings dual benefits: the spatial knowledge gained through video pretraining enhances contextual consistency and realism in non-rigid image edits,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Motion and Animation