ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions
Tom\'a\v{s} Sou\v{c}ek, Prajwal Gatti, Michael Wray, Ivan Laptev, Dima, Damen, Josef Sivic

TL;DR
This paper introduces ShowHowTo, a model that generates scene-conditioned step-by-step visual instructions from an input image and textual commands, supported by a large-scale dataset from instructional videos.
Contribution
It presents a novel large-scale dataset from videos, a new diffusion model for instruction generation, and comprehensive evaluation showing state-of-the-art performance.
Findings
Achieved high accuracy in step, scene, and task consistency.
Generated realistic and coherent instruction sequences.
Established a new benchmark for visual instruction generation.
Abstract
The goal of this work is to generate step-by-step visual instructions in the form of a sequence of images, given an input image that provides the scene context and the sequence of textual instructions. This is a challenging problem as it requires generating multi-step image sequences to achieve a complex goal while being grounded in a specific environment. Part of the challenge stems from the lack of large-scale training data for this problem. The contribution of this work is thus three-fold. First, we introduce an automatic approach for collecting large step-by-step visual instruction training data from instructional videos. We apply this approach to one million videos and create a large-scale, high-quality dataset of 0.6M sequences of image-text pairs. Second, we develop and train ShowHowTo, a video diffusion model capable of generating step-by-step visual instructions consistent with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimedia Communication and Technology · Advanced Vision and Imaging
MethodsDiffusion
