ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions

Tom\'a\v{s} Sou\v{c}ek; Prajwal Gatti; Michael Wray; Ivan Laptev; Dima; Damen; Josef Sivic

arXiv:2412.01987·cs.CV·March 26, 2025

ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions

Tom\'a\v{s} Sou\v{c}ek, Prajwal Gatti, Michael Wray, Ivan Laptev, Dima, Damen, Josef Sivic

PDF

Open Access 1 Repo

TL;DR

This paper introduces ShowHowTo, a model that generates scene-conditioned step-by-step visual instructions from an input image and textual commands, supported by a large-scale dataset from instructional videos.

Contribution

It presents a novel large-scale dataset from videos, a new diffusion model for instruction generation, and comprehensive evaluation showing state-of-the-art performance.

Findings

01

Achieved high accuracy in step, scene, and task consistency.

02

Generated realistic and coherent instruction sequences.

03

Established a new benchmark for visual instruction generation.

Abstract

The goal of this work is to generate step-by-step visual instructions in the form of a sequence of images, given an input image that provides the scene context and the sequence of textual instructions. This is a challenging problem as it requires generating multi-step image sequences to achieve a complex goal while being grounded in a specific environment. Part of the challenge stems from the lack of large-scale training data for this problem. The contribution of this work is thus three-fold. First, we introduce an automatic approach for collecting large step-by-step visual instruction training data from instructional videos. We apply this approach to one million videos and create a large-scale, high-quality dataset of 0.6M sequences of image-text pairs. Second, we develop and train ShowHowTo, a video diffusion model capable of generating step-by-step visual instructions consistent with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

soCzech/ShowHowTo
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimedia Communication and Technology · Advanced Vision and Imaging

MethodsDiffusion