SneakPeek: Future-Guided Instructional Streaming Video Generation
Cheeun Hong, German Barquero, Fadime Sener, Markos Georgopoulos, Edgar Sch\"onfeld, Stefan Popov, Yuming Du, Oscar Ma\~nas, Albert Pumarola

TL;DR
SneakPeek is a diffusion-based autoregressive framework that generates coherent, controllable instructional videos from text prompts by predicting future frames and maintaining temporal consistency across multiple steps.
Contribution
The paper introduces a novel pipeline with predictive causal adaptation, future-guided self-forcing, and multi-prompt conditioning for improved instructional video generation.
Findings
Produces temporally coherent instructional videos
Maintains semantic fidelity to multi-step instructions
Enables dynamic prompt updates during generation
Abstract
Instructional video generation is an emerging task that aims to synthesize coherent demonstrations of procedural activities from textual descriptions. Such capability has broad implications for content creation, education, and human-AI interaction, yet existing video diffusion models struggle to maintain temporal consistency and controllability across long sequences of multiple action steps. We introduce a pipeline for future-driven streaming instructional video generation, dubbed SneakPeek, a diffusion-based autoregressive framework designed to generate precise, stepwise instructional videos conditioned on an initial image and structured textual prompts. Our approach introduces three key innovations to enhance consistency and controllability: (1) predictive causal adaptation, where a causal model learns to perform next-frame prediction and anticipate future keyframes; (2) future-guided…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Video Analysis and Summarization
