VideoSketcher: Video Models Prior Enable Versatile Sequential Sketch Generation
Hui Ren, Yuval Alaluf, Omer Bar Tal, Alexander Schwing, Antonio Torralba, Yael Vinker

TL;DR
This paper introduces a data-efficient method for generating sequential sketches by adapting pretrained text-to-video diffusion models, combining semantic stroke planning from language models with high-quality visual rendering.
Contribution
It presents a novel two-stage fine-tuning approach that leverages limited human sketch data and synthetic shapes to produce controllable, high-quality sequential sketches guided by text instructions.
Findings
High-quality sketches closely follow text-specified orderings
Method works with as few as seven human-drawn sketches
Extensions enable style conditioning and interactive drawing
Abstract
Sketching is inherently a sequential process, in which strokes are drawn in a meaningful order to explore and refine ideas. However, most generative models treat sketches as static images, overlooking the temporal structure that underlies creative drawing. We present a data-efficient approach for sequential sketch generation that adapts pretrained text-to-video diffusion models to generate sketching processes. Our key insight is that large language models and video diffusion models offer complementary strengths for this task: LLMs provide semantic planning and stroke ordering, while video diffusion models serve as strong renderers that produce high-quality, temporally coherent visuals. We leverage this by representing sketches as short videos in which strokes are progressively drawn on a blank canvas, guided by text-specified ordering instructions. We introduce a two-stage fine-tuning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Interactive and Immersive Displays · 3D Shape Modeling and Analysis
