CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders
Kevin Frans, L.B. Soros, Olaf Witkowski

TL;DR
CLIPDraw is a zero-shot text-to-drawing synthesis method that uses a pre-trained language-image encoder to generate vector stroke drawings aligned with natural language descriptions, showcasing diverse styles and complexity levels.
Contribution
This work introduces CLIPDraw, a novel approach that synthesizes drawings from text without training, leveraging a pre-trained encoder and vector strokes for simple, recognizable, and style-diverse images.
Findings
Produces drawings that match ambiguous text in multiple ways
Generates diverse artistic styles reliably
Scales from simple to complex images with stroke count
Abstract
This work presents CLIPDraw, an algorithm that synthesizes novel drawings based on natural language input. CLIPDraw does not require any training; rather a pre-trained CLIP language-image encoder is used as a metric for maximizing similarity between the given description and a generated drawing. Crucially, CLIPDraw operates over vector strokes rather than pixel images, a constraint that biases drawings towards simpler human-recognizable shapes. Results compare between CLIPDraw and other synthesis-through-optimization methods, as well as highlight various interesting behaviors of CLIPDraw, such as satisfying ambiguous text in multiple ways, reliably producing drawings in diverse artistic styles, and scaling from simple to complex visual representations as stroke count is increased. Code for experimenting with the method is available at:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsHandwritten Text Recognition Techniques · Human Motion and Animation · Multimodal Machine Learning Applications
