An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano,, Gal Chechik, Daniel Cohen-Or

TL;DR
This paper introduces Textual Inversion, a method that enables personalized text-to-image generation by learning new 'words' from just a few images, allowing users to create and modify unique concepts with natural language.
Contribution
It presents a simple, effective approach to personalize text-to-image models by learning new embeddings from limited images, enhancing creative control and fidelity.
Findings
Single embedding can capture diverse concepts
Outperforms baseline methods in concept fidelity
Enables intuitive composition of personalized concepts
Abstract
Text-to-image models offer unprecedented freedom to guide creation through natural language. Yet, it is unclear how such freedom can be exercised to generate images of specific unique concepts, modify their appearance, or compose them in new roles and novel scenes. In other words, we ask: how can we use language-guided models to turn our cat into a painting, or imagine a new product based on our favorite toy? Here we present a simple approach that allows such creative freedom. Using only 3-5 images of a user-provided concept, like an object or a style, we learn to represent it through new "words" in the embedding space of a frozen text-to-image model. These "words" can be composed into natural language sentences, guiding personalized creation in an intuitive way. Notably, we find evidence that a single word embedding is sufficient for capturing unique and varied concepts. We compare our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Digital Humanities and Scholarship
