Training-Free Diffusion Priors for Text-to-Image Generation via Optimization-based Visual Inversion
Samuele Dell'Erba, Andrew D. Bagdanov

TL;DR
This paper introduces a training-free, optimization-based method called OVI to replace diffusion priors in text-to-image generation, improving visual quality without extensive training.
Contribution
Proposes a novel zero-shot visual inversion technique with regularization constraints as an alternative to trained diffusion priors.
Findings
OVI can replace traditional diffusion priors effectively.
Regularization constraints improve the realism of generated images.
Benchmark scores are comparable or superior to state-of-the-art methods.
Abstract
Diffusion models have established the state-of-the-art in text-to-image generation, but their performance often relies on a diffusion prior network to translate text embeddings into the visual manifold for easier decoding. These priors are computationally expensive and require extensive training on massive datasets. In this work, we challenge the necessity of a trained prior at all by employing Optimization-based Visual Inversion (OVI), a training-free and zero-shot alternative, to replace the need for a prior. OVI initializes a latent visual representation from random pseudo-tokens and iteratively optimizes it to maximize the cosine similarity with the input textual prompt embedding. We further propose two novel constraints, a Mahalanobis-based and a Nearest-Neighbor loss, to regularize the OVI optimization process toward the distribution of realistic images. Our experiments, conducted…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Image Enhancement Techniques
