Reverse Stable Diffusion: What prompt was used to generate this image?
Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, Mubarak Shah

TL;DR
This paper introduces a method to predict prompts from images generated by diffusion models, improving prompt understanding and image alignment, and reveals that training on this task enhances image prompt fidelity.
Contribution
It proposes a novel joint prompt regression and classification framework with curriculum learning for better prompt prediction from diffusion-generated images.
Findings
Improved prompt prediction accuracy on DiffusionDB dataset.
Training diffusion models on prompt prediction enhances image-prompt alignment.
Code is publicly available for reproducibility.
Abstract
Text-to-image diffusion models have recently attracted the interest of many researchers, and inverting the diffusion process can play an important role in better understanding the generative process and how to engineer prompts in order to obtain the desired images. To this end, we study the task of predicting the prompt embedding given an image generated by a generative diffusion model. We consider a series of white-box and black-box models (with and without access to the weights of the diffusion network) to deal with the proposed task. We propose a novel learning framework comprising a joint prompt regression and multi-label vocabulary classification objective that generates improved prompts. To further improve our method, we employ a curriculum learning procedure that promotes the learning of image-prompt pairs with lower labeling noise (i.e. that are better aligned). We conduct…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
The paper is reasonably well-rounded in terms of completeness, the experiments are generally sufficient, and the figures and charts are quite clear.
The motivation in this paper is very weak, making it difficult to identify significant innovation or contributions. The motivation as I understand it in the paper is to enhance the understanding of the diffusion model through embeddings. However, the method section only discusses how to obtain embeddings, without addressing how this enhances the understanding of the diffusion model. If the motivation is just predict the text embeddings, Image Caption + CLIP can already be achieved. Furthermore,
The considered topic of fine-tuning the text-to-image diffusion model toward better semantic alignment is important and of great interest. The proposed method seems to be simple.
- The main objective is unclear. - A large body of the manuscript and the title itself focus on predicting the original text embedding. However, as long as the actual caption is not generated but only the text embedding is predicted, it is unclear what the point is. As for evaluating the text-alignment, most of work already report CLIP similarity score, and the motive for making a new prediction model for text embedding seems to be weak. As for improving the U-net backbone via fine-tuning, o
1. The paper aims for an interesting task of predicting prompt embedding from Stable Diffusion generated images, and proposes multiple novel components to achieve the task. 2. The generated prompt-embeddings can later be paired with the images to fine-tune the generative model, which enhances the text-image alignment of the model.
1. I am generally questionable about the portion of domain-adaptive kernel learning: - Is it really applicable? From the setup, it seems to require knowledge about the test domain. However, in real world task, the test domain could be unknown or infinitely large due to the generative power of a diffusion model. How should we define the test domain for this learning task? - It requires a hyper-parameter $r$ to define the number of k-means centroids. However, this $r$ is not studied and h
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis
MethodsDiffusion
