EDITOR: Effective and Interpretable Prompt Inversion for Text-to-Image Diffusion Models
Mingzhe Li, Kejing Xia, Gehao Zhang, Zhenting Wang, Guanhong Tao, Siqi Pan, Juan Zhai, Shiqing Ma

TL;DR
This paper introduces extsc{EDITR}, a novel prompt inversion method for text-to-image diffusion models that improves image similarity, interpretability, and generalizability by leveraging captioning models and latent space refinement.
Contribution
extsc{EDITR} combines caption-based initialization with latent space reverse-engineering to enhance prompt inversion for diffusion models, outperforming existing methods.
Findings
Outperforms existing prompt inversion methods in image similarity and interpretability
Effective across multiple datasets including MS COCO, LAION, Flickr, and DiffusionDB
Enables applications like cross-concept synthesis and unsupervised segmentation
Abstract
Text-to-image generation models~(e.g., Stable Diffusion) have achieved significant advancements, enabling the creation of high-quality and realistic images based on textual descriptions. Prompt inversion, the task of identifying the textual prompt used to generate a specific artifact, holds significant potential for applications including data attribution, model provenance, and watermarking validation. Recent studies introduced a delayed projection scheme to optimize for prompts representative of the vocabulary space, though challenges in semantic fluency and efficiency remain. Advanced image captioning models or visual large language models can generate highly interpretable prompts, but they often lack in image similarity. In this paper, we propose a prompt inversion technique called \sys for text-to-image diffusion models, which includes initializing embeddings using a pre-trained…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The work addresses an interesting problem of reverse engineering diffusion models. 2.Comprehensive evaluations and ablations show the effectiveness of the approach in comparison to prior work.
1.The work has limited novelty in the sense that it combines the gradient based optimization of prior work with the latent space of an existing model. 2.The notations and equations are incorrect. The cross-entropy loss and the MLE loss are not correctly defined in equation 4 and 6. 3. The approach does not consider recent multimodal architectures such as SD3. 4. Comparison to recent prompt inversion/search techniques such as [1]. [1] STEPS: Sequential Probability Tensor Estimation for Text-
- The paper provides a clean, modular pipeline that others can readily reuse. - Better similarity and more fluent prompts than PEZ/PH2P and captioners across multiple datasets and model variants. - Produces prompts that are human-interpretable, aiding provenance/attribution and even downstream editing.
- The method composes established components; the main idea (optimize contextual embeddings, then decode to text) is a practical tweak rather than a new paradigm. - Only 100 images per dataset is used for evaluation; it's unclear how stable gains are across broader distributions. - Protocol choices (initialization, token/step budgets) could affect PEZ/PH2P competitiveness; a standardized compute budget table would be good to have.
1. Optimizing contextual embeddings after the transformer and deferring discrete text decoding to an E2T model is novel 2. EDITOR improves image similarity and prompt interpretability/text alignment.
1. EDITOR depends on a trained E2T module; this adds implementation and computation costs. 2. Mapping embedding to text to embedding may not be strict, paraphrases can drift semantics. The extent to which this harms re-generation fidelity and editability is under-measured. Authors are suggested to give more details. 3. Experiments focus on COCO/LAION/Flickr subsets.The scale of the dataset is relatively limited.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
