TL;DR
This paper introduces Visually Guided Decoding (VGD), a gradient-free method that uses large language models and CLIP guidance to generate coherent, interpretable prompts for text-to-image models, improving control and relevance.
Contribution
VGD is a novel gradient-free approach that leverages LLMs and CLIP for effective prompt inversion, enhancing interpretability and semantic alignment without additional training.
Findings
VGD outperforms existing prompt inversion methods in coherence and relevance.
VGD improves interpretability and control in text-to-image generation.
VGD does not require additional training, increasing flexibility.
Abstract
Text-to-image generative models like DALL-E and Stable Diffusion have revolutionized visual content creation across various applications, including advertising, personalized media, and design prototyping. However, crafting effective textual prompts to guide these models remains challenging, often requiring extensive trial and error. Existing prompt inversion approaches, such as soft and hard prompt techniques, are not so effective due to the limited interpretability and incoherent prompt generation. To address these issues, we propose Visually Guided Decoding (VGD), a gradient-free approach that leverages large language models (LLMs) and CLIP-based guidance to generate coherent and semantically aligned prompts. In essence, VGD utilizes the robust text generation capabilities of LLMs to produce human-readable prompts. Further, by employing CLIP scores to ensure alignment with…
Peer Reviews
Decision·ICLR 2025 Poster
The authors propose a pretty creative way to conduct gradient-free text-to-image prompt inversion and incorporate language priors in the process. The qualitative results also show some obvious improvement on CLIP-I scores when used with Llava.
My main concern about this paper lies in the experiments. In general, I am not very convinced by their result that this method significantly improves upon the existing literature. 1. The authors mainly conduct qualitative comparison with PEZ and textual inversion, and not with CLIP-Interrogator, which is very misleading given that CLIP-Interrogator is the best performing baseline based on Table 1 and it can also generate prompts that have similar human interpretability in comparison to the propo
- VGD produces coherent and human-readable prompts, facilitating user interaction and modification. - The training-free method allows easy integration with different LLMs, enhancing adaptability. - Demonstrates superior performance in generating contextually relevant prompts, as supported by both qualitative and quantitative results.
- This paper does not analyze bad cases. - The evaluation lacks depth, especially in semantic aspect evaluation. No human evaluation was conducted. Since image generation is a complex, semantically rich task, CLIPScore may not fully capture true image-prompt alignment, and its classification granularity is limited. Style transfer also requires human evaluation, but the paper only shows a few examples. - The paper evaluates only semantics without assessing image quality. There should be a discus
1. VGD generates fully interpretable prompts that enhance generalizability across tasks. 2. VGD is a gradient-free method which is more flexible
1. When apply to more complex open-source models with multiple text encoders like SDXL and SD3, as mentioned in L220-222, the performance of the method would decline. What's more, when facing non-CLIP based models that utilize T5 as text encoders, the methods is quite limited. 2.The current experimental analysis also appears insufficient. While the method in the paper shows superior performance compare to previous method, it also should include the experiments about the time and cost for more
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDiffusion · Contrastive Language-Image Pre-training
