Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation
Yutong He, Alexander Robey, Naoki Murata, Yiding Jiang, Joshua Nathaniel Williams, George J. Pappas, Hamed Hassani, Yuki Mitsufuji, Ruslan Salakhutdinov, J. Zico Kolter

TL;DR
This paper presents PRISM, an automated black-box prompt engineering algorithm that generates human-interpretable, transferable prompts for text-to-image models, reducing manual effort and improving cross-model applicability.
Contribution
Introduction of PRISM, a novel algorithm leveraging LLM in-context learning to automatically produce effective prompts for T2I models without white-box access.
Findings
PRISM generates accurate prompts for diverse objects and styles.
PRISM works across multiple T2I models including Stable Diffusion, DALL-E, and Midjourney.
PRISM produces human-interpretable prompts that are transferable.
Abstract
Prompt engineering is an effective but labor-intensive way to control text-to-image (T2I) generative models. Its time-intensive nature and complexity have spurred the development of algorithms for automated prompt generation. However, these methods often struggle with transferability across T2I models, require white-box access to the underlying model, or produce non-intuitive prompts. In this work, we introduce PRISM, an algorithm that automatically produces human-interpretable and transferable prompts that can effectively generate desired concepts given only black-box access to T2I models. Inspired by large language model (LLM) jailbreaking, PRISM leverages the in-context learning ability of LLMs to iteratively refine the candidate prompt distribution built upon the reference images. Our experiments demonstrate the versatility and effectiveness of PRISM in generating accurate prompts…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The paper is well written. The structure makes it clear and easy to follow. 2. The experiments are comprehensive and well executed 3. Higher interpretability and transferability compared to other personalization / invertion methods
1. The idea is not novel. There exists similar attempts, .e.g., Manas et al 2024, Liu et al 2024, Yang et al 2023. Manas et al propose a very similar iterative algorithm based on in-context learning. The only (small) difference lies in the tasks, which here is personalization / inversion, while in Manas et al is improving prompt-image consistency in general (with no reference images). Yang et al work is similar as it relies in on GPT4V, which is used to evaluate and propose new candidates prom
* The PRISM method is presented with clarity, is intuitive, and straightforward to implement. * The experimentation applies PRISM to two useful usecases - text-to-image personalization and image inversion - demonstrating its utility. * The authors apply PRISM to three different families of models and several generations of models, demonstrating extensibility across model classes * The analysis includes good coverage over other methodologies, and the PRISM method shows strong performance compare
1. The work would be strengthened with greater discussions about possible weaknesses of VLMs as judge models and how these concerns may be mitigated. For example, VLMs have known issues with compositionality and counting [1,2]. 2. While the ablation focusing on budget provides useful insights, it would be useful to contextualize how the improvement in metrics translate to visually perceptible improvements to better contextualize the extent to which the iterations are required for human-perceive
1. Unlike previous techniques that generate only a bag of words, PRISM generates fully human-readable prompts for image generation. 2. PRISM has a straightforward implementation, as it does not require model training. 3. The paper is written in a clear and accessible manner.
1. In Line 187 and Line 9 of the algorithm, you state that$\text{P}_{\theta_F}$ is updated; however, the method for updating the distribution is not specified in the method section. In the Appendix, it appears that this process simply involves prompting the score generated by the discriminator. How, then, can this be interpreted as updating the distribution? This could lead to misunderstandings, as “updating the distribution” generally implies updating the model parameters. 2. In Figures 3,4, a
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Mathematics, Computing, and Information Processing · Multimedia Communication and Technology
MethodsDiffusion
