Proactive Agents for Multi-Turn Text-to-Image Generation Under Uncertainty

Meera Hahn; Wenjun Zeng; Nithish Kannen; Rich Galt; Kartikeya Badola; Been Kim; Zi Wang

arXiv:2412.06771·cs.AI·October 27, 2025

Proactive Agents for Multi-Turn Text-to-Image Generation Under Uncertainty

Meera Hahn, Wenjun Zeng, Nithish Kannen, Rich Galt, Kartikeya Badola, Been Kim, Zi Wang

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces proactive text-to-image agents that actively ask clarification questions and use belief graphs to better align generated images with user intent, significantly improving accuracy and user experience.

Contribution

It proposes a novel proactive agent framework with an interface for clarification and belief visualization, along with a scalable evaluation method and empirical validation across multiple datasets.

Findings

01

Agents ask informative questions to improve alignment.

02

Achieve at least 2x higher VQAScore than standard models.

03

90% of users found the agents helpful in workflows.

Abstract

User prompts for generative AI models are often underspecified, leading to a misalignment between the user intent and models' understanding. As a result, users commonly have to painstakingly refine their prompts. We study this alignment problem in text-to-image (T2I) generation and propose a prototype for proactive T2I agents equipped with an interface to (1) actively ask clarification questions when uncertain, and (2) present their uncertainty about user intent as an understandable and editable belief graph. We build simple prototypes for such agents and propose a new scalable and automated evaluation approach using two agents, one with a ground truth intent (an image) while the other tries to ask as few questions as possible to align with the ground truth. We experiment over three image-text datasets: ImageInWords (Garg et al., 2024), COCO (Lin et al., 2014) and DesignBench, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-deepmind/proactive_t2i_agents
noneOfficial

Datasets

meerahahn/DesignBench
dataset· 58 dl
58 dl

Videos

Proactive Agents for Multi-Turn Text-to-Image Generation Under Uncertainty· slideslive

Taxonomy

TopicsImage Retrieval and Classification Techniques · Multimodal Machine Learning Applications · Artificial Intelligence in Games

MethodsALIGN