Rhetorical Text-to-Image Generation via Two-layer Diffusion Policy Optimization
Yuxi Zhang, Yueting Li, Xinyu Du, Sibo Wang

TL;DR
Rhet2Pix introduces a multi-step policy optimization framework with a two-layer diffusion module to improve rhetorical text-to-image generation, capturing hidden semantic meanings beyond literal interpretations.
Contribution
The paper presents Rhet2Pix, a novel multi-layer diffusion-based approach that effectively generates images from rhetorical language by incrementally elaborating prompts and optimizing diffusion trajectories.
Findings
Outperforms SOTA models like GPT-4o and Grok-3 in qualitative evaluations.
Effectively captures semantic richness in rhetorical prompts.
Mitigates reward sparsity in diffusion-based image generation.
Abstract
Generating images from rhetorical languages remains a critical challenge for text-to-image models. Even state-of-the-art (SOTA) multimodal large language models (MLLM) fail to generate images based on the hidden meaning inherent in rhetorical language--despite such content being readily mappable to visual representations by humans. A key limitation is that current models emphasize object-level word embedding alignment, causing metaphorical expressions to steer image generation towards their literal visuals and overlook the intended semantic meaning. To address this, we propose Rhet2Pix, a framework that formulates rhetorical text-to-image generation as a multi-step policy optimization problem, incorporating a two-layer MDP diffusion module. In the outer layer, Rhet2Pix converts the input prompt into incrementally elaborated sub-sentences and executes corresponding image-generation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Computational and Text Analysis Methods
MethodsDiffusion
