Rhetorical Text-to-Image Generation via Two-layer Diffusion Policy Optimization

Yuxi Zhang; Yueting Li; Xinyu Du; Sibo Wang

arXiv:2505.22792·cs.CV·August 12, 2025

Rhetorical Text-to-Image Generation via Two-layer Diffusion Policy Optimization

Yuxi Zhang, Yueting Li, Xinyu Du, Sibo Wang

PDF

Open Access

TL;DR

Rhet2Pix introduces a multi-step policy optimization framework with a two-layer diffusion module to improve rhetorical text-to-image generation, capturing hidden semantic meanings beyond literal interpretations.

Contribution

The paper presents Rhet2Pix, a novel multi-layer diffusion-based approach that effectively generates images from rhetorical language by incrementally elaborating prompts and optimizing diffusion trajectories.

Findings

01

Outperforms SOTA models like GPT-4o and Grok-3 in qualitative evaluations.

02

Effectively captures semantic richness in rhetorical prompts.

03

Mitigates reward sparsity in diffusion-based image generation.

Abstract

Generating images from rhetorical languages remains a critical challenge for text-to-image models. Even state-of-the-art (SOTA) multimodal large language models (MLLM) fail to generate images based on the hidden meaning inherent in rhetorical language--despite such content being readily mappable to visual representations by humans. A key limitation is that current models emphasize object-level word embedding alignment, causing metaphorical expressions to steer image generation towards their literal visuals and overlook the intended semantic meaning. To address this, we propose Rhet2Pix, a framework that formulates rhetorical text-to-image generation as a multi-step policy optimization problem, incorporating a two-layer MDP diffusion module. In the outer layer, Rhet2Pix converts the input prompt into incrementally elaborated sub-sentences and executes corresponding image-generation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Computational and Text Analysis Methods

MethodsDiffusion