REVEALER: Reinforcement-Guided Visual Reasoning for Element-Level Text-Image Alignment Evaluation
Fulin Shi, Wenyi Xiao, Bin Chen, Liang Din, Leilei Gan

TL;DR
REVEALER introduces a reinforcement-guided visual reasoning framework for fine-grained, element-level evaluation of text-image alignment, improving interpretability and accuracy over existing methods.
Contribution
It presents a novel unified framework using reinforcement learning and structured reasoning for detailed alignment evaluation in text-to-image models.
Findings
Achieves state-of-the-art performance on four benchmarks.
Outperforms proprietary models and supervised baselines.
Demonstrates superior inference efficiency.
Abstract
Evaluating the alignment between textual prompts and generated images is critical for ensuring the reliability and usability of text-to-image (T2I) models. However, most existing evaluation methods rely on coarse-grained metrics or static QA pipelines, which lack fine-grained interpretability and struggle to reflect human preferences. To address this, we propose REVEALER, a unified framework for element-level alignment evaluation based on reinforcement-guided visual reasoning. Adopting a structured "grounding-reasoning-conclusion" paradigm, our method enables Multimodal Large Language Models (MLLMs) to explicitly localize semantic elements and derive interpretable alignment judgments. We optimize the model via Group Relative Policy Optimization(GRPO) using a composite reward function that incorporates structural format, grounding accuracy, and alignment fidelity. Extensive experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling
