REVEALER: Reinforcement-Guided Visual Reasoning for Element-Level Text-Image Alignment Evaluation

Fulin Shi; Wenyi Xiao; Bin Chen; Liang Din; Leilei Gan

arXiv:2512.23169·cs.CV·February 24, 2026

REVEALER: Reinforcement-Guided Visual Reasoning for Element-Level Text-Image Alignment Evaluation

Fulin Shi, Wenyi Xiao, Bin Chen, Liang Din, Leilei Gan

PDF

Open Access

TL;DR

REVEALER introduces a reinforcement-guided visual reasoning framework for fine-grained, element-level evaluation of text-image alignment, improving interpretability and accuracy over existing methods.

Contribution

It presents a novel unified framework using reinforcement learning and structured reasoning for detailed alignment evaluation in text-to-image models.

Findings

01

Achieves state-of-the-art performance on four benchmarks.

02

Outperforms proprietary models and supervised baselines.

03

Demonstrates superior inference efficiency.

Abstract

Evaluating the alignment between textual prompts and generated images is critical for ensuring the reliability and usability of text-to-image (T2I) models. However, most existing evaluation methods rely on coarse-grained metrics or static QA pipelines, which lack fine-grained interpretability and struggle to reflect human preferences. To address this, we propose REVEALER, a unified framework for element-level alignment evaluation based on reinforcement-guided visual reasoning. Adopting a structured "grounding-reasoning-conclusion" paradigm, our method enables Multimodal Large Language Models (MLLMs) to explicitly localize semantic elements and derive interpretable alignment judgments. We optimize the model via Group Relative Policy Optimization(GRPO) using a composite reward function that incorporates structural format, grounding accuracy, and alignment fidelity. Extensive experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling