T2I-Eval-R1: Reinforcement Learning-Driven Reasoning for Interpretable Text-to-Image Evaluation

Zi-Ao Ma; Tian Lan; Rong-Cheng Tu; Shu-Hang Liu; Heyan Huang; Zhijing Wu; Chen Xu; Xian-Ling Mao

arXiv:2505.17897·cs.AI·May 26, 2025

T2I-Eval-R1: Reinforcement Learning-Driven Reasoning for Interpretable Text-to-Image Evaluation

Zi-Ao Ma, Tian Lan, Rong-Cheng Tu, Shu-Hang Liu, Heyan Huang, Zhijing Wu, Chen Xu, Xian-Ling Mao

PDF

TL;DR

This paper introduces T2I-Eval-R1, a reinforcement learning framework that trains open-source multimodal models to evaluate text-to-image generation quality with interpretable reasoning, reducing reliance on costly high-quality datasets.

Contribution

The paper presents a novel reinforcement learning approach that enables open-source models to generate interpretable evaluation rationales using only coarse quality scores, improving scalability and interpretability.

Findings

01

Outperforms baseline methods in alignment with human judgments

02

Produces more accurate and interpretable evaluation rationales

03

Achieves higher robustness and discriminative power in evaluation

Abstract

The rapid progress in diffusion-based text-to-image (T2I) generation has created an urgent need for interpretable automatic evaluation methods that can assess the quality of generated images, therefore reducing the human annotation burden. To reduce the prohibitive cost of relying on commercial models for large-scale evaluation, and to improve the reasoning capabilities of open-source models, recent research has explored supervised fine-tuning (SFT) of multimodal large language models (MLLMs) as dedicated T2I evaluators. However, SFT approaches typically rely on high-quality critique datasets, which are either generated by proprietary LLMs-with potential issues of bias and inconsistency-or annotated by humans at high cost, limiting their scalability and generalization. To address these limitations, we propose T2I-Eval-R1, a novel reinforcement learning framework that trains open-source…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsShrink and Fine-Tune