Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment

Ying Ba; Tianyu Zhang; Yalong Bai; Wenyi Mo; Tao Liang; Bing Su; Ji-Rong Wen

arXiv:2507.19002·cs.CV·July 28, 2025

Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment

Ying Ba, Tianyu Zhang, Yalong Bai, Wenyi Mo, Tao Liang, Bing Su, Ji-Rong Wen

PDF

2 Models

TL;DR

This paper introduces a new evaluation score, ICT, and an HP score model to better align image generation with human aesthetic preferences, surpassing existing text-image alignment methods in accuracy and quality.

Contribution

It proposes a novel evaluation metric and a new preference model trained on image modality, improving image aesthetic assessment beyond traditional text-image alignment.

Findings

01

ICT score surpasses existing evaluation metrics by over 10% in accuracy.

02

HP score model enhances image aesthetics and detail quality.

03

The approach improves state-of-the-art text-to-image model optimization.

Abstract

Contemporary image generation systems have achieved high fidelity and superior aesthetic quality beyond basic text-image alignment. However, existing evaluation frameworks have failed to evolve in parallel. This study reveals that human preference reward models fine-tuned based on CLIP and BLIP architectures have inherent flaws: they inappropriately assign low scores to images with rich details and high aesthetic value, creating a significant discrepancy with actual human aesthetic preferences. To address this issue, we design a novel evaluation score, ICT (Image-Contained-Text) score, that achieves and surpasses the objectives of text-image alignment by assessing the degree to which images represent textual content. Building upon this foundation, we further train an HP (High-Preference) score model using solely the image modality to enhance image aesthetics and detail quality while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.