Multimodal LLMs as Customized Reward Models for Text-to-Image Generation

Shijie Zhou; Ruiyi Zhang; Huaisheng Zhu; Branislav Kveton; Yufan Zhou; Jiuxiang Gu; Jian Chen; Changyou Chen

arXiv:2507.21391·cs.CV·July 31, 2025

Multimodal LLMs as Customized Reward Models for Text-to-Image Generation

Shijie Zhou, Ruiyi Zhang, Huaisheng Zhu, Branislav Kveton, Yufan Zhou, Jiuxiang Gu, Jian Chen, Changyou Chen

PDF

TL;DR

This paper presents LLaVA-Reward, a novel reward model utilizing pretrained multimodal large language models to evaluate text-to-image generation quality efficiently across multiple perspectives, improving over existing methods.

Contribution

Introduction of LLaVA-Reward, a reward model that directly uses MLLM hidden states and a SkipCA module for better text-image correlation, supporting various preference data types for improved evaluation.

Findings

01

LLaVA-Reward outperforms existing methods in automatic evaluation accuracy.

02

It effectively scales inference-time evaluation for text-to-image generation.

03

The model enhances text-image correlation reasoning with the SkipCA module.

Abstract

We introduce LLaVA-Reward, an efficient reward model designed to automatically evaluate text-to-image (T2I) generations across multiple perspectives, leveraging pretrained multimodal large language models (MLLMs). Existing MLLM-based approaches require instruction-following data for supervised fine-tuning and evaluate generation quality on analyzing text response, which is time-consuming and difficult to train. To address this problem, we propose LLaVA-Reward, which directly utilizes the hidden states of MLLMs given text-image pairs. To enhance the bidirectional interaction between visual and textual representations in decoder-only MLLMs, we further propose adding a Skip-connection Cross Attention (SkipCA) module. This design enhances text-image correlation reasoning by connecting early-layer visual features with later-layer hidden representations. In addition, LLaVA-Reward supports…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.