Multimodal LLMs as Customized Reward Models for Text-to-Image Generation
Shijie Zhou, Ruiyi Zhang, Huaisheng Zhu, Branislav Kveton, Yufan Zhou, Jiuxiang Gu, Jian Chen, Changyou Chen

TL;DR
This paper presents LLaVA-Reward, a novel reward model utilizing pretrained multimodal large language models to evaluate text-to-image generation quality efficiently across multiple perspectives, improving over existing methods.
Contribution
Introduction of LLaVA-Reward, a reward model that directly uses MLLM hidden states and a SkipCA module for better text-image correlation, supporting various preference data types for improved evaluation.
Findings
LLaVA-Reward outperforms existing methods in automatic evaluation accuracy.
It effectively scales inference-time evaluation for text-to-image generation.
The model enhances text-image correlation reasoning with the SkipCA module.
Abstract
We introduce LLaVA-Reward, an efficient reward model designed to automatically evaluate text-to-image (T2I) generations across multiple perspectives, leveraging pretrained multimodal large language models (MLLMs). Existing MLLM-based approaches require instruction-following data for supervised fine-tuning and evaluate generation quality on analyzing text response, which is time-consuming and difficult to train. To address this problem, we propose LLaVA-Reward, which directly utilizes the hidden states of MLLMs given text-image pairs. To enhance the bidirectional interaction between visual and textual representations in decoder-only MLLMs, we further propose adding a Skip-connection Cross Attention (SkipCA) module. This design enhances text-image correlation reasoning by connecting early-layer visual features with later-layer hidden representations. In addition, LLaVA-Reward supports…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
