MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?
Zhaorun Chen, Yichao Du, Zichen Wen, Yiyang Zhou, Chenhang Cui,, Zhenzhen Weng, Haoqin Tu, Chaoqi Wang, Zhengwei Tong, Qinglan Huang, Canyu, Chen, Qinghao Ye, Zhihong Zhu, Yuqing Zhang, Jiawei Zhou, Zhuokai Zhao,, Rafael Rafailov, Chelsea Finn, Huaxiu Yao

TL;DR
This paper introduces MJ-Bench, a comprehensive benchmark for evaluating multimodal judges used in guiding text-to-image models, highlighting their strengths and limitations across key aspects like safety, bias, and quality.
Contribution
The paper presents MJ-Bench, the first extensive benchmark dataset and evaluation framework for assessing the capabilities of various multimodal judges in image generation tasks.
Findings
Close-source VLMs outperform open-source models in feedback quality.
Smaller models excel in alignment and image quality feedback.
VLMs provide more accurate safety and bias feedback due to reasoning abilities.
Abstract
While text-to-image models like DALLE-3 and Stable Diffusion are rapidly proliferating, they often encounter challenges such as hallucination, bias, and the production of unsafe, low-quality output. To effectively address these issues, it is crucial to align these models with desired behaviors based on feedback from a multimodal judge. Despite their significance, current multimodal judges frequently undergo inadequate evaluation of their capabilities and limitations, potentially leading to misalignment and unsafe fine-tuning outcomes. To address this issue, we introduce MJ-Bench, a novel benchmark which incorporates a comprehensive preference dataset to evaluate multimodal judges in providing feedback for image generation models across four key perspectives: alignment, safety, image quality, and bias. Specifically, we evaluate a large variety of multimodal judges including smaller-sized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗yichaodu/DiffusionDPO-alignment-claude3-opusmodel· 14 dl14 dl
- 🤗yichaodu/DiffusionDPO-alignment-gemini-1.5model· 17 dl· ♡ 117 dl♡ 1
- 🤗yichaodu/DiffusionDPO-alignment-gpt-4omodel· 28 dl28 dl
- 🤗yichaodu/DiffusionDPO-alignment-gpt-4vmodel· 8 dl8 dl
- 🤗yichaodu/DiffusionDPO-alignment-hps-2.1model· 13 dl13 dl
- 🤗yichaodu/DiffusionDPO-alignment-internvl-1.5model· 25 dl25 dl
- 🤗yichaodu/DiffusionDPO-safety-claude3-opusmodel· 4 dl4 dl
- 🤗yichaodu/DiffusionDPO-safety-gemini-1.5model· 6 dl6 dl
- 🤗yichaodu/DiffusionDPO-safety-gpt-4omodel· 7 dl7 dl
- 🤗yichaodu/DiffusionDPO-safety-gpt-4vmodel· 11 dl11 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
MethodsALIGN · Diffusion
