EVALALIGN: Supervised Fine-Tuning Multimodal LLMs with Human-Aligned Data for Evaluating Text-to-Image Models
Zhiyu Tan, Xiaomeng Yang, Luozheng Qin, Mengping Yang, Cheng Zhang,, Hao Li

TL;DR
EvalAlign is a supervised fine-tuning approach for multimodal large language models that creates accurate, stable, and fine-grained evaluation metrics for text-to-image models, closely aligning with human judgments.
Contribution
The paper introduces EvalAlign, a novel supervised fine-tuning method for MLLMs to produce precise, human-aligned evaluation metrics for text-to-image generation.
Findings
EvalAlign outperforms existing metrics in stability and accuracy.
It closely matches human preferences in model evaluation.
Demonstrated effectiveness across 24 text-to-image models.
Abstract
The recent advancements in text-to-image generative models have been remarkable. Yet, the field suffers from a lack of evaluation metrics that accurately reflect the performance of these models, particularly lacking fine-grained metrics that can guide the optimization of the models. In this paper, we propose EvalAlign, a metric characterized by its accuracy, stability, and fine granularity. Our approach leverages the capabilities of Multimodal Large Language Models (MLLMs) pre-trained on extensive data. We develop evaluation protocols that focus on two key dimensions: image faithfulness and text-image alignment. Each protocol comprises a set of detailed, fine-grained instructions linked to specific scoring options, enabling precise manual scoring of the generated images. We supervised fine-tune (SFT) the MLLM to align with human evaluative judgments, resulting in a robust evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Handwritten Text Recognition Techniques · Multimodal Machine Learning Applications
MethodsSparse Evolutionary Training · ALIGN · Focus
