EVALALIGN: Supervised Fine-Tuning Multimodal LLMs with Human-Aligned   Data for Evaluating Text-to-Image Models

Zhiyu Tan; Xiaomeng Yang; Luozheng Qin; Mengping Yang; Cheng Zhang,; Hao Li

arXiv:2406.16562·cs.CV·October 11, 2024

EVALALIGN: Supervised Fine-Tuning Multimodal LLMs with Human-Aligned Data for Evaluating Text-to-Image Models

Zhiyu Tan, Xiaomeng Yang, Luozheng Qin, Mengping Yang, Cheng Zhang,, Hao Li

PDF

Open Access 1 Repo

TL;DR

EvalAlign is a supervised fine-tuning approach for multimodal large language models that creates accurate, stable, and fine-grained evaluation metrics for text-to-image models, closely aligning with human judgments.

Contribution

The paper introduces EvalAlign, a novel supervised fine-tuning method for MLLMs to produce precise, human-aligned evaluation metrics for text-to-image generation.

Findings

01

EvalAlign outperforms existing metrics in stability and accuracy.

02

It closely matches human preferences in model evaluation.

03

Demonstrated effectiveness across 24 text-to-image models.

Abstract

The recent advancements in text-to-image generative models have been remarkable. Yet, the field suffers from a lack of evaluation metrics that accurately reflect the performance of these models, particularly lacking fine-grained metrics that can guide the optimization of the models. In this paper, we propose EvalAlign, a metric characterized by its accuracy, stability, and fine granularity. Our approach leverages the capabilities of Multimodal Large Language Models (MLLMs) pre-trained on extensive data. We develop evaluation protocols that focus on two key dimensions: image faithfulness and text-image alignment. Each protocol comprises a set of detailed, fine-grained instructions linked to specific scoring options, enabling precise manual scoring of the generated images. We supervised fine-tune (SFT) the MLLM to align with human evaluative judgments, resulting in a robust evaluation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sais-fuxi/evalalign
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Handwritten Text Recognition Techniques · Multimodal Machine Learning Applications

MethodsSparse Evolutionary Training · ALIGN · Focus