A-Bench: Are LMMs Masters at Evaluating AI-generated Images?

Zicheng Zhang; Haoning Wu; Chunyi Li; Yingjie Zhou; Wei Sun; Xiongkuo; Min; Zijian Chen; Xiaohong Liu; Weisi Lin; Guangtao Zhai

arXiv:2406.03070·cs.CV·February 10, 2025·2 cites

A-Bench: Are LMMs Masters at Evaluating AI-generated Images?

Zicheng Zhang, Haoning Wu, Chunyi Li, Yingjie Zhou, Wei Sun, Xiongkuo, Min, Zijian Chen, Xiaohong Liu, Weisi Lin, Guangtao Zhai

PDF

Open Access 1 Repo 1 Datasets 3 Reviews

TL;DR

A-Bench is a comprehensive benchmark designed to evaluate whether large multi-modal models (LMMs) can effectively assess AI-generated images, addressing the limitations of traditional benchmarks and the high costs of user studies.

Contribution

This paper introduces A-Bench, a new benchmark specifically for evaluating LMMs' ability to assess AI-generated images, emphasizing both semantic understanding and visual quality.

Findings

01

LMMs show varied performance in evaluating AIGIs.

02

A-Bench provides a comprehensive validation framework.

03

The benchmark includes 2,864 images from 16 models.

Abstract

How to accurately and efficiently assess AI-generated images (AIGIs) remains a critical challenge for generative models. Given the high costs and extensive time commitments required for user studies, many researchers have turned towards employing large multi-modal models (LMMs) as AIGI evaluators, the precision and validity of which are still questionable. Furthermore, traditional benchmarks often utilize mostly natural-captured content rather than AIGIs to test the abilities of LMMs, leading to a noticeable gap for AIGIs. Therefore, we introduce A-Bench in this paper, a benchmark designed to diagnose whether LMMs are masters at evaluating AIGIs. Specifically, A-Bench is organized under two key principles: 1) Emphasizing both high-level semantic understanding and low-level visual quality perception to address the intricate demands of AIGIs. 2) Various generative models are utilized for…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 3Confidence 5

Strengths

1. The authors manually annotated a dataset containing 2864 image quality issues, which contributes to the development of AIGI evaluation. 2. The authors evaluate AIGI quality from high-level semantic aspects like counting and low-level aspects like distortion, providing valuable insights for subsequent general AIGI task evaluations. 3. The paper's A-Bench includes the evaluation performance of multiple LMMs, offering guidance for researchers who wish to use LMMs for AIGI quality assessment.

Weaknesses

1. Although A-Bench includes multiple LMMs, it lacks some of the latest SOTA models. Better models such as QWEN-VL2 and MiniCPMv2.6 can be found from opencompass. The paper does not specify the versions of gpt4o used, such as gpt-4o-2024-08-06 or gpt-4o-2024-05-13, which is crucial for future researchers. 2. The AIGI models used to generate the dataset are somewhat outdated, lacking relatively advanced image generation models such as SD3, PixArt, Flux, etc. Currently, the more outstanding AIGI m

Reviewer 02Rating 5Confidence 3

Strengths

1. The benchmark is undoubtedly useful. Given the growing reliance on LLMs to evaluate various AI-generated content like images, having a comprehensive, quantitative benchmark that assesses the effectiveness of LLMs in evaluation is highly valuable. 2. The paper tries to objectively define the underlying metrics of evaluation. 3. The benchmark development involved a rigorous process, starting with user studies to establish a baseline, followed by testing various LLMs, which adds credibility and

Weaknesses

1. While the metrics cover several important facets of semantic reasoning, they lack a rigorous scientific foundation, raising questions about whether they capture the full scope of semantic understanding as implicitly perceived by humans. Specific dimensions of semantic reasoning, such as cultural nuances, or emotional depth, may be missing from the current metrics, which could impact the holistic evaluation of AI-generated images. As such, while the comparisons of different LLMs using these me

Reviewer 03Rating 8Confidence 5

Strengths

- Authors address a very important problem: Are current LLM/LMMs good enough to be used as judges for generative models? This line of research can provide valuable insights to train better LMMs for understanding AIGIs. - A-Bench along with standard LMM evaluation benchmarks provide a complete picture of an LMMs capability to understand both real and AI generated images. - The paper is well written and very easy to follow containing all the details necessary for reproduction. - The experimental s

Weaknesses

- I didn't find any major weakness with this work.

Code & Models

Repositories

q-future/a-bench
noneOfficial

Datasets

q-future/A-Bench
dataset· 224 dl
224 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMedical Imaging and Analysis