AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?

Han Bao; Yue Huang; Yanbo Wang; Jiayi Ye; Xiangqi Wang; Xiuying Chen,; Yue Zhao; Tianyi Zhou; Mohamed Elhoseiny; Xiangliang Zhang

arXiv:2410.21259·cs.CV·March 7, 2025

AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?

Han Bao, Yue Huang, Yanbo Wang, Jiayi Ye, Xiangqi Wang, Xiuying Chen,, Yue Zhao, Tianyi Zhou, Mohamed Elhoseiny, Xiangliang Zhang

PDF

Open Access 1 Repo 4 Reviews

TL;DR

AutoBench-V introduces an automated, flexible framework that uses LVLMs and text-to-image models to evaluate large vision-language models' capabilities without extensive human effort.

Contribution

The paper presents AutoBench-V, a novel automated benchmarking framework that enables on-demand, flexible evaluation of LVLMs using self-generated visual data.

Findings

01

Effective evaluation of nine LVLMs across multiple capabilities.

02

AutoBench-V reduces human effort in benchmarking.

03

Framework demonstrates reliability and flexibility.

Abstract

Large Vision-Language Models (LVLMs) have become essential for advancing the integration of visual and linguistic information. However, the evaluation of LVLMs presents significant challenges as the evaluation benchmark always demands lots of human cost for its construction, and remains static, lacking flexibility once constructed. Even though automatic evaluation has been explored in textual modality, the visual modality remains under-explored. As a result, in this work, we address a question: "Can LVLMs themselves be used to benchmark each other in the visual automatically domain?". We introduce AutoBench-V, an automated framework for serving evaluation on demand, i.e., benchmarking LVLMs based on specific aspects of model capability. AutoBench-V leverages text-to-image models to generate relevant image samples and then utilizes LVLMs to orchestrate visual question-answering (VQA)…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 6Confidence 4

Strengths

The idea in this paper is excellent as it can reduce the human cost involved in constructing datasets and provides multiple evaluation categories.

Weaknesses

The paper should include more discussion on quality control for the dataset, such as avoiding issues with generating incoherent or low-quality images and questions.

Reviewer 02Rating 1Confidence 4

Strengths

The generation procedure is interesting and the generated samples look reasonable. The models are evaluated in detail.

Weaknesses

Unfortunately the presentation quality is absolutely not ready for publication. The very first sentence of the introduction already contains several errors: Authors cite the “Attention is all you need” from 2017 as a 2023 paper. LLaVA is cited as an LLM even though it is a vision model. And the formulation suggests that the LLM works from 2023 pave the way for NLP works from 2020, which cannot hold. This continues in the related work section, where this paper is cited as past work: (1) “Deep vi

Reviewer 03Rating 3Confidence 4

Strengths

The authors have shown tremendous efforts to build up the comprehensiveness of AutoBench-V's design. It is a fluent experience to read through the manuscript in order to interpret the flow of the pipeline.

Weaknesses

However, my biggest concern lies in the experiments. In fact, there is one critical issue that greatly undermines my overall impression regarding the value of this work - Since all the prompts and the test cases are generated using GPT4o on-the-go, which is the sole Examiner/Judge in the paper's setting, **aren't the baseline performances in Table 2 by GPT-4o technically obtained by testing on the training set?** Even if the prevention actions of self-enhancement leakage are done on the imag

Reviewer 04Rating 5Confidence 4

Strengths

1. This study represents the automatic evaluation of multimodal large language models and has a certain degree of noverty. 2. The paper is elegantly written, and the automatic evaluation process designed in the methodology section is clearly written.

Weaknesses

1. "they lack the flexibility xxx" in 053 may involve overclaim. For some charts and flowcharts (ChartQA, ScienceQA), it is difficult to draw these pictures using the T2I model, but this ability is also need to be considered. Many benchmarks consider inputting charts, documents and other pictures (MMT-Bench[1]). Although the author construct many types of tasks, all of these input pictures seem to be natural images. So I don't think this can completely become the reason for the author to claim t

Code & Models

Repositories

wad3birch/AutoBench-V
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications