Automated Generation of Challenging Multiple-Choice Questions for Vision   Language Model Evaluation

Yuhui Zhang; Yuchang Su; Yiming Liu; Xiaohan Wang; James Burgess,; Elaine Sui; Chenyu Wang; Josiah Aklilu; Alejandro Lozano; Anjiang Wei; Ludwig; Schmidt; Serena Yeung-Levy

arXiv:2501.03225·cs.CV·April 10, 2025

Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation

Yuhui Zhang, Yuchang Su, Yiming Liu, Xiaohan Wang, James Burgess,, Elaine Sui, Chenyu Wang, Josiah Aklilu, Alejandro Lozano, Anjiang Wei, Ludwig, Schmidt, Serena Yeung-Levy

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces AutoConverter, an automated framework for converting open-ended visual questions into challenging multiple-choice questions, enabling more reliable and scalable evaluation of vision language models.

Contribution

AutoConverter automates the creation of multiple-choice questions from existing datasets, facilitating consistent and scalable evaluation of VLMs.

Findings

01

AutoConverter generates challenging multiple-choice questions with high accuracy.

02

VLMs perform similarly or worse on AutoConverter-generated questions compared to human-made ones.

03

VMCBench provides a comprehensive benchmark for evaluating 33 state-of-the-art VLMs.

Abstract

The rapid development of vision language models (VLMs) demands rigorous and reliable evaluation. However, current visual question answering (VQA) benchmarks often depend on open-ended questions, making accurate evaluation difficult due to the variability in natural language responses. To address this, we introduce AutoConverter, an agentic framework that automatically converts these open-ended questions into multiple-choice format, enabling objective evaluation while reducing the costly multiple-choice question creation process. Our experiments demonstrate that AutoConverter can generate correct and challenging multiple-choice questions, with VLMs demonstrating consistently similar or lower accuracy on these questions compared to human-created ones. Using AutoConverter, we construct VMCBench, a benchmark created by transforming 20 existing VQA datasets into a unified multiple-choice…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yuhui-zh15/autoconverter
pytorchOfficial

Datasets

suyc21/VMCBench
dataset· 670 dl
670 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGeographic Information Systems Studies · Multimodal Machine Learning Applications