Bridging vision language model (VLM) evaluation gaps with a framework   for scalable and cost-effective benchmark generation

Tim R\"adsch; Leon Mayer; Simon Pavicic; A. Emre Kavur; Marcel Knopp,; Bar{\i}\c{s} \"Ozt\"urk; Klaus Maier-Hein; Paul F. Jaeger; Fabian Isensee,; Annika Reinke; Lena Maier-Hein

arXiv:2502.15563·cs.CV·February 24, 2025

Bridging vision language model (VLM) evaluation gaps with a framework for scalable and cost-effective benchmark generation

Tim R\"adsch, Leon Mayer, Simon Pavicic, A. Emre Kavur, Marcel Knopp,, Bar{\i}\c{s} \"Ozt\"urk, Klaus Maier-Hein, Paul F. Jaeger, Fabian Isensee,, Annika Reinke, Lena Maier-Hein

PDF

TL;DR

This paper introduces a scalable framework for creating domain-specific vision-language model benchmarks, releasing new datasets across seven domains, and benchmarking 22 models to reveal performance variances and guide future research.

Contribution

It presents a resource-efficient method for generating diverse VLM benchmarks from existing tasks and provides new comprehensive datasets and evaluations across multiple domains.

Findings

01

Performance varies significantly across domains and models.

02

The new benchmarks include over 162,000 human-validated answers.

03

The framework enables cost-effective, scalable benchmark creation.

Abstract

Reliable evaluation of AI models is critical for scientific progress and practical application. While existing VLM benchmarks provide general insights into model capabilities, their heterogeneous designs and limited focus on a few imaging domains pose significant challenges for both cross-domain performance comparison and targeted domain-specific evaluation. To address this, we propose three key contributions: (1) a framework for the resource-efficient creation of domain-specific VLM benchmarks enabled by task augmentation for creating multiple diverse tasks from a single existing task, (2) the release of new VLM benchmarks for seven domains, created according to the same homogeneous protocol and including 162,946 thoroughly human-validated answers, and (3) an extensive benchmarking of 22 state-of-the-art VLMs on a total of 37,171 tasks, revealing performance variances across domains…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsFocus