DatBench: Discriminative, Faithful, and Efficient VLM Evaluations

DatologyAI: Siddharth Joshi; Haoli Yin; Rishabh Adiga; Ricardo Monti; Aldo Carranza; Alex Fang; Alvin Deng; Amro Abbas; Brett Larsen; Cody Blakeney; Darren Teh; David Schwab; Fan Pan; Haakon Mongstad; Jack Urbanek; Jason Lee; Jason Telanoff; Josh Wills; Kaleigh Mentzer; Luke Merrick; Parth Doshi; Paul Burstein; Pratyush Maini; Scott Loftin; Spandan Das; Tony Jiang; Vineeth Dorna; Zhengping Wang; Bogdan Gaza; Ari Morcos; Matthew Leavitt

arXiv:2601.02316·cs.LG·January 15, 2026

DatBench: Discriminative, Faithful, and Efficient VLM Evaluations

DatologyAI: Siddharth Joshi, Haoli Yin, Rishabh Adiga, Ricardo Monti, Aldo Carranza, Alex Fang, Alvin Deng, Amro Abbas, Brett Larsen, Cody Blakeney, Darren Teh, David Schwab, Fan Pan, Haakon Mongstad, Jack Urbanek, Jason Lee, Jason Telanoff, Josh Wills, Kaleigh Mentzer

PDF

Open Access 4 Datasets

TL;DR

This paper introduces DatBench, a refined evaluation suite for vision-language models that improves fidelity, discriminability, and efficiency by filtering and transforming existing benchmarks, revealing significant capability drops and reducing computational costs.

Contribution

The paper proposes a new evaluation framework for VLMs that addresses current shortcomings by filtering datasets and transforming tasks, enhancing evaluation reliability and efficiency.

Findings

01

Filtering and transforming datasets improve discriminability.

02

Conversion to generative tasks reveals capability drops up to 35%.

03

Achieves up to 50x speedup with the new evaluation suite.

Abstract

Empirical evaluation serves as the primary compass guiding research progress in foundation models. Despite a large body of work focused on training frontier vision-language models (VLMs), approaches to their evaluation remain nascent. To guide their maturation, we propose three desiderata that evaluations should satisfy: (1) faithfulness to the modality and application, (2) discriminability between models of varying quality, and (3) efficiency in compute. Through this lens, we identify critical failure modes that violate faithfulness and discriminability, misrepresenting model capabilities: (i) multiple-choice formats reward guessing, poorly reflect downstream use cases, and saturate early as models improve; (ii) blindly solvable questions, which can be answered without images, constitute up to 70% of some evaluations; and (iii) mislabeled or ambiguous samples compromise up to 42% of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques