UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling
Haider Al-Tahan, Quentin Garrido, Randall Balestriero, Diane, Bouchacourt, Caner Hazirbas, Mark Ibrahim

TL;DR
UniBench provides a comprehensive, unified framework to evaluate vision-language models across diverse capabilities, revealing that scaling data or models alone does not enhance reasoning skills and highlighting the importance of data quality and tailored objectives.
Contribution
The paper introduces UniBench, a unified benchmarking suite for over 50 vision-language tasks, enabling systematic evaluation and comparison of models at scale.
Findings
Scaling data or model size improves many capabilities but not reasoning.
Current top models struggle with simple digit recognition tasks.
Data quality and tailored training objectives are more effective than scaling alone.
Abstract
Significant research efforts have been made to scale and improve vision-language model (VLM) training approaches. Yet, with an ever-growing number of benchmarks, researchers are tasked with the heavy burden of implementing each protocol, bearing a non-trivial computational cost, and making sense of how all these benchmarks translate into meaningful axes of progress. To facilitate a systematic evaluation of VLM progress, we introduce UniBench: a unified implementation of 50+ VLM benchmarks spanning a comprehensive range of carefully categorized capabilities from object recognition to spatial awareness, counting, and much more. We showcase the utility of UniBench for measuring progress by evaluating nearly 60 publicly available vision-language models, trained on scales of up to 12.8B samples. We find that while scaling training data or model size can boost many vision-language model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsLanguage, Metaphor, and Cognition · Spatial Cognition and Navigation · Visual and Cognitive Learning Processes
MethodsSparse Evolutionary Training
