UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond   Scaling

Haider Al-Tahan; Quentin Garrido; Randall Balestriero; Diane; Bouchacourt; Caner Hazirbas; Mark Ibrahim

arXiv:2408.04810·cs.CV·August 12, 2024

UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling

Haider Al-Tahan, Quentin Garrido, Randall Balestriero, Diane, Bouchacourt, Caner Hazirbas, Mark Ibrahim

PDF

Open Access 1 Repo 1 Video

TL;DR

UniBench provides a comprehensive, unified framework to evaluate vision-language models across diverse capabilities, revealing that scaling data or models alone does not enhance reasoning skills and highlighting the importance of data quality and tailored objectives.

Contribution

The paper introduces UniBench, a unified benchmarking suite for over 50 vision-language tasks, enabling systematic evaluation and comparison of models at scale.

Findings

01

Scaling data or model size improves many capabilities but not reasoning.

02

Current top models struggle with simple digit recognition tasks.

03

Data quality and tailored training objectives are more effective than scaling alone.

Abstract

Significant research efforts have been made to scale and improve vision-language model (VLM) training approaches. Yet, with an ever-growing number of benchmarks, researchers are tasked with the heavy burden of implementing each protocol, bearing a non-trivial computational cost, and making sense of how all these benchmarks translate into meaningful axes of progress. To facilitate a systematic evaluation of VLM progress, we introduce UniBench: a unified implementation of 50+ VLM benchmarks spanning a comprehensive range of carefully categorized capabilities from object recognition to spatial awareness, counting, and much more. We showcase the utility of UniBench for measuring progress by evaluating nearly 60 publicly available vision-language models, trained on scales of up to 12.8B samples. We find that while scaling training data or model size can boost many vision-language model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/unibench
jaxOfficial

Videos

UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling· slideslive

Taxonomy

TopicsLanguage, Metaphor, and Cognition · Spatial Cognition and Navigation · Visual and Cognitive Learning Processes

MethodsSparse Evolutionary Training