AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models

Zheda Mai; Arpita Chowdhury; Zihe Wang; Sooyoung Jeon; Lemeng Wang; Jiacheng Hou; Jihyung Kil; Wei-Lun Chao

arXiv:2506.09082·cs.CV·May 5, 2026

AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models

Zheda Mai, Arpita Chowdhury, Zihe Wang, Sooyoung Jeon, Lemeng Wang, Jiacheng Hou, Jihyung Kil, Wei-Lun Chao

PDF

TL;DR

AVA-Bench is a new benchmark that evaluates vision foundation models on 14 specific visual abilities, enabling precise identification of strengths and weaknesses for better model selection and development.

Contribution

It introduces AVA-Bench, the first benchmark explicitly disentangling 14 atomic visual abilities to improve evaluation accuracy of VFMs.

Findings

01

A 0.5B LLM achieves similar VFM rankings as a 7B LLM with 8x less GPU usage.

02

AVA-Bench reveals ability-specific strengths and weaknesses of VFMs.

03

Decoupling abilities improves understanding of model performance.

Abstract

The rise of vision foundation models (VFMs) calls for systematic evaluation. A common approach pairs VFMs with large language models (LLMs) as general-purpose heads, followed by evaluation on broad Visual Question Answering (VQA) benchmarks. However, this protocol has two key blind spots: (i) the instruction tuning data may not align with VQA test distributions, meaning a wrong prediction can stem from such data mismatch rather than a VFM' visual shortcomings; (ii) VQA benchmarks often require multiple visual abilities, making it hard to tell whether errors stem from lacking all required abilities or just a single critical one. To address these gaps, we introduce AVA-Bench, the first benchmark that explicitly disentangles 14 Atomic Visual Abilities (AVAs) -- foundational skills like localization, depth estimation, and spatial understanding that collectively support complex visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.