CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness

Zhihang Liu; Chen-Wei Xie; Bin Wen; Feiwu Yu; Jixuan Chen; Pandeng Li; Boqiang Zhang; Nianzu Yang; Yinglu Li; Zuan Gao; Yun Zheng; Hongtao Xie

arXiv:2502.14914·cs.CV·November 27, 2025

CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness

Zhihang Liu, Chen-Wei Xie, Bin Wen, Feiwu Yu, Jixuan Chen, Pandeng Li, Boqiang Zhang, Nianzu Yang, Yinglu Li, Zuan Gao, Yun Zheng, Hongtao Xie

PDF

Open Access 1 Datasets

TL;DR

CAPability introduces a comprehensive benchmark for evaluating visual captioning, assessing correctness and thoroughness across multiple views using human annotations, QA conversions, and new metrics to identify strengths and gaps in MLLMs.

Contribution

This work presents a novel multi-view benchmark with 11K annotated images and videos, introducing new metrics like extit{know but cannot tell} to evaluate captioning performance more holistically.

Findings

01

CAPability effectively evaluates caption correctness and thoroughness.

02

It reveals significant performance gaps in current MLLMs' captioning abilities.

03

The benchmark guides future improvements in multimodal captioning models.

Abstract

Visual captioning benchmarks have become outdated with the emergence of modern multimodal large language models (MLLMs), as the brief ground-truth sentences and traditional metrics fail to assess detailed captions effectively. While recent benchmarks attempt to address this by focusing on keyword extraction or object-centric evaluation, they remain limited to vague-view or object-view analyses and incomplete visual element coverage. In this paper, we introduce CAPability, a comprehensive multi-view benchmark for evaluating visual captioning across 12 dimensions spanning six critical views. We curate nearly 11K human-annotated images and videos with visual element annotations to evaluate the generated captions. CAPability stably assesses both the correctness and thoroughness of captions with \textit{precision} and \textit{hit} metrics. By converting annotations to QA pairs, we further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

lntzm/CAPability
dataset· 712 dl
712 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling