CaptionQA: Is Your Caption as Useful as the Image Itself?

Shijia Yang; Yunong Liu; Bohan Zhai; Ximeng Sun; Zicheng Liu; Emad Barsoum; Manling Li; Chenfeng Xu

arXiv:2511.21025·cs.CV·April 17, 2026

CaptionQA: Is Your Caption as Useful as the Image Itself?

Shijia Yang, Yunong Liu, Bohan Zhai, Ximeng Sun, Zicheng Liu, Emad Barsoum, Manling Li, Chenfeng Xu

PDF

1 Repo 1 Datasets

TL;DR

CaptionQA introduces a comprehensive benchmark to evaluate how well image captions support downstream tasks across multiple domains, revealing significant gaps in current model capabilities.

Contribution

It presents a new utility-based benchmark with extensive annotations and questions to assess caption usefulness in real-world applications.

Findings

01

State-of-the-art models show up to 32% lower utility in caption-based tasks compared to image-based tasks.

02

CaptionQA covers 4 domains with 33,027 annotated questions, enabling detailed utility evaluation.

03

Models perform substantially worse on caption utility than on traditional image-QA benchmarks.

Abstract

Image captions serve as efficient surrogates for visual content in multimodal systems such as retrieval, recommendation, and multi-step agentic inference pipelines. Yet current evaluation practices miss a fundamental question: Can captions stand-in for images in real downstream tasks? We propose a utility-based benchmark, CaptionQA, to evaluate model-generated captions, where caption quality is measured by how well it supports downstream tasks. CaptionQA is an extensible domain-dependent benchmark covering 4 domains--Natural, Document, E-commerce, and Embodied AI--each with fine-grained taxonomies (25 top-level and 69 subcategories) that identify useful information for domain-specific tasks. CaptionQA builds 33,027 densely annotated multiple-choice questions (50.3 per image on average) that explicitly require visual information to answer, providing a comprehensive probe of caption…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bronyayang/CaptionQA
github

Datasets

Borise/CaptionQA
dataset· 1.3k dl
1.3k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.