WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata
Basel Shbita, Pengyuan Li, Anna Lisa Gentile

TL;DR
WikiVQABench is a new knowledge-grounded VQA benchmark combining Wikipedia images, captions, and Wikidata, designed to evaluate models' ability to use external knowledge for visual question answering.
Contribution
It introduces a systematically constructed, human-curated benchmark that emphasizes external knowledge integration in visual question answering tasks.
Findings
Evaluation of 15 VLMs shows performance from 24.7% to 75.6% accuracy.
The benchmark effectively discriminates model capabilities on knowledge-intensive reasoning.
The dataset and code are publicly available.
Abstract
Visual Question Answering (VQA) benchmarks have largely emphasized perception-based tasks that can be solved from visual content alone. In contrast, many real-world scenarios require external knowledge that is not directly observable in the image to answer correctly. We introduce WikiVQABench, a human-curated knowledge-grounded VQA benchmark constructed by systematically combining Wikipedia images, their associated article captions, and structured knowledge from Wikidata. Our pipeline uses large language models (LLMs) to generate candidate multiple-choice image-question-answer sets. All generated instances are subsequently reviewed and curated by human annotators to ensure factual correctness, visual-text consistency, and that each question requires external knowledge in addition to visual evidence for correct resolution. WikiVQABench comprises a substantial collection of Wikipedia…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
