Pix2Fact: When Vision Is Not Enough -- Benchmarking Fine-Grained VQA with Web Verification on High-Resolution Real-World Scenes

Yifan Jiang; Cong Zhang; Bofei Zhang; Qiaofeng Zheng; Yifan Yang; Bingzhang Wang; Yew-Soon Ong

arXiv:2602.00593·cs.CV·May 21, 2026

Pix2Fact: When Vision Is Not Enough -- Benchmarking Fine-Grained VQA with Web Verification on High-Resolution Real-World Scenes

Yifan Jiang, Cong Zhang, Bofei Zhang, Qiaofeng Zheng, Yifan Yang, Bingzhang Wang, Yew-Soon Ong

PDF

TL;DR

Pix2Fact introduces a challenging benchmark for vision-language models that tests detailed visual grounding and external knowledge integration in high-resolution real-world scenes, exposing current models' limitations.

Contribution

The paper presents Pix2Fact, a new benchmark with expert-crafted questions on high-res images to evaluate fine-grained perception and knowledge search in VLMs, highlighting their current shortcomings.

Findings

01

State-of-the-art models achieve only around 51.7% accuracy on Pix2Fact.

02

Models frequently make visual grounding errors even with ground truth.

03

Current models struggle with long-tail, unstructured local information retrieval.

Abstract

Despite progress on general tasks, vision-language models (VLMs) still struggle with challenges that demand both fine-grained visual grounding and external knowledge, a synergy overlooked by existing benchmarks that evaluate these abilities in isolation. To fill this void, we introduce Pix2Fact, a visual question-answering benchmark designed to assess expert-level visual perception and knowledge search. Pix2Fact comprises 1,000 high-resolution (4K+) images spanning eight scenarios. Its questions and answers are meticulously crafted by PhD-holding annotators from top global universities across diverse disciplines. Each question requires detailed visual grounding and the integration of external knowledge. Evaluating ten state-of-the-art VLMs, including proprietary models such as Gemini-3.1-Pro and GPT-5.4, we find that Pix2Fact poses a formidable challenge: the most advanced model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.