Pix2Fact: When Vision Is Not Enough -- Benchmarking Fine-Grained VQA with Web Verification on High-Resolution Real-World Scenes
Yifan Jiang, Cong Zhang, Bofei Zhang, Qiaofeng Zheng, Yifan Yang, Bingzhang Wang, Yew-Soon Ong

TL;DR
Pix2Fact introduces a challenging benchmark for vision-language models that tests detailed visual grounding and external knowledge integration in high-resolution real-world scenes, exposing current models' limitations.
Contribution
The paper presents Pix2Fact, a new benchmark with expert-crafted questions on high-res images to evaluate fine-grained perception and knowledge search in VLMs, highlighting their current shortcomings.
Findings
State-of-the-art models achieve only around 51.7% accuracy on Pix2Fact.
Models frequently make visual grounding errors even with ground truth.
Current models struggle with long-tail, unstructured local information retrieval.
Abstract
Despite progress on general tasks, vision-language models (VLMs) still struggle with challenges that demand both fine-grained visual grounding and external knowledge, a synergy overlooked by existing benchmarks that evaluate these abilities in isolation. To fill this void, we introduce Pix2Fact, a visual question-answering benchmark designed to assess expert-level visual perception and knowledge search. Pix2Fact comprises 1,000 high-resolution (4K+) images spanning eight scenarios. Its questions and answers are meticulously crafted by PhD-holding annotators from top global universities across diverse disciplines. Each question requires detailed visual grounding and the integration of external knowledge. Evaluating ten state-of-the-art VLMs, including proprietary models such as Gemini-3.1-Pro and GPT-5.4, we find that Pix2Fact poses a formidable challenge: the most advanced model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
