VB: Visibility Benchmark for Visibility and Perspective Reasoning in Images
Neil Tripathi

TL;DR
VB introduces a comprehensive benchmark to evaluate vision-language models on their ability to determine visibility in images, reason about perspective, and abstain when uncertain, with detailed scoring and controlled minimal edits.
Contribution
The paper presents VB, a novel benchmark that tests models' visibility reasoning, unanswerability explanations, and robustness to minimal edits, advancing evaluation beyond prior unanswerable-VQA benchmarks.
Findings
GPT-4o and Gemini 3.1 Pro achieve top scores.
Open-source Gemma 3 12B surpasses some prior closed-source models.
Text-flip robustness is generally higher than image-flip robustness.
Abstract
We present VB, a benchmark that tests whether vision-language models can determine what is and is not visible in a photograph, and abstain when a human viewer cannot reliably answer. Each item pairs a single photo with a short yes/no visibility claim; the model must output VISIBLY_TRUE, VISIBLY_FALSE, or ABSTAIN, together with a confidence score. Items are organized into 100 families using a 2x2 design that crosses a minimal image edit with a minimal text edit, yielding 300 headline evaluation cells. Unlike prior unanswerable-VQA benchmarks, VB tests not only whether a question is unanswerable but why (via reason codes tied to specific visibility factors), and uses controlled minimal edits to verify that model judgments change when and only when the underlying evidence changes. We score models on confidence-aware accuracy with abstention (CAA), minimal-edit flip rate (MEFR),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques
