DISSECT: Diagnosing Where Vision Ends and Language Priors Begin in Scientific VLMs
Dikshant Kukreja, Kshitij Sah, Karan Goyal, Mukesh Mohania, Vikram Goyal

TL;DR
DISSECT introduces a diagnostic benchmark to analyze the perception-integration gap in vision-language models, revealing differences between open- and closed-source models in visual reasoning capabilities.
Contribution
The paper presents DISSECT, a comprehensive diagnostic benchmark with a novel Model Oracle to systematically evaluate visual perception and reasoning in VLMs.
Findings
Chemistry questions are harder for visual reasoning than Biology.
Open-source models perform better with verbalized descriptions, indicating an integration bottleneck.
Closed-source models show no gap, suggesting better perception-integration bridging.
Abstract
When asked to describe a molecular diagram, a Vision-Language Model correctly identifies ``a benzene ring with an -OH group.'' When asked to reason about the same image, it answers incorrectly. The model can see but it cannot think about what it sees. We term this the perception-integration gap: a failure where visual information is successfully extracted but lost during downstream reasoning, invisible to single-configuration benchmarks that conflate perception with integration under one accuracy number. To systematically expose such failures, we introduce DISSECT, a 12,000-question diagnostic benchmark spanning Chemistry (7,000) and Biology (5,000). Every question is evaluated under five input modes -- Vision+Text, Text-Only, Vision-Only, Human Oracle, and a novel Model Oracle in which the VLM first verbalizes the image and then reasons from its own description -- yielding diagnostic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
