DISSECT: Diagnosing Where Vision Ends and Language Priors Begin in Scientific VLMs

Dikshant Kukreja; Kshitij Sah; Karan Goyal; Mukesh Mohania; Vikram Goyal

arXiv:2604.06250·cs.CV·April 9, 2026

DISSECT: Diagnosing Where Vision Ends and Language Priors Begin in Scientific VLMs

Dikshant Kukreja, Kshitij Sah, Karan Goyal, Mukesh Mohania, Vikram Goyal

PDF

TL;DR

DISSECT introduces a diagnostic benchmark to analyze the perception-integration gap in vision-language models, revealing differences between open- and closed-source models in visual reasoning capabilities.

Contribution

The paper presents DISSECT, a comprehensive diagnostic benchmark with a novel Model Oracle to systematically evaluate visual perception and reasoning in VLMs.

Findings

01

Chemistry questions are harder for visual reasoning than Biology.

02

Open-source models perform better with verbalized descriptions, indicating an integration bottleneck.

03

Closed-source models show no gap, suggesting better perception-integration bridging.

Abstract

When asked to describe a molecular diagram, a Vision-Language Model correctly identifies ``a benzene ring with an -OH group.'' When asked to reason about the same image, it answers incorrectly. The model can see but it cannot think about what it sees. We term this the perception-integration gap: a failure where visual information is successfully extracted but lost during downstream reasoning, invisible to single-configuration benchmarks that conflate perception with integration under one accuracy number. To systematically expose such failures, we introduce DISSECT, a 12,000-question diagnostic benchmark spanning Chemistry (7,000) and Biology (5,000). Every question is evaluated under five input modes -- Vision+Text, Text-Only, Vision-Only, Human Oracle, and a novel Model Oracle in which the VLM first verbalizes the image and then reasons from its own description -- yielding diagnostic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.