VisDoT : Enhancing Visual Reasoning through Human-Like Interpretation Grounding and Decomposition of Thought

Eunsoo Lee; Jeongwoo Lee; Minki Hong; Jangho Choi; Jihie Kim

arXiv:2603.11631·cs.AI·March 13, 2026

VisDoT : Enhancing Visual Reasoning through Human-Like Interpretation Grounding and Decomposition of Thought

Eunsoo Lee, Jeongwoo Lee, Minki Hong, Jangho Choi, Jihie Kim

PDF

Open Access 1 Video

TL;DR

VisDoT enhances visual reasoning in vision-language models by grounding perception in human-like interpretation and decomposing questions into perception and logic, leading to significant performance improvements on chart-based and open-domain VQA benchmarks.

Contribution

The paper introduces VisDoT, a novel framework that formalizes perceptual tasks and employs Decomposition-of-Thought prompting to improve visual reasoning in LVLMs.

Findings

01

+11.2% improvement on ChartQA with fine-tuning

02

Surpasses GPT-4o on ChartQAPro benchmark

03

+33.2% improvement on VisDoTQA benchmark

Abstract

Large vision-language models (LVLMs) struggle to reliably detect visual primitives in charts and align them with semantic representations, which severely limits their performance on complex visual reasoning. This lack of perceptual grounding constitutes a major bottleneck for chart-based reasoning. We propose VisDoT, a framework that enhances visual reasoning through human-like interpretation grounding. We formalize four perceptual tasks based on the theory of graphical perception, including position and length. Building on this foundation, we introduce Decomposition-of-Thought (DoT) prompting, which sequentially separates questions into visual perception sub-questions and logic sub-questions. Fine-tuning InternVL with VisDoT achieves a +11.2% improvement on ChartQA and surpasses GPT-4o on the more challenging ChartQAPro benchmark. On the newly introduced VisDoTQA benchmark, the model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

VisDoT : Enhancing Visual Reasoning through Human-Like Interpretation Grounding and Decomposition of Thought· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning