VisDoT : Enhancing Visual Reasoning through Human-Like Interpretation Grounding and Decomposition of Thought
Eunsoo Lee, Jeongwoo Lee, Minki Hong, Jangho Choi, Jihie Kim

TL;DR
VisDoT enhances visual reasoning in vision-language models by grounding perception in human-like interpretation and decomposing questions into perception and logic, leading to significant performance improvements on chart-based and open-domain VQA benchmarks.
Contribution
The paper introduces VisDoT, a novel framework that formalizes perceptual tasks and employs Decomposition-of-Thought prompting to improve visual reasoning in LVLMs.
Findings
+11.2% improvement on ChartQA with fine-tuning
Surpasses GPT-4o on ChartQAPro benchmark
+33.2% improvement on VisDoTQA benchmark
Abstract
Large vision-language models (LVLMs) struggle to reliably detect visual primitives in charts and align them with semantic representations, which severely limits their performance on complex visual reasoning. This lack of perceptual grounding constitutes a major bottleneck for chart-based reasoning. We propose VisDoT, a framework that enhances visual reasoning through human-like interpretation grounding. We formalize four perceptual tasks based on the theory of graphical perception, including position and length. Building on this foundation, we introduce Decomposition-of-Thought (DoT) prompting, which sequentially separates questions into visual perception sub-questions and logic sub-questions. Fine-tuning InternVL with VisDoT achieves a +11.2% improvement on ChartQA and surpasses GPT-4o on the more challenging ChartQAPro benchmark. On the newly introduced VisDoTQA benchmark, the model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
