Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding
Kung-Hsiang Huang, Can Qin, Haoyi Qiu, Philippe Laban, Shafiq Joty, Caiming Xiong, Chien-Sheng Wu

TL;DR
This paper investigates why vision language models struggle with visual arithmetic tasks and introduces CogAlign, a post-training strategy that improves their reasoning abilities, especially in chart and geometry understanding, with less data.
Contribution
The paper identifies the root causes of visual arithmetic deficiencies in VLMs and proposes CogAlign, a novel post-training method inspired by cognitive development theory to enhance reasoning.
Findings
CogAlign significantly improves VLM performance on visual arithmetic tasks.
The method outperforms supervised fine-tuning with less training data.
Results show enhanced transfer to downstream tasks like chart and geometry understanding.
Abstract
Vision Language Models (VLMs) have achieved remarkable progress in multimodal tasks, yet they often struggle with visual arithmetic, seemingly simple capabilities like object counting or length comparison, which are essential for relevant complex tasks like chart understanding and geometric reasoning. In this work, we first investigate the root causes of this deficiency through a suite of probing tasks focusing on basic visual arithmetic. Our analysis reveals that while pre-trained vision encoders typically capture sufficient information, the text decoder often fails to decode it correctly for arithmetic reasoning. To address this, we propose CogAlign, a novel post-training strategy inspired by Piaget's theory of cognitive development. CogAlign trains VLMs to recognize invariant properties under visual transformations. We demonstrate that this approach significantly improves the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCategorization, perception, and language · Handwritten Text Recognition Techniques · Constraint Satisfaction and Optimization
