ChartPoint: Guiding MLLMs with Grounding Reflection for Chart Reasoning
Zhengzhuo Xu, SiNan Du, Yiyan Qi, SiwenLu, Chengjin Xu, Chun Yuan, Jian Guo

TL;DR
This paper introduces ChartPoint, a method that enhances multimodal large language models' ability to reason about charts by integrating visual grounding through bounding boxes and re-rendering, addressing the limitations of OCR-based content extraction.
Contribution
It proposes PointCoT, a novel approach combining reflective reasoning with visual grounding, and creates a large dataset for training models to improve chart comprehension and reasoning.
Findings
Models outperform state-of-the-art on chart benchmarks.
Introduction of a new dataset with step-by-step reasoning annotations.
Enhanced reasoning accuracy in chart comprehension tasks.
Abstract
Multimodal Large Language Models (MLLMs) have emerged as powerful tools for chart comprehension. However, they heavily rely on extracted content via OCR, which leads to numerical hallucinations when chart textual annotations are sparse. While existing methods focus on scaling instructions, they fail to address the fundamental challenge, i.e., reasoning with visual perception. In this paper, we identify a critical observation: MLLMs exhibit weak grounding in chart elements and proportional relationships, as evidenced by their inability to localize key positions to match their reasoning. To bridge this gap, we propose PointCoT, which integrates reflective interaction into chain-of-thought reasoning in charts. By prompting MLLMs to generate bounding boxes and re-render charts based on location annotations, we establish connections between textual reasoning steps and visual grounding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Data Visualization and Analytics · Multimodal Machine Learning Applications
