START: Spatial and Textual Learning for Chart Understanding
Zhuoming Liu, Xiaofeng Gao, Feiyang Niu, Qiaozi Gao, Liu Liu, Robinson Piramuthu

TL;DR
START introduces a novel approach combining spatial and textual learning to improve chart understanding in multimodal large language models, using a new dataset and benchmark for evaluation.
Contribution
The paper proposes START, a method integrating chart-element grounding and chart-to-code generation, along with a new dataset and benchmark for comprehensive chart understanding.
Findings
START achieves significant performance improvements over baseline models.
The START-Dataset enables effective training of spatial and textual chart understanding.
START surpasses previous state-of-the-art methods on benchmark evaluations.
Abstract
Chart understanding is crucial for deploying multimodal large language models (MLLMs) in real-world scenarios such as analyzing scientific papers and technical reports. Unlike natural images, charts pair a structured visual layout (spatial property) with an underlying data representation (textual property) -- grasping both is essential for precise, fine-grained chart reasoning. Motivated by this observation, we propose START, the Spatial and Textual learning for chART understanding. Specifically, we introduce (i) chart-element grounding and (ii) chart-to-code generation to strengthen an MLLM's understanding of both chart visual layout and data details. To facilitate spatial and textual learning, we propose the START-Dataset generated with a novel data-generation pipeline that first leverages an MLLM to translate real chart images into executable chart code, recovering the underlying…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Data Visualization and Analytics
