In-Depth and In-Breadth: Pre-training Multimodal Language Models Customized for Comprehensive Chart Understanding

Wan-Cyuan Fan; Yen-Chun Chen; Mengchen Liu; Alexander Jacobson; Lu Yuan; Leonid Sigal

arXiv:2507.14298·cs.CL·July 22, 2025

In-Depth and In-Breadth: Pre-training Multimodal Language Models Customized for Comprehensive Chart Understanding

Wan-Cyuan Fan, Yen-Chun Chen, Mengchen Liu, Alexander Jacobson, Lu Yuan, Leonid Sigal

PDF

TL;DR

ChartScope is a novel vision-language model designed for comprehensive understanding of diverse scientific charts, utilizing synthesized data and dual-path training to improve data alignment and reasoning capabilities across many chart types.

Contribution

The paper introduces ChartScope, a new LVLM with an efficient data generation pipeline and dual-path training strategy for broad and deep chart comprehension.

Findings

01

Significantly improves understanding across diverse chart types.

02

Outperforms existing models on the ChartDQA benchmark.

03

Demonstrates effective reasoning over underlying chart data.

Abstract

Recent methods for customizing Large Vision Language Models (LVLMs) for domain-specific tasks have shown promising results in scientific chart comprehension. However, existing approaches face two major limitations: First, they rely on paired data from only a few chart types, limiting generalization to wide range of chart types. Secondly, they lack targeted pre-training for chart-data alignment, which hampers the model's understanding of underlying data. In this paper, we introduce ChartScope, an LVLM optimized for in-depth chart comprehension across diverse chart types. We propose an efficient data generation pipeline that synthesizes paired data for a wide range of chart types, along with a novel Dual-Path training strategy that enabling the model to succinctly capture essential data details while preserving robust reasoning capabilities by incorporating reasoning over the underlying…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.