On Pre-training of Multimodal Language Models Customized for Chart Understanding

Wan-Cyuan Fan; Yen-Chun Chen; Mengchen Liu; Lu Yuan; Leonid Sigal

arXiv:2407.14506·cs.CV·July 21, 2025·1 cites

On Pre-training of Multimodal Language Models Customized for Chart Understanding

Wan-Cyuan Fan, Yen-Chun Chen, Mengchen Liu, Lu Yuan, Leonid Sigal

PDF

Open Access 3 Reviews

TL;DR

This paper enhances multimodal language models for scientific chart understanding by integrating raw data, textual representations, and data extraction steps, resulting in a specialized model called CHOPINLLM that outperforms existing approaches.

Contribution

It introduces novel training strategies for MLLMs to improve chart comprehension, including data alignment, textual image replacement, and data extraction methods, culminating in the CHOPINLLM model.

Findings

01

CHOPINLLM effectively interprets various chart types.

02

Incorporating raw data improves comprehension accuracy.

03

Textual representation transfer enhances reasoning skills.

Abstract

Recent studies customizing Multimodal Large Language Models (MLLMs) for domain-specific tasks have yielded promising results, especially in the field of scientific chart comprehension. These studies generally utilize visual instruction tuning with specialized datasets to enhance question and answer (QA) accuracy within the chart domain. However, they often neglect the fundamental discrepancy between natural image-caption pre-training data and digital chart image-QA data, particularly in the models' capacity to extract underlying numeric values from charts. This paper tackles this oversight by exploring the training processes necessary to improve MLLMs' comprehension of charts. We present three key findings: (1) Incorporating raw data values in alignment pre-training markedly improves comprehension of chart data. (2) Replacing images with their textual representation randomly during…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

1. The paper introduces efficient training techniques that significantly enhance chart comprehension. 2. CHOPINLLM, a model for chart understanding, demonstrates strong performance with various chart types. 3. A benchmark is established to evaluate MLLMs' comprehension of different chart types, aiding future research. 4. The data generation pipeline uses text-only Large Language Models to efficiently create diverse datasets, reducing costs and complexity.

Weaknesses

1. CHOPINLLM did not achieve state-of-the-art (SOTA) performance in Table 4. While the authors claim that higher-performing models benefited from using more data and annotated datasets, there is no evidence showing that the proposed synthetic data offers performance gains when combined with existing datasets. Demonstrating that such a combination improves results would strengthen the contribution of the synthetic data. Otherwise, the benefit of using only synthetic data to build an underperformi

Reviewer 02Rating 6Confidence 5

Strengths

- Investigating ways to improve chart understanding of MLLMs from the “pretraining” (e.g., aligning the connector with captioning data) perspective is rarely explored, which sets this work apart from others that focus on chart understanding in supervised finetuning of the full model on chart QAs. Experiments demonstrate that having a curated chart understanding dataset for pretraining can significantly enhance the model’s performance when later supervised finetuned on the same set of visual QA d

Weaknesses

- A main argument from the paper seems to be that existing models could learn a shortcut that uses chart annotations to analyze the chart and answer questions (L73), while your methods result in a model that has less reliance (L478). Yet, there are no controlled experiments from the paper to support either claim. - Lack of discussions and/or ablations on the effectiveness of orthogonal data and code generation compared to first generate the data then code. Generating code without knowing the dat

Reviewer 03Rating 5Confidence 5

Strengths

1. The authors present a clear and easy-to-understand workflow. 2. They provide a Chart instruction dataset that includes raw data and QA. The dataset creation process and its characteristics are well explained. 3. The authors offer a comprehensive summary and recommendations regarding MLLM training in the chart domain, particularly on instruction data selection and mixing.

Weaknesses

1. Training aligned with raw data is already widely adopted (e.g., ChartAst, ChartReformer). Similarly, extracting chart data before QA has been explored (e.g., OneChart). 2. The authors emphasize that their model handles unannotated charts well, but there is no specific design for addressing it. Furthermore, results on unannotated charts are not provided. Benchmarked datasets like PlotQA are overly simple and repetitive, while others such as MMC, ChartBench, and ChartX (all are provided in Tab

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling