On Pre-training of Multimodal Language Models Customized for Chart Understanding
Wan-Cyuan Fan, Yen-Chun Chen, Mengchen Liu, Lu Yuan, Leonid Sigal

TL;DR
This paper enhances multimodal language models for scientific chart understanding by integrating raw data, textual representations, and data extraction steps, resulting in a specialized model called CHOPINLLM that outperforms existing approaches.
Contribution
It introduces novel training strategies for MLLMs to improve chart comprehension, including data alignment, textual image replacement, and data extraction methods, culminating in the CHOPINLLM model.
Findings
CHOPINLLM effectively interprets various chart types.
Incorporating raw data improves comprehension accuracy.
Textual representation transfer enhances reasoning skills.
Abstract
Recent studies customizing Multimodal Large Language Models (MLLMs) for domain-specific tasks have yielded promising results, especially in the field of scientific chart comprehension. These studies generally utilize visual instruction tuning with specialized datasets to enhance question and answer (QA) accuracy within the chart domain. However, they often neglect the fundamental discrepancy between natural image-caption pre-training data and digital chart image-QA data, particularly in the models' capacity to extract underlying numeric values from charts. This paper tackles this oversight by exploring the training processes necessary to improve MLLMs' comprehension of charts. We present three key findings: (1) Incorporating raw data values in alignment pre-training markedly improves comprehension of chart data. (2) Replacing images with their textual representation randomly during…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The paper introduces efficient training techniques that significantly enhance chart comprehension. 2. CHOPINLLM, a model for chart understanding, demonstrates strong performance with various chart types. 3. A benchmark is established to evaluate MLLMs' comprehension of different chart types, aiding future research. 4. The data generation pipeline uses text-only Large Language Models to efficiently create diverse datasets, reducing costs and complexity.
1. CHOPINLLM did not achieve state-of-the-art (SOTA) performance in Table 4. While the authors claim that higher-performing models benefited from using more data and annotated datasets, there is no evidence showing that the proposed synthetic data offers performance gains when combined with existing datasets. Demonstrating that such a combination improves results would strengthen the contribution of the synthetic data. Otherwise, the benefit of using only synthetic data to build an underperformi
- Investigating ways to improve chart understanding of MLLMs from the “pretraining” (e.g., aligning the connector with captioning data) perspective is rarely explored, which sets this work apart from others that focus on chart understanding in supervised finetuning of the full model on chart QAs. Experiments demonstrate that having a curated chart understanding dataset for pretraining can significantly enhance the model’s performance when later supervised finetuned on the same set of visual QA d
- A main argument from the paper seems to be that existing models could learn a shortcut that uses chart annotations to analyze the chart and answer questions (L73), while your methods result in a model that has less reliance (L478). Yet, there are no controlled experiments from the paper to support either claim. - Lack of discussions and/or ablations on the effectiveness of orthogonal data and code generation compared to first generate the data then code. Generating code without knowing the dat
1. The authors present a clear and easy-to-understand workflow. 2. They provide a Chart instruction dataset that includes raw data and QA. The dataset creation process and its characteristics are well explained. 3. The authors offer a comprehensive summary and recommendations regarding MLLM training in the chart domain, particularly on instruction data selection and mixing.
1. Training aligned with raw data is already widely adopted (e.g., ChartAst, ChartReformer). Similarly, extracting chart data before QA has been explored (e.g., OneChart). 2. The authors emphasize that their model handles unannotated charts well, but there is no specific design for addressing it. Furthermore, results on unannotated charts are not provided. Benchmarked datasets like PlotQA are overly simple and repetitive, while others such as MMC, ChartBench, and ChartX (all are provided in Tab
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
