Distill Visual Chart Reasoning Ability from LLMs to MLLMs
Wei He, Zhiheng Xi, Wanxu Zhao, Xiaoran Fan, Yiwen Ding, Zifei Shan, Tao Gui, Qi Zhang, Xuanjing Huang

TL;DR
This paper introduces a scalable method called Code-as-Intermediary Translation (CIT) to synthesize high-quality chart question-answering data, enabling large language models to improve visual reasoning on charts without expensive annotation.
Contribution
The paper presents CIT, a novel data synthesis approach that leverages code translation to distill visual reasoning skills into multimodal language models efficiently.
Findings
Models trained with ReachQA outperform baselines on chart reasoning tasks.
Fine-tuned models show improved general reasoning benchmark performance.
ReachQA dataset contains 3k charts and 20k Q&A pairs for training.
Abstract
Solving complex chart Q&A tasks requires advanced visual reasoning abilities in multimodal large language models (MLLMs), including recognizing key information from visual inputs and conducting reasoning over it. While fine-tuning MLLMs for reasoning is critical, collecting and annotating charts and questions is expensive, hard to scale, and often results in low-quality annotations. To address this, we propose Code-as-Intermediary Translation (CIT), a cost-effective, efficient and scalable data synthesis method for distilling visual reasoning abilities from LLMs to MLLMs. The code serves as an intermediary that translates visual chart representations into textual representations, enabling language models to understand cross-modal information and generate reasoning chains accordingly. In this way, we can employ text-based synthesizing techniques to expand chart-plotting code and generate…
Peer Reviews
Decision·Submitted to ICLR 2025
Utilizing code as a generative engine is a sound approach to creating new chart images. Experimental results indicate that the new dataset significantly enhances specific performance metrics.
The use of code as a tool for rendering the chart dataset appears similar to the concept presented in [1]. Furthermore, data augmentation through code is reminiscent of [2]. Could you elaborate on how your approach differentiates from these works? The model fine-tunes general MLLM models. How would the performance change if an equivalent amount of chart data were selected from general datasets for fine-tuning the MLLMs? A comparative analysis might be insightful. Why is the chart type describe
1. The inspiration of the method is interesting and leverages the translation principle to incorporate an intermediate language. By using code as an intermediate language, the method is interesting and novel. 2. The steps behind the method are clear and reasonable, first use code to generate charts, and then generate QA pairs. The authors use multiple methods, such as Evol-instruct, self-instruct, llm-as-judge and so on to make the generated data to be high quality and diversity. 3. The experim
1. The main concern is about the motivation or the story. While the authors mention that the existing MLLMs are struggle in the recognition and reasoning abilities, the authors then want to use distill LLMs' reasoning ability to MLLMs. However, it is clear that Claude and GPT-4o are smart in these ways. Therefore, the problem is not about existing MLLMs but the open-sourced, or freely sourced MLLMs. I encourage the authors to rephrase the story, which then could introduce the method more clearly
- The overall writing is clear and easy to follow. - The idea of CIT is straightforward, effective and easily scalable. - CIT addressed an important gap in MLLM training in lacking accurate textual annotations of visual diagrams. - The resulting dataset has high quality. Althought only with 3K images and 20K QAs, the perforamnce of MLLMs improves by at most 35 percent. In addition, when trained a mixture with general multimodal dataset, the model can effectively retain its general MM benchmarks.
- As authors claimed, the MLLMss ability consist of two main parts, 1. recognizing key information from visual inputs 2. conducting reasoning over it. The design of ReachQA is rich in both parts, so it's unclear which part improves models' over performance the most, the reviewer is aware that similar analysis is conducted in section 5, but an error analysis (similar with figure 1) on before/after ReachQA training can make it more clear. - Dataset volume is a concern. As a training dataset, only
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing · Natural Language Processing Techniques · Semantic Web and Ontologies
