Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large   Language Models in Code Generation from Scientific Plots

Chengyue Wu; Yixiao Ge; Qiushan Guo; Jiahao Wang; Zhixuan Liang; Zeyu; Lu; Ying Shan; Ping Luo

arXiv:2405.07990·cs.CL·May 14, 2024·1 cites

Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots

Chengyue Wu, Yixiao Ge, Qiushan Guo, Jiahao Wang, Zhixuan Liang, Zeyu, Lu, Ying Shan, Ping Luo

PDF

Open Access 1 Datasets

TL;DR

Plot2Code is a new benchmark for evaluating multi-modal large language models' ability to generate executable code from scientific plots, highlighting current challenges and guiding future improvements.

Contribution

We introduce Plot2Code, a comprehensive benchmark with new automatic evaluation metrics for assessing MLLMs' visual coding capabilities on scientific plots.

Findings

01

Most MLLMs struggle with text-dense plots

02

Existing models heavily rely on textual instructions

03

Evaluation reveals significant challenges in visual coding tasks

Abstract

The remarkable progress of Multi-modal Large Language Models (MLLMs) has attracted significant attention due to their superior performance in visual contexts. However, their capabilities in turning visual figure to executable code, have not been evaluated thoroughly. To address this, we introduce Plot2Code, a comprehensive visual coding benchmark designed for a fair and in-depth assessment of MLLMs. We carefully collect 132 manually selected high-quality matplotlib plots across six plot types from publicly available matplotlib galleries. For each plot, we carefully offer its source code, and an descriptive instruction summarized by GPT-4. This approach enables Plot2Code to extensively evaluate MLLMs' code capabilities across various input modalities. Furthermore, we propose three automatic evaluation metrics, including code pass rate, text-match ratio, and GPT-4V overall rating, for a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

TencentARC/Plot2Code
dataset· 2.3k dl
2.3k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Position-Wise Feed-Forward Layer · Dropout · Label Smoothing · Residual Connection · Absolute Position Encodings · Byte Pair Encoding