Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of   Large Vision-Language Models

Lei Li; Yuqi Wang; Runxin Xu; Peiyi Wang; Xiachong Feng; Lingpeng; Kong; Qi Liu

arXiv:2403.00231·cs.CV·June 4, 2024·1 cites

Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models

Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng, Kong, Qi Liu

PDF

Open Access 5 Datasets 1 Video

TL;DR

This paper introduces Multimodal ArXiv, a large dataset of scientific figures and captions, to improve large vision-language models' understanding of scientific visuals, especially in mathematical reasoning and complex semantics.

Contribution

The creation of ArXivCap and ArXivQA datasets to enhance LVLMs' scientific comprehension and reasoning capabilities, filling a critical gap in scientific domain training data.

Findings

01

ArXivQA improves mathematical reasoning accuracy by 10.4%.

02

State-of-the-art LVLMs struggle with academic figure semantics.

03

Domain-specific training significantly boosts performance.

Abstract

Large vision-language models (LVLMs) excel across diverse tasks involving concrete images from natural scenes. However, their ability to interpret abstract figures, such as geometry shapes and scientific plots, remains limited due to a scarcity of training datasets in scientific domains. To fill this gap, we introduce Multimodal ArXiv, consisting of ArXivCap and ArXivQA, for enhancing LVLMs scientific comprehension. ArXivCap is a figure-caption dataset comprising 6.4M images and 3.9M captions, sourced from 572K ArXiv papers spanning various scientific domains. Drawing from ArXivCap, we introduce ArXivQA, a question-answering dataset generated by prompting GPT-4V based on scientific figures. ArXivQA greatly enhances open-sourced LVLMs' mathematical reasoning capabilities, achieving a 10.4\% absolute accuracy gain on a multimodal mathematical reasoning benchmark. Furthermore, employing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models· underline

Taxonomy

TopicsBiomedical Text Mining and Ontologies · Topic Modeling