Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models
Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng, Kong, Qi Liu

TL;DR
This paper introduces Multimodal ArXiv, a large dataset of scientific figures and captions, to improve large vision-language models' understanding of scientific visuals, especially in mathematical reasoning and complex semantics.
Contribution
The creation of ArXivCap and ArXivQA datasets to enhance LVLMs' scientific comprehension and reasoning capabilities, filling a critical gap in scientific domain training data.
Findings
ArXivQA improves mathematical reasoning accuracy by 10.4%.
State-of-the-art LVLMs struggle with academic figure semantics.
Domain-specific training significantly boosts performance.
Abstract
Large vision-language models (LVLMs) excel across diverse tasks involving concrete images from natural scenes. However, their ability to interpret abstract figures, such as geometry shapes and scientific plots, remains limited due to a scarcity of training datasets in scientific domains. To fill this gap, we introduce Multimodal ArXiv, consisting of ArXivCap and ArXivQA, for enhancing LVLMs scientific comprehension. ArXivCap is a figure-caption dataset comprising 6.4M images and 3.9M captions, sourced from 572K ArXiv papers spanning various scientific domains. Drawing from ArXivCap, we introduce ArXivQA, a question-answering dataset generated by prompting GPT-4V based on scientific figures. ArXivQA greatly enhances open-sourced LVLMs' mathematical reasoning capabilities, achieving a 10.4\% absolute accuracy gain on a multimodal mathematical reasoning benchmark. Furthermore, employing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Topic Modeling
