MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data   Engine

Renrui Zhang; Xinyu Wei; Dongzhi Jiang; Ziyu Guo; Shicheng Li; Yichi; Zhang; Chengzhuo Tong; Jiaming Liu; Aojun Zhou; Bin Wei; Shanghang Zhang,; Peng Gao; Chunyuan Li; Hongsheng Li

arXiv:2407.08739·cs.CV·November 5, 2024·1 cites

MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine

Renrui Zhang, Xinyu Wei, Dongzhi Jiang, Ziyu Guo, Shicheng Li, Yichi, Zhang, Chengzhuo Tong, Jiaming Liu, Aojun Zhou, Bin Wei, Shanghang Zhang,, Peng Gao, Chunyuan Li, Hongsheng Li

PDF

Open Access 3 Repos

TL;DR

MAVIS introduces an automated pipeline for creating large-scale mathematical visual datasets to enhance multi-modal large language models' reasoning, visual encoding, and problem-solving abilities in math through comprehensive instruction tuning.

Contribution

It presents a fully automated data generation process and a multi-stage training pipeline for MLLMs, significantly improving their mathematical visual understanding and reasoning capabilities.

Findings

01

Curated two large datasets: MAVIS-Caption and MAVIS-Instruct.

02

Enhanced vision-language alignment and diagram encoding in MLLMs.

03

Improved problem-solving and reasoning skills in the resulting models.

Abstract

The mathematical capabilities of Multi-modal Large Language Models (MLLMs) remain under-explored with three areas to be improved: visual encoding of math diagrams, diagram-language alignment, and chain-of-thought (CoT) reasoning. This draws forth an urgent demand for an effective training paradigm and a large-scale, comprehensive dataset with detailed CoT rationales, which is challenging to collect and costly to annotate manually. To tackle this issue, we propose MAVIS, a MAthematical VISual instruction tuning pipeline for MLLMs, featuring an automatic data engine to efficiently create mathematical visual datasets. We design the data generation process to be entirely independent of human intervention or GPT API usage, while ensuring the diagram-caption correspondence, question-answer correctness, and CoT reasoning quality. With this approach, we curate two datasets, MAVIS-Caption (558K…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics Education and Teaching Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Cosine Annealing · Adam · Attention Dropout · Multi-Head Attention · Weight Decay · Byte Pair Encoding · Dropout