MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine
Renrui Zhang, Xinyu Wei, Dongzhi Jiang, Ziyu Guo, Shicheng Li, Yichi, Zhang, Chengzhuo Tong, Jiaming Liu, Aojun Zhou, Bin Wei, Shanghang Zhang,, Peng Gao, Chunyuan Li, Hongsheng Li

TL;DR
MAVIS introduces an automated pipeline for creating large-scale mathematical visual datasets to enhance multi-modal large language models' reasoning, visual encoding, and problem-solving abilities in math through comprehensive instruction tuning.
Contribution
It presents a fully automated data generation process and a multi-stage training pipeline for MLLMs, significantly improving their mathematical visual understanding and reasoning capabilities.
Findings
Curated two large datasets: MAVIS-Caption and MAVIS-Instruct.
Enhanced vision-language alignment and diagram encoding in MLLMs.
Improved problem-solving and reasoning skills in the resulting models.
Abstract
The mathematical capabilities of Multi-modal Large Language Models (MLLMs) remain under-explored with three areas to be improved: visual encoding of math diagrams, diagram-language alignment, and chain-of-thought (CoT) reasoning. This draws forth an urgent demand for an effective training paradigm and a large-scale, comprehensive dataset with detailed CoT rationales, which is challenging to collect and costly to annotate manually. To tackle this issue, we propose MAVIS, a MAthematical VISual instruction tuning pipeline for MLLMs, featuring an automatic data engine to efficiently create mathematical visual datasets. We design the data generation process to be entirely independent of human intervention or GPT API usage, while ensuring the diagram-caption correspondence, question-answer correctness, and CoT reasoning quality. With this approach, we curate two datasets, MAVIS-Caption (558K…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics Education and Teaching Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Cosine Annealing · Adam · Attention Dropout · Multi-Head Attention · Weight Decay · Byte Pair Encoding · Dropout
