MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning
Xinyan Chen, Renrui Zhang, Dongzhi Jiang, Aojun Zhou, Shilin Yan, Weifeng Lin, Hongsheng Li

TL;DR
MINT-CoT introduces a novel method for interleaving visual tokens into mathematical reasoning steps in LLMs, significantly improving multimodal mathematical problem-solving capabilities.
Contribution
The paper proposes MINT-CoT, a new approach that adaptively interleaves visual tokens into reasoning, along with a large dataset and a three-stage training strategy for enhanced multimodal math reasoning.
Findings
MINT-CoT-7B outperforms baselines by over 23% on multiple math benchmarks.
Constructed a 54K problem dataset with token-level visual-region alignment.
Demonstrated effective visual interleaved reasoning in mathematical domains.
Abstract
Chain-of-Thought (CoT) has widely enhanced mathematical reasoning in Large Language Models (LLMs), but it still remains challenging for extending it to multimodal domains. Existing works either adopt a similar textual reasoning for image input, or seek to interleave visual signals into mathematical CoT. However, they face three key limitations for math problem-solving: reliance on coarse-grained box-shaped image regions, limited perception of vision encoders on math content, and dependence on external capabilities for visual modification. In this paper, we propose MINT-CoT, introducing Mathematical INterleaved Tokens for Chain-of-Thought visual reasoning. MINT-CoT adaptively interleaves relevant visual tokens into textual reasoning steps via an Interleave Token, which dynamically selects visual regions of any shapes within math figures. To empower this capability, we construct the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Graph Neural Networks
MethodsADaptive gradient method with the OPTimal convergence rate · Shrink and Fine-Tune
