MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning

Xinyan Chen; Renrui Zhang; Dongzhi Jiang; Aojun Zhou; Shilin Yan; Weifeng Lin; Hongsheng Li

arXiv:2506.05331·cs.CV·June 6, 2025

MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning

Xinyan Chen, Renrui Zhang, Dongzhi Jiang, Aojun Zhou, Shilin Yan, Weifeng Lin, Hongsheng Li

PDF

Open Access 1 Repo

TL;DR

MINT-CoT introduces a novel method for interleaving visual tokens into mathematical reasoning steps in LLMs, significantly improving multimodal mathematical problem-solving capabilities.

Contribution

The paper proposes MINT-CoT, a new approach that adaptively interleaves visual tokens into reasoning, along with a large dataset and a three-stage training strategy for enhanced multimodal math reasoning.

Findings

01

MINT-CoT-7B outperforms baselines by over 23% on multiple math benchmarks.

02

Constructed a 54K problem dataset with token-level visual-region alignment.

03

Demonstrated effective visual interleaved reasoning in mathematical domains.

Abstract

Chain-of-Thought (CoT) has widely enhanced mathematical reasoning in Large Language Models (LLMs), but it still remains challenging for extending it to multimodal domains. Existing works either adopt a similar textual reasoning for image input, or seek to interleave visual signals into mathematical CoT. However, they face three key limitations for math problem-solving: reliance on coarse-grained box-shaped image regions, limited perception of vision encoders on math content, and dependence on external capabilities for visual modification. In this paper, we propose MINT-CoT, introducing Mathematical INterleaved Tokens for Chain-of-Thought visual reasoning. MINT-CoT adaptively interleaves relevant visual tokens into textual reasoning steps via an Interleave Token, which dynamically selects visual regions of any shapes within math figures. To empower this capability, we construct the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xinyan-cxy/mint-cot
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Graph Neural Networks

MethodsADaptive gradient method with the OPTimal convergence rate · Shrink and Fine-Tune