VisTIRA: Closing the Image-Text Modality Gap in Visual Math Reasoning via Structured Tool Integration

Saeed Khaki; Ashudeep Singh; Nima Safaei; Kamal Ginotra

arXiv:2601.14440·cs.AI·March 18, 2026

VisTIRA: Closing the Image-Text Modality Gap in Visual Math Reasoning via Structured Tool Integration

Saeed Khaki, Ashudeep Singh, Nima Safaei, Kamal Ginotra

PDF

Open Access

TL;DR

This paper introduces VisTIRA, a structured reasoning framework that combines tool integration and OCR grounding to significantly improve visual math reasoning in vision-language models, reducing the modality gap with text-based models.

Contribution

The paper presents VisTIRA, a novel framework that decomposes image-based math problems into natural language and executable steps, along with a LaTeX pipeline and synthetic data for training and evaluation.

Findings

01

Tool-integrated supervision enhances reasoning accuracy.

02

OCR grounding benefits smaller models more significantly.

03

The modality gap decreases as model size increases.

Abstract

Vision-language models (VLMs) lag behind text-only language models on mathematical reasoning when the same problems are presented as images rather than text. We empirically characterize this as a modality gap: the same question in text form yields markedly higher accuracy than its visually typeset counterpart, due to compounded failures in reading dense formulas, layout, and mixed symbolic-diagrammatic context. First, we introduce VisTIRA (Vision and Tool-Integrated Reasoning Agent), a tool-integrated reasoning framework that enables structured problem solving by iteratively decomposing a given math problem (as an image) into natural language rationales and executable Python steps to determine the final answer. Second, we build a framework to measure and improve visual math reasoning: a LaTeX-based pipeline that converts chain-of-thought math corpora (e.g., NuminaMath) into challenging…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Mathematics, Computing, and Information Processing · Data Visualization and Analytics