VisTIRA: Closing the Image-Text Modality Gap in Visual Math Reasoning via Structured Tool Integration
Saeed Khaki, Ashudeep Singh, Nima Safaei, Kamal Ginotra

TL;DR
This paper introduces VisTIRA, a structured reasoning framework that combines tool integration and OCR grounding to significantly improve visual math reasoning in vision-language models, reducing the modality gap with text-based models.
Contribution
The paper presents VisTIRA, a novel framework that decomposes image-based math problems into natural language and executable steps, along with a LaTeX pipeline and synthetic data for training and evaluation.
Findings
Tool-integrated supervision enhances reasoning accuracy.
OCR grounding benefits smaller models more significantly.
The modality gap decreases as model size increases.
Abstract
Vision-language models (VLMs) lag behind text-only language models on mathematical reasoning when the same problems are presented as images rather than text. We empirically characterize this as a modality gap: the same question in text form yields markedly higher accuracy than its visually typeset counterpart, due to compounded failures in reading dense formulas, layout, and mixed symbolic-diagrammatic context. First, we introduce VisTIRA (Vision and Tool-Integrated Reasoning Agent), a tool-integrated reasoning framework that enables structured problem solving by iteratively decomposing a given math problem (as an image) into natural language rationales and executable Python steps to determine the final answer. Second, we build a framework to measure and improve visual math reasoning: a LaTeX-based pipeline that converts chain-of-thought math corpora (e.g., NuminaMath) into challenging…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Mathematics, Computing, and Information Processing · Data Visualization and Analytics
