Simple Vision-Language Math Reasoning via Rendered Text
Matvey Skripkin, Elizaveta Goncharova, Andrey Kuznetsov

TL;DR
This paper introduces a simple, effective vision-language approach for solving math problems by rendering LaTeX equations into images paired with structured prompts, achieving state-of-the-art accuracy.
Contribution
It demonstrates that rendering fidelity and prompt design are key, enabling compact models to excel in math reasoning tasks with minimal complexity.
Findings
Achieves state-of-the-art accuracy on math benchmarks
Gains up to 20% on tasks like MMMU, ChartQA, DocVQA
Matches or surpasses existing solvers with simple rendering approach
Abstract
We present a lightweight yet effective pipeline for training vision-language models to solve math problems by rendering LaTeX encoded equations into images and pairing them with structured chain-of-thought prompts. This simple text-to-vision augmentation enables compact multimodal architectures to achieve state-of-the-art reasoning accuracy. Through systematic ablations, we find that rendering fidelity and prompt design are the primary drivers of performance. Despite its simplicity, our approach consistently matches or surpasses both open-source and proprietary math-focused vision-language solvers on widely used benchmarks, while preserving broad general-domain competence - showing gains on tasks such as MMMU, ChartQA, and DocVQA of up to 20%.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Model Reduction and Neural Networks · Mathematics, Computing, and Information Processing
