Simple Vision-Language Math Reasoning via Rendered Text

Matvey Skripkin; Elizaveta Goncharova; Andrey Kuznetsov

arXiv:2511.11704·cs.LG·November 18, 2025

Simple Vision-Language Math Reasoning via Rendered Text

Matvey Skripkin, Elizaveta Goncharova, Andrey Kuznetsov

PDF

Open Access

TL;DR

This paper introduces a simple, effective vision-language approach for solving math problems by rendering LaTeX equations into images paired with structured prompts, achieving state-of-the-art accuracy.

Contribution

It demonstrates that rendering fidelity and prompt design are key, enabling compact models to excel in math reasoning tasks with minimal complexity.

Findings

01

Achieves state-of-the-art accuracy on math benchmarks

02

Gains up to 20% on tasks like MMMU, ChartQA, DocVQA

03

Matches or surpasses existing solvers with simple rendering approach

Abstract

We present a lightweight yet effective pipeline for training vision-language models to solve math problems by rendering LaTeX encoded equations into images and pairing them with structured chain-of-thought prompts. This simple text-to-vision augmentation enables compact multimodal architectures to achieve state-of-the-art reasoning accuracy. Through systematic ablations, we find that rendering fidelity and prompt design are the primary drivers of performance. Despite its simplicity, our approach consistently matches or surpasses both open-source and proprietary math-focused vision-language solvers on widely used benchmarks, while preserving broad general-domain competence - showing gains on tasks such as MMMU, ChartQA, and DocVQA of up to 20%.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Model Reduction and Neural Networks · Mathematics, Computing, and Information Processing