MathGen: Revealing the Illusion of Mathematical Competence through Text-to-Image Generation

Ruiyao Liu; Hui Shen; Ping Zhang; Yunta Hsieh; Yifan Zhang; Jing Xu; Sicheng Chen; Junchen Li; Jiawei Lu; Jianing Ma; Jiaqi Mo; Qi Han; Zhen Zhang; Zhongwei Wan; Jing Xiong; Xin Wang; Ziyuan Liu; Hangrui Cao; Ngai Wong

arXiv:2603.27959·cs.CV·April 1, 2026

MathGen: Revealing the Illusion of Mathematical Competence through Text-to-Image Generation

Ruiyao Liu, Hui Shen, Ping Zhang, Yunta Hsieh, Yifan Zhang, Jing Xu, Sicheng Chen, Junchen Li, Jiawei Lu, Jianing Ma, Jiaqi Mo, Qi Han, Zhen Zhang, Zhongwei Wan, Jing Xiong, Xin Wang, Ziyuan Liu, Hangrui Cao, Ngai Wong

PDF

TL;DR

MathGen is a benchmark revealing that current text-to-image models struggle significantly with generating accurate mathematical visuals, highlighting a major gap in their capabilities.

Contribution

The paper introduces MathGen, a comprehensive benchmark with an evaluation protocol to assess the mathematical visual generation ability of T2I models.

Findings

01

Best closed-source model achieves 42.0% accuracy.

02

Open-source models achieve 1-11% accuracy.

03

Models perform near 0% on structured tasks.

Abstract

Modern generative models have demonstrated the ability to solve challenging mathematical problems. In many real-world settings, however, mathematical solutions must be expressed visually through diagrams, plots, geometric constructions, and structured symbolic layouts, where correctness depends on precise visual composition. This naturally raises the question of whether generative models can still do so when the answer must be rendered visually rather than written in text? To study this problem, we introduce MathGen, a rigorous benchmark of 900 problems spanning seven core domains, each paired with an executable verifier under a Script-as-a-Judge protocol for deterministic and objective evaluation. Experiments on representative open-source and proprietary text-to-image models show that mathematical fidelity remains a major bottleneck: even the best closed-source model reaches only 42.0%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.