DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models
Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, Huan Zhang

TL;DR
DynaMath is a dynamic visual benchmark designed to evaluate the robustness of vision-language models in mathematical reasoning by testing their performance across varied visual and textual question variants.
Contribution
This paper introduces DynaMath, a novel dynamic benchmark with generated question variants to assess VLMs' reasoning robustness beyond static problem sets.
Findings
VLMs perform significantly worse on worst-case variants than average cases.
Current models show limited robustness in mathematical reasoning tasks.
DynaMath reveals critical gaps in VLMs' ability to generalize across question variations.
Abstract
The rapid advancements in Vision-Language Models (VLMs) have shown great potential in tackling mathematical reasoning tasks that involve visual context. Unlike humans who can reliably apply solution steps to similar problems with minor modifications, we found that SOTA VLMs like GPT-4o can consistently fail in these scenarios, revealing limitations in their mathematical reasoning capabilities. In this paper, we investigate the mathematical reasoning robustness in VLMs and evaluate how well these models perform under different variants of the same question, such as changes in visual numerical values or function graphs. While several vision-based math benchmarks have been developed to assess VLMs' problem-solving capabilities, these benchmarks contain only static sets of problems and cannot easily evaluate mathematical reasoning robustness. To fill this gap, we introduce DynaMath, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques
MethodsSparse Evolutionary Training
