DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical   Reasoning Robustness of Vision Language Models

Chengke Zou; Xingang Guo; Rui Yang; Junyu Zhang; Bin Hu; Huan Zhang

arXiv:2411.00836·cs.CV·February 25, 2025·2 cites

DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models

Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, Huan Zhang

PDF

Open Access 1 Datasets 1 Video

TL;DR

DynaMath is a dynamic visual benchmark designed to evaluate the robustness of vision-language models in mathematical reasoning by testing their performance across varied visual and textual question variants.

Contribution

This paper introduces DynaMath, a novel dynamic benchmark with generated question variants to assess VLMs' reasoning robustness beyond static problem sets.

Findings

01

VLMs perform significantly worse on worst-case variants than average cases.

02

Current models show limited robustness in mathematical reasoning tasks.

03

DynaMath reveals critical gaps in VLMs' ability to generalize across question variations.

Abstract

The rapid advancements in Vision-Language Models (VLMs) have shown great potential in tackling mathematical reasoning tasks that involve visual context. Unlike humans who can reliably apply solution steps to similar problems with minor modifications, we found that SOTA VLMs like GPT-4o can consistently fail in these scenarios, revealing limitations in their mathematical reasoning capabilities. In this paper, we investigate the mathematical reasoning robustness in VLMs and evaluate how well these models perform under different variants of the same question, such as changes in visual numerical values or function graphs. While several vision-based math benchmarks have been developed to assess VLMs' problem-solving capabilities, these benchmarks contain only static sets of problems and cannot easily evaluate mathematical reasoning robustness. To fill this gap, we introduce DynaMath, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

DynaMath/DynaMath_Sample
dataset· 505 dl
505 dl

Videos

DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques

MethodsSparse Evolutionary Training