Toward Automated Robustness Evaluation of Mathematical Reasoning

Yutao Hou; Zeguan Xiao; Fei Yu; Yihan Jiang; Ma Shuguang; Zhaoqian Dai; Hailiang Huang; Yun Chen; Guanhua Chen

arXiv:2506.05038·cs.CL·April 27, 2026

Toward Automated Robustness Evaluation of Mathematical Reasoning

Yutao Hou, Zeguan Xiao, Fei Yu, Yihan Jiang, Ma Shuguang, Zhaoqian Dai, Hailiang Huang, Yun Chen, Guanhua Chen

PDF

TL;DR

This paper introduces MaSTer, an automated framework for generating adversarial variants to evaluate and improve the robustness of large language models in mathematical reasoning tasks.

Contribution

The paper presents MaSTer, a novel automated stress testing framework that dynamically creates adversarial variants to probe and enhance LLM robustness without data contamination.

Findings

01

MaSTer effectively generates adversarial variants for GSM8K and MATH-500.

02

Variants produced by MaSTer can be used as fine-tuning data to improve model robustness.

03

The framework is adaptable to non-mathematical tasks, demonstrating broad applicability.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in various reasoning-intensive tasks. However, these models exhibit unexpected brittleness, often failing on simple variations of the same underlying task. Existing robustness evaluations predominantly rely on hand-crafted templates or a limited set of perturbation rules. Consequently, such approaches lack the adaptability to probe latent vulnerabilities unique to specific models and remain susceptible to data contamination. To address this, we propose the Math Stress Tester (MaSTer), an automated framework inspired by software stress testing. MaSTer generates adversarial variants via a multi-round rewrite-verify loop, ensuring semantic consistency while successfully inducing model failure. Our framework generates benchmark variants dynamically for each LLM, thus minimizing the risk of data contamination. Experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.