Toward Automated Robustness Evaluation of Mathematical Reasoning
Yutao Hou, Zeguan Xiao, Fei Yu, Yihan Jiang, Ma Shuguang, Zhaoqian Dai, Hailiang Huang, Yun Chen, Guanhua Chen

TL;DR
This paper introduces MaSTer, an automated framework for generating adversarial variants to evaluate and improve the robustness of large language models in mathematical reasoning tasks.
Contribution
The paper presents MaSTer, a novel automated stress testing framework that dynamically creates adversarial variants to probe and enhance LLM robustness without data contamination.
Findings
MaSTer effectively generates adversarial variants for GSM8K and MATH-500.
Variants produced by MaSTer can be used as fine-tuning data to improve model robustness.
The framework is adaptable to non-mathematical tasks, demonstrating broad applicability.
Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in various reasoning-intensive tasks. However, these models exhibit unexpected brittleness, often failing on simple variations of the same underlying task. Existing robustness evaluations predominantly rely on hand-crafted templates or a limited set of perturbation rules. Consequently, such approaches lack the adaptability to probe latent vulnerabilities unique to specific models and remain susceptible to data contamination. To address this, we propose the Math Stress Tester (MaSTer), an automated framework inspired by software stress testing. MaSTer generates adversarial variants via a multi-round rewrite-verify loop, ensuring semantic consistency while successfully inducing model failure. Our framework generates benchmark variants dynamically for each LLM, thus minimizing the risk of data contamination. Experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
