GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations
Jyotika Singh, Fang Tu, Aziza Mirzadova, Amit Agarwal, Hitesh Laxmichand Patel, Sandip Ghoshal, Miguel Ballesteros, Yassine Benajiba, Weiyi Sun, Graham Horwood, Sujith Ravi, Dan Roth

TL;DR
GSM-SEM is a stochastic framework that generates semantically diverse benchmark variants for evaluating language models, reducing memorization bias and better assessing true reasoning capabilities.
Contribution
It introduces a novel method for creating dynamic, semantically varied benchmark datasets that challenge models beyond static test sets.
Findings
Performance drops observed across 14 SOTA LLMs with semantic perturbations.
Larger declines when combining semantic and symbolic variations.
GSM-SEM variants are publicly released as human-validated datasets.
Abstract
Benchmarks like GSM8K are popular measures of mathematical reasoning, but leaderboard gains can overstate true capability due to memorization of fixed test sets. Most robustness variants apply surface-level perturbations (paraphrases, renamings, number swaps, distractors) that largely preserve the underlying facts, and static releases can themselves become memorization targets over time. We introduce GSM-SEM, a reusable and stochastic framework for generating semantically diverse benchmark variants with substantially higher semantic variance than prior approaches. GSM-SEM perturbs problem statements by modifying entities, attributes, and/or relationships, frequently altering underlying facts and requiring models to recompute solutions under new conditions, while constraining generation to preserve the original calculations/answer and approximate problem difficulty. GSM-SEM generates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
