ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark
Michael Shalyt, Rotem Elimelech, Ido Kaminer

TL;DR
ASyMOB is a comprehensive benchmark for evaluating large language models' symbolic mathematics skills, revealing their strengths, weaknesses, and robustness, and highlighting the impact of integrated code execution on performance.
Contribution
Introduces ASyMOB, a large-scale symbolic math benchmark with analysis of LLM generalization, robustness, and the effects of code integration, advancing evaluation methods in symbolic mathematics.
Findings
LLMs show significant performance degradation under perturbations.
Models with code execution outperform those without, especially weaker models.
Advanced models demonstrate high accuracy and robustness, indicating a potential phase transition.
Abstract
Large language models (LLMs) are rapidly approaching the level of proficiency in university-level symbolic mathematics required for applications in advanced science and technology. However, existing benchmarks fall short in assessing the core skills of LLMs in symbolic mathematics-such as integration, differential equations, and algebraic simplification. To address this gap, we introduce ASyMOB, a novel assessment framework focused exclusively on symbolic manipulation, featuring 17,092 unique math challenges, organized by similarity and complexity. ASyMOB enables analysis of LLM generalization capabilities by comparing performance in problems that differ by simple numerical or symbolic `perturbations'. Evaluated LLMs exhibit substantial degradation in performance for all perturbation types (up to -70.3%), suggesting reliance on memorized patterns rather than deeper understanding of…
Peer Reviews
Decision·Submitted to ICLR 2026
The work clearly explains how problems are built and expanded into symbolic, numeric, and equivalence variants, with worked examples for each. * ASyMOB fills gaps in existing literature dataset, targeting symbolic manipulation (integration, limits, DEs, series, hypergeometrics) rather than text-to-math. It offers controlled difficulty via systematic perturbations and broad university-level problem coverage that previous benchmarks lack. * Dataset instances are created with random transforms, a
* The work documents qualitative examples where CAS fails but LLMs succeed, and a case solvable only by an LLM + CAS hybrid (Figure 6). Further, it argues that symbolics hurt CAS more than LLMs. What’s missing is a dataset-level percentage/table partitioning successes into LLM-only, CAS-only, and hybrid categories across perturbations. Adding this would substantively strengthen the claim. * Some of the perturbations appear to be somewhat contrived. This may not necessarily be a bad thing, but it
- ASyMOB isolates symbolic mathematical reasoning from linguistic understanding, providing a clean test of algebraic manipulation skills. - The symbolic, numeric, and equivalence perturbations enable fine-grained evaluation of robustness and generalization. - Dual symbolic–numeric verification ensures reliability, and the findings reveal meaningful trends such as a "phase transition" toward genuine reasoning in frontier LLMs.
- The scope is somehow limited. The benchmark focuses narrowly on algebraic operations, omitting other mathematical reasoning domains, such as geometry or proofs. - Some generated variants may be mathematically artificial and not representative of real-world symbolic problems. - Several key conclusions, such as the role of code integration and hybrid tool use in improving LLM reasoning, have already been explored in prior work on tool-augmented or agentic LLMs, making the contributions more incr
- The benchmark is reasonable, as the symbolic and numeric versions can fully evaluate the ability of LLMs to address mathematical reasoning. - The provided examples are well-motivated, as identifying cases where both LLMs and symbolic systems do not perform well can help guide further research directions.
- The novelty of this paper requires further clarification. As noted at the end of this paper, GSM-Symbolic has conducted similar research and reached comparable conclusions. Therefore, it is important for the authors to clearly articulate the unique contribution and positioning of this work within the field, especially given the prior work, i.e., GSM-Symbolic. The authors should carefully clarify the difference between their benchmark and existing work. Additionally, some closely related studie
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputability, Logic, AI Algorithms · Polynomial and algebraic computation
