ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark

Michael Shalyt; Rotem Elimelech; Ido Kaminer

arXiv:2505.23851·cs.CL·June 2, 2025

ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark

Michael Shalyt, Rotem Elimelech, Ido Kaminer

PDF

Open Access 1 Repo 1 Datasets 3 Reviews

TL;DR

ASyMOB is a comprehensive benchmark for evaluating large language models' symbolic mathematics skills, revealing their strengths, weaknesses, and robustness, and highlighting the impact of integrated code execution on performance.

Contribution

Introduces ASyMOB, a large-scale symbolic math benchmark with analysis of LLM generalization, robustness, and the effects of code integration, advancing evaluation methods in symbolic mathematics.

Findings

01

LLMs show significant performance degradation under perturbations.

02

Models with code execution outperform those without, especially weaker models.

03

Advanced models demonstrate high accuracy and robustness, indicating a potential phase transition.

Abstract

Large language models (LLMs) are rapidly approaching the level of proficiency in university-level symbolic mathematics required for applications in advanced science and technology. However, existing benchmarks fall short in assessing the core skills of LLMs in symbolic mathematics-such as integration, differential equations, and algebraic simplification. To address this gap, we introduce ASyMOB, a novel assessment framework focused exclusively on symbolic manipulation, featuring 17,092 unique math challenges, organized by similarity and complexity. ASyMOB enables analysis of LLM generalization capabilities by comparing performance in problems that differ by simple numerical or symbolic `perturbations'. Evaluated LLMs exhibit substantial degradation in performance for all perturbation types (up to -70.3%), suggesting reliance on memorized patterns rather than deeper understanding of…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

The work clearly explains how problems are built and expanded into symbolic, numeric, and equivalence variants, with worked examples for each. * ASyMOB fills gaps in existing literature dataset, targeting symbolic manipulation (integration, limits, DEs, series, hypergeometrics) rather than text-to-math. It offers controlled difficulty via systematic perturbations and broad university-level problem coverage that previous benchmarks lack. * Dataset instances are created with random transforms, a

Weaknesses

* The work documents qualitative examples where CAS fails but LLMs succeed, and a case solvable only by an LLM + CAS hybrid (Figure 6). Further, it argues that symbolics hurt CAS more than LLMs. What’s missing is a dataset-level percentage/table partitioning successes into LLM-only, CAS-only, and hybrid categories across perturbations. Adding this would substantively strengthen the claim. * Some of the perturbations appear to be somewhat contrived. This may not necessarily be a bad thing, but it

Reviewer 02Rating 4Confidence 3

Strengths

- ASyMOB isolates symbolic mathematical reasoning from linguistic understanding, providing a clean test of algebraic manipulation skills. - The symbolic, numeric, and equivalence perturbations enable fine-grained evaluation of robustness and generalization. - Dual symbolic–numeric verification ensures reliability, and the findings reveal meaningful trends such as a "phase transition" toward genuine reasoning in frontier LLMs.

Weaknesses

- The scope is somehow limited. The benchmark focuses narrowly on algebraic operations, omitting other mathematical reasoning domains, such as geometry or proofs. - Some generated variants may be mathematically artificial and not representative of real-world symbolic problems. - Several key conclusions, such as the role of code integration and hybrid tool use in improving LLM reasoning, have already been explored in prior work on tool-augmented or agentic LLMs, making the contributions more incr

Reviewer 03Rating 4Confidence 4

Strengths

- The benchmark is reasonable, as the symbolic and numeric versions can fully evaluate the ability of LLMs to address mathematical reasoning. - The provided examples are well-motivated, as identifying cases where both LLMs and symbolic systems do not perform well can help guide further research directions.

Weaknesses

- The novelty of this paper requires further clarification. As noted at the end of this paper, GSM-Symbolic has conducted similar research and reached comparable conclusions. Therefore, it is important for the authors to clearly articulate the unique contribution and positioning of this work within the field, especially given the prior work, i.e., GSM-Symbolic. The authors should carefully clarify the difference between their benchmark and existing work. Additionally, some closely related studie

Code & Models

Repositories

RamanujanMachine/ASyMOB
noneOfficial

Datasets

Shalyt/ASyMOB-Algebraic_Symbolic_Mathematical_Operations_Benchmark
dataset· 313 dl
313 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputability, Logic, AI Algorithms · Polynomial and algebraic computation