AbstRaL: Augmenting LLMs' Reasoning by Reinforcing Abstract Thinking

Silin Gao; Antoine Bosselut; Samy Bengio; Emmanuel Abbe

arXiv:2506.07751·cs.CL·February 24, 2026

AbstRaL: Augmenting LLMs' Reasoning by Reinforcing Abstract Thinking

Silin Gao, Antoine Bosselut, Samy Bengio, Emmanuel Abbe

PDF

3 Reviews

TL;DR

This paper introduces AbstRaL, a reinforcement learning-based method to enhance LLMs' abstract reasoning skills, improving robustness in grade school math tasks and general reasoning under distribution shifts.

Contribution

Proposes a novel RL-based approach to teach LLMs abstract reasoning, outperforming supervised fine-tuning in robustness and generalization for math and reasoning tasks.

Findings

01

AbstRaL significantly reduces performance drops on GSM benchmarks.

02

Abstract reasoning via RL improves LLMs' out-of-distribution mathematical reasoning.

03

Enhanced abstract thinking benefits general reasoning capabilities.

Abstract

Recent studies have shown that large language models (LLMs), especially smaller ones, often lack robustness in grade school math (GSM) reasoning. In particular, they tend to experience performance drops when faced with distribution shifts, such as changes to numerical or nominal variables, or insertions of distracting clauses. A possible strategy to address this involves generating synthetic data to further "instantiate" reasoning problems on potential variations. In this work, we instead focus on the strategy of "abstracting" reasoning problems. This not only helps counteract distribution shifts but also facilitates the connection to symbolic tools for deriving solutions. Focusing on GSM, we find that this abstraction process is better acquired through reinforcement learning (RL) than just supervised fine-tuning, which often fails to produce faithful abstractions. Our method, AbstRaL…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 2Confidence 4

Strengths

1. Combination of learning abstractions using an SFT-RL framework seems interesting. 2. Though authors have used only two main LLMs, the ablation studies are done htoughtfully. 3. Using the socratic version of GSM8k to rewrite the abstract intermediate steps (/decomposed questions) is interesting. Therefore, makes the data somewhat valuable.

Weaknesses

The paper suffers largely on novelty, lack of sufficient experimentation, broad claims which are untrue. Novelty: Neither decomposition nor abstraction is new in symbolic math, math word problem solving. There are too many papers, with too many variations and ideas. Authors also do not specify any contributions. So, I am not even sure if this is a central claim. Lack of sufficient Experimentations: Only two small LMs are chosen to support very broad claims. Even the variations of GSM chosen ar

Reviewer 02Rating 2Confidence 4

Strengths

1. The central idea of moving from "instantiation" to "abstraction" is a strong conceptual contribution. Explicitly teaching the model a reasoning schema that is invariant to surface details is a promising direction for improving OOD robustness. 2. The proposed AbstRaL framework is well-thought-out. The creation of the GranulAR data format, which blends socratic decomposition with symbolic abstraction, is a clever way to make the task more tractable for LLMs. The use of reinforcement learning

Weaknesses

1. The framework's dependence on a vastly more powerful model (Llama-3.3-70B) for critical data generation steps is a major weakness. Both the initial "Condition Recognition" and the subsequent "Abstract Reasoning Chain Rewriting" are performed by this moel. This raises the concern that the improvements seen in smaller models are largely due to knowledge distillation from a superior model, rather than the inherent power of the AbstRaL learning scheme itself. 2. The method is evaluated exclusi

Reviewer 03Rating 8Confidence 4

Strengths

- This paper presents a quite interesting approach for math reasoning. Instead of directly reason on the text space, it leverages formula inference to derive the right result. This effectively turn math reasoning into a programming task (i.e., extract parameter, write a program, and run the program to get result) and assemble results with response. - The dataset construction leverages existing CoT to construct generalizable abstract trace. - Evaluation is pretty solid, especially with ablation

Weaknesses

- the proposed abstraction composition is quite complicated, especially given the rewriting + retrieval (I personally think it's better called extraction as I was confused with document retrieval at first glance). Will a simpler way to construct reasoning, i.e, ask the model to directly generate formulas on top of basic COT (e.g., in a separate section) achieve the similar performance? Especially considering directly generate a small snippet of code for computation and ask LLM to assemble back t

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.