Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning

Jiayu Wang; Yifei Ming; Zixuan Ke; Caiming Xiong; Shafiq Joty; Aws Albarghouthi; Frederic Sala

arXiv:2506.04723·cs.AI·October 27, 2025

Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning

Jiayu Wang, Yifei Ming, Zixuan Ke, Caiming Xiong, Shafiq Joty, Aws Albarghouthi, Frederic Sala

PDF

Open Access 2 Models 1 Datasets 1 Video

TL;DR

This paper introduces SPARKLE, a detailed framework to analyze how reinforcement learning improves language models' reasoning, revealing that RL enhances internal strategy formulation and knowledge integration rather than external plan execution.

Contribution

The paper presents SPARKLE, a novel analytic framework for dissecting RL effects on language models' reasoning, and proposes SparkleRL-PSS for training with hard problems using partial scaffolding.

Findings

01

RL models are more robust to explicit plan degradation.

02

RL improves models' ability to integrate knowledge.

03

Hard problems with partial scaffolding can be effectively reused for training.

Abstract

Reinforcement learning (RL) has become the dominant paradigm for improving the performance of language models on complex reasoning tasks. Despite the substantial empirical gains demonstrated by RL-based training methods like GRPO, a granular understanding of why and how RL enhances performance is still lacking. To bridge this gap, we introduce SPARKLE, a fine-grained analytic framework to dissect the effects of RL across three key dimensions: (1) plan following and execution, (2) knowledge integration, and (3) chain of subproblems. Using this framework, we gain insights beyond mere accuracy. For instance, providing models with explicit human-crafted, step-by-step plans can surprisingly degrade performance on the most challenging benchmarks, yet RL-tuned models exhibit greater robustness, experiencing markedly smaller performance drops than base or SFT models. This suggests that RL may…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

sparkle-reasoning/sparkle_preview
dataset· 55 dl
55 dl

Videos

Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Reinforcement Learning in Robotics

MethodsBalanced Selection