Boosting LLM Reasoning via Human-Inspired Reward Shaping
Wenze Lin, Zhen Yang, Xitai Jiang, Xiaoteng Ma, Gao Huang

TL;DR
This paper introduces T2T, a human-inspired dual-phase reward framework for LLM reasoning, which enhances learning by dynamically switching between exploration and consolidation phases.
Contribution
It proposes a novel reward shaping method inspired by human learning, improving reasoning in LLMs through stage-specific incentives.
Findings
T2T outperforms standard GRPO and recent baselines on mathematical benchmarks.
The dual-phase mechanism improves reasoning accuracy across 5 mainstream LLMs.
Extensive experiments validate the effectiveness of human-inspired reward shaping.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for enhancing reasoning in Large Language Models (LLMs). However, existing reward formulations typically treat exploration and consolidation as a monolithic process, resulting in entangled stage-wise learning dynamics. This contradicts the natural learning behavior of human learners. In human learning, individuals adopt distinct behavioral patterns toward mastered versus unfamiliar problems. When confronting unmastered challenges, humans prioritize broad exploration to seek viable solutions. By contrast, for well-mastered problems, they focus instead on reasoning condensation and knowledge abstraction to distill concise underlying principles. Motivated by this gap, we introduce T2T(Thickening-to-Thinning), a dynamic reward framework inspired by human learning processes. Specifically, it implements…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
