Boosting LLM Reasoning via Human-Inspired Reward Shaping

Wenze Lin; Zhen Yang; Xitai Jiang; Xiaoteng Ma; Gao Huang

arXiv:2602.04265·cs.LG·May 15, 2026

Boosting LLM Reasoning via Human-Inspired Reward Shaping

Wenze Lin, Zhen Yang, Xitai Jiang, Xiaoteng Ma, Gao Huang

PDF

TL;DR

This paper introduces T2T, a human-inspired dual-phase reward framework for LLM reasoning, which enhances learning by dynamically switching between exploration and consolidation phases.

Contribution

It proposes a novel reward shaping method inspired by human learning, improving reasoning in LLMs through stage-specific incentives.

Findings

01

T2T outperforms standard GRPO and recent baselines on mathematical benchmarks.

02

The dual-phase mechanism improves reasoning accuracy across 5 mainstream LLMs.

03

Extensive experiments validate the effectiveness of human-inspired reward shaping.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for enhancing reasoning in Large Language Models (LLMs). However, existing reward formulations typically treat exploration and consolidation as a monolithic process, resulting in entangled stage-wise learning dynamics. This contradicts the natural learning behavior of human learners. In human learning, individuals adopt distinct behavioral patterns toward mastered versus unfamiliar problems. When confronting unmastered challenges, humans prioritize broad exploration to seek viable solutions. By contrast, for well-mastered problems, they focus instead on reasoning condensation and knowledge abstraction to distill concise underlying principles. Motivated by this gap, we introduce T2T(Thickening-to-Thinning), a dynamic reward framework inspired by human learning processes. Specifically, it implements…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.