GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation
Sijia Li, Yuchen Huang, Zifan Liu, Yanping Li, Jingjing Fu, Li Zhao, Jiang Bian, Ling Zhang, Jun Zhang, Rui Wang

TL;DR
GEAR introduces an adaptive credit assignment framework for LLMs that leverages self-distillation signals to improve policy updates, especially in complex long-horizon tasks.
Contribution
The paper proposes GEAR, a novel method for adaptive granularity credit assignment using self-distillation, enhancing reinforcement learning for LLM agents.
Findings
GEAR outperforms standard GRPO and baselines across eight benchmarks.
Significant improvements up to 20% over GRPO in challenging tasks.
Adaptive segmentation based on divergence spikes improves credit assignment accuracy.
Abstract
Reinforcement learning has become a widely used post-training approach for LLM agents, where training commonly relies on outcome-level rewards that provide only coarse supervision. While finer-grained credit assignment is promising for effective policy updates, obtaining reliable local credit and assigning it to the right parts of the long-horizon trajectory remains an open challenge. In this paper, we propose Granularity-adaptivE Advantage Reweighting (GEAR), an adaptive-granularity credit assignment framework that reshapes the trajectory-level GRPO advantage using token- and segment-level signals derived from self-distillation. GEAR compares an on-policy student with a ground-truth-conditioned teacher to obtain a reference-guided divergence signal for identifying adaptive segment boundaries and modulating local advantage weights. This divergence often spikes at the onset of a semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
