GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation

Sijia Li; Yuchen Huang; Zifan Liu; Yanping Li; Jingjing Fu; Li Zhao; Jiang Bian; Ling Zhang; Jun Zhang; Rui Wang

arXiv:2605.11853·cs.LG·May 15, 2026

GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation

Sijia Li, Yuchen Huang, Zifan Liu, Yanping Li, Jingjing Fu, Li Zhao, Jiang Bian, Ling Zhang, Jun Zhang, Rui Wang

PDF

TL;DR

GEAR introduces an adaptive credit assignment framework for LLMs that leverages self-distillation signals to improve policy updates, especially in complex long-horizon tasks.

Contribution

The paper proposes GEAR, a novel method for adaptive granularity credit assignment using self-distillation, enhancing reinforcement learning for LLM agents.

Findings

01

GEAR outperforms standard GRPO and baselines across eight benchmarks.

02

Significant improvements up to 20% over GRPO in challenging tasks.

03

Adaptive segmentation based on divergence spikes improves credit assignment accuracy.

Abstract

Reinforcement learning has become a widely used post-training approach for LLM agents, where training commonly relies on outcome-level rewards that provide only coarse supervision. While finer-grained credit assignment is promising for effective policy updates, obtaining reliable local credit and assigning it to the right parts of the long-horizon trajectory remains an open challenge. In this paper, we propose Granularity-adaptivE Advantage Reweighting (GEAR), an adaptive-granularity credit assignment framework that reshapes the trajectory-level GRPO advantage using token- and segment-level signals derived from self-distillation. GEAR compares an on-policy student with a ground-truth-conditioned teacher to obtain a reference-guided divergence signal for identifying adaptive segment boundaries and modulating local advantage weights. This divergence often spikes at the onset of a semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.