AMR-SD: Asymmetric Meta-Reflective Self-Distillation for Token-Level Credit Assignment
Zhenlin Wei, Pu Jian, Yingzhuo Deng, Xiaohan Wang, Jiajun Chai, Zhexin Hu, Wei Lin, Shanbin Zhang, Guojun Yin

TL;DR
This paper introduces AMR-SD, a novel self-distillation method that improves token-level credit assignment in large language models by using reflection bottlenecks and causal information gain, leading to better stability and performance.
Contribution
AMR-SD innovatively incorporates reflection bottlenecks and causal information gain to enhance token-level credit assignment in LLMs, addressing over-conditioning and collapse issues.
Findings
AMR-SD outperforms existing baselines on scientific, mathematical, and tool-use benchmarks.
It achieves robust long-horizon stability in complex reasoning tasks.
The method prevents late-stage training collapse.
Abstract
The alignment of Large Language Models (LLMs) for complex reasoning heavily relies on Reinforcement Learning with Verifiable Rewards (RLVR). However, standard algorithms like GRPO apply sequence-level rewards uniformly to all tokens, creating a severe credit-assignment bottleneck. While on-policy self-distillation attempts to resolve this by conditioning a self-teacher on privileged contexts, direct exposure to raw oracle solutions often induces over-conditioned teacher distributions, implicit answer leakage, and late-stage training collapse. To overcome these limitations, we propose Asymmetric Meta-Reflective Self-Distillation (AMR-SD). Instead of conditioning directly on raw reference traces, AMR-SD inserts a reflection bottleneck: it compresses diagnostic signals -- from verifier outcomes, peer rollouts, or reference feedback -- into concise, self-generated Socratic hints and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
