AMR-SD: Asymmetric Meta-Reflective Self-Distillation for Token-Level Credit Assignment

Zhenlin Wei; Pu Jian; Yingzhuo Deng; Xiaohan Wang; Jiajun Chai; Zhexin Hu; Wei Lin; Shanbin Zhang; Guojun Yin

arXiv:2605.18529·cs.AI·May 19, 2026

AMR-SD: Asymmetric Meta-Reflective Self-Distillation for Token-Level Credit Assignment

Zhenlin Wei, Pu Jian, Yingzhuo Deng, Xiaohan Wang, Jiajun Chai, Zhexin Hu, Wei Lin, Shanbin Zhang, Guojun Yin

PDF

TL;DR

This paper introduces AMR-SD, a novel self-distillation method that improves token-level credit assignment in large language models by using reflection bottlenecks and causal information gain, leading to better stability and performance.

Contribution

AMR-SD innovatively incorporates reflection bottlenecks and causal information gain to enhance token-level credit assignment in LLMs, addressing over-conditioning and collapse issues.

Findings

01

AMR-SD outperforms existing baselines on scientific, mathematical, and tool-use benchmarks.

02

It achieves robust long-horizon stability in complex reasoning tasks.

03

The method prevents late-stage training collapse.

Abstract

The alignment of Large Language Models (LLMs) for complex reasoning heavily relies on Reinforcement Learning with Verifiable Rewards (RLVR). However, standard algorithms like GRPO apply sequence-level rewards uniformly to all tokens, creating a severe credit-assignment bottleneck. While on-policy self-distillation attempts to resolve this by conditioning a self-teacher on privileged contexts, direct exposure to raw oracle solutions often induces over-conditioned teacher distributions, implicit answer leakage, and late-stage training collapse. To overcome these limitations, we propose Asymmetric Meta-Reflective Self-Distillation (AMR-SD). Instead of conditioning directly on raw reference traces, AMR-SD inserts a reflection bottleneck: it compresses diagnostic signals -- from verifier outcomes, peer rollouts, or reference feedback -- into concise, self-generated Socratic hints and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.