Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment

Yuming Yang; Mingyoung Lai; Wanxu Zhao; Xiaoran Fan; Zhiheng Xi; Mingqi Wu; Chiyue Huang; Jun Zhao; Haijun Lv; Jian Tong; Yunhua Zhou; Yicheng Zou; Qipeng Guo; Tao Gui; Qi Zhang; Xuanjing Huang

arXiv:2601.14249·cs.CL·April 23, 2026

Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment

Yuming Yang, Mingyoung Lai, Wanxu Zhao, Xiaoran Fan, Zhiheng Xi, Mingqi Wu, Chiyue Huang, Jun Zhao, Haijun Lv, Jian Tong, Yunhua Zhou, Yicheng Zou, Qipeng Guo, Tao Gui, Qi Zhang, Xuanjing Huang

PDF

1 Repo 1 Datasets

TL;DR

This paper introduces the Rank-Surprisal Ratio (RSR), a new metric that effectively assesses the informativeness and alignment of reasoning trajectories to improve student LLM training.

Contribution

The paper proposes RSR, a simple yet effective metric that outperforms existing measures in selecting informative reasoning trajectories for training student models.

Findings

01

RSR correlates strongly with reasoning performance (average Spearman 0.86).

02

RSR outperforms existing metrics in trajectory and teacher selection.

03

RSR is easy to compute and interpret across diverse models and trajectories.

Abstract

Long chain-of-thought (CoT) trajectories provide rich supervision signals for distilling reasoning from teacher to student LLMs. However, both prior work and our experiments show that trajectories from stronger teachers do not necessarily yield better students, highlighting the importance of data-student suitability in distillation. Existing methods assess suitability primarily through student likelihood, favoring trajectories that align closely with the student model's current behavior but overlooking more informative ones. Addressing this, we propose Rank-Surprisal Ratio (RSR), a simple metric that captures both alignment and informativeness to assess the suitability of a reasoning trajectory. RSR is motivated by the observation that effective trajectories typically balance learning signal strength and behavioral alignment by combining low absolute probability with relatively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

umeannever/RankSurprisalRatio
github

Datasets

Umean/RSR_data
dataset· 136 dl
136 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.