In Good GRACEs: Principled Teacher Selection for Knowledge Distillation
Abhishek Panigrahi, Bingbin Liu, Sadhika Malladi, Sham Kakade, Surbhi Goel

TL;DR
This paper introduces GRACE, a lightweight, data-free score for selecting optimal teachers in knowledge distillation, which correlates strongly with student performance and guides key design choices.
Contribution
We propose GRACE, a novel distributional gradient property measure that predicts teacher effectiveness without access to teacher internals or test data.
Findings
GRACE correlates up to 86% with student performance.
Using GRACE-selected teachers improves student accuracy by up to 7.4%.
GRACE guides distillation design choices like temperature and teacher selection.
Abstract
Knowledge distillation is an efficient strategy to use data generated by large "teacher" language models to train smaller capable "student" models, but selecting the optimal teacher for a specific student-task combination requires expensive trial-and-error. We propose a lightweight score called GRACE to quantify how effective a teacher will be for post-training a student model. GRACE measures distributional properties of the student's gradients without access to a verifier, teacher logits, teacher internals, or test data. From an information-theoretic perspective, GRACE connects to leave-one-out stability of gradient-based algorithms, which controls the generalization performance of the distilled students. On GSM8K and MATH, GRACE correlates strongly (up to 86% Spearman correlation) with the performance of the distilled LLaMA and OLMo students. In particular, training a student using…
Peer Reviews
Decision·ICLR 2026 Poster
- GRACE offers an efficient method to identify the most suitable teacher for distillation, improving the distillation process without requiring extensive trial and error. - The proposed GRACE strongly correlates with student performance, as demonstrated by its high correlation (up to 92%) with post-distillation results, out-performing traditional metrics like G-Vendi and G-Var. - It provides actionable insights to practitioners, including optimal generation temperature and teacher selection un
- While effective, GRACE's correlations with student performance are not perfect, indicating that further refinement and additional explanatory factors might improve its accuracy. - The score’s performance can be sensitive to hyperparameters, such as the number of prompts and generation dimension, which can complicate its application. - The current task is limited to Math problem-solving, however, this is a small task within the NLP and AI domain. Is there any possibility of generalizing to o
1. The paper is well-motivated. The trial-and-error approach for teacher selection is computationally costly, and a principled, efficient method is a valuable contribution. 2. The connection of the proposed score to CMI for generalization bound is clever. 3. A key strength is that GRACE is lightweight and does not require access to teacher logits or internal states. This makes the method broadly applicable, even in black-box or API-based distillation scenarios.
1. The theoretical motivation (Lemma 2.1 and Appendix A) connects GRACE to the generalization gap in terms of loss. However, the empirical experiments measure correlation with final test accuracy (Average-at-k). The authors also note in the appendix (lines 752-755) that "GRACE serves as a reliable predictor of student performance, even though it fails to correlate with loss-based quantities". This is a disconnect that undermines the theoretical argument and should be discussed more prominently.
1. The proposed metric is well motivated, and reasonably extends G-Vendi, and is easy to compute cheaply. 1. The proposed method achieves good predictive correlation and succesffully predicts the best teacher model across multiple datasets and student models. 1. The proposed method correctly identifies/predicts generation temperature across different teacher model sizes, unlike some prior baselines which monotonically increase with temperature. With synthetic LLM-generated data and seqKD being
1. Most of the experiments in the paper (except figure 11) use temperature 1. Most llms however are typically used a much lower temperatures. The performance correlation of GRACE drops sharply for greedy sampling, which implies the method is indeed affected by temperature. This significantly weakens the impact of most of the analysis/experiments in this paper. 1. The proposed method uses an ad-hoc gradient scaling of $1/log(L)$, motivated by a decrease in gradient norm with increasing sequence l
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Topic Modeling · Multimodal Machine Learning Applications
