Teaching-Assistant-in-the-Loop: Improving Knowledge Distillation from Imperfect Teacher Models in Low-Budget Scenarios
Yuhang Zhou, Wei Ai

TL;DR
This paper introduces a three-component framework that uses a teaching assistant model to improve knowledge distillation from imperfect large language models, especially in low-resource scenarios, by leveraging multiple signals including student self-consistency and confidence scoring.
Contribution
It proposes a novel teaching assistant framework with a two-stage training process that enhances sample efficiency and robustness in distillation from imperfect teachers.
Findings
Achieves up to 20.79% relative improvement in complex reasoning tasks.
Effectively utilizes multiple signals to improve student model training.
Demonstrates superiority over standard fine-tuning methods.
Abstract
There is increasing interest in distilling task-specific knowledge from large language models (LLM) to smaller student models. Nonetheless, LLM distillation presents a dual challenge: 1) there is a high cost associated with querying the teacher LLM, such as GPT-4, for gathering an ample number of demonstrations; 2) the teacher LLM might provide imperfect outputs with a negative impact on the student's learning process. To enhance sample efficiency within resource-constrained, imperfect teacher scenarios, we propose a three-component framework leveraging three signal types. The first signal is the student's self-consistency (consistency of student multiple outputs), which is a proxy of the student's confidence. Specifically, we introduce a ``teaching assistant'' (TA) model to assess the uncertainty of both the student's and the teacher's outputs via confidence scoring, which serves as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsOnline Learning and Analytics
MethodsAttention Is All You Need · Softmax · Layer Normalization · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Multi-Head Attention · Position-Wise Feed-Forward Layer
