A Study on Hidden Layer Distillation for Large Language Model Pre-Training
Maxime Guigon, Lucas Dixon, Micha\"el E. Sander

TL;DR
This paper benchmarks Hidden Layer Distillation against logit-based knowledge distillation in large language model pre-training, finding systematic perplexity improvements but no consistent downstream task gains.
Contribution
It provides the first large-scale evaluation of Hidden Layer Distillation for decoder-only LLM pre-training, highlighting its potential and limitations.
Findings
HLD yields systematic perplexity gains over KD.
HLD does not consistently improve downstream task performance.
Latent signals in hidden layers can be extracted but need breakthroughs for greater impact.
Abstract
Knowledge Distillation (KD) is a critical tool for training Large Language Models (LLMs), yet the majority of research focuses on approaches that rely solely on output logits, neglecting semantic information in the teacher's intermediate representations. While Hidden Layer Distillation (HLD) showed potential for encoder architectures, its application to decoder-only pre-training at scale remains largely unexplored. Through compute-controlled experiments, we benchmark HLD against logit-based KD and self-supervised baselines with Gemma3 3.4B as teacher and 123M and 735M students trained on up to 168B tokens from the C4 dataset. Our experiments show that HLD does not consistently outperform standard KD on downstream evaluation tasks. Nevertheless, we show that HLD can yield a systematic perplexity gain over KD across all shared-hyperparameter configurations, suggesting that a latent signal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
