Demystifying Low-Rank Knowledge Distillation in Large Language Models: Convergence, Generalization, and Information-Theoretic Guarantees
Alberlucia Rafael Soarez, Daniel Kim, Mariana Costa, Alejandro Torre

TL;DR
This paper provides a rigorous theoretical analysis of low-rank knowledge distillation in large language models, establishing convergence, generalization bounds, and information-theoretic insights that guide effective rank selection.
Contribution
It introduces a comprehensive theoretical framework for low-rank distillation, including convergence rates, generalization bounds, and mutual information analysis, supported by empirical validation.
Findings
Convergence rate of O(1/√T) under mild assumptions.
Generalization error scales with rank as O(r(m+n)/√n).
Optimal rank suggested as O(√n) based on information-theoretic analysis.
Abstract
Knowledge distillation has emerged as a powerful technique for compressing large language models (LLMs) into efficient, deployable architectures while preserving their advanced capabilities. Recent advances in low-rank knowledge distillation, particularly methods like Low-Rank Clone (LRC), have demonstrated remarkable empirical success, achieving comparable performance to full-parameter distillation with significantly reduced training data and computational overhead. However, the theoretical foundations underlying these methods remain poorly understood. In this paper, we establish a rigorous theoretical framework for low-rank knowledge distillation in language models. We prove that under mild assumptions, low-rank projection preserves the optimization dynamics, yielding explicit convergence rates of . We derive generalization bounds that characterize the fundamental…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Stochastic Gradient Optimization Techniques
