Demystifying Low-Rank Knowledge Distillation in Large Language Models: Convergence, Generalization, and Information-Theoretic Guarantees

Alberlucia Rafael Soarez; Daniel Kim; Mariana Costa; Alejandro Torre

arXiv:2603.22355·stat.ML·March 25, 2026

Demystifying Low-Rank Knowledge Distillation in Large Language Models: Convergence, Generalization, and Information-Theoretic Guarantees

Alberlucia Rafael Soarez, Daniel Kim, Mariana Costa, Alejandro Torre

PDF

Open Access

TL;DR

This paper provides a rigorous theoretical analysis of low-rank knowledge distillation in large language models, establishing convergence, generalization bounds, and information-theoretic insights that guide effective rank selection.

Contribution

It introduces a comprehensive theoretical framework for low-rank distillation, including convergence rates, generalization bounds, and mutual information analysis, supported by empirical validation.

Findings

01

Convergence rate of O(1/√T) under mild assumptions.

02

Generalization error scales with rank as O(r(m+n)/√n).

03

Optimal rank suggested as O(√n) based on information-theoretic analysis.

Abstract

Knowledge distillation has emerged as a powerful technique for compressing large language models (LLMs) into efficient, deployable architectures while preserving their advanced capabilities. Recent advances in low-rank knowledge distillation, particularly methods like Low-Rank Clone (LRC), have demonstrated remarkable empirical success, achieving comparable performance to full-parameter distillation with significantly reduced training data and computational overhead. However, the theoretical foundations underlying these methods remain poorly understood. In this paper, we establish a rigorous theoretical framework for low-rank knowledge distillation in language models. We prove that under mild assumptions, low-rank projection preserves the optimization dynamics, yielding explicit convergence rates of $O (1/ T)$ . We derive generalization bounds that characterize the fundamental…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Stochastic Gradient Optimization Techniques