A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone

Jitai Hao; Qiang Huang; Hao Liu; Xinyan Xiao; Zhaochun Ren; Jun Yu

arXiv:2505.12781·cs.CL·December 19, 2025

A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone

Jitai Hao, Qiang Huang, Hao Liu, Xinyan Xiao, Zhaochun Ren, Jun Yu

PDF

Open Access 1 Repo

TL;DR

This paper introduces Low-Rank Clone (LRC), a novel pre-training method that significantly improves the efficiency of training small language models by effectively transferring knowledge from large teachers using low-rank projections and activation cloning.

Contribution

LRC combines low-rank weight compression and activation alignment to enhance knowledge distillation, reducing training data needs and computational costs for small language models.

Findings

01

LRC achieves comparable or better performance than state-of-the-art models.

02

LRC reduces training tokens from trillions to 20 billion.

03

LRC improves training efficiency by over 1,000 times.

Abstract

Training high-performing Small Language Models (SLMs) remains costly, even with knowledge distillation and pruning from larger teacher models. Existing work often faces three key challenges: (1) information loss from hard pruning, (2) inefficient alignment of representations, and (3) underutilization of informative activations, particularly from Feed-Forward Networks (FFNs). To address these challenges, we introduce Low-Rank Clone (LRC), an efficient pre-training method that constructs SLMs aspiring to behavioral equivalence with strong teacher models. LRC trains a set of low-rank projection matrices that jointly enable soft pruning by compressing teacher weights, and activation clone by aligning student activations, including FFN signals, with those of the teacher. This unified design maximizes knowledge transfer while removing the need for explicit alignment modules. Extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

currentf/lowrankclone
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Advanced Neural Network Applications

MethodsSparse Evolutionary Training · Pruning · Knowledge Distillation