DWA-KD: Dual-Space Weighting and Time-Warped Alignment for Cross-Tokenizer Knowledge Distillation
Duc Trung Vu, Pham Khanh Chi, Dat Phi Van, Linh Ngo Van, Sang Dinh, Trung Le

TL;DR
DWA-KD introduces a novel cross-tokenizer knowledge distillation framework that enhances token-wise and sequence-level alignment using dual-space weighting and Soft-DTW, significantly improving LLM compression performance.
Contribution
The paper proposes DWA-KD, combining dual-space entropy-based weighting with Soft-DTW alignment to address limitations in cross-tokenizer knowledge distillation.
Findings
Outperforms state-of-the-art KD methods on NLP benchmarks.
Dual-space weighting improves focus on informative tokens.
Soft-DTW alignment enhances lexical and semantic sequence matching.
Abstract
Knowledge Distillation (KD) has emerged as a crucial technique for compressing Large Language Models (LLMs). Although existing cross-tokenizer KD methods have made notable progress, their effectiveness remains constrained by suboptimal alignment across sequence and vocabulary levels. To address these limitations, we introduce Dual-Space Weighting and Time-Warped Alignment (DWA-KD), a novel cross-tokenizer distillation framework that enhances token-wise distillation through dual-space entropy-based weighting and achieves precise sequence-level alignment by leveraging both lexical and semantic information. At the token level, DWA-KD maps teacher representations into the student space and vice versa, performing dual-space KD via Kullback-Leibler divergence (KL). The process is modulated by dual-space weights that up-weight tokens where the student is uncertain and the teacher is confident,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Time Series Analysis and Forecasting · Advanced Graph Neural Networks
