DWA-KD: Dual-Space Weighting and Time-Warped Alignment for Cross-Tokenizer Knowledge Distillation

Duc Trung Vu; Pham Khanh Chi; Dat Phi Van; Linh Ngo Van; Sang Dinh; Trung Le

arXiv:2602.21669·cs.CL·February 26, 2026

DWA-KD: Dual-Space Weighting and Time-Warped Alignment for Cross-Tokenizer Knowledge Distillation

Duc Trung Vu, Pham Khanh Chi, Dat Phi Van, Linh Ngo Van, Sang Dinh, Trung Le

PDF

Open Access 1 Video

TL;DR

DWA-KD introduces a novel cross-tokenizer knowledge distillation framework that enhances token-wise and sequence-level alignment using dual-space weighting and Soft-DTW, significantly improving LLM compression performance.

Contribution

The paper proposes DWA-KD, combining dual-space entropy-based weighting with Soft-DTW alignment to address limitations in cross-tokenizer knowledge distillation.

Findings

01

Outperforms state-of-the-art KD methods on NLP benchmarks.

02

Dual-space weighting improves focus on informative tokens.

03

Soft-DTW alignment enhances lexical and semantic sequence matching.

Abstract

Knowledge Distillation (KD) has emerged as a crucial technique for compressing Large Language Models (LLMs). Although existing cross-tokenizer KD methods have made notable progress, their effectiveness remains constrained by suboptimal alignment across sequence and vocabulary levels. To address these limitations, we introduce Dual-Space Weighting and Time-Warped Alignment (DWA-KD), a novel cross-tokenizer distillation framework that enhances token-wise distillation through dual-space entropy-based weighting and achieves precise sequence-level alignment by leveraging both lexical and semantic information. At the token level, DWA-KD maps teacher representations into the student space and vice versa, performing dual-space KD via Kullback-Leibler divergence (KL). The process is modulated by dual-space weights that up-weight tokens where the student is uncertain and the teacher is confident,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

DWA-KD: Dual-Space Weighting and Time-Warped Alignment for Cross-Tokenizer Knowledge Distillation· underline

Taxonomy

TopicsTopic Modeling · Time Series Analysis and Forecasting · Advanced Graph Neural Networks