ToDi: Token-wise Distillation via Fine-Grained Divergence Control

Seongryong Jung; Suwan Yoon; DongGeon Kim; Hwanhee Lee

arXiv:2505.16297·cs.CL·September 30, 2025

ToDi: Token-wise Distillation via Fine-Grained Divergence Control

Seongryong Jung, Suwan Yoon, DongGeon Kim, Hwanhee Lee

PDF

Open Access 1 Video

TL;DR

ToDi introduces a token-wise knowledge distillation method that adaptively combines divergence measures to improve the training of smaller language models, leading to better performance on instruction-following tasks.

Contribution

This paper proposes ToDi, a novel token-wise distillation approach that dynamically balances divergence types per token, enhancing knowledge transfer efficiency and effectiveness.

Findings

01

ToDi outperforms recent distillation baselines on instruction-following benchmarks.

02

Token-wise divergence weighting improves distribution alignment.

03

Extensive ablations confirm ToDi's effectiveness and efficiency.

Abstract

Large language models (LLMs) offer impressive performance but are impractical for resource-constrained deployment due to high latency and energy consumption. Knowledge distillation (KD) addresses this by transferring knowledge from a large teacher to a smaller student model. However, conventional KD, notably approaches like Forward KL (FKL) and Reverse KL (RKL), apply uniform divergence loss across the entire vocabulary, neglecting token-level prediction discrepancies. By investigating these representative divergences via gradient analysis, we reveal that FKL boosts underestimated tokens, while RKL suppresses overestimated ones, showing their complementary roles. Based on this observation, we propose Token-wise Distillation (ToDi), a novel method that adaptively combines FKL and RKL per token using a sigmoid-based weighting function derived from the teacher-student probability…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

ToDi: Token-wise Distillation via Fine-Grained Divergence Control· underline

Taxonomy

TopicsAdvanced Control Systems Optimization · Process Optimization and Integration

MethodsKnowledge Distillation