Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models
Gongbo Zhang, Wen Wang, Ye Tian, Li Yuan

TL;DR
This paper introduces TIDE, a novel framework for cross-architecture distillation of diffusion large language models, enabling knowledge transfer between models with different architectures, attention mechanisms, and tokenizers.
Contribution
TIDE is the first framework for cross-architecture distillation of dLLMs, incorporating three modular components to improve knowledge transfer and model performance.
Findings
Distilling 8B dense and 16B MoE teachers into a 0.6B student improves benchmark scores.
The approach yields a 48.78 HumanEval score, surpassing the baseline.
Outperforms baseline by an average of 1.53 points across eight benchmarks.
Abstract
Diffusion large language models (dLLMs) offer parallel decoding and bidirectional context, but state-of-the-art dLLMs require billions of parameters for competitive performance. While existing distillation methods for dLLMs reduce inference steps within a single architecture, none address cross-architecture knowledge transfer, in which the teacher and student differ in architecture, attention mechanism, and tokenizer. We present TIDE, the first framework for cross-architecture dLLM distillation, comprising three modular components: (1) TIDAL, which jointly modulates distillation strength across training progress and diffusion timestep to account for the teacher's noise-dependent reliability; (2) CompDemo, which enriches the teacher's context via complementary mask splitting to improve predictions under heavy masking; and (3) Reverse CALM, a cross-tokenizer objective that inverts…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗TIDE-dllm/distill-WeDLM-TIDE_Sharedmodel· 532 dl532 dl
- 🤗TIDE-dllm/distill-LLaDA2-TIDE_Sharedmodel· 522 dl522 dl
- 🤗TIDE-dllm/distill-LLaDA2-TIDE_Crossmodel· 547 dl547 dl
- 🤗TIDE-dllm/distill-LLaDA2-CALMmodel· 531 dl531 dl
- 🤗TIDE-dllm/distill-WeDLM-KLmodel· 544 dl544 dl
- 🤗TIDE-dllm/distill-WeDLM-TIDE_Crossmodel· 541 dl541 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
