MTA: Multi-Granular Trajectory Alignment for Large Language Model Distillation
Pham Khanh Chi, Quoc Phong Dao, Thuat Nguyen, Linh Ngo Van, Trung Le, Thanh Hong Nguyen

TL;DR
This paper introduces Multi-Granular Trajectory Alignment (MTA), a novel method for LLM knowledge distillation that aligns representations across layers and semantic levels to improve transfer quality.
Contribution
MTA is the first approach to align teacher and student models along their entire layer-wise transformation trajectories at multiple semantic granularities.
Findings
MTA outperforms state-of-the-art baselines on standard benchmarks.
Layer-wise and semantic-level alignment improves knowledge transfer.
Ablation studies confirm the effectiveness of each component.
Abstract
Knowledge distillation is a key technique for compressing large language models (LLMs), but most existing methods align representations at fixed layers or token-level outputs, ignoring how representations evolve across depth. As a result, the student is only weakly guided to capture the teacher's internal relational structure during distillation, which limits knowledge transfer. To address this limitation, we propose Multi-Granular Trajectory Alignment (MTA), a framework that aligns teacher and student representations along their layer-wise transformation trajectory. MTA adopts a layer-adaptive strategy: lower layers are aligned at the word level to preserve lexical information, while higher layers operate on phrase-level spans (e.g., noun and verb phrases) to capture compositional semantics. We instantiate this idea through a Dynamic Structural Alignment loss that matches the relative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
