Beyond Next-Token Alignment: Distilling Multimodal Large Language Models via Token Interactions
Lin Chen, Xiaoke Zhao, Kun Ding, Weiwei Feng, Changtao Miao, Zili Wang, Wenxuan Guo, Ying Wang, Kaiyuan Zheng, Bo Zhang, Zhe Li, and Shiming Xiang

TL;DR
This paper introduces Align-TI, a knowledge distillation framework for multimodal large language models that focuses on dynamic token interactions, significantly improving model compression and performance.
Contribution
Align-TI is a novel KD method that models token interactions for better multimodal model distillation, outperforming existing approaches and setting new state-of-the-art results.
Findings
Achieves 2.6% relative improvement over vanilla KD
Distilled Align-TI-2B surpasses larger LLMs by 7.0%
Establishes a new state-of-the-art in MLLM distillation
Abstract
Multimodal Large Language Models (MLLMs) demonstrate impressive cross-modal capabilities, yet their substantial size poses significant deployment challenges. Knowledge distillation (KD) is a promising solution for compressing these models, but existing methods primarily rely on static next-token alignment, neglecting the dynamic token interactions, which embed essential capabilities for multimodal understanding and generation. To this end, we introduce Align-TI, a novel KD framework designed from the perspective of Token Interactions. Our approach is motivated by the insight that MLLMs rely on two primary interactions: vision-instruction token interactions to extract relevant visual information, and intra-response token interactions for coherent generation. Accordingly, Align-TI introduces two components: IVA enables the student model to imitate the teacher's instruction-relevant visual…
Peer Reviews
Decision·Submitted to ICLR 2026
The insight for IVA is strong. Recognizing that many visual tokens are redundant and that distillation should focus on instruction-salient regions is a valuable contribution. The paper provides a principled exploration of different transformer layers for visual-text attention distillation. The IRS metric is a good method for selecting the most relevant layer, rather than relying on manual design. The TPA component's objective is practical. Aligning the full vocabulary's transition matrix is in
- The two proposed modules, IVA and TPA, feel separate and not well-integrated. IVA is a VLM-specific technique for aligning visual-instruction interactions. TPA, however, is a general-purpose LLM distillation method for text generation. The paper does not convincingly unify them into a single coherent KD framework. - The paper is difficult for a reader unfamiliar with the field to assess. The related work section is in the appendix, so the main text lacks a necessary discussion of previous wor
1.Extensive experiments demonstrate state-of-the-art performance, while thorough ablation studies validate the method's generalization across different MLLM architectures. 2.The insight is interesting. The work provides an interesting perspective by addressing the often-overlooked problem of insufficient visual token distillation in existing MLLM knowledge distillation methods. 3.The proposed IVA and TPA components are novel and effective, successfully aligning the student’s visual attention and
1.A potential direction for future work could be applying this distillation approach during the pre-training stage. It would be insightful to discuss how this might enhance the vision-language alignment. 2.The paper could be further strengthened by a more detailed analysis of the individual loss components. For instance, an ablation study quantifying the specific impact of each term ($L_{sft}, L_{kd}, L_{iva}, L_{tpa}$) would offer a clearer understanding of their respective contributions to the
1. The shift from static next-token alignment to token interaction modeling (IVA and TPA) provides a theoretically grounded and empirically validated innovation in knowledge distillation. 2. Extensive experiments and ablations across multiple benchmarks demonstrate robustness and scalability, including efficiency analysis and architectural generalization. 3. The proposed TPA component explicitly addresses train-test distribution discrepancies, a critical and underexplored issue in multimodal dis
1. Essentially, MLLMs already perform rich cross-modal and intra-modal token interactions through their attention layers. From this perspective, Align-TI appears to be an incremental refinement of existing attention mechanisms, i.e., reweighting visual focus (IVA) and regularizing output dynamics (TPA). It would be valuable to see further discussion or empirical evidence clarifying how Align-TI captures interactions beyond what standard attention already provides. 2. The evaluation primarily use
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling
