MixSD: Mixed Contextual Self-Distillation for Knowledge Injection
Jiarui Liu, Lechen Zhang, Yongjin Yang, Yinghui He, Yingheng Wang, Weihao Xuan, Zhijing Jin, Mona Diab

TL;DR
MixSD is a novel method for injecting knowledge into language models by dynamically mixing supervision tokens from the model itself, preserving capabilities and reducing forgetting.
Contribution
It introduces a distribution-aligned supervision technique that improves knowledge retention without external teachers or fixed targets.
Findings
MixSD outperforms standard fine-tuning in memorization-retention trade-offs.
It retains up to 100% of the base model's capabilities while maintaining high training accuracy.
MixSD produces supervision targets closer to the model's native distribution, reducing harmful parameter updates.
Abstract
Supervised fine-tuning (SFT) is widely used to inject new knowledge into language models, but it often degrades pretrained capabilities such as reasoning and general-domain performance. We argue this forgetting arises because fine-tuning targets from humans or external systems diverge from the model's autoregressive distribution, forcing the optimizer to imitate low-probability token sequences. To address this problem, we propose MixSD, a simple external-teacher-free method for distribution-aligned knowledge injection. Instead of training on fixed targets, MixSD constructs supervision dynamically by mixing tokens from two conditionals of the base model itself: an expert conditional that observes the injected fact in context, and a naive conditional that reflects the model's original prior. The resulting supervision sequences preserve the factual learning signal while remaining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
