MixSD: Mixed Contextual Self-Distillation for Knowledge Injection

Jiarui Liu; Lechen Zhang; Yongjin Yang; Yinghui He; Yingheng Wang; Weihao Xuan; Zhijing Jin; Mona Diab

arXiv:2605.16865·cs.CL·May 22, 2026

MixSD: Mixed Contextual Self-Distillation for Knowledge Injection

Jiarui Liu, Lechen Zhang, Yongjin Yang, Yinghui He, Yingheng Wang, Weihao Xuan, Zhijing Jin, Mona Diab

PDF

TL;DR

MixSD is a novel method for injecting knowledge into language models by dynamically mixing supervision tokens from the model itself, preserving capabilities and reducing forgetting.

Contribution

It introduces a distribution-aligned supervision technique that improves knowledge retention without external teachers or fixed targets.

Findings

01

MixSD outperforms standard fine-tuning in memorization-retention trade-offs.

02

It retains up to 100% of the base model's capabilities while maintaining high training accuracy.

03

MixSD produces supervision targets closer to the model's native distribution, reducing harmful parameter updates.

Abstract

Supervised fine-tuning (SFT) is widely used to inject new knowledge into language models, but it often degrades pretrained capabilities such as reasoning and general-domain performance. We argue this forgetting arises because fine-tuning targets from humans or external systems diverge from the model's autoregressive distribution, forcing the optimizer to imitate low-probability token sequences. To address this problem, we propose MixSD, a simple external-teacher-free method for distribution-aligned knowledge injection. Instead of training on fixed targets, MixSD constructs supervision dynamically by mixing tokens from two conditionals of the base model itself: an expert conditional that observes the injected fact in context, and a naive conditional that reflects the model's original prior. The resulting supervision sequences preserve the factual learning signal while remaining…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.