TuneShift-KD: Knowledge Distillation and Transfer for Fine-tuned Models
Yushi Guan, Jeanine Ohene-Agyei, Daniel Kwan, Jean Sebastien Dandurand, Yifei Zhang, Nandita Vijaykumar

TL;DR
TuneShift-KD is an automated method for transferring specialized knowledge from a fine-tuned model to a new model using minimal data, leveraging perplexity differences without needing training datasets.
Contribution
It introduces a novel, dataset-free distillation approach that identifies and transfers specialized knowledge through synthetic prompts based on perplexity differences.
Findings
Models fine-tuned with TuneShift-KD outperform prior methods in accuracy.
The approach requires only a few representative prompts and no training datasets.
TuneShift-KD effectively transfers knowledge across different models.
Abstract
To embed domain-specific or specialized knowledge into pre-trained foundation models, fine-tuning using techniques such as parameter efficient fine-tuning (e.g. LoRA) is a common practice. However, as new LLM architectures and pre-trained models emerge, transferring this specialized knowledge to newer models becomes an important task. In many scenarios, the original specialized data may be unavailable due to privacy or commercial restrictions, necessitating distillation and transfer of this specialized knowledge from the fine-tuned base model to a different pre-trained model. We present TuneShift-KD, a novel approach that automatically distills specialized knowledge from a fine-tuned model to a target model using only a few examples representative of the specialized information. Our key insight is that specialized knowledge can be identified through perplexity differences between base…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper addresses a practical and emerging challenge, motivated by real deployment constraints (privacy, cloud-hosted models, hardware vendors): transferring LoRA-fine-tuned expertise when the fine-tuning data are unavailable. The perplexity-difference criterion is intuitive, theoretically grounded (entropy and KL analysis), and easy to implement. Broad applicability and strong empirical results
While elegant, the main novelty—using PPL difference as a filter—is incremental compared to prior perplexity-based data filtering Results are limited to small-/mid-scale models (≤13 B). It is unclear whether the method scales or remains stable for larger modern architectures (e.g., 70 B). Reported improvements (1–7 pp) are modest and could lie within the noise range of evaluation harnesses, yet statistical significance is not reported. I think the tasks (GSM8K, MBPP, BBH) are general-domain,
- The perplexity difference criterion is intuitive. Prompts where the fine-tuned models are confident but base models struggle can capture specialized knowledge. - Unlike Trans-LoRA, TuneShift-KD requires no discriminator, which makes the standard fine-tuning process simpler and more practical. - The method shows accuracy gains over Trans-LoRA in GSM8K, MBPP, and BBH. - The model is highly automatic, can transfer information across different architectures, and works without the exact base model
- As the authors acknowledge, perplexity/likelihood-based selection of training samples is a well-known technique, known already 15 years ago (see e.g., [1]). Even in an LLM-based distillation context, log-likelihood/entropy-based methods have been used recently (see e.g., [2, 3]). This work is clearly part of the same family of data selection methods, limiting the novelty. - My interpretation of the results is that the main performance driver compared to Trans-LoRA is the diversity of the promp
1. The authors address a genuine and relevant challenge in model transfer and knowledge distillation — how to extract specialized knowledge from an existing fine-tuned model when the original fine-tuning data are unavailable. This problem has practical significance in real-world scenarios, particularly where data access is limited or restricted by compliance constraints. 2. TuneShift-KD avoids the need for additional discriminators or manual labeling by relying on perplexity-based filtering and
1. Lack of theoretical grounding: The core assumption—that perplexity differences can effectively represent knowledge differences between models—has no solid theoretical justification. The authors provide only heuristic reasoning without statistical significance analysis or ablation comparing alternative indicators such as KL divergence or output diversity. 2. Overly heuristic and weakly interpretable method: The key filtering mechanism of TuneShift-KD depends on an empirically chosen threshold
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Domain Adaptation and Few-Shot Learning · Topic Modeling
