Minifinetuning: Low-Data Generation Domain Adaptation through Corrective Self-Distillation
Peter Belcak, Greg Heinrich, Jan Kautz, Pavlo Molchanov

TL;DR
Minifinetuning (MFT) is a novel low-data domain adaptation method for language models that significantly reduces overfitting and degeneralization without requiring pre-training data, outperforming standard finetuning techniques.
Contribution
The paper introduces minifinetuning, a new approach that enhances domain adaptation in low-data scenarios by mitigating overfitting effects through corrective self-distillation.
Findings
MFT achieves 2-10x better specialization-to-degeneralization ratios.
MFT is robust with as few as 500 samples in the target domain.
Outperforms parameter-efficient finetuning methods.
Abstract
Finetuning language models for a new domain inevitably leads to the deterioration of their general performance. This becomes more pronounced the more limited the finetuning data resource. We introduce minifinetuning (MFT), a method for language model domain adaptation that considerably reduces the effects of overfitting-induced degeneralization in low-data settings and which does so in the absence of any pre-training data for replay. MFT demonstrates 2-10x more favourable specialization-to-degeneralization ratios than standard finetuning across a wide range of models and domains and exhibits an intrinsic robustness to overfitting when data in the new domain is scarce and down to as little as 500 samples. Employing corrective self-distillation that is individualized on the sample level, MFT outperforms parameter-efficient finetuning methods, demonstrates replay-like degeneralization…
Peer Reviews
Decision·Submitted to ICLR 2025
Originality -------------- 1. To my knowledge, this is the first method specifically design to fine-tune off-the-shelf pretrained LLMs on a different domain, without having access to their training distribution. 2. The specific formulas derived for token-wise adaptation, including the special case of the original model's maximum prediction being already right, are not obvious. Quality ---------- 1. The various parts of the algorithm are well justified 2. Main experiments are comprehensive, acro
Quality ---------- I understand why Replay had to be included as a baseline, despite its advantage, but I'm wondering if a variant of Replay using only synthetic data sampled from the pre-trained model could have made the point stronger that the actual training data is necessary, vs. Replay acting as a regularizer. Minor points ----------------- 1. l. 222: Basline -> Baseline 2. l. 527: the citation for the Gemma paper shows up as "Team et al.", which is not really informative. Maybe update the
1. The paper introduces interesting results with a new, simple methodology for fine-tuning with minimal perturbation of the existing distribution learned by language models. The proposed method is intuitive and results (albeit incomplete) seem to indicate promising balance between retaining pre-trained representations while learning appropriate representations on the fine-tuning data. 2. The core idea of the soft KD by uniformly moving density mass to a specific location of interest to better m
1. The coverage of techniques that alleviate catastrophic forgetting needs to be more extensive and the paper largely ignores a large set of techniques based on alleviating catastrophic forgetting and also metrics to measure forgetting (e.g., see Sec. 2.3 in `[2]`) and the stability-plasticity tradeoff (sec. 3.1 in `[2]`). 2. The paper has a weak baseline comparison and misses comparing with state-of-the-art papers in regularization based continual learning to mitigate catastrophic forgetting
1. This presentation is clear. 2. The method is straightforward and intuitive.
1. Although most parts of the method can be justified, I find that one particular design is a bit heuristic and unexplained. In particular, the method essentially involves a weighted combination of the one-hot distribution (for fine-tuning) and the original model's output distribution (for avoiding degeneralization). However, instead of a constant $\alpha$, the authors adopted a variable $\alpha$, such that the correct answer for the fine-tuning always dominates the original model's best answer
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Speech Recognition and Synthesis
