Why Fine-Tuning Encourages Hallucinations and How to Fix It
Guy Kaplan, Zorik Gekhman, Zhen Zhu, Lotem Rozner, Yuval Reif, Swabha Swayamdipta, Derek Hoiem, Roy Schwartz

TL;DR
This paper investigates why fine-tuning large language models leads to hallucinations and proposes methods like self-distillation and parameter freezing to reduce these errors while maintaining performance.
Contribution
It introduces a self-distillation-based fine-tuning approach and analyzes mechanisms behind hallucinations, offering practical solutions to mitigate them.
Findings
Self-distillation reduces hallucinations by mitigating interference among semantic representations.
Freezing parameter groups preserves task performance while decreasing hallucinations.
Interference among overlapping semantic representations is a key driver of hallucinations.
Abstract
Large language models are prone to hallucinating factually incorrect statements. A key source of these errors is exposure to new factual information through supervised fine-tuning (SFT), which can increase hallucinations w.r.t. knowledge acquired during pre-training. In this work, we explore whether SFT-induced hallucinations can be mitigated using established tools from the continual learning literature, since they arise as a by-product of knowledge degradation during training. We propose a self-distillation-based SFT method that facilitates effective factual learning while minimizing hallucinations w.r.t. pre-existing knowledge by regularizing output-distribution drift. We also show that, in settings where new knowledge acquisition is unnecessary, suppressing factual plasticity by freezing parameter groups, can preserve task performance while reducing hallucinations. Lastly, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
