MoRFI: Monotonic Sparse Autoencoder Feature Identification
Dimitris Dimakopoulos, Shay B. Cohen, Ioannis Konstas

TL;DR
This paper investigates how fine-tuning large language models on new data causes hallucinations, and introduces MoRFI, a method to identify latent features responsible for this effect.
Contribution
The paper presents MoRFI, a novel technique using sparse autoencoders to identify causally relevant latent features related to hallucinations in fine-tuned language models.
Findings
Incremental new knowledge increases hallucinations, especially with prolonged training.
MoRFI effectively identifies latent features associated with knowledge disruption.
Single-latent interventions can recover stored knowledge in the residual stream.
Abstract
Large language models (LLMs) acquire most of their factual knowledge during the pre-training stage, through next token prediction. Subsequent stages of post-training often introduce new facts outwith the parametric knowledge, giving rise to hallucinations. While it has been demonstrated that supervised fine-tuning (SFT) on new knowledge may exacerbate the problem, the underlying mechanisms are still poorly understood. We conduct a controlled fine-tuning experiment, focusing on closed-book QA, and find latent directions that causally contribute to hallucinations. Specifically, we fine-tune Llama 3.1 8B, Gemma 2 9B and Mistral 7B v03 on seven distinct single QA datasets, controlling for the percentage of new knowledge and number of training epochs. By measuring performance on the test set, we validate that incrementally introducing new knowledge increases hallucinations, with the effect…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
