Unfamiliar Finetuning Examples Control How Language Models Hallucinate
Katie Kang, Eric Wallace, Claire Tomlin, Aviral Kumar, Sergey Levine

TL;DR
This paper investigates how unfamiliar finetuning examples influence language model hallucinations and demonstrates that controlling these examples can reduce hallucinations and improve factuality in generated content.
Contribution
It reveals that unfamiliar finetuning examples shape hallucinations and proposes methods to control these effects, enhancing model reliability and factual accuracy.
Findings
Unfamiliar finetuning examples influence hallucination patterns.
Controlling hallucinations in reward models improves factuality.
Strategic supervision reduces negative effects of hallucinations in RL finetuning.
Abstract
Large language models are known to hallucinate when faced with unfamiliar queries, but the underlying mechanism that govern how models hallucinate are not yet fully understood. In this work, we find that unfamiliar examples in the models' finetuning data -- those that introduce concepts beyond the base model's scope of knowledge -- are crucial in shaping these errors. In particular, we find that an LLM's hallucinated predictions tend to mirror the responses associated with its unfamiliar finetuning examples. This suggests that by modifying how unfamiliar finetuning examples are supervised, we can influence a model's responses to unfamiliar queries (e.g., say ``I don't know''). We empirically validate this observation in a series of controlled experiments involving SFT, RL, and reward model finetuning on TriviaQA and MMLU. Our work further investigates RL finetuning strategies for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques
MethodsBalanced Selection · Shrink and Fine-Tune
