Pre-trained Language Models Return Distinguishable Probability Distributions to Unfaithfully Hallucinated Texts
Taehun Cha, Donghun Lee

TL;DR
This paper demonstrates that pre-trained language models produce distinguishable probability and uncertainty distributions for hallucinated texts, enabling a new training method that reduces hallucinations while preserving text quality.
Contribution
The study reveals a general phenomenon across models and datasets, and introduces a novel hallucination-reducing training algorithm based on distribution distinguishability.
Findings
88-98% of cases show distinguishable distributions
Proposed algorithm outperforms baselines in faithfulness metrics
Maintains high general text quality
Abstract
In this work, we show the pre-trained language models return distinguishable generation probability and uncertainty distribution to unfaithfully hallucinated texts, regardless of their size and structure. By examining 24 models on 6 data sets, we find out that 88-98% of cases return statistically significantly distinguishable generation probability and uncertainty distributions. Using this general phenomenon, we showcase a hallucination-reducing training algorithm. Our algorithm outperforms other baselines by achieving higher faithfulness metrics while maintaining sound general text quality measures.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling
