Pre-trained Language Models Return Distinguishable Probability   Distributions to Unfaithfully Hallucinated Texts

Taehun Cha; Donghun Lee

arXiv:2409.16658·cs.CL·September 26, 2024

Pre-trained Language Models Return Distinguishable Probability Distributions to Unfaithfully Hallucinated Texts

Taehun Cha, Donghun Lee

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper demonstrates that pre-trained language models produce distinguishable probability and uncertainty distributions for hallucinated texts, enabling a new training method that reduces hallucinations while preserving text quality.

Contribution

The study reveals a general phenomenon across models and datasets, and introduces a novel hallucination-reducing training algorithm based on distribution distinguishability.

Findings

01

88-98% of cases show distinguishable distributions

02

Proposed algorithm outperforms baselines in faithfulness metrics

03

Maintains high general text quality

Abstract

In this work, we show the pre-trained language models return distinguishable generation probability and uncertainty distribution to unfaithfully hallucinated texts, regardless of their size and structure. By examining 24 models on 6 data sets, we find out that 88-98% of cases return statistically significantly distinguishable generation probability and uncertainty distributions. Using this general phenomenon, we showcase a hallucination-reducing training algorithm. Our algorithm outperforms other baselines by achieving higher faithfulness metrics while maintaining sound general text quality measures.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

AIML-K/HalluDist
pytorchOfficial

Videos

Pre-trained Language Models Return Distinguishable Probability Distributions to Unfaithfully Hallucinated Texts· underline

Taxonomy

TopicsTopic Modeling