Probing LLM Hallucination from Within: Perturbation-Driven Approach via Internal Knowledge
Seongmin Lee, Hsiang Hsu, Chun-Fu Chen, Duen Horng Chau

TL;DR
This paper introduces SHINE, a perturbation-driven method for classifying and detecting hallucinations in large language models without external knowledge or fine-tuning, achieving state-of-the-art results.
Contribution
The paper proposes a novel hallucination probing task and a perturbation-based method that improves hallucination detection across multiple LLMs without additional training.
Findings
SHINE outperforms seven competing methods in hallucination detection.
Perturbing key entities affects hallucination types differently.
Effective across three modern LLMs and four datasets.
Abstract
LLM hallucination, where unfaithful text is generated, presents a critical challenge for LLMs' practical applications. Current detection methods often resort to external knowledge, LLM fine-tuning, or supervised training with large hallucination-labeled datasets. Moreover, these approaches do not distinguish between different types of hallucinations, which is crucial for enhancing detection performance. To address such limitations, we introduce hallucination probing, a new task that classifies LLM-generated text into three categories: aligned, misaligned, and fabricated. Driven by our novel discovery that perturbing key entities in prompts affects LLM's generation of these three types of text differently, we propose SHINE, a novel hallucination probing method that does not require external knowledge, supervised training, or LLM fine-tuning. SHINE is effective in hallucination probing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCryptography and Residue Arithmetic · Logic, Reasoning, and Knowledge · Logic, programming, and type systems
