Neural Probe-Based Hallucination Detection for Large Language Models
Shize Liang, Hongzhi Wang

TL;DR
This paper introduces a neural network-based token-level hallucination detection method for large language models, leveraging lightweight MLP probes on hidden states to improve accuracy and stability over existing techniques.
Contribution
It proposes a nonlinear probe framework with a multi-objective loss and Bayesian optimization for optimal probe placement, advancing hallucination detection in LLMs.
Findings
MLP probes outperform state-of-the-art methods in accuracy and recall.
The approach achieves low false-positive detection in multiple datasets.
Layer optimization enhances detection performance.
Abstract
Large language models(LLMs) excel at text generation and knowledge question-answering tasks, but they are prone to generating hallucinated content, severely limiting their application in high-risk domains. Current hallucination detection methods based on uncertainty estimation and external knowledge retrieval suffer from the limitation that they still produce erroneous content at high confidence levels and rely heavily on retrieval efficiency and knowledge coverage. In contrast, probe methods that leverage the model's hidden-layer states offer real-time and lightweight advantages. However, traditional linear probes struggle to capture nonlinear structures in deep semantic spaces.To overcome these limitations, we propose a neural network-based framework for token-level hallucination detection. By freezing language model parameters, we employ lightweight MLP probes to perform nonlinear…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Healthcare · Mental Health via Writing
