Counterfactual Probing for Hallucination Detection and Mitigation in Large Language Models
Yijun Feng

TL;DR
This paper introduces Counterfactual Probing, a method to detect and reduce hallucinations in large language models by evaluating their responses to plausible but factually incorrect counterfactual statements, improving factual accuracy without retraining.
Contribution
The paper presents a novel counterfactual probing approach that enhances hallucination detection and mitigation in LLMs without requiring model retraining.
Findings
Outperforms baseline hallucination detection methods.
Reduces hallucination scores by an average of 24.5%.
Can be integrated into existing LLM pipelines as a real-time verification tool.
Abstract
Large Language Models have demonstrated remarkable capabilities across diverse tasks, yet they frequently generate hallucinations outputs that are fluent but factually incorrect or unsupported. We propose Counterfactual Probing, a novel approach for detecting and mitigating hallucinations in LLM outputs. Our method dynamically generates counterfactual statements that appear plausible but contain subtle factual errors, then evaluates the model's sensitivity to these perturbations. We hypothesize that genuine knowledge exhibits robustness to counterfactual variations, while hallucinated content shows inconsistent confidence patterns when confronted with plausible alternatives. Our comprehensive evaluation on TruthfulQA, factual statement datasets, and curated hallucination examples demonstrates that counterfactual probing achieves superior detection performance compared to baseline…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMisinformation and Its Impacts · Topic Modeling · Adversarial Robustness in Machine Learning
