FactCheckmate: Preemptively Detecting and Mitigating Hallucinations in LMs
Deema Alnuhait, Neeraja Kirtane, Muhammad Khalifa, Hao Peng

TL;DR
FactCheckmate is a method that preemptively detects and mitigates hallucinations in language models by analyzing internal representations, leading to more factual outputs with minimal overhead.
Contribution
It introduces a novel approach using hidden states to predict and prevent hallucinations in LMs before they occur, improving factual accuracy.
Findings
Achieves over 70% detection accuracy of hallucinations.
Outputs with intervention are 34.4% more factual.
Effective across multiple model families and datasets.
Abstract
Language models (LMs) hallucinate. We inquire: Can we detect and mitigate hallucinations before they happen? This work answers this research question in the positive, by showing that the internal representations of LMs provide rich signals that can be used for this purpose. We introduce FactCheckmate, which preemptively detects hallucinations by learning a classifier that predicts whether the LM will hallucinate, based on the model's hidden states produced over the inputs, before decoding begins. If a hallucination is detected, FactCheckmate then intervenes by adjusting the LM's hidden states such that the model will produce more factual outputs. FactCheckmate provides fresh insights that the inner workings of LMs can be revealed by their hidden states. Practically, both its detection and mitigation models are lightweight, adding little inference overhead; FactCheckmate proves a more…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. Both the detection and mitigation models are lightweight, resulting in minimal inference overhead, which is advantageous for practical applications. 2. The approach is evaluated across various large-scale models, including Llama, Mistral, and Gemma, demonstrating its broad applicability.
1. The paper lacks an analysis of the generalizability of the learned classification network and intervention model. Specifically, it is unclear whether the trained classification and intervention models are generalizable across different large models and tasks. Given that the data collection for training was based on only three tasks, questions remain regarding the generalizability to other tasks. Is a new training dataset needed for additional tasks, or does the current model extend effectivel
- Extensive experimental results across different models and datasets are robust and demonstrate effectiveness. - Offers practical implications for real-world applications where factual accuracy is crucial, enhancing LM reliability.
- Writing needs improvement, including but not limited to the abstract and introduction - Typos, e.g. Line 44 "representaions" - determining the factuality through merely probing the LMs' representations is not novel as a methodology - Limited exploration of other LM components beyond hidden states. - Generalizability of results is uncertain for tasks beyond QA.
**Strength 1** The paper introduces a new approach to detecting hallucinations by leveraging the internal representations of LMs. **Strength 2** The experimental design is solid, and the results effectively demonstrate the effectiveness of the proposed method. **Strength 3** The paper is well-written and easy to follow, with clear explanations of the methodology and results.
**Weakness 1** The paper focuses solely on close-book hallucinations, whereas many hallucinations occur in open-book settings, such as in abstractive summarization. Evaluating the method's effectiveness in handling open-book hallucinations would provide a more comprehensive understanding of its capabilities. **Weakness 2** The evaluation of the proposed method's factuality is conducted on the NQ-open dataset, and the classifier used is also trained on the same dataset. It remains unclear whethe
Overall, the presentation is well-written and easy to follow.
1. Although many methods for detecting and mitigating LLM hallucinations are outlined in the related work, the authors compare their approach with only one method. To convincingly demonstrate the superiority of their method, it would be prudent to include 3-4 baselines for both detection and mitigation aspects. Without this broader comparison, I cannot recognize the advantages of the authors' approach. 2. I appreciate the experiments conducted on different open-source model families, but there
Videos
Taxonomy
TopicsEpilepsy research and treatment · Bipolar Disorder and Treatment
