Prompt-Response Semantic Divergence Metrics for Faithfulness Hallucination and Misalignment Detection in Large Language Models
Igor Halperin

TL;DR
This paper proposes Semantic Divergence Metrics (SDM), a lightweight framework that detects faithfulness hallucinations in large language models by measuring semantic divergence across prompts and responses, improving hallucination detection accuracy.
Contribution
The paper introduces SDM, a novel prompt-aware semantic divergence framework that enhances hallucination detection in LLMs by analyzing response consistency across paraphrased prompts.
Findings
SDM effectively detects faithfulness hallucinations in LLMs.
The combined metrics accurately classify different response types.
Semantic divergence scores correlate with hallucination severity.
Abstract
The proliferation of Large Language Models (LLMs) is challenged by hallucinations, critical failure modes where models generate non-factual, nonsensical or unfaithful text. This paper introduces Semantic Divergence Metrics (SDM), a novel lightweight framework for detecting Faithfulness Hallucinations -- events of severe deviations of LLMs responses from input contexts. We focus on a specific implementation of these LLM errors, {confabulations, defined as responses that are arbitrary and semantically misaligned with the user's query. Existing methods like Semantic Entropy test for arbitrariness by measuring the diversity of answers to a single, fixed prompt. Our SDM framework improves upon this by being more prompt-aware: we test for a deeper form of arbitrariness by measuring response consistency not only across multiple answers but also across multiple, semantically-equivalent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
