Detecting LLM Fact-conflicting Hallucinations Enhanced by   Temporal-logic-based Reasoning

Ningke Li; Yahui Song; Kailong Wang; Yuekang Li; Ling Shi; Yi Liu,; Haoyu Wang

arXiv:2502.13416·cs.CL·February 20, 2025

Detecting LLM Fact-conflicting Hallucinations Enhanced by Temporal-logic-based Reasoning

Ningke Li, Yahui Song, Kailong Wang, Yuekang Li, Ling Shi, Yi Liu,, Haoyu Wang

PDF

Open Access 1 Datasets

TL;DR

This paper introduces Drowzee, a framework that uses temporal logic and knowledge bases to detect fact-conflicting hallucinations in large language models by generating and validating complex test cases.

Contribution

Drowzee is the first end-to-end system combining temporal logic reasoning with automated test case generation to identify hallucinations in LLMs.

Findings

01

Effectively detects non-temporal hallucinations (24.7%-59.8%)

02

Identifies temporal hallucinations (16.7%-39.2%)

03

Works across nine LLMs and knowledge domains

Abstract

Large language models (LLMs) face the challenge of hallucinations -- outputs that seem coherent but are actually incorrect. A particularly damaging type is fact-conflicting hallucination (FCH), where generated content contradicts established facts. Addressing FCH presents three main challenges: 1) Automatically constructing and maintaining large-scale benchmark datasets is difficult and resource-intensive; 2) Generating complex and efficient test cases that the LLM has not been trained on -- especially those involving intricate temporal features -- is challenging, yet crucial for eliciting hallucinations; and 3) Validating the reasoning behind LLM outputs is inherently difficult, particularly with complex logical relationships, as it requires transparency in the model's decision-making process. This paper presents Drowzee, an innovative end-to-end metamorphic testing framework that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

GarnettLiang/Omnibench-RAG
dataset· 9 dl
9 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Logic, Reasoning, and Knowledge · Topological and Geometric Data Analysis

MethodsBalanced Selection · Sparse Evolutionary Training