Logical Consistency of Large Language Models in Fact-checking

Bishwamittra Ghosh; Sarah Hasan; Naheed Anjum Arafat; Arijit Khan

arXiv:2412.16100·cs.CL·March 3, 2025

Logical Consistency of Large Language Models in Fact-checking

Bishwamittra Ghosh, Sarah Hasan, Naheed Anjum Arafat, Arijit Khan

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper investigates the logical consistency of large language models in complex fact-checking tasks involving propositional logic, introduces new datasets, assesses current models, and improves their consistency through fine-tuning.

Contribution

It introduces logical fact-checking datasets, evaluates LLMs' logical consistency, and enhances their performance via supervised fine-tuning.

Findings

01

Existing LLMs lack logical consistency on complex queries.

02

New datasets enable benchmarking of logical fact-checking.

03

Fine-tuning improves LLMs' logical reasoning abilities.

Abstract

In recent years, large language models (LLMs) have demonstrated significant success in performing varied natural language tasks such as language translation, question-answering, summarizing, fact-checking, etc. Despite LLMs' impressive ability to generate human-like texts, LLMs are infamous for their inconsistent responses - a meaning-preserving change in the input query results in an inconsistent response and attributes to vulnerabilities of LLMs such as hallucination. Consequently, existing research focuses on simple paraphrasing-based consistency assessment of LLMs, and ignores complex queries that necessitate an even better understanding of logical reasoning by an LLM. Our work therefore addresses the logical inconsistency of LLMs under complex logical queries with primitive logical operators, e.g., negation, conjunction, and disjunction. As a test bed, we consider…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 5Confidence 3

Strengths

Overall benchmarks for checking logical consistency for a given triple context and logical query are good and can help in evaluating logical consistency of LLMs. Empirical results showing zero shot/prompt and fine tuning results show the improvements across benchmarks.

Weaknesses

Experiments are more on the benchmark created and not sure how well they translate to real world situations or places where queries are in textual nature and does he supervised model still be able to generalize to textual data as opposed to triple context?

Reviewer 02Rating 5Confidence 3

Strengths

1. The definition of logical consistency is strict and the paper constructs three new datasets for finetuning and evaluating the consistency of LLM. 2. The construction of the LLMQUERY seems interesting and the author proposes many optimizations towards that. 3. The experiments show that the logical consistency can be improved by finetuning but not prompting.

Weaknesses

1. Though being logical consistent is a desirable attribute of LLM, we are also interested in whether it retrieve the correct answer. The logical consistency is only computed by whether the model gives the consistent answer with itself, but whether the answers are true or false are neglected in the experiment. 2. The writing of this paper can be further improved. For example, the Proposition 3 is trivial and kind of redundant. There are also some mistakes like in Line 346 ``more computationally

Reviewer 03Rating 6Confidence 3

Strengths

There are three main contributions of this paper. 1) Novel Dataset and Benchmarking: Three logical fact-checking datasets FreebaseLFC, NELLLFC, and WikiLFC, derived from knowledge graphs. 2) Evaluating LLMs on these new fact-checking datasets. The paper includes a variety of experimental setups, including comparisons of zero-shot instruction prompting and supervised fine-tuning, adding depth to the evaluation of LLMs' logical consistency. Results show that existing LLMs are not consistent when c

Weaknesses

1) While the paper presents methods for extracting relevant KG context through BFS and embedding-based retrieval, it is unclear how effective these methods are in dynamically varying real-world contexts. Expanding this section could strengthen the applicability of the approach. 2) The paper suggests fine-tuning as a solution to improve logical consistency but does not deeply discuss the associated computational overheads and limitations, which may impact the scalability of this approach for lar

Videos

Logical Consistency of Large Language Models in Fact-Checking· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques