TL;DR
This paper introduces a realistic political claim dataset for fact checking, analyzes LLM performance drops in this setting, and proposes RAV, an agentic framework that improves verification accuracy across domains.
Contribution
It presents PFO, a more realistic benchmark dataset for fact checking, and introduces RAV, a novel agent-based system that enhances verification performance across multiple domains.
Findings
LLMs' performance drops by 22% on PFO compared to unfiltered data.
RAV outperforms state-of-the-art methods on multiple benchmarks.
RAV shows minimal performance drop of 16.3% in macro f1 when using filtered data.
Abstract
Automated fact checking with large language models (LLMs) offers a scalable alternative to manual verification. Evaluating fact checking is challenging as existing benchmark datasets often include post claim analysis and annotator cues, which are absent in real world scenarios where claims are fact checked immediately after being made. This limits the realism of current evaluations. We present Politi Fact Only (PFO), a 5 class benchmark dataset of 2,982 political claims from politifact.com, where all post claim analysis and annotator cues have been removed manually. This ensures that models are evaluated using only the information that would have been available prior to the claim's verification. Evaluating LLMs on PFO, we see an average performance drop of 22% in terms of macro f1 compared to PFO's unfiltered version. Based on the identified challenges of the existing LLM based fact…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
