Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers

Wooseok Seo; Seungju Han; Jaehun Jung; Benjamin Newman; Seungwon Lim; Seungbeen Lee; Ximing Lu; Yejin Choi; Youngjae Yu

arXiv:2506.13342·cs.AI·February 6, 2026

Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers

Wooseok Seo, Seungju Han, Jaehun Jung, Benjamin Newman, Seungwon Lim, Seungbeen Lee, Ximing Lu, Yejin Choi, Youngjae Yu

PDF

Open Access 1 Repo 1 Models 2 Datasets

TL;DR

This paper evaluates various large language models for fact verification, highlighting dataset issues, the effectiveness of frontier LLMs with few-shot learning, and the potential of fine-tuned smaller models with synthetic data for improved accuracy.

Contribution

It uncovers dataset annotation problems, emphasizes the importance of frontier LLMs in fact verification, and proposes training smaller models with synthetic data to enhance reasoning capabilities.

Findings

01

16% of dataset labels are ambiguous or incorrect, affecting model rankings.

02

Frontier LLMs with few-shot learning outperform many baselines.

03

Synthetic multi-hop reasoning data improves small model performance.

Abstract

Fact verification is essential for ensuring the reliability of LLM applications. In this study, we evaluate 12 pre-trained LLMs and one specialized fact-verifier, including frontier LLMs and open-weight reasoning LLMs, using a collection of examples from 14 fact-checking benchmarks. We share three findings intended to guide future development of more robust fact verifiers. First, we highlight the importance of addressing annotation errors and ambiguity in datasets, demonstrating that approximately 16\% of ambiguous or incorrectly labeled data substantially influences model rankings. Neglecting this issue may result in misleading conclusions during comparative evaluations, and we suggest using a systematic pipeline utilizing LLM-as-a-judge to help identify these issues at scale. Second, we discover that frontier LLMs with few-shot in-context examples, often overlooked in previous works,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

just1nseo/verifying-the-verifiers
noneOfficial

Models

🤗
just1nseo/ClearCheck-8B
model· 25 dl· ♡ 3
25 dl♡ 3

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling