Towards Comprehensive Stage-wise Benchmarking of Large Language Models in Fact-Checking

Hongzhan Lin; Zixin Chen; Zhiqi Shen; Ziyang Luo; Zhen Ye; Jing Ma; Tat-Seng Chua; Guandong Xu

arXiv:2601.02669·cs.CL·January 7, 2026

Towards Comprehensive Stage-wise Benchmarking of Large Language Models in Fact-Checking

Hongzhan Lin, Zixin Chen, Zhiqi Shen, Ziyang Luo, Zhen Ye, Jing Ma, Tat-Seng Chua, Guandong Xu

PDF

Open Access

TL;DR

This paper introduces FactArena, a comprehensive evaluation framework for large language models that assesses their performance across the entire fact-checking process, revealing systematic weaknesses and guiding future improvements.

Contribution

The paper presents a novel, fully automated, stage-wise benchmarking framework that evaluates LLMs on claim decomposition, evidence retrieval, and justification, addressing limitations of existing single-stage benchmarks.

Findings

01

Significant discrepancies between claim verification accuracy and end-to-end fact-checking performance.

02

FactArena provides stable, interpretable rankings across 16 state-of-the-art LLMs.

03

Adaptive claim evolution probes models' factual robustness beyond fixed datasets.

Abstract

Large Language Models (LLMs) are increasingly deployed in real-world fact-checking systems, yet existing evaluations focus predominantly on claim verification and overlook the broader fact-checking workflow, including claim extraction and evidence retrieval. This narrow focus prevents current benchmarks from revealing systematic reasoning failures, factual blind spots, and robustness limitations of modern LLMs. To bridge this gap, we present FactArena, a fully automated arena-style evaluation framework that conducts comprehensive, stage-wise benchmarking of LLMs across the complete fact-checking pipeline. FactArena integrates three key components: (i) an LLM-driven fact-checking process that standardizes claim decomposition, evidence retrieval via tool-augmented interactions, and justification-based verdict prediction; (ii) an arena-styled judgment mechanism guided by consolidated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education · Topic Modeling