RealFactBench: A Benchmark for Evaluating Large Language Models in Real-World Fact-Checking

Shuo Yang; Yuqin Dai; Guoqing Wang; Xinran Zheng; Jinfeng Xu; Jinze Li; Zhenzhe Ying; Weiqiang Wang; Edith C.H. Ngai

arXiv:2506.12538·cs.CL·June 17, 2025

RealFactBench: A Benchmark for Evaluating Large Language Models in Real-World Fact-Checking

Shuo Yang, Yuqin Dai, Guoqing Wang, Xinran Zheng, Jinfeng Xu, Jinze Li, Zhenzhe Ying, Weiqiang Wang, Edith C.H. Ngai

PDF

Open Access 1 Repo 1 Datasets

TL;DR

RealFactBench is a new comprehensive benchmark designed to evaluate large language models and multimodal models in real-world fact-checking scenarios, addressing limitations of existing benchmarks.

Contribution

It introduces a diverse, multimodal dataset and a novel Unknown Rate metric for nuanced evaluation of models' uncertainty handling in fact-checking tasks.

Findings

01

Models show limitations in real-world fact-checking tasks.

02

The Unknown Rate metric provides deeper insights into model uncertainty.

03

Benchmark results highlight areas for future improvement.

Abstract

Large Language Models (LLMs) hold significant potential for advancing fact-checking by leveraging their capabilities in reasoning, evidence retrieval, and explanation generation. However, existing benchmarks fail to comprehensively evaluate LLMs and Multimodal Large Language Models (MLLMs) in realistic misinformation scenarios. To bridge this gap, we introduce RealFactBench, a comprehensive benchmark designed to assess the fact-checking capabilities of LLMs and MLLMs across diverse real-world tasks, including Knowledge Validation, Rumor Detection, and Event Verification. RealFactBench consists of 6K high-quality claims drawn from authoritative sources, encompassing multimodal content and diverse domains. Our evaluation framework further introduces the Unknown Rate (UnR) metric, enabling a more nuanced assessment of models' ability to handle uncertainty and balance between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kalendsyang/realfactbench
noneOfficial

Datasets

kalends/RealFactBench
dataset· 107 dl
107 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Misinformation and Its Impacts · Spam and Phishing Detection