RealFactBench: A Benchmark for Evaluating Large Language Models in Real-World Fact-Checking
Shuo Yang, Yuqin Dai, Guoqing Wang, Xinran Zheng, Jinfeng Xu, Jinze Li, Zhenzhe Ying, Weiqiang Wang, Edith C.H. Ngai

TL;DR
RealFactBench is a new comprehensive benchmark designed to evaluate large language models and multimodal models in real-world fact-checking scenarios, addressing limitations of existing benchmarks.
Contribution
It introduces a diverse, multimodal dataset and a novel Unknown Rate metric for nuanced evaluation of models' uncertainty handling in fact-checking tasks.
Findings
Models show limitations in real-world fact-checking tasks.
The Unknown Rate metric provides deeper insights into model uncertainty.
Benchmark results highlight areas for future improvement.
Abstract
Large Language Models (LLMs) hold significant potential for advancing fact-checking by leveraging their capabilities in reasoning, evidence retrieval, and explanation generation. However, existing benchmarks fail to comprehensively evaluate LLMs and Multimodal Large Language Models (MLLMs) in realistic misinformation scenarios. To bridge this gap, we introduce RealFactBench, a comprehensive benchmark designed to assess the fact-checking capabilities of LLMs and MLLMs across diverse real-world tasks, including Knowledge Validation, Rumor Detection, and Event Verification. RealFactBench consists of 6K high-quality claims drawn from authoritative sources, encompassing multimodal content and diverse domains. Our evaluation framework further introduces the Unknown Rate (UnR) metric, enabling a more nuanced assessment of models' ability to handle uncertainty and balance between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Misinformation and Its Impacts · Spam and Phishing Detection
