HintsOfTruth: A Multimodal Checkworthiness Detection Dataset with Real and Synthetic Claims
Michiel van der Meer, Pavel Korshunov, S\'ebastien Marcel, Lonneke van der Plas

TL;DR
This paper introduces HintsOfTruth, a comprehensive multimodal dataset with real and synthetic claims to improve checkworthiness detection, and evaluates various models highlighting the trade-offs between accuracy and computational cost.
Contribution
The creation of HintsOfTruth, a large multimodal dataset with real and synthetic claims, and a comparative analysis of different detection models including LLMs.
Findings
Lightweight text encoders perform comparably to multimodal models on real data.
Multimodal LLMs are more accurate but computationally expensive.
Multimodal models are more robust with synthetic data.
Abstract
Misinformation can be countered with fact-checking, but the process is costly and slow. Identifying checkworthy claims is the first step, where automation can help scale fact-checkers' efforts. However, detection methods struggle with content that is (1) multimodal, (2) from diverse domains, and (3) synthetic. We introduce HintsOfTruth, a public dataset for multimodal checkworthiness detection with 27K real-world and synthetic image/claim pairs. The mix of real and synthetic data makes this dataset unique and ideal for benchmarking detection methods. We compare fine-tuned and prompted Large Language Models (LLMs). We find that well-configured lightweight text-based encoders perform comparably to multimodal models but the former only focus on identifying non-claim-like content. Multimodal LLMs can be more accurate but come at a significant computational cost, making them impractical for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSoftware System Performance and Reliability · Software Engineering Research
MethodsFocus
