HintsOfTruth: A Multimodal Checkworthiness Detection Dataset with Real and Synthetic Claims

Michiel van der Meer; Pavel Korshunov; S\'ebastien Marcel; Lonneke van der Plas

arXiv:2502.11753·cs.AI·June 5, 2025

HintsOfTruth: A Multimodal Checkworthiness Detection Dataset with Real and Synthetic Claims

Michiel van der Meer, Pavel Korshunov, S\'ebastien Marcel, Lonneke van der Plas

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces HintsOfTruth, a comprehensive multimodal dataset with real and synthetic claims to improve checkworthiness detection, and evaluates various models highlighting the trade-offs between accuracy and computational cost.

Contribution

The creation of HintsOfTruth, a large multimodal dataset with real and synthetic claims, and a comparative analysis of different detection models including LLMs.

Findings

01

Lightweight text encoders perform comparably to multimodal models on real data.

02

Multimodal LLMs are more accurate but computationally expensive.

03

Multimodal models are more robust with synthetic data.

Abstract

Misinformation can be countered with fact-checking, but the process is costly and slow. Identifying checkworthy claims is the first step, where automation can help scale fact-checkers' efforts. However, detection methods struggle with content that is (1) multimodal, (2) from diverse domains, and (3) synthetic. We introduce HintsOfTruth, a public dataset for multimodal checkworthiness detection with 27K real-world and synthetic image/claim pairs. The mix of real and synthetic data makes this dataset unique and ideal for benchmarking detection methods. We compare fine-tuned and prompted Large Language Models (LLMs). We find that well-configured lightweight text-based encoders perform comparably to multimodal models but the former only focus on identifying non-claim-like content. Multimodal LLMs can be more accurate but come at a significant computational cost, making them impractical for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ukplab/5pils
none

Videos

HintsOfTruth: A Multimodal Checkworthiness Detection Dataset with Real and Synthetic Claims· underline

Taxonomy

TopicsSoftware System Performance and Reliability · Software Engineering Research

MethodsFocus