MultiCW: A Large-Scale Balanced Benchmark Dataset for Training Robust Check-Worthiness Detection Models

Martin Hyben; Sebastian Kula; Jan Cegin; Jakub Simko; Ivan Srba; Robert Moro

arXiv:2602.16298·cs.CL·February 19, 2026

MultiCW: A Large-Scale Balanced Benchmark Dataset for Training Robust Check-Worthiness Detection Models

Martin Hyben, Sebastian Kula, Jan Cegin, Jakub Simko, Ivan Srba, Robert Moro

PDF

Open Access 1 Video

TL;DR

The paper introduces MultiCW, a large, balanced multilingual dataset for check-worthy claim detection, and benchmarks various models to evaluate their robustness across languages, domains, and styles.

Contribution

It provides a comprehensive, multilingual benchmark dataset for check-worthy claim detection and evaluates model performance, highlighting the strengths of fine-tuned models over zero-shot LLMs.

Findings

01

Fine-tuned models outperform zero-shot LLMs in claim classification.

02

Models show strong generalization across languages, domains, and styles.

03

MultiCW enables systematic comparison of models for fact-checking tasks.

Abstract

Large Language Models (LLMs) are beginning to reshape how media professionals verify information, yet automated support for detecting check-worthy claims a key step in the fact-checking process remains limited. We introduce the Multi-Check-Worthy (MultiCW) dataset, a balanced multilingual benchmark for check-worthy claim detection spanning 16 languages, 7 topical domains, and 2 writing styles. It consists of 123,722 samples, evenly distributed between noisy (informal) and structured (formal) texts, with balanced representation of check-worthy and non-check-worthy classes across all languages. To probe robustness, we also introduce an equally balanced out-of-distribution evaluation set of 27,761 samples in 4 additional languages. To provide baselines, we benchmark 3 common fine-tuned multilingual transformers against a diverse set of 15 commercial and open LLMs under zero-shot settings.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

MultiCW: A Large-Scale Balanced Benchmark Dataset for Training Robust Check-Worthiness Detection Models· underline

Taxonomy

TopicsMisinformation and Its Impacts · Computational and Text Analysis Methods · Hate Speech and Cyberbullying Detection