TL;DR
This paper introduces MultiSynFact, a large-scale multilingual fact-checking dataset generated using LLMs, external knowledge, and validation, to improve automatic fact-checking across multiple languages including low-resource ones.
Contribution
The paper presents the first large-scale multilingual fact-checking dataset, MultiSynFact, created with an LLM-based pipeline and validation, supporting research in multilingual fact-checking.
Findings
MultiSynFact contains 2.2 million claim-source pairs.
The dataset improves fact-checking performance across languages.
Open-source framework facilitates further research.
Abstract
Robust automatic fact-checking systems have the potential to combat online misinformation at scale. However, most existing research primarily focuses on English. In this paper, we introduce MultiSynFact, the first large-scale multilingual fact-checking dataset containing 2.2M claim-source pairs designed to support Spanish, German, English, and other low-resource languages. Our dataset generation pipeline leverages Large Language Models (LLMs), integrating external knowledge from Wikipedia and incorporating rigorous claim validation steps to ensure data quality. We evaluate the effectiveness of MultiSynFact across multiple models and experimental settings. Additionally, we open-source a user-friendly framework to facilitate further research in multilingual fact-checking and dataset generation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
