Claim Matching Beyond English to Scale Global Fact-Checking
Ashkan Kazemi, Kiran Garimella, Devin Gaffney, Scott A. Hale

TL;DR
This paper introduces a multilingual claim matching approach to scale fact-checking across languages, using a novel dataset and a custom embedding model that outperforms existing multilingual models.
Contribution
The paper presents a new multilingual dataset for claim matching, a custom embedding model trained with knowledge distillation, and demonstrates improved performance over LASER and LaBSE.
Findings
Our model exceeds LASER and LaBSE in claim matching accuracy.
The dataset includes high-resource and low-resource languages.
We release datasets, code, and models for future research.
Abstract
Manual fact-checking does not scale well to serve the needs of the internet. This issue is further compounded in non-English contexts. In this paper, we discuss claim matching as a possible solution to scale fact-checking. We define claim matching as the task of identifying pairs of textual messages containing claims that can be served with one fact-check. We construct a novel dataset of WhatsApp tipline and public group messages alongside fact-checked claims that are first annotated for containing "claim-like statements" and then matched with potentially similar items and annotated for claim matching. Our dataset contains content in high-resource (English, Hindi) and lower-resource (Bengali, Malayalam, Tamil) languages. We train our own embedding model using knowledge distillation and a high-quality "teacher" model in order to address the imbalance in embedding quality between the low-…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsKnowledge Distillation
