CsFEVER and CTKFacts: Acquiring Czech data for fact verification
Herbert Ullrich, Jan Drchal, Martin R\'ypar, Hana Vincourov\'a,, V\'aclav Moravec

TL;DR
This paper presents methods for acquiring Czech fact verification data, including a Czech version of FEVER, a new Czech claims dataset, and baseline models, advancing multilingual fact-checking resources.
Contribution
It introduces Czech datasets for fact verification and NLI, utilizing machine translation and annotation techniques, with tools and baselines for future research.
Findings
Published 127k Czech FEVER translations with noted weaknesses.
Collected and annotated 3,097 claims from Czech news sources.
Analyzed datasets for spurious cues and annotation errors.
Abstract
In this paper, we examine several methods of acquiring Czech data for automated fact-checking, which is a task commonly modeled as a classification of textual claim veracity w.r.t. a corpus of trusted ground truths. We attempt to collect sets of data in form of a factual claim, evidence within the ground truth corpus, and its veracity label (supported, refuted or not enough info). As a first attempt, we generate a Czech version of the large-scale FEVER dataset built on top of Wikipedia corpus. We take a hybrid approach of machine translation and document alignment; the approach and the tools we provide can be easily applied to other languages. We discuss its weaknesses and inaccuracies, propose a future approach for their cleaning and publish the 127k resulting translations, as well as a version of such dataset reliably applicable for the Natural Language Inference task - the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗ctu-aic/xlm-roberta-large-squad2-ctkfactsmodel· 2 dl2 dl
- 🤗ctu-aic/xlm-roberta-large-xnli-csfevermodel· 4 dl4 dl
- 🤗ctu-aic/xlm-roberta-large-squad2-ctkfacts_nlimodel· 1 dl1 dl
- 🤗ctu-aic/xlm-roberta-large-xnli-ctkfacts_nlimodel· 2 dl2 dl
- 🤗ctu-aic/xlm-roberta-large-squad2-enfever_nlimodel· 2 dl2 dl
- 🤗ctu-aic/xlm-roberta-large-xnli-enfever_nlimodel· 2 dl2 dl
- 🤗ctu-aic/bert-base-multilingual-cased-csfever_nearestpmodel· 1 dl1 dl
- 🤗ctu-aic/xlm-roberta-large-squad2-csfever_nearestpmodel· 1 dl1 dl
- 🤗ctu-aic/xlm-roberta-large-xnli-csfever_nlimodel· 2 dl2 dl
- 🤗ctu-aic/xlm-roberta-large-squad2-csfever_nlimodel· 2 dl2 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
