CsFEVER and CTKFacts: Acquiring Czech data for fact verification

Herbert Ullrich; Jan Drchal; Martin R\'ypar; Hana Vincourov\'a,; V\'aclav Moravec

arXiv:2201.11115·cs.CL·December 19, 2023·1 cites

CsFEVER and CTKFacts: Acquiring Czech data for fact verification

Herbert Ullrich, Jan Drchal, Martin R\'ypar, Hana Vincourov\'a,, V\'aclav Moravec

PDF

Open Access 1 Repo 10 Models 5 Datasets

TL;DR

This paper presents methods for acquiring Czech fact verification data, including a Czech version of FEVER, a new Czech claims dataset, and baseline models, advancing multilingual fact-checking resources.

Contribution

It introduces Czech datasets for fact verification and NLI, utilizing machine translation and annotation techniques, with tools and baselines for future research.

Findings

01

Published 127k Czech FEVER translations with noted weaknesses.

02

Collected and annotated 3,097 claims from Czech news sources.

03

Analyzed datasets for spurious cues and annotation errors.

Abstract

In this paper, we examine several methods of acquiring Czech data for automated fact-checking, which is a task commonly modeled as a classification of textual claim veracity w.r.t. a corpus of trusted ground truths. We attempt to collect sets of data in form of a factual claim, evidence within the ground truth corpus, and its veracity label (supported, refuted or not enough info). As a first attempt, we generate a Czech version of the large-scale FEVER dataset built on top of Wikipedia corpus. We take a hybrid approach of machine translation and document alignment; the approach and the tools we provide can be easily applied to other languages. We discuss its weaknesses and inaccuracies, propose a future approach for their cleaning and publish the 127k resulting translations, as well as a version of such dataset reliably applicable for the Natural Language Inference task - the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aic-factcheck/csfever-and-ctkfacts-paper
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification