A Dataset and Strong Baselines for Classification of Czech News Texts

Hynek Kydl\'i\v{c}ek; Jind\v{r}ich Libovick\'y

arXiv:2307.10666·cs.CL·July 21, 2023

A Dataset and Strong Baselines for Classification of Czech News Texts

Hynek Kydl\'i\v{c}ek, Jind\v{r}ich Libovick\'y

PDF

Open Access 1 Repo 3 Datasets

TL;DR

This paper introduces CZE-NEC, a large Czech news classification dataset with four tasks, demonstrating that specialized pre-trained models outperform general large-scale language models in these tasks.

Contribution

The paper presents CZE-NEC, a comprehensive Czech news dataset with multiple classification tasks, and provides strong baseline models showing the superiority of language-specific pre-trained encoders.

Findings

01

Human performance is lower than machine baselines.

02

Language-specific models outperform large-scale generative models.

03

The dataset enables rigorous evaluation of Czech NLP models.

Abstract

Pre-trained models for Czech Natural Language Processing are often evaluated on purely linguistic tasks (POS tagging, parsing, NER) and relatively simple classification tasks such as sentiment classification or article classification from a single news source. As an alternative, we present CZEch~NEws~Classification~dataset (CZE-NEC), one of the largest Czech classification datasets, composed of news articles from various sources spanning over twenty years, which allows a more rigorous evaluation of such models. We define four classification tasks: news source, news category, inferred author's gender, and day of the week. To verify the task difficulty, we conducted a human evaluation, which revealed that human performance lags behind strong machine-learning baselines built upon pre-trained transformer models. Furthermore, we show that language-specific pre-trained encoder analysis…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hynky1999/czech-news-classification-dataset
noneOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling