A Dataset and Strong Baselines for Classification of Czech News Texts
Hynek Kydl\'i\v{c}ek, Jind\v{r}ich Libovick\'y

TL;DR
This paper introduces CZE-NEC, a large Czech news classification dataset with four tasks, demonstrating that specialized pre-trained models outperform general large-scale language models in these tasks.
Contribution
The paper presents CZE-NEC, a comprehensive Czech news dataset with multiple classification tasks, and provides strong baseline models showing the superiority of language-specific pre-trained encoders.
Findings
Human performance is lower than machine baselines.
Language-specific models outperform large-scale generative models.
The dataset enables rigorous evaluation of Czech NLP models.
Abstract
Pre-trained models for Czech Natural Language Processing are often evaluated on purely linguistic tasks (POS tagging, parsing, NER) and relatively simple classification tasks such as sentiment classification or article classification from a single news source. As an alternative, we present CZEch~NEws~Classification~dataset (CZE-NEC), one of the largest Czech classification datasets, composed of news articles from various sources spanning over twenty years, which allows a more rigorous evaluation of such models. We define four classification tasks: news source, news category, inferred author's gender, and day of the week. To verify the task difficulty, we conducted a human evaluation, which revealed that human performance lags behind strong machine-learning baselines built upon pre-trained transformer models. Furthermore, we show that language-specific pre-trained encoder analysis…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
