AlbNews: A Corpus of Headlines for Topic Modeling in Albanian
Erion \c{C}ano, Dario Lamaj

TL;DR
This paper presents AlbNews, a new Albanian news headlines corpus with labeled topics, enabling NLP research in low-resource languages, and provides baseline classification results for future work.
Contribution
It introduces AlbNews, a novel Albanian news headline dataset with topic labels, addressing resource scarcity and facilitating NLP research in Albanian.
Findings
Basic classifiers outperform ensemble models on AlbNews.
Baseline classification scores demonstrate the dataset's utility.
AlbNews is freely available for research use.
Abstract
The scarcity of available text corpora for low-resource languages like Albanian is a serious hurdle for research in natural language processing tasks. This paper introduces AlbNews, a collection of 600 topically labeled news headlines and 2600 unlabeled ones in Albanian. The data can be freely used for conducting topic modeling research. We report the initial classification scores of some traditional machine learning classifiers trained with the AlbNews samples. These results show that basic models outrun the ensemble learning ones and can serve as a baseline for future experiments.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Big Data Technologies and Applications
