SlovKE: A Large-Scale Dataset and LLM Evaluation for Slovak Keyphrase Extraction
David \v{S}teva\v{n}\'ak, Marek \v{S}uppa

TL;DR
This paper introduces SlovKE, a large-scale Slovak dataset for keyphrase extraction, benchmarks multiple methods including GPT-3.5, and highlights the challenges of morphological inflection in low-resource languages.
Contribution
It provides the first large Slovak keyphrase dataset, benchmarks existing methods, and evaluates LLM-based extraction, revealing the impact of morphological complexity.
Findings
Unsupervised methods achieve up to 11.6% exact-match F1@6.
KeyLLM improves partial matching and captures relevant concepts.
Morphological mismatch is a major challenge for statistical methods.
Abstract
Keyphrase extraction for morphologically rich, low-resource languages remains understudied, largely due to the scarcity of suitable evaluation datasets. We address this gap for Slovak by constructing a dataset of 227,432 scientific abstracts with author-assigned keyphrases -- scraped and systematically cleaned from the Slovak Central Register of Theses -- representing a 25-fold increase over the largest prior Slovak resource and approaching the scale of established English benchmarks such as KP20K. Using this dataset, we benchmark three unsupervised baselines (YAKE, TextRank, KeyBERT with SlovakBERT embeddings) and evaluate KeyLLM, an LLM-based extraction method using GPT-3.5-turbo. Unsupervised baselines achieve at most 11.6\% exact-match , with a large gap to partial matching (up to 51.5\%), reflecting the difficulty of matching inflected surface forms to author-assigned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques · Sentiment Analysis and Opinion Mining · Biomedical Text Mining and Ontologies
