Out of Thin Air: Is Zero-Shot Cross-Lingual Keyword Detection Better Than Unsupervised?
Boshko Koloski, Senja Pollak, Bla\v{z} \v{S}krlj, Matej, Martinc

TL;DR
This study investigates whether zero-shot cross-lingual keyword extraction using pretrained multilingual models outperforms traditional unsupervised methods, especially for low-resource languages with no labeled data.
Contribution
It demonstrates that pretrained multilingual models fine-tuned on diverse languages outperform unsupervised keyword extractors in zero-shot settings across multiple languages.
Findings
Pretrained models outperform unsupervised methods in all tested languages.
Zero-shot cross-lingual models are effective for low-resource languages.
Fine-tuning on multilingual corpora enhances zero-shot keyword extraction performance.
Abstract
Keyword extraction is the task of retrieving words that are essential to the content of a given document. Researchers proposed various approaches to tackle this problem. At the top-most level, approaches are divided into ones that require training - supervised and ones that do not - unsupervised. In this study, we are interested in settings, where for a language under investigation, no training data is available. More specifically, we explore whether pretrained multilingual language models can be employed for zero-shot cross-lingual keyword extraction on low-resource languages with limited or no available labeled training data and whether they outperform state-of-the-art unsupervised keyword extractors. The comparison is conducted on six news article datasets covering two high-resource languages, English and Russian, and four low-resource languages, Croatian, Estonian, Latvian, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques · Text and Document Classification Technologies · Topic Modeling
