Tagging Scientific Publications using Wikipedia and Natural Language Processing Tools. Comparison on the ArXiv Dataset
Micha{\l} {\L}opuszy\'nski, {\L}ukasz Bolikowski

TL;DR
This paper compares two methods for tagging scientific publications—using Wikipedia and noun phrase extraction—evaluating their effectiveness on a large ArXiv dataset to enhance machine learning applications.
Contribution
It introduces and compares two simple, scalable tagging methods for scientific texts, demonstrating their potential for improving document analysis tasks.
Findings
Wikipedia-based tags show strong coverage of scientific topics
Noun phrase extraction provides complementary labels
Both methods improve downstream machine learning tasks
Abstract
In this work, we compare two simple methods of tagging scientific publications with labels reflecting their content. As a first source of labels Wikipedia is employed, second label set is constructed from the noun phrases occurring in the analyzed corpus. We examine the statistical properties and the effectiveness of both approaches on the dataset consisting of abstracts from 0.7 million of scientific documents deposited in the ArXiv preprint collection. We believe that obtained tags can be later on applied as useful document features in various machine learning tasks (document similarity, clustering, topic modelling, etc.).
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
