Tagging Scientific Publications using Wikipedia and Natural Language   Processing Tools. Comparison on the ArXiv Dataset

Micha{\l} {\L}opuszy\'nski; {\L}ukasz Bolikowski

arXiv:1309.0326·cs.CL·November 4, 2014

Tagging Scientific Publications using Wikipedia and Natural Language Processing Tools. Comparison on the ArXiv Dataset

Micha{\l} {\L}opuszy\'nski, {\L}ukasz Bolikowski

PDF

TL;DR

This paper compares two methods for tagging scientific publications—using Wikipedia and noun phrase extraction—evaluating their effectiveness on a large ArXiv dataset to enhance machine learning applications.

Contribution

It introduces and compares two simple, scalable tagging methods for scientific texts, demonstrating their potential for improving document analysis tasks.

Findings

01

Wikipedia-based tags show strong coverage of scientific topics

02

Noun phrase extraction provides complementary labels

03

Both methods improve downstream machine learning tasks

Abstract

In this work, we compare two simple methods of tagging scientific publications with labels reflecting their content. As a first source of labels Wikipedia is employed, second label set is constructed from the noun phrases occurring in the analyzed corpus. We examine the statistical properties and the effectiveness of both approaches on the dataset consisting of abstracts from 0.7 million of scientific documents deposited in the ArXiv preprint collection. We believe that obtained tags can be later on applied as useful document features in various machine learning tasks (document similarity, clustering, topic modelling, etc.).

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.