Classification and Clustering of arXiv Documents, Sections, and   Abstracts, Comparing Encodings of Natural and Mathematical Language

Philipp Scharpf; Moritz Schubotz; Abdou Youssef; Felix Hamborg; Norman; Meuschke; Bela Gipp

arXiv:2005.11021·cs.DL·May 25, 2020

Classification and Clustering of arXiv Documents, Sections, and Abstracts, Comparing Encodings of Natural and Mathematical Language

Philipp Scharpf, Moritz Schubotz, Abdou Youssef, Felix Hamborg, Norman, Meuschke, Bela Gipp

PDF

TL;DR

This paper investigates how different encodings of natural and mathematical language influence the classification and clustering of arXiv documents, demonstrating high accuracy and efficiency, and highlighting the independence of text and formula features.

Contribution

It introduces novel encoding strategies for natural and mathematical language and evaluates their impact on document classification and clustering performance.

Findings

01

Achieved up to 82.8% classification accuracy

02

Cluster purities up to 69.4% and 99.9%

03

Computer outperforms human experts in classification

Abstract

In this paper, we show how selecting and combining encodings of natural and mathematical language affect classification and clustering of documents with mathematical content. We demonstrate this by using sets of documents, sections, and abstracts from the arXiv preprint server that are labeled by their subject class (mathematics, computer science, physics, etc.) to compare different encodings of text and formulae and evaluate the performance and runtimes of selected classification and clustering algorithms. Our encodings achieve classification accuracies up to $82.8%$ and cluster purities up to $69.4%$ (number of clusters equals number of classes), and $99.9%$ (unspecified number of clusters) respectively. We observe a relatively low correlation between text and math similarity, which indicates the independence of text and formulae and motivates treating them as separate features of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.