Classification and Clustering of arXiv Documents, Sections, and Abstracts, Comparing Encodings of Natural and Mathematical Language
Philipp Scharpf, Moritz Schubotz, Abdou Youssef, Felix Hamborg, Norman, Meuschke, Bela Gipp

TL;DR
This paper investigates how different encodings of natural and mathematical language influence the classification and clustering of arXiv documents, demonstrating high accuracy and efficiency, and highlighting the independence of text and formula features.
Contribution
It introduces novel encoding strategies for natural and mathematical language and evaluates their impact on document classification and clustering performance.
Findings
Achieved up to 82.8% classification accuracy
Cluster purities up to 69.4% and 99.9%
Computer outperforms human experts in classification
Abstract
In this paper, we show how selecting and combining encodings of natural and mathematical language affect classification and clustering of documents with mathematical content. We demonstrate this by using sets of documents, sections, and abstracts from the arXiv preprint server that are labeled by their subject class (mathematics, computer science, physics, etc.) to compare different encodings of text and formulae and evaluate the performance and runtimes of selected classification and clustering algorithms. Our encodings achieve classification accuracies up to and cluster purities up to (number of clusters equals number of classes), and (unspecified number of clusters) respectively. We observe a relatively low correlation between text and math similarity, which indicates the independence of text and formulae and motivates treating them as separate features of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
