Towards Math-Aware Automated Classification and Similarity Search of Scientific Publications: Methods of Mathematical Content Representations
Michal R\r{u}\v{z}i\v{c}ka, Petr Sojka

TL;DR
This paper explores mathematical content representations for automated classification and similarity search in STEM documents, evaluating different methods on arXiv papers to improve performance using machine learning algorithms.
Contribution
It introduces structured math representations that outperform flat TeX tokens and assesses their impact on classification and similarity search tasks.
Findings
Structured math representations improve classification accuracy.
Weighted tokens slightly enhance search performance.
Math content representation influences machine learning effectiveness.
Abstract
In this paper, we investigate mathematical content representations suitable for the automated classification of and the similarity search in STEM documents using standard machine learning algorithms: the Latent Dirichlet Allocation (LDA) and the Latent Semantic Indexing (LSI). The methods are evaluated on a subset of arXiv.org papers with the Mathematics Subject Classification (MSC) as a reference classification and using the standard precision/recall/F1-measure metrics. The results give insight into how different math representations may influence the performance of the classification and similarity search tasks in STEM repositories. Non-surprisingly, machine learning methods are able to grab distributional semantics from textual tokens. A proper selection of weighted tokens representing math may improve the quality of the results slightly. A structured math representation that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing · Scientific Computing and Data Management · Topic Modeling
