Classifying document types to enhance search and recommendations in digital libraries
Aristotelis Charalampous, Petr Knoth

TL;DR
This paper presents a machine learning approach to classify document types in digital libraries, addressing missing metadata issues and demonstrating improved search and recommendation relevance.
Contribution
A new supervised machine learning method using text features for document type classification, achieving high accuracy and potential to enhance digital library search systems.
Findings
Achieved 0.96 F1-score with random forest and Adaboost classifiers.
Users are significantly more likely to click on research papers and theses.
Document type classification can improve search and recommendation systems.
Abstract
In this paper, we address the problem of classifying documents available from the global network of (open access) repositories according to their type. We show that the metadata provided by repositories enabling us to distinguish research papers, thesis and slides are missing in over 60% of cases. While these metadata describing document types are useful in a variety of scenarios ranging from research analytics to improving search and recommender (SR) systems, this problem has not yet been sufficiently addressed in the context of the repositories infrastructure. We have developed a new approach for classifying document types using supervised machine learning based exclusively on text specific features. We achieve 0.96 F1-score using the random forest and Adaboost classifiers, which are the best performing models on our data. By analysing the SR system logs of the CORE [1] digital…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Advanced Text Analysis Techniques · Topic Modeling
