Using Supervised Learning to Classify Metadata of Research Data by Discipline of Research
Tobias Weber, Dieter Kranzlm\"uller, Michael Fromm, Nelson Tavares de, Sousa

TL;DR
This paper develops machine learning models to automatically classify research data metadata by discipline, enabling large-scale analysis of research trends and interdisciplinarity with high accuracy.
Contribution
It introduces a large dataset and evaluates multiple models, finding that multi-layer perceptrons perform best for multi-label classification of research disciplines.
Findings
Multi-layer perceptrons achieved an f1-macro score of 0.760.
The dataset includes 609,524 records for reproducible evaluation.
Models are suitable for large-scale analysis of research data trends.
Abstract
Automated classification of metadata of research data by their discipline(s) of research can be used in scientometric research, by repository service providers, and in the context of research data aggregation services. Openly available metadata of the DataCite index for research data were used to compile a large training and evaluation set comprised of 609,524 records, which is published alongside this paper. These data allow to reproducibly assess classification approaches, such as tree-based models and neural networks. According to our experiments with 20 base classes (multi-label classification), multi-layer perceptron models perform best with a f1-macro score of 0.760 closely followed by Long Short-Term Memory models (f1-macro score of 0.755). A possible application of the trained classification models is the quantitative analysis of trends towards interdisciplinarity of digital…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Advanced Text Analysis Techniques · Topic Modeling
