Presenting a classifier to detect research contributions in OpenAlex
Nick Haupka

TL;DR
This paper presents a high-accuracy classifier for distinguishing research articles from non-research content in OpenAlex, enhancing data quality for bibliometric analysis.
Contribution
It introduces a novel document type classifier that effectively identifies non-research content using open metadata, improving classification accuracy in bibliometric datasets.
Findings
F1-score of 0.95 for the classifier
Reclassified 10.75% of articles as non-research
Potential to improve data quality in OpenAlex
Abstract
This paper introduces a document type classifier with the purpose to optimise the distinction between research and non-research journal publications in OpenAlex. Based on open metadata, the classifier can detect non-research or editorial content within a set of classified articles and reviews (e.g. paratexts, abstracts, editorials, letters). The classifier achieves an F1-score of 0,95, indicating a potential improvement in the data quality of bibliometric research in OpenAlex when applying the classifier on real data. In total, 4.589.967 out of 42.701.863 articles and reviews could be reclassified as non-research contributions by the classifier, representing a share of 10,75%
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
