Content-based subject classification at article level in biomedical context
Eric Jeangirard

TL;DR
This paper introduces a mixed NLP-based approach for classifying biomedical articles at the article level by leveraging journal-level FoR codes and embeddings, improving classification accuracy with stratified sampling.
Contribution
It presents a novel method combining journal-level FoR codes and NLP embeddings to classify articles at the individual level in biomedical research.
Findings
Stratified sampling reduces bias in classification.
Embedding-based classifiers improve article-level classification accuracy.
The method is implemented and available on GitHub.
Abstract
Subject classification is an important task to analyze scholarly publications. In general, mainly two kinds of approaches are used: classification at a journal level and classification at the article level. We propose a mixed approach, leveraging on embeddings technique in NLP to train classifiers with article metadata (title, abstract, keywords in particular) labelled with the journal-level classification FoR (Fields of Research) and then apply these classifiers at the article level. We use this approach in the context of biomedical publications using metadata from Pubmed. Fasttext classifiers are trained with FoR codes and used to classify publications based on their available metadata. Results show that using a stratification sampling strategy for training help reduce the bias due to unbalanced field distribution. An implementation of the method is proposed on the repository…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Text and Document Classification Technologies · Topic Modeling
