Content-based subject classification at article level in biomedical   context

Eric Jeangirard

arXiv:2104.14800·cs.DL·May 12, 2021·1 cites

Content-based subject classification at article level in biomedical context

Eric Jeangirard

PDF

Open Access 1 Repo

TL;DR

This paper introduces a mixed NLP-based approach for classifying biomedical articles at the article level by leveraging journal-level FoR codes and embeddings, improving classification accuracy with stratified sampling.

Contribution

It presents a novel method combining journal-level FoR codes and NLP embeddings to classify articles at the individual level in biomedical research.

Findings

01

Stratified sampling reduces bias in classification.

02

Embedding-based classifiers improve article-level classification accuracy.

03

The method is implemented and available on GitHub.

Abstract

Subject classification is an important task to analyze scholarly publications. In general, mainly two kinds of approaches are used: classification at a journal level and classification at the article level. We propose a mixed approach, leveraging on embeddings technique in NLP to train classifiers with article metadata (title, abstract, keywords in particular) labelled with the journal-level classification FoR (Fields of Research) and then apply these classifiers at the article level. We use this approach in the context of biomedical publications using metadata from Pubmed. Fasttext classifiers are trained with FoR codes and used to classify publications based on their available metadata. Results show that using a stratification sampling strategy for training help reduce the bias due to unbalanced field distribution. An implementation of the method is proposed on the repository…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dataesr/scientific_tagger
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiomedical Text Mining and Ontologies · Text and Document Classification Technologies · Topic Modeling