MeSHup: A Corpus for Full Text Biomedical Document Indexing

Xindi Wang; Robert E. Mercer; Frank Rudzicz

arXiv:2204.13604·cs.CL·April 29, 2022·1 cites

MeSHup: A Corpus for Full Text Biomedical Document Indexing

Xindi Wang, Robert E. Mercer, Frank Rudzicz

PDF

Open Access

TL;DR

This paper introduces MeSHup, a large-scale annotated corpus of over 1.3 million biomedical articles with MeSH labels, enabling improved development and evaluation of automated indexing systems.

Contribution

The paper provides the first publicly available, extensive corpus for biomedical document indexing and establishes a new baseline using an end-to-end model.

Findings

01

The corpus contains 1,342,667 articles with labels and metadata.

02

An end-to-end model trained on MeSHup outperforms previous methods.

03

The dataset facilitates robust evaluation and comparison of indexing systems.

Abstract

Medical Subject Heading (MeSH) indexing refers to the problem of assigning a given biomedical document with the most relevant labels from an extremely large set of MeSH terms. Currently, the vast number of biomedical articles in the PubMed database are manually annotated by human curators, which is time consuming and costly; therefore, a computational system that can assist the indexing is highly valuable. When developing supervised MeSH indexing systems, the availability of a large-scale annotated text corpus is desirable. A publicly available, large corpus that permits robust evaluation and comparison of various systems is important to the research community. We release a large scale annotated MeSH indexing corpus, MeSHup, which contains 1,342,667 full text articles in English, together with the associated MeSH labels and metadata, authors, and publication venues that are collected…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiomedical Text Mining and Ontologies · Topic Modeling · Advanced Text Analysis Techniques