MeSHup: A Corpus for Full Text Biomedical Document Indexing
Xindi Wang, Robert E. Mercer, Frank Rudzicz

TL;DR
This paper introduces MeSHup, a large-scale annotated corpus of over 1.3 million biomedical articles with MeSH labels, enabling improved development and evaluation of automated indexing systems.
Contribution
The paper provides the first publicly available, extensive corpus for biomedical document indexing and establishes a new baseline using an end-to-end model.
Findings
The corpus contains 1,342,667 articles with labels and metadata.
An end-to-end model trained on MeSHup outperforms previous methods.
The dataset facilitates robust evaluation and comparison of indexing systems.
Abstract
Medical Subject Heading (MeSH) indexing refers to the problem of assigning a given biomedical document with the most relevant labels from an extremely large set of MeSH terms. Currently, the vast number of biomedical articles in the PubMed database are manually annotated by human curators, which is time consuming and costly; therefore, a computational system that can assist the indexing is highly valuable. When developing supervised MeSH indexing systems, the availability of a large-scale annotated text corpus is desirable. A publicly available, large corpus that permits robust evaluation and comparison of various systems is important to the research community. We release a large scale annotated MeSH indexing corpus, MeSHup, which contains 1,342,667 full text articles in English, together with the associated MeSH labels and metadata, authors, and publication venues that are collected…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Topic Modeling · Advanced Text Analysis Techniques
