MedMentions: A Large Biomedical Corpus Annotated with UMLS Concepts
Sunil Mohan, Donghui Li

TL;DR
MedMentions is a large, manually annotated biomedical corpus linked with UMLS concepts, designed to advance biomedical named entity recognition and linking research with extensive data and baseline models.
Contribution
This paper introduces MedMentions, a comprehensive biomedical corpus with over 4,000 abstracts and 350,000 mentions linked to UMLS, including data splits and baseline models for entity linking.
Findings
Over 4,000 abstracts annotated with 350,000 mentions
Includes a subset focused on document retrieval
Provides baseline models and evaluation metrics
Abstract
This paper presents the formal release of MedMentions, a new manually annotated resource for the recognition of biomedical concepts. What distinguishes MedMentions from other annotated biomedical corpora is its size (over 4,000 abstracts and over 350,000 linked mentions), as well as the size of the concept ontology (over 3 million concepts from UMLS 2017) and its broad coverage of biomedical disciplines. In addition to the full corpus, a sub-corpus of MedMentions is also presented, comprising annotations for a subset of UMLS 2017 targeted towards document retrieval. To encourage research in Biomedical Named Entity Recognition and Linking, data splits for training and testing are included in the release, and a baseline model and its metrics for entity linking are also described.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Topic Modeling · Natural Language Processing Techniques
