COMETA: A Corpus for Medical Entity Linking in the Social Media

Marco Basaldella; Fangyu Liu; Ehsan Shareghi; Nigel Collier

arXiv:2010.03295·cs.CL·October 9, 2020

COMETA: A Corpus for Medical Entity Linking in the Social Media

Marco Basaldella, Fangyu Liu, Ehsan Shareghi, Nigel Collier

PDF

1 Repo 1 Models

TL;DR

COMETA is a new large-scale corpus of social media biomedical mentions linked to SNOMED CT, designed to improve entity linking in health-related language and evaluate current models' performance.

Contribution

This paper introduces COMETA, a comprehensive biomedical entity linking dataset from Reddit, and benchmarks multiple models to highlight current challenges and the need for combined approaches.

Findings

01

No existing model achieves perfect performance on COMETA.

02

Combining different data views improves entity linking accuracy.

03

Current models still have significant performance gaps.

Abstract

Whilst there has been growing progress in Entity Linking (EL) for general language, existing datasets fail to address the complex nature of health terminology in layman's language. Meanwhile, there is a growing need for applications that can understand the public's voice in the health domain. To address this we introduce a new corpus called COMETA, consisting of 20k English biomedical entity mentions from Reddit expert-annotated with links to SNOMED CT, a widely-used medical knowledge graph. Our corpus satisfies a combination of desirable properties, from scale and coverage to diversity and quality, that to the best of our knowledge has not been met by any of the existing resources in the field. Through benchmark experiments on 20 EL baselines from string- to neural-based models we shed light on the ability of these systems to perform complex inference on entities and concepts under 2…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cambridgeltl/cometa
pytorchOfficial

Models

🤗
cambridgeltl/BioRedditBERT-uncased
model· 201 dl· ♡ 6
201 dl♡ 6

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.