Multi-domain Clinical Natural Language Processing with MedCAT: the   Medical Concept Annotation Toolkit

Zeljko Kraljevic; Thomas Searle; Anthony Shek; Lukasz Roguski; Kawsar; Noor; Daniel Bean; Aurelie Mascio; Leilei Zhu; Amos A Folarin; Angus Roberts,; Rebecca Bendayan; Mark P Richardson; Robert Stewart; Anoop D Shah; Wai Keong; Wong; Zina Ibrahim; James T Teo; Richard JB Dobson

arXiv:2010.01165·cs.CL·March 26, 2021·Artif. Intell. Medicine

Multi-domain Clinical Natural Language Processing with MedCAT: the Medical Concept Annotation Toolkit

Zeljko Kraljevic, Thomas Searle, Anthony Shek, Lukasz Roguski, Kawsar, Noor, Daniel Bean, Aurelie Mascio, Leilei Zhu, Amos A Folarin, Angus Roberts,, Rebecca Bendayan, Mark P Richardson, Robert Stewart, Anoop D Shah, Wai Keong, Wong, Zina Ibrahim, James T Teo, Richard JB Dobson

PDF

1 Repo

TL;DR

MedCAT is an open-source toolkit that leverages self-supervised learning to extract medical concepts from unstructured EHR text, enabling scalable, accurate, and cross-domain clinical information extraction.

Contribution

Introduces a novel self-supervised machine learning algorithm for medical concept extraction and an annotation interface, integrated into the CogStack ecosystem for deployment.

Findings

01

Improved UMLS concept extraction performance (F1: 0.448-0.738)

02

Successful SNOMED-CT extraction across three hospitals

03

High transferability (F1 > 0.94) between datasets and hospitals

Abstract

Electronic health records (EHR) contain large volumes of unstructured text, requiring the application of Information Extraction (IE) technologies to enable clinical analysis. We present the open-source Medical Concept Annotation Toolkit (MedCAT) that provides: a) a novel self-supervised machine learning algorithm for extracting concepts using any concept vocabulary including UMLS/SNOMED-CT; b) a feature-rich annotation interface for customising and training IE models; and c) integrations to the broader CogStack ecosystem for vendor-agnostic health system deployment. We show improved performance in extracting UMLS concepts from open datasets (F1:0.448-0.738 vs 0.429-0.650). Further real-world validation demonstrates SNOMED-CT extraction at 3 large London hospitals with self-supervised training over ~8.8B words from ~17M clinical records and further fine-tuning with ~6K clinician…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

CogStack/MedCAT
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.