SwellShark: A Generative Model for Biomedical Named Entity Recognition without Labeled Data
Jason Fries, Sen Wu, Alex Ratner, Christopher R\'e

TL;DR
SwellShark is a novel framework that leverages weak supervision and generative modeling to develop high-accuracy biomedical NER systems without requiring manually labeled data, significantly reducing annotation effort.
Contribution
The paper introduces SwellShark, a method that uses biomedical resources as function primitives for weak supervision and a generative model to produce large-scale labeled datasets without manual annotation.
Findings
Achieves competitive NER performance without hand-labeled data
Reduces annotation time from weeks to 24 hours for drug name extraction
Matches state-of-the-art supervised benchmarks in biomedical NER
Abstract
We present SwellShark, a framework for building biomedical named entity recognition (NER) systems quickly and without hand-labeled data. Our approach views biomedical resources like lexicons as function primitives for autogenerating weak supervision. We then use a generative model to unify and denoise this supervision and construct large-scale, probabilistically labeled datasets for training high-accuracy NER taggers. In three biomedical NER tasks, SwellShark achieves competitive scores with state-of-the-art supervised benchmarks using no hand-labeled training data. In a drug name extraction task using patient medical records, one domain expert using SwellShark achieved within 5.1% of a crowdsourced annotation approach -- which originally utilized 20 teams over the course of several weeks -- in 24 hours.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Biomedical Text Mining and Ontologies · Natural Language Processing Techniques
