How to tackle an emerging topic? Combining strong and weak labels for Covid news NER
Aleksander Ficek, Fangyu Liu, Nigel Collier

TL;DR
This paper introduces COVIDNEWS-NER, a new dataset for COVID-19 news NER, and proposes CONTROSTER, a method that effectively combines weak and strong labels through transfer learning to improve NER for emerging topics.
Contribution
The paper presents a novel COVIDNEWS-NER dataset and a strategic label combination approach, CONTROSTER, for enhancing NER in emerging medical topics.
Findings
Weak data pretraining improves NER performance.
Combining out-of-domain and in-domain weak labels is beneficial.
CONTROSTER outperforms models trained solely on strong or weak data.
Abstract
Being able to train Named Entity Recognition (NER) models for emerging topics is crucial for many real-world applications especially in the medical domain where new topics are continuously evolving out of the scope of existing models and datasets. For a realistic evaluation setup, we introduce a novel COVID-19 news NER dataset (COVIDNEWS-NER) and release 3000 entries of hand annotated strongly labelled sentences and 13000 auto-generated weakly labelled sentences. Besides the dataset, we propose CONTROSTER, a recipe to strategically combine weak and strong labels in improving NER in an emerging topic through transfer learning. We show the effectiveness of CONTROSTER on COVIDNEWS-NER while providing analysis on combining weak and strong labels for training. Our key findings are: (1) Using weak data to formulate an initial backbone before tuning on strong data outperforms methods trained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
