Automatic Creation of Named Entity Recognition Datasets by Querying Phrase Representations
Hyunjae Kim, Jaehyo Yoo, Seunghyun Yoon, Jaewoo Kang

TL;DR
This paper introduces HighGEN, a novel framework that automatically creates high-coverage NER datasets using phrase embedding search and a verification process, significantly improving performance over previous methods.
Contribution
HighGEN is the first framework to generate high-coverage pseudo-dictionaries for NER using dense phrase embedding search and embedding-based verification.
Findings
HighGEN achieves an average F1 score improvement of 4.7 over previous models.
HighGEN outperforms prior methods across five NER benchmark datasets.
The approach effectively reduces false positives in weakly supervised NER datasets.
Abstract
Most weakly supervised named entity recognition (NER) models rely on domain-specific dictionaries provided by experts. This approach is infeasible in many domains where dictionaries do not exist. While a phrase retrieval model was used to construct pseudo-dictionaries with entities retrieved from Wikipedia automatically in a recent study, these dictionaries often have limited coverage because the retriever is likely to retrieve popular entities rather than rare ones. In this study, we present a novel framework, HighGEN, that generates NER datasets with high-coverage pseudo-dictionaries. Specifically, we create entity-rich dictionaries with a novel search method, called phrase embedding search, which encourages the retriever to search a space densely populated with various entities. In addition, we use a new verification process based on the embedding distance between candidate entity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies
