Beyond Boundaries: Learning a Universal Entity Taxonomy across Datasets and Languages for Open Named Entity Recognition
Yuming Yang, Wantong Zhao, Caishuang Huang, Junjie Ye, Xiao Wang,, Huiyuan Zheng, Yang Nan, Yuran Wang, Xueying Xu, Kaixin Huang, Yunke Zhang,, Tao Gui, Qi Zhang, Xuanjing Huang

TL;DR
This paper introduces B2NERD, a curated dataset that standardizes entity definitions across multiple datasets and languages, significantly improving Large Language Models' ability to perform open NER across diverse domains.
Contribution
The creation of B2NERD, a universal entity taxonomy dataset that enhances LLMs' generalization in open NER by resolving inconsistencies and redundancies in existing datasets.
Findings
B2NERD improves LLMs' open NER performance by 6.8-12.0 F1 points.
Models trained on B2NERD outperform GPT-4 in out-of-domain benchmarks.
B2NERD covers over 400 entity types across 15 datasets and 6 languages.
Abstract
Open Named Entity Recognition (NER), which involves identifying arbitrary types of entities from arbitrary domains, remains challenging for Large Language Models (LLMs). Recent studies suggest that fine-tuning LLMs on extensive NER data can boost their performance. However, training directly on existing datasets neglects their inconsistent entity definitions and redundant data, limiting LLMs to dataset-specific learning and hindering out-of-domain adaptation. To address this, we present B2NERD, a compact dataset designed to guide LLMs' generalization in Open NER under a universal entity taxonomy. B2NERD is refined from 54 existing English and Chinese datasets using a two-step process. First, we detect inconsistent entity definitions across datasets and clarify them by distinguishable label names to construct a universal taxonomy of 400+ entity types. Second, we address redundancy using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Web Data Mining and Analysis
MethodsResidual Connection · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer
