Text Classification in the Wild: a Large-scale Long-tailed Name Normalization Dataset
Jiexing Qi, Shuhao Li, Zhixin Guo, Yusheng Huang, Chenghu Zhou, Weinan, Zhang, Xinbing Wang, and Zhouhan Lin

TL;DR
This paper introduces a large-scale, naturally long-tailed institution name normalization dataset, along with baseline methods and a specialized BERT-based model, to address long-tailed and open-set classification challenges in real-world NLP applications.
Contribution
The creation of LoT-insts1, the largest natural long-tailed dataset for institution name normalization, and the development of a BERT-based model with improved out-of-distribution generalization.
Findings
Our dataset exceeds existing datasets in size and natural long-tailed distribution.
The proposed BERT-based model outperforms baseline methods on few-shot and zero-shot sets.
Baseline methods show significant performance gaps in long-tailed and open-set scenarios.
Abstract
Real-world data usually exhibits a long-tailed distribution,with a few frequent labels and a lot of few-shot labels. The study of institution name normalization is a perfect application case showing this phenomenon. There are many institutions worldwide with enormous variations of their names in the publicly available literature. In this work, we first collect a large-scale institution name normalization dataset LoT-insts1, which contains over 25k classes that exhibit a naturally long-tailed distribution. In order to isolate the few-shot and zero-shot learning scenarios from the massive many-shot classes, we construct our test set from four different subsets: many-, medium-, and few-shot sets, as well as a zero-shot open set. We also replicate several important baseline methods on our data, covering a wide range from search-based methods to neural network methods that use the pretrained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text and Document Classification Technologies
MethodsMulti-Head Attention · Attention Is All You Need · Test · Attention Dropout · Dropout · Linear Warmup With Linear Decay · Residual Connection · Refunds@Expedia|||How do I get a full refund from Expedia? · WordPiece · Softmax
