Text Classification in the Wild: a Large-scale Long-tailed Name   Normalization Dataset

Jiexing Qi; Shuhao Li; Zhixin Guo; Yusheng Huang; Chenghu Zhou; Weinan; Zhang; Xinbing Wang; and Zhouhan Lin

arXiv:2302.09509·cs.CL·February 21, 2023

Text Classification in the Wild: a Large-scale Long-tailed Name Normalization Dataset

Jiexing Qi, Shuhao Li, Zhixin Guo, Yusheng Huang, Chenghu Zhou, Weinan, Zhang, Xinbing Wang, and Zhouhan Lin

PDF

Open Access 1 Repo

TL;DR

This paper introduces a large-scale, naturally long-tailed institution name normalization dataset, along with baseline methods and a specialized BERT-based model, to address long-tailed and open-set classification challenges in real-world NLP applications.

Contribution

The creation of LoT-insts1, the largest natural long-tailed dataset for institution name normalization, and the development of a BERT-based model with improved out-of-distribution generalization.

Findings

01

Our dataset exceeds existing datasets in size and natural long-tailed distribution.

02

The proposed BERT-based model outperforms baseline methods on few-shot and zero-shot sets.

03

Baseline methods show significant performance gaps in long-tailed and open-set scenarios.

Abstract

Real-world data usually exhibits a long-tailed distribution,with a few frequent labels and a lot of few-shot labels. The study of institution name normalization is a perfect application case showing this phenomenon. There are many institutions worldwide with enormous variations of their names in the publicly available literature. In this work, we first collect a large-scale institution name normalization dataset LoT-insts1, which contains over 25k classes that exhibit a naturally long-tailed distribution. In order to isolate the few-shot and zero-shot learning scenarios from the massive many-shot classes, we construct our test set from four different subsets: many-, medium-, and few-shot sets, as well as a zero-shot open set. We also replicate several important baseline methods on our data, covering a wide range from search-based methods to neural network methods that use the pretrained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lumia-group/lot-insts
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text and Document Classification Technologies

MethodsMulti-Head Attention · Attention Is All You Need · Test · Attention Dropout · Dropout · Linear Warmup With Linear Decay · Residual Connection · Refunds@Expedia|||How do I get a full refund from Expedia? · WordPiece · Softmax