An Empirical Study on Large-Scale Multi-Label Text Classification   Including Few and Zero-Shot Labels

Ilias Chalkidis; Manos Fergadiotis; Sotiris Kotitsas; Prodromos; Malakasiotis; Nikolaos Aletras; Ion Androutsopoulos

arXiv:2010.01653·cs.CL·October 6, 2020

An Empirical Study on Large-Scale Multi-Label Text Classification Including Few and Zero-Shot Labels

Ilias Chalkidis, Manos Fergadiotis, Sotiris Kotitsas, Prodromos, Malakasiotis, Nikolaos Aletras, Ion Androutsopoulos

PDF

1 Repo

TL;DR

This paper empirically evaluates various large-scale multi-label text classification methods, demonstrating the superiority of hierarchical and Transformer-based models, and introduces new approaches for improved few and zero-shot learning leveraging label hierarchies.

Contribution

It provides the first comprehensive empirical comparison of LMTC methods, introduces a new state-of-the-art combining BERT with LWANs, and proposes models leveraging label hierarchies for better few and zero-shot learning.

Findings

01

Hierarchical Probabilistic Label Trees outperform flat LWANs.

02

Transformer-based models outperform previous state-of-the-art in two datasets.

03

New models leveraging label hierarchies improve few and zero-shot learning.

Abstract

Large-scale Multi-label Text Classification (LMTC) has a wide range of Natural Language Processing (NLP) applications and presents interesting challenges. First, not all labels are well represented in the training set, due to the very large label set and the skewed label distributions of LMTC datasets. Also, label hierarchies and differences in human labelling guidelines may affect graph-aware annotation proximity. Finally, the label hierarchies are periodically updated, requiring LMTC models capable of zero-shot generalization. Current state-of-the-art LMTC models employ Label-Wise Attention Networks (LWANs), which (1) typically treat LMTC as flat multi-label classification; (2) may use the label hierarchy to improve zero-shot learning, although this practice is vastly understudied; and (3) have not been combined with pre-trained Transformers (e.g. BERT), which have led to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

iliaschalkidis/lmtc-eurlex57k
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Dense Connections · Layer Normalization · WordPiece · Multi-Head Attention · Dropout · Linear Warmup With Linear Decay · Attention Dropout · Weight Decay · Attention Is All You Need