From Lazy to Prolific: Tackling Missing Labels in Open Vocabulary   Extreme Classification by Positive-Unlabeled Sequence Learning

Ranran Haoran Zhang; Bensu U\c{c}ar; Soumik Dey; Hansi Wu; Binbin Li,; Rui Zhang

arXiv:2408.08981·cs.IR·January 10, 2025

From Lazy to Prolific: Tackling Missing Labels in Open Vocabulary Extreme Classification by Positive-Unlabeled Sequence Learning

Ranran Haoran Zhang, Bensu U\c{c}ar, Soumik Dey, Hansi Wu, Binbin Li,, Rui Zhang

PDF

Open Access 1 Video

TL;DR

This paper introduces PUSL, a novel approach for open-vocabulary extreme multi-label classification that addresses missing labels and evaluation issues by reframing the task as an infinite keyphrase generation problem, leading to improved label generation and more reliable assessment.

Contribution

The paper proposes Positive-Unlabeled Sequence Learning (PUSL), a new method that tackles label laziness and evaluation unreliability in OXMC by reformulating it as a keyphrase generation task and introducing new metrics.

Findings

01

PUSL generates 30% more unique labels in imbalanced datasets.

02

72% of PUSL's predictions match actual user queries.

03

PUSL outperforms existing methods in F1 scores as label counts increase.

Abstract

Open-vocabulary Extreme Multi-label Classification (OXMC) extends traditional XMC by allowing prediction beyond an extremely large, predefined label set (typically $1 0^{3}$ to $1 0^{12}$ labels), addressing the dynamic nature of real-world labeling tasks. However, self-selection bias in data annotation leads to significant missing labels in both training and test data, particularly for less popular inputs. This creates two critical challenges: generation models learn to be "lazy'" by under-generating labels, and evaluation becomes unreliable due to insufficient annotation in the test set. In this work, we introduce Positive-Unlabeled Sequence Learning (PUSL), which reframes OXMC as an infinite keyphrase generation task, addressing the generation model's laziness. Additionally, we propose to adopt a suite of evaluation metrics, F1@ $O$ and newly proposed B@ $k$ , to reliably assess…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

From Lazy to Prolific: Tackling Missing Labels in Open Vocabulary Extreme Classification by Positive-Unlabeled Sequence Learning· underline

Taxonomy

TopicsNatural Language Processing Techniques

MethodsSparse Evolutionary Training · ALIGN