Stratified Sampling for Extreme Multi-Label Data
Maximillian Merrillees, Lan Du

TL;DR
This paper introduces a novel algorithm for creating stratified data partitions in extreme multi-label classification, addressing the lack of effective stratification methods and improving dataset representativeness for better model evaluation.
Contribution
The paper proposes a simple, efficient algorithm for stratified partitioning of XML datasets with millions of labels, filling a gap in current dataset splitting methods.
Findings
Existing benchmark splits are often unrepresentative of the full dataset.
Stratified partitions improve the reliability of model evaluation.
Stratification is challenging but crucial for XML datasets.
Abstract
Extreme multi-label classification (XML) is becoming increasingly relevant in the era of big data. Yet, there is no method for effectively generating stratified partitions of XML datasets. Instead, researchers typically rely on provided test-train splits that, 1) aren't always representative of the entire dataset, and 2) are missing many of the labels. This can lead to poor generalization ability and unreliable performance estimates, as has been established in the binary and multi-class settings. As such, this paper presents a new and simple algorithm that can efficiently generate stratified partitions of XML datasets with millions of unique labels. We also examine the label distributions of prevailing benchmark splits, and investigate the issues that arise from using unrepresentative subsets of data for model development. The results highlight the difficulty of stratifying XML data,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Algorithms and Data Compression · Image Retrieval and Classification Techniques
