On the Necessity of World Knowledge for Mitigating Missing Labels in Extreme Classification
Jatin Prakash, Anirudh Buvanesh, Bishal Santra, Deepak Saini, Sachin, Yadav, Jian Jiao, Yashoteja Prabhu, Amit Sharma, Manik Varma

TL;DR
This paper introduces SKIM, a scalable method combining small language models and unstructured meta-data to address missing labels in extreme classification, significantly improving retrieval performance and online ad click-yield.
Contribution
The paper presents SKIM, a novel scalable algorithm that leverages small language models and meta-data to mitigate missing labels in large-scale extreme classification tasks.
Findings
SKIM outperforms existing methods on Recall@100 by over 10 points.
SKIM scales to datasets with 10 million documents, outperforming others by 12%.
Online A/B tests show a 1.23% increase in ad click-yield.
Abstract
Extreme Classification (XC) aims to map a query to the most relevant documents from a very large document set. XC algorithms used in real-world applications learn this mapping from datasets curated from implicit feedback, such as user clicks. However, these datasets inevitably suffer from missing labels. In this work, we observe that systematic missing labels lead to missing knowledge, which is critical for accurately modelling relevance between queries and documents. We formally show that this absence of knowledge cannot be recovered using existing methods such as propensity weighting and data imputation strategies that solely rely on the training dataset. While LLMs provide an attractive solution to augment the missing knowledge, leveraging them in applications with low latency requirements and large document sets is challenging. To incorporate missing knowledge at scale, we propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData-Driven Disease Surveillance · Artificial Immune Systems Applications · Advanced Computational Techniques and Applications
