TL;DR
This paper introduces a decoupled softmax loss for dual-encoder models, significantly improving their performance and efficiency in extreme multi-label classification tasks, matching or surpassing state-of-the-art methods.
Contribution
Proposes a novel decoupled softmax loss and a top-k operator-based loss for dual-encoder models, enhancing their effectiveness and parameter efficiency in XMC tasks.
Findings
Standard dual-encoder models with the new loss match or outperform SOTA by up to 2% at Precision@1.
The proposed methods are 20x smaller in trainable parameters.
Achieves competitive results on large XMC datasets.
Abstract
Dual-encoder (DE) models are widely used in retrieval tasks, most commonly studied on open QA benchmarks that are often characterized by multi-class and limited training data. In contrast, their performance in multi-label and data-rich retrieval settings like extreme multi-label classification (XMC), remains under-explored. Current empirical evidence indicates that DE models fall significantly short on XMC benchmarks, where SOTA methods linearly scale the number of learnable parameters with the total number of classes (documents in the corpus) by employing per-class classification head. To this end, we first study and highlight that existing multi-label contrastive training losses are not appropriate for training DE models on XMC tasks. We propose decoupled softmax loss - a simple modification to the InfoNCE loss - that overcomes the limitations of existing contrastive losses. We…
Peer Reviews
Decision·ICLR 2024 poster
To the best of my knowledge, the work is original and significant. Although the distinction between using a single multi-class cross-entropy vs using multiple binary cross-entropies (OvA-BCE) have long been well understood, it isn't immediately obvious that the DE setting would be so different, and that the OvA-BCE loss would fail to train. In the context of Retrieval-Augmented Generation (RAG), when the retriever is not trained in an end-to-end manner with the answerer, one often conceptualize
... (continued from strengths) However, I wouldn't have come to this realization from the manuscript's abstract nor introduction, and that nugget would have been lost on me had I encountered the manuscript outside of a reviewing context. I understand that this work mainly targets the XMC literature (of which - full disclosure - I personally don't know much), but I still believe that the authors should dedicate some of their high-level discussions (i.e., abstract and/or introduction), to the sign
- The work is well-motivated and addresses the practical problem of getting dual encoder methods to work in the extreme multilabel setting. - The proposed contribution is simple, as it is just a loss function paired with either a negative mining approach or a memory-efficient implementation using all negatives. - The ablation of the different loss variants -- soft top-5 and soft top-100 -- is compelling and shows that the method can effectively optimize precision or recall at 5 or 100, respect
- I might have missed something, but I think it should be made clearer earlier on in the paper that the differentiable top-k operator had been proposed previously [1]. The authors also link to the author of the stackexchange answer, but it would be ideal to cite the specific answer at the link (please correct me if this appears somewhere in the main text, but I couldn't find it). Relatedly, is there other work that uses the formulation by Thomas Ahle? For example, how does this formulation compa
- The authors highlight a neglected problem in the XMC dual-encoder training stage. The proposed Decoupled Loss effectively solves this problem. - The theoretical part of the paper is well-presented, with Section 3 providing clear symbol definitions.
- The main motivation for this paper is the imperfect design of current dual-encoder training loss. However, there is a lack of evidence that this has been a general issue for current XMC methods. Most discussions and experiments are designed to compare the Decoupled Loss and regular loss using the authors' own training framework. Some experiments are implemented using a synthetic dataset (Fig. 2) or pre-selected labels (Fig. 3). After reading the entire paper, I believe the proposed loss can so
Code & Models
- 🤗quicktensor/dexml_lf-wikipedia-500kmodel· 3 dl· ♡ 13 dl♡ 1
- 🤗quicktensor/dexml_lf-amazontitles-1.3mmodel· 4 dl· ♡ 14 dl♡ 1
- 🤗quicktensor/dexml_lf-amazontitles-131kmodel· 1 dl1 dl
- 🤗quicktensor/dexml_eurlex-4kmodel· 1 dl1 dl
- 🤗thekop79/dexml_movielens-100kmodel
- 🤗thekop79/dexml_movielens-33Mmodel
- 🤗thekop79/dexml_movielens-25Mmodel
- 🤗thekop79/dexml_eurlex-4k_hnmmodel
Videos
Taxonomy
TopicsText and Document Classification Technologies · Domain Adaptation and Few-Shot Learning · Machine Learning and Data Classification
MethodsInfoNCE · Softmax
