TL;DR
This paper introduces MUFIN, a multi-modal extreme classification method that effectively handles millions of labels by combining visual and textual data, achieving higher accuracy in large-scale product recommendation and prediction tasks.
Contribution
MUFIN is the first multi-modal XC approach that uses cross-modal attention and scalable training routines, bridging the gap between embedding-based and classifier-based methods.
Findings
MUFIN outperforms existing methods by at least 3% accuracy on multiple datasets.
Developed a new dataset MM-AmazonTitles-300K with over 300K products and multi-modal descriptors.
Achieved scalable training and inference routines with logarithmic complexity in the number of labels.
Abstract
This paper develops the MUFIN technique for extreme classification (XC) tasks with millions of labels where datapoints and labels are endowed with visual and textual descriptors. Applications of MUFIN to product-to-product recommendation and bid query prediction over several millions of products are presented. Contemporary multi-modal methods frequently rely on purely embedding-based methods. On the other hand, XC methods utilize classifier architectures to offer superior accuracies than embedding only methods but mostly focus on text-based categorization tasks. MUFIN bridges this gap by reformulating multi-modal categorization as an XC problem with several millions of labels. This presents the twin challenges of developing multi-modal architectures that can offer embeddings sufficiently expressive to allow accurate categorization over millions of labels; and training and inference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsFocus
