Ranking-based Fusion Algorithms for Extreme Multi-label Text Classification (XMTC)

Celso Fran\c{c}a; Gestefane Rabbi; Thiago Salles; Washington Cunha; Leonardo Rocha; Marcos Andr\'e Gon\c{c}alves

arXiv:2507.03761·cs.IR·July 8, 2025

Ranking-based Fusion Algorithms for Extreme Multi-label Text Classification (XMTC)

Celso Fran\c{c}a, Gestefane Rabbi, Thiago Salles, Washington Cunha, Leonardo Rocha, Marcos Andr\'e Gon\c{c}alves

PDF

TL;DR

This paper proposes ranking-based fusion algorithms that combine sparse and dense retrieval methods to improve extreme multi-label text classification, effectively addressing the long-tail label distribution challenge.

Contribution

It introduces novel fusion algorithms that leverage the complementary strengths of sparse and dense retrievers for better label ranking in XMTC.

Findings

01

Fusion algorithms improve overall classification accuracy.

02

Enhanced performance on tail labels compared to individual retrievers.

03

Effective balancing of head and tail label predictions.

Abstract

In the context of Extreme Multi-label Text Classification (XMTC), where labels are assigned to text instances from a large label space, the long-tail distribution of labels presents a significant challenge. Labels can be broadly categorized into frequent, high-coverage \textbf{head labels} and infrequent, low-coverage \textbf{tail labels}, complicating the task of balancing effectiveness across all labels. To address this, combining predictions from multiple retrieval methods, such as sparse retrievers (e.g., BM25) and dense retrievers (e.g., fine-tuned BERT), offers a promising solution. The fusion of \textit{sparse} and \textit{dense} retrievers is motivated by the complementary ranking characteristics of these methods. Sparse retrievers compute relevance scores based on high-dimensional, bag-of-words representations, while dense retrievers utilize approximate nearest neighbor (ANN)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.