Accelerating Extreme Classification via Adaptive Feature Agglomeration

Ankit Jalan; Purushottam Kar

arXiv:1905.11769·cs.LG·May 29, 2019

Accelerating Extreme Classification via Adaptive Feature Agglomeration

Ankit Jalan, Purushottam Kar

PDF

1 Repo

TL;DR

This paper introduces DEFRAG, an adaptive feature agglomeration method that significantly accelerates extreme classification tasks with millions of labels and features, especially in sparse datasets, while maintaining high accuracy.

Contribution

DEFRAG is a scalable, provably effective feature agglomeration technique that reduces dimensionality by over an order of magnitude, improving speed and handling missing features in extreme classification.

Findings

01

Reduces training and prediction times by up to 40%.

02

Effective in sparse, high-dimensional datasets.

03

Improves coverage on rare labels.

Abstract

Extreme classification seeks to assign each data point, the most relevant labels from a universe of a million or more labels. This task is faced with the dual challenge of high precision and scalability, with millisecond level prediction times being a benchmark. We propose DEFRAG, an adaptive feature agglomeration technique to accelerate extreme classification algorithms. Despite past works on feature clustering and selection, DEFRAG distinguishes itself in being able to scale to millions of features, and is especially beneficial when feature sets are sparse, which is typical of recommendation and multi-label datasets. The method comes with provable performance guarantees and performs efficient task-driven agglomeration to reduce feature dimensionalities by an order of magnitude or more. Experiments show that DEFRAG can not only reduce training and prediction times of several leading…

Equations75

i \in S \sum (ℓ (w^{⊤} x^{i}; y^{i}) - ℓ (\tilde{w}^{⊤} \tilde{x}^{i}; y^{i}))^{2} \leq L \cdot k = 1 \sum K w_{F_{k}}^{⊥}_{2} \cdot err_{k} .

i \in S \sum (ℓ (w^{⊤} x^{i}; y^{i}) - ℓ (\tilde{w}^{⊤} \tilde{x}^{i}; y^{i}))^{2} \leq L \cdot k = 1 \sum K w_{F_{k}}^{⊥}_{2} \cdot err_{k} .

i \in S \sum (ℓ (w^{⊤} x^{i}; y^{i}) - ℓ (\tilde{w}^{⊤} \tilde{x}^{i}; y^{i}))^{2} \leq L \cdot w_{0} \cdot k = 1 \sum K err_{k} .

i \in S \sum (ℓ (w^{⊤} x^{i}; y^{i}) - ℓ (\tilde{w}^{⊤} \tilde{x}^{i}; y^{i}))^{2} \leq L \cdot w_{0} \cdot k = 1 \sum K err_{k} .

DCG (r, v) := j = 1 \sum p \frac{v _{r_{j}}}{lo g ( 1 + j )}

DCG (r, v) := j = 1 \sum p \frac{v _{r_{j}}}{lo g ( 1 + j )}

I (v) := (r \in S_{p} max DCG (r, v))^{- 1} = (DCG (rank (v), v))^{- 1}

I (v) := (r \in S_{p} max DCG (r, v))^{- 1} = (DCG (rank (v), v))^{- 1}

nDCG (r, v) := I (v) \cdot DCG (r, v)

nDCG (r, v) := I (v) \cdot DCG (r, v)

i = 1 \sum m nDCG (r, v^{i}) = i = 1 \sum m I (v^{i}) j = 1 \sum p \frac{v _{r_{j}}^{i}}{lo g ( 1 + j )} = j = 1 \sum p \frac{1}{lo g ( 1 + j )} i = 1 \sum m I (v^{i}) \cdot v_{r_{j}}^{i}

i = 1 \sum m nDCG (r, v^{i}) = i = 1 \sum m I (v^{i}) j = 1 \sum p \frac{v _{r_{j}}^{i}}{lo g ( 1 + j )} = j = 1 \sum p \frac{1}{lo g ( 1 + j )} i = 1 \sum m I (v^{i}) \cdot v_{r_{j}}^{i}

c_{l} = α lo g b_{l} + (1 - α) lo g a_{l},

c_{l} = α lo g b_{l} + (1 - α) lo g a_{l},

(w_{F_{k}} - c_{w, k} \cdot 1_{d_{k}})^{⊤} Z_{k} Z_{k}^{⊤} (w_{F_{k}} - c_{w, k} \cdot 1_{d_{k}}) \leq Δ_{k}^{⊤} w_{F_{k}}^{⊥}_{2}^{2} .

(w_{F_{k}} - c_{w, k} \cdot 1_{d_{k}})^{⊤} Z_{k} Z_{k}^{⊤} (w_{F_{k}} - c_{w, k} \cdot 1_{d_{k}}) \leq Δ_{k}^{⊤} w_{F_{k}}^{⊥}_{2}^{2} .

f (c_{m i n}) = (u^{⊤} V V^{⊤} u) - \frac{( u ^{⊤} V V ^{⊤} 1 ) ^{2}}{( 1 ^{⊤} V V ^{⊤} 1 )} = \frac{( u ^{⊤} V V ^{⊤} u ) \cdot ( 1 ^{⊤} V V ^{⊤} 1 ) - ( u ^{⊤} V V ^{⊤} 1 ) ^{2}}{( 1 ^{⊤} V V ^{⊤} 1 )}

f (c_{m i n}) = (u^{⊤} V V^{⊤} u) - \frac{( u ^{⊤} V V ^{⊤} 1 ) ^{2}}{( 1 ^{⊤} V V ^{⊤} 1 )} = \frac{( u ^{⊤} V V ^{⊤} u ) \cdot ( 1 ^{⊤} V V ^{⊤} 1 ) - ( u ^{⊤} V V ^{⊤} 1 ) ^{2}}{( 1 ^{⊤} V V ^{⊤} 1 )}

u^{⊤} V V^{⊤} u

u^{⊤} V V^{⊤} u

u^{⊤} V V^{⊤} 1

p^{2} \cdot (1^{⊤} V V^{⊤} 1)^{2} + 2 p \cdot (1^{⊤} V V^{⊤} u^{⊥}) (1^{⊤} V V^{⊤} 1) + ((u^{⊥})^{⊤} V V^{⊤} u^{⊥}) (1^{⊤} V V^{⊤} 1)

p^{2} \cdot (1^{⊤} V V^{⊤} 1)^{2} + 2 p \cdot (1^{⊤} V V^{⊤} u^{⊥}) (1^{⊤} V V^{⊤} 1) + ((u^{⊥})^{⊤} V V^{⊤} u^{⊥}) (1^{⊤} V V^{⊤} 1)

- p^{2} \cdot (1^{⊤} V V^{⊤} 1)^{2} - 2 p \cdot (1^{⊤} V V^{⊤} u^{⊥}) (1^{⊤} V V^{⊤} 1) - (1^{⊤} V V^{⊤} u^{⊥})^{2}

=

f (c_{m i n}) = \frac{(( u ^{⊥} ) ^{⊤} V V ^{⊤} u ^{⊥} ) ( 1 ^{⊤} V V ^{⊤} 1 ) - ( 1 ^{⊤} V V ^{⊤} u ^{⊥} ) ^{2}}{( 1 ^{⊤} V V ^{⊤} 1 )} = (u^{⊥})^{⊤} V V^{⊤} u^{⊥} - \frac{( 1 ^{⊤} V V ^{⊤} u ^{⊥} ) ^{2}}{( 1 ^{⊤} V V ^{⊤} 1 )}

f (c_{m i n}) = \frac{(( u ^{⊥} ) ^{⊤} V V ^{⊤} u ^{⊥} ) ( 1 ^{⊤} V V ^{⊤} 1 ) - ( 1 ^{⊤} V V ^{⊤} u ^{⊥} ) ^{2}}{( 1 ^{⊤} V V ^{⊤} 1 )} = (u^{⊥})^{⊤} V V^{⊤} u^{⊥} - \frac{( 1 ^{⊤} V V ^{⊤} u ^{⊥} ) ^{2}}{( 1 ^{⊤} V V ^{⊤} 1 )}

f (c_{m i n}) \leq (u^{⊥})^{⊤} V V^{⊤} u^{⊥} = V^{⊤} u^{⊥}_{2}^{2}

f (c_{m i n}) \leq (u^{⊥})^{⊤} V V^{⊤} u^{⊥} = V^{⊤} u^{⊥}_{2}^{2}

V^{⊤} u^{⊥} = \boldmath μ^{k} 1^{⊤} u^{⊥} + Δ_{k}^{⊤} u^{⊥} = Δ_{k}^{⊤} u^{⊥},

V^{⊤} u^{⊥} = \boldmath μ^{k} 1^{⊤} u^{⊥} + Δ_{k}^{⊤} u^{⊥} = Δ_{k}^{⊤} u^{⊥},

i \in S \sum (ℓ (w^{⊤} x^{i}; y^{i}) - ℓ (\tilde{w}^{⊤} \tilde{x}^{i}; y^{i}))^{2} \leq L \cdot k = 1 \sum K w_{F_{k}}^{⊥}_{2} \cdot err_{k} .

i \in S \sum (ℓ (w^{⊤} x^{i}; y^{i}) - ℓ (\tilde{w}^{⊤} \tilde{x}^{i}; y^{i}))^{2} \leq L \cdot k = 1 \sum K w_{F_{k}}^{⊥}_{2} \cdot err_{k} .

i \in S \sum (ℓ (w^{⊤} x^{i}; y^{i}) - ℓ (\tilde{w}^{⊤} \tilde{x}^{i}; y^{i}))^{2} \leq i = 1 \sum n (ℓ (w^{⊤} x^{i}; y^{i}) - ℓ (\tilde{w}^{⊤} \tilde{x}^{i}; y^{i}))^{2}

i \in S \sum (ℓ (w^{⊤} x^{i}; y^{i}) - ℓ (\tilde{w}^{⊤} \tilde{x}^{i}; y^{i}))^{2} \leq i = 1 \sum n (ℓ (w^{⊤} x^{i}; y^{i}) - ℓ (\tilde{w}^{⊤} \tilde{x}^{i}; y^{i}))^{2}

i = 1 \sum n (ℓ (w^{⊤} x^{i}; y^{i}) - ℓ (\tilde{w}^{⊤} \tilde{x}^{i}; y^{i}))^{2} \leq L^{2} \cdot i = 1 \sum n (w^{⊤} x^{i} - \tilde{w}^{⊤} \tilde{x}^{i})^{2} = L^{2} \cdot i = 1 \sum n (k = 1 \sum K (w_{F_{k}} - \tilde{w}_{k} 1_{d_{k}})^{⊤} x_{F_{k}}^{i})^{2}

i = 1 \sum n (ℓ (w^{⊤} x^{i}; y^{i}) - ℓ (\tilde{w}^{⊤} \tilde{x}^{i}; y^{i}))^{2} \leq L^{2} \cdot i = 1 \sum n (w^{⊤} x^{i} - \tilde{w}^{⊤} \tilde{x}^{i})^{2} = L^{2} \cdot i = 1 \sum n (k = 1 \sum K (w_{F_{k}} - \tilde{w}_{k} 1_{d_{k}})^{⊤} x_{F_{k}}^{i})^{2}

i = 1 \sum n k = 1 \sum K ((w_{F_{k}} - \tilde{w}_{k} 1_{d_{k}})^{⊤} x_{F_{k}}^{i})^{2} + k \neq = l \sum ((w_{F_{k}} - \tilde{w}_{k} 1_{d_{k}})^{⊤} x_{F_{k}}^{i}) ((w_{F_{l}} - \tilde{w}_{l} 1_{d_{l}})^{⊤} x_{F_{l}}^{i})

i = 1 \sum n k = 1 \sum K ((w_{F_{k}} - \tilde{w}_{k} 1_{d_{k}})^{⊤} x_{F_{k}}^{i})^{2} + k \neq = l \sum ((w_{F_{k}} - \tilde{w}_{k} 1_{d_{k}})^{⊤} x_{F_{k}}^{i}) ((w_{F_{l}} - \tilde{w}_{l} 1_{d_{l}})^{⊤} x_{F_{l}}^{i})

= k = 1 \sum K i = 1 \sum n ((w_{F_{k}} - \tilde{w}_{k} 1_{d_{k}})^{⊤} x_{F_{k}}^{i})^{2} + k \neq = l \sum i = 1 \sum n ((w_{F_{k}} - \tilde{w}_{k} 1_{d_{k}})^{⊤} x_{F_{k}}^{i}) ((w_{F_{l}} - \tilde{w}_{l} 1_{d_{l}})^{⊤} x_{F_{l}}^{i})

= k = 1 \sum K (w_{F_{k}} - \tilde{w}_{k} 1_{d_{k}})^{⊤} [i = 1 \sum n x_{F_{k}}^{i} (x_{F_{k}}^{i})^{⊤}] (w_{F_{k}} - \tilde{w}_{k} 1_{d_{k}}) + k \neq = l \sum (w_{F_{k}} - \tilde{w}_{k} 1_{d_{k}})^{⊤} [i = 1 \sum n x_{F_{k}}^{i} (x_{F_{l}}^{i})^{⊤}] (w_{F_{l}} - \tilde{w}_{l} 1_{d_{l}})

err_{k}^{2} := j \in F_{k} \sum p^{j} - \boldmath μ^{k}_{2}^{2} = P - 1_{d_{k}} (\boldmath μ^{k})^{⊤}_{F}^{2}

err_{k}^{2} := j \in F_{k} \sum p^{j} - \boldmath μ^{k}_{2}^{2} = P - 1_{d_{k}} (\boldmath μ^{k})^{⊤}_{F}^{2}

P_{k}^{⊤} (w_{F_{k}} - c_{w, k} \cdot 1_{d_{k}}) \leq Δ_{k}^{⊤} w_{F_{k}}^{⊥}_{2} \leq Δ_{k}^{⊤}_{2} w_{F_{k}}^{⊥}_{2} = ∥ Δ_{k} ∥_{2} w_{F_{k}}^{⊥}_{2} \leq ∥ Δ_{k} ∥_{F} w_{F_{k}}^{⊥}_{2} = err_{k} \cdot w_{F_{k}}^{⊥}_{2},

P_{k}^{⊤} (w_{F_{k}} - c_{w, k} \cdot 1_{d_{k}}) \leq Δ_{k}^{⊤} w_{F_{k}}^{⊥}_{2} \leq Δ_{k}^{⊤}_{2} w_{F_{k}}^{⊥}_{2} = ∥ Δ_{k} ∥_{2} w_{F_{k}}^{⊥}_{2} \leq ∥ Δ_{k} ∥_{F} w_{F_{k}}^{⊥}_{2} = err_{k} \cdot w_{F_{k}}^{⊥}_{2},

k = 1 \sum K (w_{F_{k}} - \tilde{w}_{k} 1_{d_{k}})^{⊤} P_{k} P_{k}^{⊤} (w_{F_{k}} - \tilde{w}_{k} 1_{d_{k}}) + k \neq = l \sum (w_{F_{k}} - \tilde{w}_{k} 1_{d_{k}})^{⊤} P_{k} P_{l}^{⊤} (w_{F_{l}} - \tilde{w}_{l} 1_{d_{l}})

k = 1 \sum K (w_{F_{k}} - \tilde{w}_{k} 1_{d_{k}})^{⊤} P_{k} P_{k}^{⊤} (w_{F_{k}} - \tilde{w}_{k} 1_{d_{k}}) + k \neq = l \sum (w_{F_{k}} - \tilde{w}_{k} 1_{d_{k}})^{⊤} P_{k} P_{l}^{⊤} (w_{F_{l}} - \tilde{w}_{l} 1_{d_{l}})

k = 1 \sum K P_{k}^{⊤} (w_{F_{k}} - \tilde{w}_{k} 1_{d_{k}})_{2}^{2} + k \neq = l \sum P_{k}^{⊤} (w_{F_{k}} - \tilde{w}_{k} 1_{d_{k}})_{2} P_{l}^{⊤} (w_{F_{l}} - \tilde{w}_{l} 1_{d_{l}})_{2}

k = 1 \sum K P_{k}^{⊤} (w_{F_{k}} - \tilde{w}_{k} 1_{d_{k}})_{2}^{2} + k \neq = l \sum P_{k}^{⊤} (w_{F_{k}} - \tilde{w}_{k} 1_{d_{k}})_{2} P_{l}^{⊤} (w_{F_{l}} - \tilde{w}_{l} 1_{d_{l}})_{2}

\leq k = 1 \sum K err_{k}^{2} \cdot w_{F_{k}}^{⊥}_{2}^{2} + k \neq = l \sum err_{k} \cdot w_{F_{k}}^{⊥}_{2} \cdot err_{l} \cdot w_{F_{l}}^{⊥}_{2} = (k = 1 \sum K err_{k} \cdot w_{F_{k}}^{⊥}_{2})^{2}

l \in [T] \sum ((c^{+} - c^{-})^{⊤} z^{l} - (\tilde{c}^{+} - \tilde{c}^{-})^{⊤} \tilde{z}^{l})^{2} \leq k = 1 \sum K err_{k} \cdot c_{F_{k}}^{+} - c_{F_{k}}^{-}_{2} .

l \in [T] \sum ((c^{+} - c^{-})^{⊤} z^{l} - (\tilde{c}^{+} - \tilde{c}^{-})^{⊤} \tilde{z}^{l})^{2} \leq k = 1 \sum K err_{k} \cdot c_{F_{k}}^{+} - c_{F_{k}}^{-}_{2} .

l \in [T] \sum ((c^{+} - c^{-})^{⊤} z^{l} - (\tilde{c}^{+} - \tilde{c}^{-})^{⊤} \tilde{z}^{l})^{2} \leq l = 1 \sum L ((c^{+} - c^{-})^{⊤} z^{l} - (\tilde{c}^{+} - \tilde{c}^{-})^{⊤} \tilde{z}^{l})^{2}

l \in [T] \sum ((c^{+} - c^{-})^{⊤} z^{l} - (\tilde{c}^{+} - \tilde{c}^{-})^{⊤} \tilde{z}^{l})^{2} \leq l = 1 \sum L ((c^{+} - c^{-})^{⊤} z^{l} - (\tilde{c}^{+} - \tilde{c}^{-})^{⊤} \tilde{z}^{l})^{2}

l = 1 \sum L (\boldmath δ^{⊤} z^{l} - \tilde{\boldmath δ}^{⊤} \tilde{z}^{l})^{2} = l = 1 \sum L (k = 1 \sum K (\boldmath δ_{F_{k}} - \tilde{\boldmath δ}_{k} 1_{d_{k}})^{⊤} z_{F_{k}}^{l})^{2}

l = 1 \sum L (\boldmath δ^{⊤} z^{l} - \tilde{\boldmath δ}^{⊤} \tilde{z}^{l})^{2} = l = 1 \sum L (k = 1 \sum K (\boldmath δ_{F_{k}} - \tilde{\boldmath δ}_{k} 1_{d_{k}})^{⊤} z_{F_{k}}^{l})^{2}

k = 1 \sum K (\boldmath δ_{F_{k}} - \tilde{\boldmath δ}_{k} 1_{d_{k}})^{⊤} [l = 1 \sum L z_{F_{k}}^{l} (z_{F_{k}}^{l})^{⊤}] (\boldmath δ_{F_{k}} - \tilde{\boldmath δ}_{k} 1_{d_{k}}) + k \neq = j \sum (\boldmath δ_{F_{k}} - \tilde{\boldmath δ}_{k} 1_{d_{k}})^{⊤} [l = 1 \sum L z_{F_{k}}^{l} (z_{F_{j}}^{l})^{⊤}] (\boldmath δ_{F_{j}} - \tilde{\boldmath δ}_{j} 1_{d_{j}})

k = 1 \sum K (\boldmath δ_{F_{k}} - \tilde{\boldmath δ}_{k} 1_{d_{k}})^{⊤} [l = 1 \sum L z_{F_{k}}^{l} (z_{F_{k}}^{l})^{⊤}] (\boldmath δ_{F_{k}} - \tilde{\boldmath δ}_{k} 1_{d_{k}}) + k \neq = j \sum (\boldmath δ_{F_{k}} - \tilde{\boldmath δ}_{k} 1_{d_{k}})^{⊤} [l = 1 \sum L z_{F_{k}}^{l} (z_{F_{j}}^{l})^{⊤}] (\boldmath δ_{F_{j}} - \tilde{\boldmath δ}_{j} 1_{d_{j}})

err_{k}^{2} := j \in F_{k} \sum q^{j} - \boldmath ν^{k}_{2}^{2} = Q - 1_{d_{k}} (\boldmath ν^{k})^{⊤}_{F}^{2}

err_{k}^{2} := j \in F_{k} \sum q^{j} - \boldmath ν^{k}_{2}^{2} = Q - 1_{d_{k}} (\boldmath ν^{k})^{⊤}_{F}^{2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

purushottamkar/defrag
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Accelerating Extreme Classification via Adaptive Feature Agglomeration

Ankit Jalan &Purushottam Kar

Department of CSE, IIT Kanpur, INDIA

[email protected], [email protected]

Abstract

Extreme classification seeks to assign each data point, the most relevant labels from a universe of a million or more labels. This task is faced with the dual challenge of high precision and scalability, with millisecond level prediction times being a benchmark. We propose DEFRAG, an adaptive feature agglomeration technique to accelerate extreme classification algorithms. Despite past works on feature clustering and selection, DEFRAG distinguishes itself in being able to scale to millions of features, and is especially beneficial when feature sets are sparse, which is typical of recommendation and multi-label datasets. The method comes with provable performance guarantees and performs efficient task-driven agglomeration to reduce feature dimensionalities by an order of magnitude or more. Experiments show that DEFRAG can not only reduce training and prediction times of several leading extreme classification algorithms by as much as 40%, but also be used for feature reconstruction to address the problem of missing features, as well as offer superior coverage on rare labels.

1 Introduction

The task of taking assigning data points, one or more labels from a vast universe of millions of labels is often referred to as the extreme classification problem. Although reminiscent of the classical multi-label learning problem, the emphasis on addressing extremely large label spaces distinguishes extreme classification. Recent advances in extreme classification have allowed problems such as ranking, recommendation and retrieval to be viewed and formulated as multi-label problems, indeed with millions of labels.

This focus on extremely large label sets has given us state-of-the-art methods for product recommendation Jain et al. (2016), search advertising Prabhu et al. (2018), and video recommendation Weston et al. (2013), as well as led to advances in our understanding of scalable optimization Prabhu et al. (2018), and distributed and parallel processing Yen et al. (2017); Babbar and Schölkopf (2017). Recent advances have utilized a variety of techniques – label embeddings, random forests, binary relevance, which we review in §3.

Nevertheless, extreme classification algorithms continue to face several challenges that we enumerate below.

Precision: data points often have only 5-6 or fewer labels relevant to them (e.g. very few products, of the possibly millions on sale on an online marketplace, would interest any given customer). It is challenging to accurately identify these 5-6 relevant labels among the millions of irrelevant ones.
Prediction: given their use in (live) recommendation systems, extremely rapid predictions are expected, typically within milliseconds. This often restricts the algorithmic techniques that can be used, to computationally frugal ones.
Processing: extreme classification datasets contain not only millions of labels, but also millions of data points, each represented as a million-dimensional vector itself. It is challenging to offer scalable training on such large datasets.
Parity: huge label sets often exhibit power-law behavior with most labels being rare i.e. relevant to very few data points. This makes it tempting for algorithms to focus only on popular labels, neglecting the vast majority of rare ones. However, this is detrimental for recommendation outcomes.

In this work, we develop the DEFRAG method and variants to address these specific challenges for a large family of algorithms. Our contributions are summarized below.

Our Contributions.

We propose the DEFRAG algorithm that accelerates extreme classification algorithms by performing efficient feature agglomeration on datasets with millions of features and data points. DEFRAG performs agglomeration by constructing a balanced hierarchy which novel, and offers faster and better agglomerates than traditional clustering methods.
We show that DEFRAG provably preserves the performance of a large family of extreme classification algorithms. This is corroborated experimentally where using DEFRAG significantly reduces training and prediction times of algorithms but with no significant reduction in precision levels.
We exploit DEFRAG’s agglomerates in a novel manner to develop the REFRAG algorithm to address the parity problem by performing efficient label re-ranking. This vastly improves the coverage of existing algorithms by accurately predicting extremely rare labels.
We develop the FIAT algorithm to perform scalable feature imputation which preserves prediction accuracy even when a large fraction of data features are removed.
We perform extensive experimentation on large-scale datasets to establish that DEFRAG not only offers significant reductions in training and prediction times, but that it does so with little or no reduction in precision.

2 Problem Formulation and Notation

The training data will be provided as $n$ labeled data points $({{\mathbf{x}}}^{i},{{\mathbf{y}}}^{i}),i=1,\ldots,n$ where ${{\mathbf{x}}}^{i}\in{\mathbb{R}}^{d}$ is the feature vector and ${{\mathbf{y}}}^{i}\in\left\{{0,1}\right\}^{L}$ is the label vector. There may be several (upto $L$ ) labels associated with each data point. Extreme classification datasets exhibit extreme sparsity in feature and label vectors. Let $\hat{d}$ denote the average number of non-zero features per data point and $\hat{L}$ denote the average number of active labels per data point. §6 shows that $\hat{d}\ll d$ and $\hat{L}\ll L$ . We will denote the feature matrix using $X=\left[{{{\mathbf{x}}}^{1},\ldots,{{\mathbf{x}}}^{n}}\right]\in{\mathbb{R}}^{d\times n}$ and the label matrix using $Y=\left[{{{\mathbf{y}}}^{1},\ldots,{{\mathbf{y}}}^{n}}\right]\in\left\{{0,1}\right\}^{L\times n}$ .

Notation.

Let ${\mathcal{F}}=\left\{{F_{1},\ldots,F_{K}}\right\}$ denote any $K$ -partition of the feature set $[d]$ i.e. $F_{i}\cap F_{j}=\emptyset$ if $i\neq j$ and $\bigcup_{k=1}^{K}F_{k}=[d]$ . Let $d_{k}:=\left|{F_{k}}\right|$ denote the size of the $k\text{$ {}^{\text{th}} $}$ cluster. For any vector ${{\mathbf{z}}}\in{\mathbb{R}}^{d}$ , let ${{\mathbf{z}}}_{j}$ denote its $j\text{$ {}^{\text{th}} $}$ coordinate. For any set $F_{k}\in{\mathcal{F}}$ , let ${{\mathbf{z}}}_{F_{k}}:=[{{\mathbf{z}}}_{j}]_{j\in F_{k}}^{\top}\in{\mathbb{R}}^{d_{k}}$ denote the (shorter) vector containing only coordinates from the set $F_{k}$ .

Feature Agglomeration.

Feature agglomeration involves creating clusters of features and then summing up features within a cluster. If ${\mathcal{F}}$ is a partition of the features $[d]$ , then corresponding to every cluster $F_{k}\in{\mathcal{F}}$ , we create a a single “super”-feature. Thus, given a vector ${{\mathbf{z}}}\in{\mathbb{R}}^{d}$ , we can create an agglomerated vector $\tilde{{\mathbf{z}}}^{[{\mathcal{F}}]}\in{\mathbb{R}}^{K}$ (abbreviated to just $\tilde{{\mathbf{z}}}$ for sake of notational simplicity) with just $K$ features using the clustering ${\mathcal{F}}$ . The $k\text{$ {}^{\text{th}} $}$ dimension of $\tilde{{\mathbf{z}}}$ will be $\tilde{{\mathbf{z}}}_{k}=\sum_{j\in F_{k}}{{\mathbf{z}}}_{i}$ for $k=1,\ldots,K$ . The DEFRAG algorithm will automatically learn relevant feature clusters ${\mathcal{F}}$ .

3 Related Works

We discuss relevant works in extreme classification and scalable clustering and feature agglomeration techniques here.

Binary Relevance.

Also known as one-vs-all methods, these techniques, for example DiSMEC Babbar and Schölkopf (2017), PPDSparse Yen et al. (2017), and ProXML Babbar and Schölkopf (2019), learn $L$ binary classifiers: for each label $l\in[L]$ , a binary classifier is learnt to distinguish data points that contain label $l$ from those that do not. Binary relevance methods offer some of the highest precision values among extreme classification algorithms Prabhu et al. (2018). However, despite advances in parallel training and active set methods, they still incur training and prediction times that are prohibitive for most applications.

Label/Feature Embedding.

These techniques project feature and/or label vectors onto a low dimensional space i.e. ${{\mathbf{x}}}^{i}\mapsto\hat{{\mathbf{x}}}^{i},{{\mathbf{y}}}^{i}\mapsto\hat{{\mathbf{y}}}^{i}$ where $\hat{{\mathbf{x}}}^{i},\hat{{\mathbf{y}}}^{i}\in{\mathbb{R}}^{p}$ , $p\ll\min\left\{{d,L}\right\}$ using random or learnt projections. Prediction and training is performed in the low dimensional space ${\mathbb{R}}^{p}$ for speed. These methods SLEEC Bhatia et al. (2015), AnnexML Tagami (2017) and LEML Yu et al. (2014) offer strong theoretical guarantees, but are usually forced to choose a moderate value of $p$ to maintain scalability. This often results in low precision values and causes these methods to struggle on rare labels.

Data Partitioning.

These techniques learn a decision tree over the data points which are hierarchically clustered into several leaves, with the hope that the similar data points, i.e. those with similar label vectors, end up in the same leaf. A simple classifier (usually constant) performs label prediction at a leaf. These methods PfastreXML Jain et al. (2016), FastXML Prabhu and Varma (2014) and CRAFTML Siblini et al. (2018) offer fast prediction times due to prediction being logarithmic in number of leaves in a balanced tree.

Label Partitioning.

These methods instead learn to organize labels into (overlapping) clusters, using hierarchical partitioning techniques. Prediction is done by taking a data point to one or more of the leaves of the tree and using a simple method such as 1-vs-all among labels present at that leaf. These methods PLT Jasinska et al. (2016), Parabel Prabhu et al. (2018), LPSR Weston et al. (2013) offer fast prediction times due to the tree structure, as well as high precision by using a 1-vs-all classifier at the leaves, as Parabel does.

Large Scale (Feature) Clustering.

Clustering, as well as feature clustering and agglomeration, are well-studied topics. Past works include techniques for scalable balanced k-means using alternating minimization techniques SCBC Banerjee and Ghosh (2006) and BCLS Liu et al. (2017), scalable spectral clustering using landmarking LSC Chen and Cai (2011), and scalable information-theoretic clustering ITDC Dhillon et al. (2003). We do compare DEFRAG against all these algorithms. These algorithms were chosen since they were able to scale to at least the smallest datasets in our experiments.

4 Adaptive Extreme Feature Agglomeration

We now describe the DEFRAG method, discuss its key advantages and then develop the REFRAG method for rare label prediction and the FIAT method for feature imputation. Recall from §2 that given a $K$ -partition ${\mathcal{F}}$ of the features $[d]$ , feature agglomeration takes each cluster $F_{k}\in{\mathcal{F}}$ and agglomerates all features $j\in F_{k}$ by summing up their feature values.

DEFRAG: aDaptive Extreme FeatuRe AGglomeration.

Given a dataset with $d$ -dimensional features, DEFRAG first clusters these features into balanced clusters, with each cluster containing, say no more than $d_{0}$ features. Suppose this process results in $K$ clusters. DEFRAG then uses feature agglomeration (see §2 and Figure 1) to obtain $K$ -dimensional features for all data points in the dataset which are then used for training and testing.

DEFRAG first creates a representative vector for each feature $j\in[d]$ and then performs hierarchical clustering on them (see Algorithm 1) to obtain feature clusters. At each internal node of the hierarchy, features at that node are split into two children nodes of equal sizes by solving either a balanced spherical $2$ -means problem or else by minimizing a ranking loss like nDCG Prabhu and Varma (2014) which we call DEFRAG-N (see Appendix A for details). This process is continued till we are left with less than $d_{0}$ features at a node, in which case the node is made a leaf. We now discuss two methods to construct these representative vectors.

DEFRAG-X

This variant clusters together co-occurrent features e.g. $j,j^{\prime}\in[d]$ where data points with a non-zero (or high) value for feature $j$ also have a non-zero (or high) value for feature $j^{\prime}$ . DEFRAG-X represents each feature $j\in[d]$ as an $n$ -dimensional vector ${{\mathbf{p}}}^{j}=[{{\mathbf{x}}}^{1}_{j},\ldots,{{\mathbf{x}}}^{n}_{j}]^{\top}\in{\mathbb{R}}^{n}$ , essentially as the list of values that feature takes in all data points.

DEFRAG-XY

This variant clusters together co-predictive features e.g. $j,j^{\prime}\in[d]$ where data points where feature $j$ is non-zero have similar labels as data points where feature $j^{\prime}$ is non-zero. To do so, DEFRAG-XY represents each feature $j\in[d]$ as an $L$ -dimensional vector ${{\mathbf{q}}}^{j}=\sum_{i=1}^{n}{{\mathbf{x}}}^{i}_{j}{{\mathbf{y}}}^{i}\in{\mathbb{R}}^{L}$ , essentially as a weighted aggregate of the label vectors of data points where the feature $j$ is non-zero.

Suited for sparse, high-dim. data.

DEFRAG is superior to classical dimensionality reduction techniques like PCA/random projection for high-dimensional, sparse data.

Applying feature agglomeration to a vector simply involves summing up the coordinates of that vector and is much cheaper than performing PCA or a random projection.
PCA/random projection densify vectors and so methods such as LEML and SLEEC are compelled to use a small embedding dimension ( $\approx$ 500) for sake of scalability which leads to information loss. Feature agglomeration, however, does not densify vectors: if a vector ${{\mathbf{x}}}\in{\mathbb{R}}^{d}$ has only $s$ non-zero coordinates, then for any feature $K$ -clustering ${\mathcal{F}}$ , the vector $\tilde{{\mathbf{x}}}^{[{\mathcal{F}}]}\in{\mathbb{R}}^{K}$ cannot have more than $s$ non-zero coordinates. This allows DEFRAG to operate with relatively large values of $K$ (e.g. $K=d/8$ is default in our experiments as we set $d_{0}=8$ ) without worrying about memory or time issues. Thus, DEFRAG can offer mildly agglomerated vectors which preserve much of the information of the original vector, yet offer speedups due to the reduced dimensionality.
Feature agglomeration has an implicit weight-tying effect since once we learn a model over the agglomerated features, all features belonging to a given cluster $F_{k}\in{\mathcal{F}}$ effectively receive the same model weight. This reduces the capacity of the model and can improve generalization error.

Provably bounded distortion, Performance preservation.

We show in §5 that if we obtain a feature clustering ${\mathcal{F}}$ with small clustering error, then feature agglomeration using ${\mathcal{F}}$ provably preserves the performance of all linear models. Specifically, for every model ${{\mathbf{w}}}\in{\mathbb{R}}^{d}$ over the original vectors, there must exist a model $\tilde{{\mathbf{w}}}\in{\mathbb{R}}^{K}$ over the agglomerated vectors such that for any vector ${{\mathbf{x}}}\in{\mathbb{R}}^{d}$ , we have ${{\mathbf{w}}}^{\top}{{\mathbf{x}}}\approx\tilde{{\mathbf{w}}}^{\top}\tilde{{\mathbf{x}}}^{[{\mathcal{F}}]}$ . This ensures that similar 1-vs-all models can be learnt over $\tilde{{\mathbf{x}}}^{[{\mathcal{F}}]}$ , as well as similar trees and label partitions can be built. We note that all algorithms discussed in §3 ultimately use just linear models as components (e.g. binary relevance methods learn $L$ linear classifiers, embedding methods learn linear projections, data and label partitioning methods learn linear models to split internal nodes).

Task adaptivity.

DEFRAG-XY takes into account labels in its feature representation which makes it task-adaptive as compared to dimensionality reduction or clustering methods like k-means, PCA which do not take consider labels. Indeed, we will see that on many datasets, DEFRAG-XY outperforms DEFRAG-X which also does not take labels into account.

Novelty, Speed and Scalability.

Hierarchical feature agglomeration is novel in the context of extreme classification although hierarchical data partitioning (Parabel) and hierarchical label partitioning (PfastreXML) have been successfully attempted before. The representative vectors created by DEFRAG are themselves sparse and hierarchical feature agglomeration offers speedy feature clustering. DEFRAG’s overhead on the training process is thus, very small.

Time Complexity.

Let $\text{nnz}(X)=n\cdot\hat{d}$ be the number of non-zero elements in the feature matrix $X$ . Computing the feature representations ${{\mathbf{p}}}^{j},j\in[d]$ takes ${\cal O}\left({{\text{nnz}(X)}}\right)$ time. The total time taken to perform balanced spherical $2$ -means clustering for all nodes at a certain level in the tree is ${\cal O}\left({{\text{nnz}(X)}}\right)$ as well. Since DEFRAG performs balanced splits, there can be at most ${\cal O}\left({{\log d}}\right)$ levels in the tree, thus giving us a total time complexity of ${\cal O}\left({{\text{nnz}(X)\log d}}\right)$ .

FIAT: Feature Imputation via AgglomeraTion.

Co-occurence based feature imputation has been popularly used to overcome the problem of missing features. However, this becomes prohibitive for extreme classification settings since the standard co-occurrence matrix $C=XX^{\top}$ is too dense to store and operate. We exploit the feature clusters offered by DEFRAG to create a scalable co-occurrence based feature imputation algorithm FIAT. For any feature cluster $F_{k}\in{\mathcal{F}}$ let $X_{F_{k}}\in{\mathbb{R}}^{d\times n}$ denote the matrix with only those rows that belong to the cluster $F_{k}$ . Given this, we compute a pseudo co-occurrence matrix $C^{{\mathcal{F}}}=\sum_{k=1}^{K}X_{F_{k}}X_{F_{k}}^{\top}\in{\mathbb{R}}^{d}$ .

Note that $C^{{\mathcal{F}}}$ has a block-diagonal structure and has only upto $\frac{d^{2}}{K}$ non-zero entries where $K$ is the number of clusters. Thus, it is much cheaper to store and operate. Given a feature vector ${{\mathbf{x}}}\in{\mathbb{R}}^{d}$ that we suspect has missing features, we perform feature imputation on it by simply calculating $C^{{\mathcal{F}}}{{\mathbf{x}}}$ .

REFRAG: REranking via FeatuRe AGglomeration.

The presence of a vast majority of rare labels that occur in very few data points can cause algorithms to neglect rare labels in favor of popular ones Wei and Li (2018). To address this, we propose an efficient reranking solution based on the pseudo co-occurrence matrix $C^{{\mathcal{F}}}$ described earlier. First compute the matrix product $C^{{\mathcal{F}}}XY^{\top}\in{\mathbb{R}}^{d\times L}$ . The $l\text{$ {}^{\text{th}} $}$ column of this matrix $l\in[L]$ can be interpreted as giving us a prototype data point $\text{\boldmath$ \mathbf{\xi} $}^{l}\in{\mathbb{R}}^{d}$ for the label $l$ .

These prototypes can be used to get the affinity score of a test data point ${{\mathbf{x}}}^{t}$ to a label $l\in[L]$ as $e^{-\frac{\gamma}{2}\cdot\left\|{{{\mathbf{x}}}^{t}-\text{\boldmath$ \mathbf{\xi} $}^{l}}\right\|_{2}^{2}}$ . Once a base classification algorithm such as Parabel or DiSMEC has given scores for the test point ${{\mathbf{x}}}^{t}$ with respect to various labels, instead of predicting the labels with the highest scores right-away, we combine the classifier scores with these affinity scores and then make the predictions. We note that a similar approach was proposed by Jain et al. (2016) who did achieve enhanced performance on rare labels.

However, whereas their method requires an optimization problem to be solved to obtain the prototypes, we have a closed form expression for prototypes in our model given the efficiently computable pseudo co-occurrence matrix $C^{{\mathcal{F}}}$ .

Due to lack of space, further algorithmic details as well as proofs of theorems in §5 are presented in the full version of the paper available at the URL given below.

5 Performance Guarantees

In this section we establish that DEFRAG provably preserves the performance of extreme classification algorithms. For any vector ${{\mathbf{v}}}\in{\mathbb{R}}^{p}$ we will utilize the orthogonal decomposition ${{\mathbf{v}}}={{\mathbf{v}}}^{\parallel}+{{\mathbf{v}}}^{\perp}$ where ${{\mathbf{v}}}^{\parallel}$ is the component of ${{\mathbf{v}}}$ along the all-ones vector ${\mathbf{1}}_{p}=(1,\ldots,1)\in{\mathbb{R}}^{p}$ and ${{\mathbf{v}}}^{\perp}$ is the component orthogonal to it i.e. ${\mathbf{1}}_{p}^{\top}{{\mathbf{v}}}^{\perp}=0$ . At the core of our results is the following lemma. Given a real valued matrix $Z\in{\mathbb{R}}^{d\times p}$ for some $p>0$ and a $K$ -partition ${\mathcal{F}}$ of the feature set $[d]$ , we will let $Z_{k}\in{\mathbb{R}}^{d_{k}\times p}$ denote the matrix formed out of the rows of the matrix that correspond to the partition $F_{k}$ .

Lemma 1.

Given any matrix $Z\in{\mathbb{R}}^{d\times p}$ and any $K$ -partition ${\mathcal{F}}=\left\{{F_{1},\ldots,F_{K}}\right\}$ of $[d]$ , suppose there exist vectors $\text{\boldmath$ \mathbf{\mu} $}^{1},\ldots,\text{\boldmath$ \mathbf{\mu} $}^{K}\in{\mathbb{R}}^{p}$ such that $Z_{k}={\mathbf{1}}_{d_{k}}(\text{\boldmath$ \mathbf{\mu} $}^{k})^{\top}+\Delta_{k}$ where ${\mathbf{1}}_{d_{k}}:=(1,\ldots,1)^{\top}\in{\mathbb{R}}^{d_{k}}$ , then for every ${{\mathbf{w}}}\in{\mathbb{R}}^{d}$ and every $k\in[K]$ , there must exist a real value $c_{{{\mathbf{w}}},k}\in{\mathbb{R}}$ such that

[TABLE]

Lemma 1 will be used to show below that, if a group of features is “well-clustered”, then it is possible to tie together weights corresponding to those features in every linear model.

Theorem 2.

Upon executing DEFRAG-X with a feature matrix $X=[{{\mathbf{x}}}^{1},\ldots,{{\mathbf{x}}}^{n}]$ and label matrix $Y=[{{\mathbf{y}}}^{1},\ldots,{{\mathbf{y}}}^{n}]$ , suppose we obtain a feature $K$ -partition ${\mathcal{F}}=[F_{1},\ldots,F_{K}]$ with $\text{err}_{k}$ denoting the Euclidean clustering error within the $k\text{$ {}^{\text{th}} $}$ cluster, then for any loss function $\ell(\cdot)$ that is $L$ -Lipschitz and for every linear model ${{\mathbf{w}}}\in{\mathbb{R}}^{d}$ , there must exist a model $\tilde{{\mathbf{w}}}\in{\mathbb{R}}^{K}$ such that for all subsets of data points $S\subseteq[n]$ ,

[TABLE]

To simplify this result, let $w_{0}=\max_{k\in[K]}\left\|{{{\mathbf{w}}}_{F_{k}}^{\perp}}\right\|_{2}^{2}\leq\max_{k\in[K]}\left\|{{{\mathbf{w}}}_{F_{k}}}\right\|_{2}^{2}$ (since cluster sizes $d_{k}$ are typically small, $w_{0}$ is small too) and use the fact that $\text{err}_{k}\geq 0$ to get

[TABLE]

A few points are notable about the above results.

Uniform Model Preservation.

Theorem 2 guarantees that if the clustering error is small (and DEFRAG does minimize clustering error), then for every possible linear model ${{\mathbf{w}}}\in{\mathbb{R}}^{d}$ over the original features ${{\mathbf{x}}}^{i}$ , we can learn a model $\tilde{{\mathbf{w}}}\in{\mathbb{R}}^{d}$ over the agglomerated features $\tilde{{\mathbf{x}}}^{i}$ such that both models behave similarly with respect to any Lipschitz loss function. It is notable that Theorem 2 holds simultaneously for all linear models ${{\mathbf{w}}}$ , thus making it algorithm agnostic.

Classifier Preservation.

Most leading algorithms (Parabel, DiSMEC, PfastreXML, PPDSparse, SLEEC) construct classifiers by learning several linear models using hinge loss or exponential loss which are Lipschitz. By preserving the performance of all such individual linear models, DEFRAG preserves the overall performance of these algorithms too. Note that Theorem 2 holds uniformly over all subsets $S\subseteq[n]$ of data points, which is useful since these algorithms often learn several linear models on various subsets of the data.

Graceful Adaptivity.

Suppose for a model ${{\mathbf{w}}}$ , the weights within a cluster $F_{k}$ are similar i.e. ${{\mathbf{w}}}_{F_{k}}\approx w\cdot{\mathbf{1}}_{d_{k}},w\in{\mathbb{R}}$ . Then this implies ${{\mathbf{w}}}_{F_{k}}^{\perp}\approx{\mathbf{0}}$ and the contribution of this cluster to the total error will be very small. This indicates that if some of the original weights are anyway tied together, DEFRAG automatically offers extremely accurate reconstructions.

In Appendix B.1, we show that DEFRAG-XY preserves the performance of label clustering methods such as Parabel.

6 Experimental Results

We studied the effects of using DEFRAG variants with several extreme classification algorithms, as well as compared DEFRAG with other clustering algorithms. Our implementation of DEFRAG is available at the URL given below.

Code Link: https://github.com/purushottamkar/defrag/

Datasets and Implementations.

All datasets, train-test splits, and implementations of extreme classification algorithms were sourced from the Extreme Classification Repository Bhatia et al. (2019) (see Table 4 in the appendix). Implementations of clustering algorithms were sourced from the authors whenever possible. For SCBC, LSC and ITDC, public implementations were not available and scalable implementations were created in the Python language.

Hyperparameters.

If available, hyperparameter settings recommended by authors were used for all methods. If unavailable, a fine grid search was performed over a reasonable range to offer adequately tuned hyperparameters to the methods. DEFRAG had its only hyperparameter, the max size of a feature cluster $d_{0}$ (see Algorithm 1), fixed to $8$ .

Comparison with other clustering methods.

Table 1 compares DEFRAG with other clustering algorithms on clustering quality, execution time and classification performance (see Appendix C for definitions of clustering metrics). Features were agglomerated according to feature clusters given by all algorithms and Parabel was executed on them. DEFRAG handily outperforms all other methods.

Dataset-wise and Method-wise performance

Table 3 presents the outcome of using DEFRAG with several leading algorithms on 8 datasets. On Wiki10 and Delicious, DEFRAG-XY+Parabel offers the best overall performance across all methods. More generally, the table shows 21 instances, across the 8 datasets, of how DEFRAG performs with various algorithms. In 3 of these instances, DEFRAG outperforms the base method (EURLex-PPDSparse, Wiki10-Parabel and Delicious-Parabel), in 7 others, DEFRAG lags by less than 2.5%, in 8 others, it lags by less than 5%. Only in 3 cases is the lag $>$ 5%. DEFRAG variants do seem to work best with the Parabel method.

Trade-offs offered by DEFRAG.

It is easy to see that if we create a small number of clusters $K$ , by setting $d_{0}$ to be a large value, then the agglomerated vectors will be lower dimensional and as such, offer faster training/prediction and smaller model sizes. However this may cause a dip in prediction accuracy. Figure 2 shows that DEFRAG variants offer attractive trade-offs in this respect.

Rare-label prediction with REFRAG.

Table 2 shows that REFRAG offers much better propensity-weighted metrics Jain et al. (2016) (which down-weigh popular and emphasize rare labels) than PfastreXML which also attempts label reranking. Figure 3 shows that REFRAG achieves much better coverage (3.85) than Parabel (1.23) on Delicious and in general, predicts rare labels far more accurately. Figure 3 also shows that FIAT offers resilience to feature erasures.

Acknowledgments

The authors thank the reviewers for comments on improving the presentation of the paper. The authors are also thankful to the lab-team members at the CSE department, IIT Kanpur esp. M. Bagga, S. Malhotra, N. Yadav, B. K. Mishra for their support in running the experiments. P. K. thanks Microsoft Research India and Tower Research for research grants.

Appendix A Algorithmic Details from §4

In this section we present details of implementations of algorithms that were omitted from the main text due to lack of space. We start off with implementation details of the DEFRAG-X and DEFRAG-XY methods.

A.1 DEFRAG Implementation Details

As mentioned in the main text, DEFRAG chooses to represent each feature $j\in[d]$ as a vector, either as an $n$ -dimensional vector ${{\mathbf{p}}}^{j}=[{{\mathbf{x}}}^{1}_{j},\ldots,{{\mathbf{x}}}^{n}_{j}]^{\top}\in{\mathbb{R}}^{n}$ of the values that feature takes on in the $n$ data points (used by DEFRAG-X), or as an $L$ -dimensional vector ${{\mathbf{q}}}^{j}=\sum_{i=1}^{n}{{\mathbf{x}}}^{i}_{j}{{\mathbf{y}}}^{i}\in{\mathbb{R}}^{L}$ , essentially as a weighted aggregate of the label vectors of data points where the feature $j$ is non-zero.

Irrespective of the representation used, DEFRAG next performs hierarchical clustering on the representative vectors as described in Algorithm 1. Starting with the root node which contains all features, nodes are split evenly till the number of features falls below a set threshold at which point the node is made a leaf. For our experiments in Figure 2, we varied this threshold among the values $4,8,16,32$ , thus obtaining respectively $d/4,d/8,d/16,d/32$ clusters, but for all other experiments in Figure 3 and Tables 1 and 3, we fixed this threshold to $8$ , thus obtaining $d/8$ clusters.

The clustering was done using one of two methods - balanced spherical k-means, or balanced nDCG splitting. The details of these implementations are given below. Given the extreme sparsity of both the data and label vectors, the feature representations we obtain via DEFRAG-X and DEFRAG-XY, i.e. ${{\mathbf{p}}}^{j},{{\mathbf{q}}}^{j}$ are themselves very sparse and so case Euclidean notions of proximity such as dot products and norms may not be appropriate for clustering. Thus, we also develop a version which we call DEFRAG-N which minimizes an nDCG ranking loss at each node Prabhu and Varma [2014]. We found that balancing did not affect the performance of DEFRAG-N too much. This may be due to the fact that the minimizing the nDCG loss naturally produces rather balanced clusters, something that has been independently observed by Jain et al. [2016]; Prabhu and Varma [2014].

Balanced Spherical k-means.

Algorithm 2 presents the node splitting routine using the balanced k-means algorithm. The algorithm essentially follows the traditional Lloyd’s algortihm except at the cluster assignment stage when, instead of just assigning each point to the nearest cluster, the algorithm performs a fair split.

This algorithm can be derived as a special case (for $k$ = 2) of the refinement step in the constrained clustering routine proposed in Banerjee and Ghosh [2006]. It is notable that Prabhu et al. [2018] derive essentially the same algorithm for splitting labels into balanced cluster, but they derive their approach starting from a different graph flow-based approach to constrained clustering.

nDCG Splitting.

Given that the vector representations of the features we use are sparse vectors, we employed a ranking-based splitting technique as well. The technique was adapted from the the work of Prabhu and Varma [2014] and is described here (their work applies the technique to cluster binary vectors whereas we apply it to cluster feature representative vectors which need not be binary). Note that the technique requires that the vector representations contain only non-negative values. This is indeed true in all our experiments since the features in our datasets are created out of bag-of-words representations which are indeed non-negative.

For any vector ${{\mathbf{v}}}\in{\mathbb{R}}_{+}^{p}$ with positive coordinates i.e. ${{\mathbf{v}}}_{i}\geq 0,i\in[p]$ , we let $\text{rank}({{\mathbf{v}}})\in S_{p}$ denote the permutation ranking the $p$ coordinates of ${{\mathbf{v}}}$ in decreasing order i.e. if ${{\mathbf{r}}}:=\text{rank}({{\mathbf{v}}})$ then ${{\mathbf{v}}}_{{{\mathbf{r}}}_{i}}\geq{{\mathbf{v}}}_{{{\mathbf{r}}}_{j}}$ if $i>j$ . For any positive vector ${{\mathbf{v}}}\in{\mathbb{R}}_{+}^{p}$ and any permutation ${{\mathbf{r}}}\in S_{p}$ is any permutation (not necessarily the one that ranks the coordinates of ${{\mathbf{v}}}$ ), then we define the Discounted Cumulative Gain (DCG) score of the permuatation ${{\mathbf{r}}}$ with respect to the vector ${{\mathbf{v}}}$ as

[TABLE]

We also define the maximum such score any ranking can achieve as the following

[TABLE]

Given the above the normalized DCG score of any permutation ${{\mathbf{r}}}\in S_{p}$ with respect to the vector ${{\mathbf{v}}}$ as

[TABLE]

Now, given a set of vectors ${{\mathbf{v}}}^{1},\ldots,{{\mathbf{v}}}^{m}\in{\mathbb{R}}^{p}$ , a single “centroid” ranking that fits all of them the best can be found as $\max_{{{\mathbf{r}}}\in S_{p}}\sum_{i=1}^{m}\text{nDCG}({{\mathbf{r}}},{{\mathbf{v}}})$ where

[TABLE]

This implies that the best ranking is given by $\arg\max_{{{\mathbf{r}}}\in S_{p}}\sum_{i=1}^{m}\text{nDCG}({{\mathbf{r}}},{{\mathbf{v}}})=\text{rank}\left({\sum_{i=1}^{m}I({{\mathbf{v}}}^{i})\cdot{{\mathbf{v}}}^{i}_{{{\mathbf{r}}}_{j}}}\right)$ . Algorithm 3 presents the DEFRAG-N clustering technique that uses the above rule to recompute cluster centroids.

A.2 Accelerated Clustering

Given the large size of the datasets used in our experiments, it was important to ensure that the feature clustering time of DEFRAG did not exceed the savings in training time it offered for various methods, so as to ensure that the total training time of DEFRAG (clustering + agglomeration + training) still remained smaller than that of the various extreme classification algorithms.

To do so we notice that, given the heavy tailed phenomenon exhibited by most large-scale datasets, clustering time can be reduced significantly by subsampling data points and labels. More specifically, we notice that in the execution of DEFRAG-X, performance is not affected even if we represent each feature using only its values in $\tilde{n}$ most voluminous data points where the “volume” of a data point $i\in[n]$ is calculated as $\left\|{{{\mathbf{x}}}^{i}}\right\|_{1}$ . Similarly, we noticed that the execution of DEFRAG-XY is not affected, i.e. clusters are not affected, even if we take into account only the $\tilde{L}$ most popular labels. Such an effect (of performance not being affected by taking only “head” objects) has been observed before Wei and Li [2018] as well.

Although the above steps do not greatly affect the clustering performance, they do drastically reduce the clustering time of the DEFRAG variants. Thus, all our DEFRAG-X experiments are executed taking only the $\tilde{n}=0.25n$ most voluminous documents. Thus, each feature is represented only as an $\tilde{n}$ -dimensional vector. Similarly, all our DEFRAG-XY experiments are executed taking only the $\tilde{n}=0.25n$ most voluminous documents and the $\tilde{L}=0.05L$ most popular labels. Thus, each feature is represented only as an $\tilde{L}$ -dimensional vector.

A.3 Ensemble Training

Several extreme classification methods, such as Parabel, SLEEC, PfastreXML construct ensemble classifiers by executing the algorithm independently a few times to obtain different classifiers and then using consensus voting techniques to aggregate the predictions of these different classifiers. Unless otherwise mentioned, we always executed DEFRAG variants afresh for each member of the ensemble as well.

For example, the Parabel algorithm trains and ensemle of 3 tree-based classifiers, each of which is independently capable of making classifications. We ran DEFRAG independently 3 times as well, once for each execution of Parabel. Since DEFRAG uses random initializations in its clustering routines, we found that the cluster partitions were not identical across the three executions. We did find this step to boost accuracy, presumably since it allowed different sets of features to be clustered together in different executions.

A.4 Cluster Averaging

We note that although DEFRAG simply sums up the feature values within a feature cluster, an alternative technique could be to average the feature values within the cluster. Although not a significant step in general, averaging does affect the performance of classifiers like DiSMEC or PPDSparse which practice model trimming i.e. setting model coordinates which have a value below a certain threshold, to zero in order to save model space. For such classifiers, simple agglomeration may produce features with inflated feature values which result in small model values which in turn get trimmed to zero. Cluster averaging may help in these settings.

For instance, on the EURLex dataset with DiSMEC, DEFRAG with cluster averaging can yield more than 1.3% boost in P1 accuracies. However, we note that this effect is not uniform and that on some datasets, averaging can actually hurt performance. For instance, on the Wiki10 dataset with DiSMEC, DEFRAG with cluster averaging actually causes minor dips of upto 0.4% in P1.

A.5 REFRAG Implementation Details

As mentioned in the main text, for a given test point ${{\mathbf{x}}}^{i}\in{\mathbb{R}}^{d}$ , the pseudo co-occurrence model $C^{{\mathcal{F}}}$ was used to obtain a score $a_{l}=e^{-\frac{\gamma}{2}\cdot\left\|{{{\mathbf{x}}}^{t}-\text{\boldmath$ \mathbf{\xi} $}^{l}}\right\|_{2}^{2}}$ indicating the affinity of the test point ${{\mathbf{x}}}^{t}$ to the label $l$ . The base algorithm, say Parabel or PPDSparse, was used to obtain a separate score, say $b_{l}$ indicating what did that algorithm think regarding the suitability of label $l$ for the data point ${{\mathbf{x}}}^{t}$ . These two scores were combined in the following manner.

[TABLE]

and the labels with the highest score, as assigned by $c_{l}$ were assigned to the data point ${{\mathbf{x}}}^{t}$ . As Figure 3 indicates, this does not appreciably reduce the precision on popular labels as they keep getting predict as before. However, this does greatly increase the algorithm’s ability to predict rare labels. We set $\alpha=0.8$ for all experiments.

Appendix B Proofs from §5

We now provide complete proofs for the results mentioned in the main paper starting with the base lemma and the analysis for DEFRAG-X followed by the analysis for DEFRAG-XY.

Lemma 1.

Given any matrix $Z\in{\mathbb{R}}^{d\times p}$ and any $K$ -partition ${\mathcal{F}}=\left\{{F_{1},\ldots,F_{K}}\right\}$ of $[d]$ , suppose there exist vectors $\text{\boldmath$ \mathbf{\mu} $}^{1},\ldots,\text{\boldmath$ \mathbf{\mu} $}^{K}\in{\mathbb{R}}^{p}$ such that $Z_{k}={\mathbf{1}}_{d_{k}}(\text{\boldmath$ \mathbf{\mu} $}^{k})^{\top}+\Delta_{k}$ where ${\mathbf{1}}_{d_{k}}:=(1,\ldots,1)^{\top}\in{\mathbb{R}}^{d_{k}}$ , then for every ${{\mathbf{w}}}\in{\mathbb{R}}^{d}$ and every $k\in[K]$ , there must exist a real value $c_{{{\mathbf{w}}},k}\in{\mathbb{R}}$ such that

[TABLE]

Proof.

Consider a fixed $k$ and for sake of notational simplicity, let us abbreviate ${{\mathbf{u}}}:={{\mathbf{w}}}_{F_{k}},V:=Z_{k},c:=c_{{{\mathbf{w}}},k}$ and ${\mathbf{1}}:={\mathbf{1}}_{d_{k}}$ . We will show that if $V={\mathbf{1}}(\text{\boldmath$ \mathbf{\mu} $}^{k})^{\top}+\Delta_{k}$ as promised, then there must exist a $c\in{\mathbb{R}}$ such that $({{\mathbf{u}}}-c\cdot{\mathbf{1}})^{\top}VV^{\top}({{\mathbf{u}}}-c\cdot{\mathbf{1}})\leq\left\|{\Delta_{k}^{\top}{{\mathbf{u}}}^{\perp}}\right\|_{2}^{2}$ . To establish this result, first notice that the objective function $f(c)=({{\mathbf{u}}}-c\cdot{\mathbf{1}})^{\top}VV^{\top}({{\mathbf{u}}}-c\cdot{\mathbf{1}})$ is minimized at a value of $c_{\min}=\frac{{{\mathbf{u}}}^{\top}VV^{\top}{\mathbf{1}}}{{\mathbf{1}}^{\top}VV^{\top}{\mathbf{1}}}$ . At this value, we have

[TABLE]

Concentrating just on the numerator, upon using the orthogonal decomposition ${{\mathbf{u}}}={{\mathbf{u}}}^{\parallel}+{{\mathbf{u}}}^{\perp}$ where ${\mathbf{1}}^{\top}{{\mathbf{u}}}^{\perp}=0$ and letting ${{\mathbf{u}}}^{\parallel}:=p\cdot{\mathbf{1}}$ for some $p\in{\mathbb{R}}$ , we get

[TABLE]

which, upon inserting in the numerator expression, give us

[TABLE]

Thus, we have

[TABLE]

Note that the second term in the above expression is always non-negative, albeit not one that is easy to lower-bound. Thus, we simply bound the function value as

[TABLE]

Now, the preconditions of the lemma guarantee us that $V={\mathbf{1}}(\text{\boldmath$ \mathbf{\mu} $}^{k})^{\top}+\Delta_{k}$ and thus, we have

[TABLE]

where in the last step we have used the fact that ${\mathbf{1}}^{\top}{{\mathbf{u}}}^{\perp}=0$ by construction. This finishes the proof. ∎

Theorem 2.

Upon executing DEFRAG-X with a feature matrix $X=[{{\mathbf{x}}}^{1},\ldots,{{\mathbf{x}}}^{n}]$ and label matrix $Y=[{{\mathbf{y}}}^{1},\ldots,{{\mathbf{y}}}^{n}]$ , suppose we obtain a feature $K$ -partition ${\mathcal{F}}=[F_{1},\ldots,F_{K}]$ with $\text{err}_{k}$ denoting the Euclidean clustering error within the $k\text{$ {}^{\text{th}} $}$ cluster, then for any loss function $\ell(\cdot)$ that is $L$ -Lipschitz, for every linear model ${{\mathbf{w}}}\in{\mathbb{R}}^{d}$ , there must exist a model $\tilde{{\mathbf{w}}}\in{\mathbb{R}}^{K}$ such that for all subsets of data points $S\subseteq[n]$ ,

[TABLE]

Proof.

We will describe how to obtain the identity of $\tilde{{\mathbf{w}}}$ in a short while. For now, notice that since all terms in the summation on the left hand side are non-negative, we have for any $S\subseteq[n]$

[TABLE]

Using the Lipschitz property of the loss function and using the partition structure gives us

[TABLE]

Expanding the right hand side (and ignoring the $L^{2}$ term for now) gives us

[TABLE]

Now, DEFRAG-X represents each feature $j\in[d]$ as an $n$ dimensional vector, ${{\mathbf{p}}}^{j}=[{{\mathbf{x}}}^{1}_{j},\ldots,{{\mathbf{x}}}^{n}_{j}]^{\top}\in{\mathbb{R}}^{n}$ and then performs clustering on these vectors to obtain a $K$ -partition of the feature set $[d]$ , say ${\mathcal{F}}=\left\{{F_{1},\ldots,F_{k}}\right\}$ . Say the centroids of these $K$ clusters are $\text{\boldmath$ \mathbf{\mu} $}^{1},\ldots,\text{\boldmath$ \mathbf{\mu} $}^{K}\in{\mathbb{R}}^{n}$ . Consider one of these clusters, say the $k\text{$ {}^{\text{th}} $}$ cluster $F_{k}$ with, say $d_{k}$ features in that cluster. If we denote $P_{k}=[{{\mathbf{p}}}^{j}]_{j\in F_{k}}^{\top}\in{\mathbb{R}}^{d_{k}\times n}$ , then the following observations are immediate

For any ${{\mathbf{v}}}\in{\mathbb{R}}^{d_{k}}$ , we have ${{\mathbf{v}}}^{\top}P_{k}={{\mathbf{v}}}^{\top}\sum_{i=1}^{n}{{\mathbf{x}}}^{i}_{F_{k}}$ 2. 2.

The Euclidean clustering error within the $k\text{$ {}^{\text{th}} $}$ cluster is given by

[TABLE] 3. 3.

If we denote $P_{k}={\mathbf{1}}_{d_{k}}(\text{\boldmath$ \mathbf{\mu} $}^{k})^{\top}+\Delta_{k}$ , then we must have $\left\|{\Delta_{k}}\right\|_{F}=\left\|{P_{k}-{\mathbf{1}}_{d_{k}}(\text{\boldmath$ \mathbf{\mu} $}^{k})^{\top}}\right\|_{F}=\text{err}_{k}$ 4. 4.

Lemma 1, when applied with $p=n$ show us that with the above notation, we have, for some real value $c_{{{\mathbf{w}}},k}$

[TABLE]

where $\left\|{\Delta_{k}}\right\|_{2}$ denotes the spectral norm (i.e. the largest singular value) of the matrix $\Delta_{k}$ and $\left\|{\Delta_{k}}\right\|_{F}$ denotes the Frobenius norm of the same matrix.

Note that although the inequality $\left\|{\Delta_{k}}\right\|_{2}\leq\left\|{\Delta_{k}}\right\|_{F}$ may seem loose at first sight, notice that since $\text{rank}(\Delta_{k})\leq d_{k}$ , we also have $\left\|{\Delta_{k}}\right\|_{F}\leq\sqrt{d_{k}}\cdot\left\|{\Delta_{k}}\right\|_{2}$ and since $d_{k}$ is typically a small number, this shows that this inequality is not too loose.

The above observations allow us to construct $\tilde{{\mathbf{w}}}$ . All we need to do is, for every cluster $k\in[K]$ , consult Theorem 2 to obtain a constant $c_{{{\mathbf{w}}},k}$ that offers the guarantees of the theorem. We then simply concatenate these constants to construct $\tilde{{\mathbf{w}}}=[c_{{{\mathbf{w}}},k}]_{k\in[K]}\in{\mathbb{R}}^{K}$ . Using the first point in the list of observations, we can now also rewrite the last expression in our ongoing calculations as

[TABLE]

Applying the Cauchy-Schwartz inequality and the rest of the observations allows us to upper-bound the above expression as

[TABLE]

This finishes the proof upon putting back the $L^{2}$ factor we had omitted earlier and taking square roots on both sides. ∎

B.1 Performance Guarantees for DEFRAG-XY

We now prove a similar result for DEFRAG-XY, specifically that DEFRAG-XY accurately preserves the performance of label clustering methods such as Parabel. Since Parabel performs “spherical k-means” which relies on scores of the form $({{\mathbf{c}}}^{+}-{{\mathbf{c}}}^{-})^{\top}{{\mathbf{z}}}^{l}$ to decide cluster assignments, the result below assures that these scores remain preserved on the agglomerated data as well. We stress that the above result can be readily adapted to usual k-means which relies on scores of the form $\left\|{{{\mathbf{c}}}^{+}-{{\mathbf{z}}}^{l}}\right\|_{2}^{2}-\left\|{{{\mathbf{c}}}^{-}-{{\mathbf{z}}}^{l}}\right\|_{2}^{2}$ .

Theorem 3.

Upon executing DEFRAG-XY with feature matrix $X$ and label matrix $Y$ , suppose we obtain a feature $K$ -partition ${\mathcal{F}}$ with Euclidean clustering errors $\left\{{\text{err}_{k}}\right\}_{k\in[K]}$ . Suppose ${{\mathbf{z}}}^{l}=\sum_{i}{{\mathbf{y}}}^{i}_{l}{{\mathbf{x}}}^{i}$ and $\tilde{{\mathbf{z}}}^{l}=\sum_{i}{{\mathbf{y}}}^{i}_{l}\tilde{{\mathbf{x}}}^{i}$ are the original and agglomerated label features for $l\in[L]$ . Then for every $2$ -means clustering of the original label features, with cluster centroids ${{\mathbf{c}}}^{+}$ and ${{\mathbf{c}}}^{-}$ , there must exist centroids $\tilde{{\mathbf{c}}}^{+}$ and $\tilde{{\mathbf{c}}}^{-}$ that offer similar clustering error over the agglomerated features. Specifically, we have, for all subsets of labels $T\subseteq[L]$ ,

[TABLE]

Proof.

We will establish the identity of the modified centroids $\tilde{{\mathbf{c}}}^{+},\tilde{{\mathbf{c}}}^{-}$ in a short while. For now, notice that as before, by positivity of all terms in the summation, we have for any $T\subseteq[L]$

[TABLE]

For sake of notational simplicity, let us denote $\text{\boldmath$ \mathbf{\delta} $}={{\mathbf{c}}}^{+}-{{\mathbf{c}}}^{-}$ and $\tilde{\text{\boldmath$ \mathbf{\delta} $}}=\tilde{{\mathbf{c}}}^{+}-\tilde{{\mathbf{c}}}^{-}$ . This gives us

[TABLE]

Expanding the right hand side and expanding similarly as before gives us

[TABLE]

Now, DEFRAG-XY represents each feature $j\in[d]$ as an $L$ dimensional vector, ${{\mathbf{q}}}^{j}=\sum_{i=1}^{n}{{\mathbf{x}}}^{i}_{j}{{\mathbf{y}}}^{i}\in{\mathbb{R}}^{L}$ and then performs clustering on these vectors to obtain a $K$ -partition of the feature set $[d]$ , say ${\mathcal{F}}=\left\{{F_{1},\ldots,F_{k}}\right\}$ . Say the centroids of these $K$ clusters are $\text{\boldmath$ \mathbf{\nu} $}^{1},\ldots,\text{\boldmath$ \mathbf{\nu} $}^{K}\in{\mathbb{R}}^{n}$ . Consider one of these clusters, say the $k\text{$ {}^{\text{th}} $}$ cluster $F_{k}$ with, say $d_{k}$ features in that cluster. If we denote $Q_{k}=[{{\mathbf{q}}}^{j}]_{j\in F_{k}}^{\top}\in{\mathbb{R}}^{d_{k}\times L}$ , then the following observations are immediate

For any ${{\mathbf{v}}}\in{\mathbb{R}}^{d_{k}}$ , we have ${{\mathbf{v}}}^{\top}Q_{k}={{\mathbf{v}}}^{\top}\sum_{l=1}^{L}{{\mathbf{z}}}^{l}_{F_{k}}$ 2. 2.

The Euclidean clustering error within the $k\text{$ {}^{\text{th}} $}$ cluster is given by

[TABLE] 3. 3.

If we denote $Q_{k}={\mathbf{1}}_{d_{k}}(\text{\boldmath$ \mathbf{\nu} $}^{k})^{\top}+\Delta_{k}$ , then we must have $\left\|{\Delta_{k}}\right\|_{F}=\left\|{Q_{k}-{\mathbf{1}}_{d_{k}}(\text{\boldmath$ \mathbf{\nu} $}^{k})^{\top}}\right\|_{F}=\text{err}_{k}$ 4. 4.

Lemma 1, when applied with $p=L$ show us that with the above notation, we have for some real value $d_{\text{\boldmath$ \mathbf{\delta} $},k}$

[TABLE]

where $\left\|{\Delta_{k}}\right\|_{2}$ denotes the spectral norm (i.e. the largest singular value) of the matrix $\Delta_{k}$ and $\left\|{\Delta_{k}}\right\|_{F}$ denotes the Frobenius norm of the same matrix.

Again, note that the inequality $\left\|{\Delta_{k}}\right\|_{2}\leq\left\|{\Delta_{k}}\right\|_{F}$ is not extremely loose since, $\text{rank}(\Delta_{k})\leq d_{k}$ gives us $\left\|{\Delta_{k}}\right\|_{F}\leq\sqrt{d_{k}}\cdot\left\|{\Delta_{k}}\right\|_{2}$ and since $d_{k}$ is typically a small number, this shows that this inequality is not too loose.

The above observations allow us to construct $\tilde{\text{\boldmath$ \mathbf{\delta} $}}$ . All we need to do is, for every cluster $k\in[K]$ , consult Theorem 2 to obtain a constant $d_{\text{\boldmath$ \mathbf{\delta} $},k}$ that offers the guarantees of the theorem. We then simply concatenate these constants to construct $\tilde{\text{\boldmath$ \mathbf{\delta} $}}=[d_{\text{\boldmath$ \mathbf{\delta} $},k}]_{k\in[K]}\in{\mathbb{R}}^{K}$ . Once we have $\tilde{\text{\boldmath$ \mathbf{\delta} $}}$ , we may construct $\tilde{{\mathbf{c}}}^{+}$ and $\tilde{{\mathbf{c}}}^{-}$ as any two vectors such that $\tilde{{\mathbf{c}}}^{+}-\tilde{{\mathbf{c}}}^{-}=\tilde{\text{\boldmath$ \mathbf{\delta} $}}$ . Moreover, using the first point in the list of observations, we can now also rewrite the last expression in our ongoing calculations as

[TABLE]

Applying the Cauchy-Schwartz inequality and the rest of the observations allows us to upper-bound the above expression as

[TABLE]

This finishes the proof upon taking square roots on both sides. ∎

Appendix C Experimental Details from §6

We provide additional details about experimental settings, as well as additional experimental results in this appendix.

C.1 Clustering Metrics

We used several clustering metrics to evaluate the various clustering algorithms compared in Table 1. Those metrics are defined formally below. Note that in that experiment, to be fair, all algorithms were asked to output the same number of clusters $d/8$ where $d$ is the original dimensionality of the data features.

The notion of mutual information loss (LMI) is defined in Dhillon et al. [2003] for multi-class classification problems and adapted here for our setting of multilabel classification problems. LMI measures the loss in predictive capability due to clustering of features. The LMI score is a positive real value between [math] and $1$ . Thus, a smaller value of the LMI metric is better. It is notable that in Table 1, DEFRAG achieved the lowest LMI score of 0.37 followed by SCBC which achieved an LMI score of 0.39. All other algorithms acheived an LMI score of around 0.50 or more.

Definition 1 (Mutual Information Loss (LMI)).

Given a $K$ -clustering ${\mathcal{F}}$ of $[d]$ features, let $X=[{{\mathbf{x}}}^{1},\ldots,{{\mathbf{x}}}^{n}]\in{\mathbb{R}}^{d\times n}$ denote the original feature matrix, $\tilde{X}=[\tilde{{\mathbf{x}}}^{1},\ldots,\tilde{{\mathbf{x}}}^{n}]\in{\mathbb{R}}^{K\times n}$ denote the matrix of agglomerated features and $Y=[{{\mathbf{y}}}^{1},\ldots,{{\mathbf{y}}}^{n}]\in\left\{{0,1}\right\}^{L\times n}$ denote the label matrix. Then the normalized loss in mutual information is defined as

[TABLE]

where $I(Y;X)$ is the mutual information between the labels and the original features and $I(Y;\tilde{X})$ is the mutual information between the agglomerated features. These terms are defined below for the multi-label learning setting.

Definition 2 (Mutual Information (MI)).

Given a feature matrix $Z=[{{\mathbf{z}}}^{1},\ldots,{{\mathbf{z}}}^{n}]\in{\mathbb{R}}^{p\times n}$ with each data point having $p$ features and $Y=[{{\mathbf{y}}}^{1},\ldots,{{\mathbf{y}}}^{n}]\in\left\{{0,1}\right\}^{L\times n}$ denote the label matrix. Then the mutual information between the labels and the features is defined as

[TABLE]

where ${\mathbf{\Pi}}=\frac{ZY^{\top}}{{\mathbf{1}}_{p}^{\top}ZY^{\top}{\mathbf{1}}_{L}}\in\Delta_{(p\times L)-1}$ is the joint feature-label probability matrix, $\ln({\mathbf{\Pi}})\in{\mathbb{R}}^{p\times L}$ denotes the matrix obtained by taking entry wise logarithm of the ${\mathbf{\Pi}}$ matrix, $\text{\boldmath$ \mathbf{\pi} $}_{p}=\frac{Y{\mathbf{1}}_{n}}{\left\|{Y{\mathbf{1}}_{n}}\right\|_{1}}\in\Delta_{p-1}$ is the feature probability vector, $\text{\boldmath$ \mathbf{\pi} $}_{L}=\frac{Z{\mathbf{1}}_{n}}{\left\|{Z{\mathbf{1}}_{n}}\right\|_{1}}\in\Delta_{L-1}$ is the label probability vector, $\ln(\text{\boldmath$ \mathbf{\pi} $}_{p})$ and $\ln(\text{\boldmath$ \mathbf{\pi} $}_{L})$ are obtained by taking element-wise logarithm on the vectors $\text{\boldmath$ \mathbf{\pi} $}_{p}$ and $\text{\boldmath$ \mathbf{\pi} $}_{L}$ respectively, and ${\mathbf{1}}_{k}=(1,\ldots,1)^{\top}\in{\mathbb{R}}^{k}$ for any $k>0$ . The above definition assumes that the feature matrix $Z$ contains no negative values. However, this is true for bag-of-words representations used in extreme classification problems.

The notion of balance factor is a rather unforgiving metric measuring how balanced are the clusters output by a clustering technique. It is frequently seen that imbalanced feature clusters can lead to reduction in classification performance. The balance factor is positive real number greater than or equal to one. A smaller balance factor indicates a more balanced clustering with a balance factor of $1$ denoting perfectly balanced clusters. It is notable that in Table 1, DEFRAG achieved the lowest balance factor of 1.11 and 1.08.

Definition 3 (Balance Factor).

Given a $K$ -clustering ${\mathcal{F}}=[F_{1},\ldots,F_{K}]$ of $[d]$ features where we define $d_{k}:=\left|{F_{k}}\right|$ , its balance factor is defined as

[TABLE]

If we have $\min_{k}d_{k}=0$ , then the balance factor is defined to be infinity.

The normalized entropy is a more gentle metric of balance. It takes a value between [math] and $1$ and a larger value indicates a more balanced clustering and a smaller value indicates the presence of a few very concentrated clusters. It is notable that in Table 1, DEFRAG achieved the highest entropy values of 0.99.

Definition 4 (Normalized Entropy).

Given a $K$ -clustering ${\mathcal{F}}=[F_{1},\ldots,F_{K}]$ of $[d]$ features where we define $d_{k}:=\left|{F_{k}}\right|$ , its normalized entropy is defined as

[TABLE]

where we define $0\ln 0=0$ for sake of avoiding singular points.

C.2 Additional Experimental Results

In this section we report additional results of the reductions DEFRAG variants offer on training and prediction time and model size along with the effect on prediction accuracy. We also report here results on using the nDCG-based splitting method DEFRAG-N, as well as additional results on the performance of the FIAT algorithm on settings with missing features.

Bibliography19

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Babbar and Schölkopf [2017] Rohit Babbar and Bernhard Schölkopf. Di SMEC - Distributed Sparse Machines for Extreme Multi-label Classification. In 10th ACM International Conference on Web Search and Data Mining (WSDM) , 2017.
2Babbar and Schölkopf [2019] Rohit Babbar and Bernhard Schölkopf. Data Scarcity, Robustness and Extreme Multi-label Classification. In 30th European Conference on Machine Learning (ECML) - to appear , 2019.
3Banerjee and Ghosh [2006] Arindam Banerjee and Joydeep Ghosh. Scalable Clustering Algorithms with Balancing Constraints. Data Mining and Knowledge Discovery , 13(3):365–395, 2006.
4Bhatia et al. [2015] Kush Bhatia, Himanshu Jain, Purushottam Kar, Manik Varma, and Prateek Jain. Sparse Local Embeddings for Extreme Multi-label Classification. In 29th Annual Conference on Neural Information Processing Systems (NIPS) , 2015.
5Bhatia et al. [2019] Kush Bhatia, Kunal Dahiya, Himanshu Jain, Yashoteja Prabhu, and Manik Varma. The Extreme Classification Repository. http://manikvarma.org/downloads/XC/XML Repository.html , 2019. Retrieved on February 15, 2019.
6Chen and Cai [2011] Xinlei Chen and Deng Cai. Large Scale Spectral Clustering with Landmark-Based Representation. In 25th AAAI Conference on Artificial Intelligence (AAAI) , 2011.
7Dhillon et al. [2003] Inderjit S. Dhillon, Subramanyam Mallela, and Rahul Kumar. A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification. Journal of Machine Learning Research , 3:1265–1287, 2003.
8Jain et al. [2016] Himanshu Jain, Yashoteja Prabhu, and Manik Varma. Extreme Multi-label Loss Functions for Recommendation, Tagging, Ranking & Other Missing Label Applications. In 22nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) , 2016.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Accelerating Extreme Classification via Adaptive Feature Agglomeration

Abstract

1 Introduction

Our Contributions.

2 Problem Formulation and Notation

Notation.

Feature Agglomeration.

3 Related Works

Binary Relevance.

Label/Feature Embedding.

Data Partitioning.

Label Partitioning.

Large Scale (Feature) Clustering.

4 Adaptive Extreme Feature Agglomeration

DEFRAG: aDaptive Extreme FeatuRe AGglomeration.

DEFRAG-X

DEFRAG-XY

Suited for sparse, high-dim. data.

Provably bounded distortion, Performance preservation.

Task adaptivity.

Novelty, Speed and Scalability.

Time Complexity.

FIAT: Feature Imputation via AgglomeraTion.

REFRAG: REranking via FeatuRe AGglomeration.

5 Performance Guarantees

Lemma 1**.**

Theorem 2**.**

Uniform Model Preservation.

Classifier Preservation.

Graceful Adaptivity.

6 Experimental Results

Datasets and Implementations.

Hyperparameters.

Comparison with other clustering methods.

Dataset-wise and Method-wise performance

Trade-offs offered by DEFRAG.

Rare-label prediction with REFRAG.

Acknowledgments

Appendix A Algorithmic Details from §4

A.1 DEFRAG Implementation Details

Balanced Spherical k-means.

nDCG Splitting.

A.2 Accelerated Clustering

A.3 Ensemble Training

A.4 Cluster Averaging

A.5 REFRAG Implementation Details

Appendix B Proofs from §5

Lemma 1.

Proof.

Theorem 2.

Proof.

B.1 Performance Guarantees for DEFRAG-XY

Theorem 3**.**

Proof.

Appendix C Experimental Details from §6

C.1 Clustering Metrics

Definition 1** (Mutual Information Loss (LMI)).**

Definition 2** (Mutual Information (MI)).**

Definition 3** (Balance Factor).**

Definition 4** (Normalized Entropy).**

C.2 Additional Experimental Results

Lemma 1.

Theorem 2.

Theorem 3.

Definition 1 (Mutual Information Loss (LMI)).

Definition 2 (Mutual Information (MI)).

Definition 3 (Balance Factor).

Definition 4 (Normalized Entropy).