Modeling Winner-Take-All Competition in Sparse Binary Projections

Wenye Li

arXiv:1907.11959·cs.LG·January 28, 2020

Modeling Winner-Take-All Competition in Sparse Binary Projections

Wenye Li

PDF

Open Access

TL;DR

This paper introduces a supervised and unsupervised model for sparse binary projections that enhances similarity search accuracy and speed, inspired by biological neural mechanisms, with practical applications demonstrated through empirical evaluations.

Contribution

The paper presents a novel supervised-WTA model and extends it to an unsupervised setting, offering an efficient algorithm for sparse binary projections in similarity search.

Findings

01

Significantly improved search accuracy over state-of-the-art methods

02

Faster running speed in similarity search tasks

03

Effective in both supervised and unsupervised scenarios

Abstract

Inspired by the advances in biological science, the study of sparse binary projection models has attracted considerable recent research attention. The models project dense input samples into a higher-dimensional space and output sparse binary data representations after the Winner-Take-All competition, subject to the constraint that the projection matrix is also sparse and binary. Following the work along this line, we developed a supervised-WTA model when training samples with both input and output representations are available, from which the optimal projection matrix can be obtained with a simple, effective yet efficient algorithm. We further extended the model and the algorithm to an unsupervised setting where only the input representation of the samples is available. In a series of empirical evaluation on similarity search tasks, the proposed models reported significantly improved…

Tables2

Table 1. Table 1: Search accuracies on various datasets with fixed output dimension ( d ′ = 2 , 000 superscript 𝑑 ′ 2 000 d^{\prime}=2,000 ) . On ImageNet with one million samples, the results of SUP/LIFTING algorithms are not available due to the prohibitive computation to obtain the output representations.

Datasets	$k$	SUP	UNSUP	LSH	FJL	FLY	LIFTING	ITQ	SPH	ISOH
	2	$0.1758$	$0.1143$	$0.0174$	$0.0169$	$0.0474$	$0.1748$	$0.0103$	$0.0097$	$0.0101$
ARTFC	4	$0.6665$	$0.3531$	$0.0243$	$0.0237$	$0.0673$	$0.6134$	$0.0175$	$0.0138$	$0.0227$
$d = 1, 000$	8	$0.3647$	$0.3944$	$0.0259$	$0.0255$	$0.0376$	$0.2612$	$0.0360$	$0.0173$	$0.0331$
	16	$0.5884$	$0.3267$	$0.0278$	$0.0282$	$0.0402$	$0.1694$	$0.0367$	$0.0202$	$0.0349$
	32	$0.3141$	$0.1319$	$0.0324$	$0.0336$	$0.0443$	$0.0832$	$0.0382$	$0.0235$	$0.0375$
	2	$0.1317$	$0.1596$	$0.0217$	$0.0198$	$0.0511$	$0.0831$	$0.0221$	$0.0195$	$0.0198$
GLOVE	4	$0.2310$	$0.3251$	$0.0356$	$0.0328$	$0.0964$	$0.1458$	$0.0617$	$0.0311$	$0.0594$
$d = 300$	8	$0.3061$	$0.3959$	$0.0655$	$0.0618$	$0.1073$	$0.1914$	$0.1209$	$0.0591$	$0.1112$
	16	$0.4030$	$0.4495$	$0.1138$	$0.1081$	$0.1809$	$0.2851$	$0.1939$	$0.1004$	$0.1882$
	32	$0.4374$	$0.4323$	$0.2039$	$0.2139$	$0.2808$	$0.3917$	$0.3208$	$0.1717$	$0.2727$
	2	$0.2860$	$0.2476$	$0.0369$	$0.0422$	$0.1119$	$0.1159$	$0.0288$	$0.0237$	$0.0194$
MNIST	4	$0.3338$	$0.3829$	$0.0844$	$0.1029$	$0.1721$	$0.2003$	$0.0852$	$0.0649$	$0.0773$
$d = 784$	8	$0.3885$	$0.4387$	$0.1823$	$0.2004$	$0.2717$	$0.3044$	$0.2008$	$0.1443$	$0.1601$
	16	$0.4698$	$0.4957$	$0.3226$	$0.3409$	$0.3953$	$0.4150$	$0.3207$	$0.2559$	$0.3101$
	32	$0.5108$	$0.5207$	$0.4773$	$0.4846$	$0.5162$	$0.5130$	$0.4415$	$0.3616$	$0.4067$
	2	$0.1706$	$0.1502$	$0.0355$	$0.0349$	$0.1066$	$0.1139$	$0.0274$	$0.0272$	$0.0230$
SIFT	4	$0.2240$	$0.2278$	$0.0760$	$0.0707$	$0.1592$	$0.2120$	$0.0550$	$0.0596$	$0.0623$
$d = 128$	8	$0.3768$	$0.3912$	$0.1556$	$0.1692$	$0.2382$	$0.3059$	$0.0951$	$0.1153$	$0.1240$
	16	$0.4353$	$0.4461$	$0.2751$	$0.2698$	$0.3409$	$0.3529$	$0.1712$	$0.1993$	$0.1905$
	32	$0.4839$	$0.4751$	$0.4122$	$0.4290$	$0.4504$	$0.4295$	$0.3217$	$0.2582$	$0.2832$
	2	N.A.	$0.1280$	$0.0251$	$0.0224$	$0.0668$	N.A.	$0.0197$	$0.0174$	$0.0202$
ImageNet	4	N.A.	$0.1863$	$0.0502$	$0.0578$	$0.1058$	N.A.	$0.0406$	$0.0389$	$0.0392$
$d = 1, 000$	8	N.A.	$0.2177$	$0.0824$	$0.0854$	$0.1519$	N.A.	$0.0925$	$0.0806$	$0.0826$
	16	N.A.	$0.2391$	$0.1522$	$0.1527$	$0.2122$	N.A.	$0.1679$	$0.1338$	$0.1378$
	32	N.A.	$0.2480$	$0.2282$	$0.2337$	$0.2430$	N.A.	$0.2311$	$0.1801$	$0.2002$

Table 2. Table 2: Search accuracies on GLOVE dataset with various input dimensions and fixed output dimension ( d ′ = 2 , 000 superscript 𝑑 ′ 2 000 d^{\prime}=2,000 ) .

Dimension	$k$	SUP	UNSUP	LSH	FJL	FLY	LIFTING	ITQ	SPH	ISOH
$d = 100$	2	$0.1007$	$0.1210$	$0.0208$	$0.0210$	$0.0503$	$0.0982$	$0.0245$	$0.0229$	$0.0230$
	4	$0.1720$	$0.2274$	$0.0335$	$0.0317$	$0.0787$	$0.1449$	$0.0683$	$0.0452$	$0.0633$
	8	$0.2365$	$0.2816$	$0.0591$	$0.0602$	$0.1059$	$0.1898$	$0.1509$	$0.0705$	$0.1297$
	16	$0.3113$	$0.3572$	$0.1096$	$0.1125$	$0.1698$	$0.2279$	$0.2201$	$0.1232$	$0.1995$
	32	$0.3779$	$0.3831$	$0.1962$	$0.2007$	$0.2581$	$0.2954$	$0.3311$	$0.2398$	$0.2952$
$d = 200$	2	$0.0808$	$0.1073$	$0.0183$	$0.0177$	$0.0387$	$0.0733$	$0.0231$	$0.0223$	$0.1234$
	4	$0.1432$	$0.2030$	$0.0275$	$0.0257$	$0.0624$	$0.1037$	$0.0692$	$0.0395$	$0.0212$
	8	$0.2008$	$0.2551$	$0.0459$	$0.0329$	$0.0786$	$0.1363$	$0.1197$	$0.0624$	$0.1256$
	16	$0.2759$	$0.3189$	$0.0816$	$0.0798$	$0.1284$	$0.1712$	$0.1804$	$0.1105$	$0.1905$
	32	$0.3254$	$0.3331$	$0.1442$	$0.1502$	$0.1991$	$0.2391$	$0.3025$	$0.2051$	$0.2782$
$d = 500$	2	$0.0689$	$0.0866$	$0.0148$	$0.0152$	$0.0226$	$0.0490$	$0.0197$	$0.0173$	$0.0182$
	4	$0.1328$	$0.1702$	$0.0195$	$0.0183$	$0.0394$	$0.0711$	$0.0522$	$0.0301$	$0.0397$
	8	$0.1878$	$0.2252$	$0.0278$	$0.0276$	$0.0421$	$0.0892$	$0.0973$	$0.0521$	$0.0885$
	16	$0.2768$	$0.2696$	$0.0437$	$0.0469$	$0.0710$	$0.1188$	$0.1497$	$0.0995$	$0.1305$
	32	$0.3172$	$0.2727$	$0.0728$	$0.0804$	$0.1115$	$0.1806$	$0.2119$	$0.1502$	$0.2117$
$d = 1, 000$	2	$0.0464$	$0.0508$	$0.0132$	$0.0129$	$0.0172$	$0.0293$	$0.0177$	$0.0166$	$0.0179$
	4	$0.0985$	$0.1131$	$0.0162$	$0.0175$	$0.0288$	$0.0396$	$0.0356$	$0.0289$	$0.0322$
	8	$0.1447$	$0.1615$	$0.0214$	$0.0261$	$0.0303$	$0.0490$	$0.0434$	$0.0312$	$0.0365$
	16	$0.2115$	$0.2071$	$0.0305$	$0.0372$	$0.0474$	$0.0662$	$0.0912$	$0.0787$	$0.0883$
	32	$0.2194$	$0.2049$	$0.0464$	$0.0511$	$0.0699$	$0.0874$	$0.1507$	$0.1339$	$0.1303$

Equations26

y_{i}=\left\{\begin{array}[]{cl}1,&\mbox{if $x_{i}$ is among top-$k$ entries of $\left(x_{1},\cdots,x_{d}\right)$.}\\ 0,&\mbox{otherwise.}\end{array}\right.

y_{i}=\left\{\begin{array}[]{cl}1,&\mbox{if $x_{i}$ is among top-$k$ entries of $\left(x_{1},\cdots,x_{d}\right)$.}\\ 0,&\mbox{otherwise.}\end{array}\right.

w_{i.}x_{.m}\geq w_{j.}x_{.m},\mbox{ if $y_{im}=1$ and $y_{jm}=0$}

w_{i.}x_{.m}\geq w_{j.}x_{.m},\mbox{ if $y_{im}=1$ and $y_{jm}=0$}

L_{s} (W) = m = 1 \sum n i = 1 \sum d^{'} j = 1 \sum d^{'} y_{im} (1 - y_{j m}) (w_{i .} x_{. m} - w_{j .} x_{. m}) .

L_{s} (W) = m = 1 \sum n i = 1 \sum d^{'} j = 1 \sum d^{'} y_{im} (1 - y_{j m}) (w_{i .} x_{. m} - w_{j .} x_{. m}) .

max L_{s} (W)

max L_{s} (W)

⟺

⟺

⟺

⟺

max w_{i .} [m = 1 \sum n x_{. m} (y_{im} - \frac{k}{d ^{'}})]

max w_{i .} [m = 1 \sum n x_{. m} (y_{im} - \frac{k}{d ^{'}})]

w_{i .} \in {0, 1}^{1 \times d}

w_{i .} \in {0, 1}^{1 \times d}

ℓ_{. i} = m = 1 \sum n x_{. m} (y_{im} - \frac{k}{d ^{'}}),

ℓ_{. i} = m = 1 \sum n x_{. m} (y_{im} - \frac{k}{d ^{'}}),

w_{i .}^{*} = W T A_{c}^{d} (ℓ_{. i}) .

w_{i .}^{*} = W T A_{c}^{d} (ℓ_{. i}) .

L_{u} (W, Y) = m = 1 \sum n i = 1 \sum d^{'} j = 1 \sum d^{'} y_{im} (1 - y_{j m}) (w_{i .} x_{. m} - w_{j .} x_{. m})

L_{u} (W, Y) = m = 1 \sum n i = 1 \sum d^{'} j = 1 \sum d^{'} y_{im} (1 - y_{j m}) (w_{i .} x_{. m} - w_{j .} x_{. m})

y_{. m}^{t} = W T A_{k}^{d^{'}} (W^{t} x_{. m})

y_{. m}^{t} = W T A_{k}^{d^{'}} (W^{t} x_{. m})

w_{i .}^{t + 1} = W T A_{c}^{d} (ℓ_{. i}^{t})

w_{i .}^{t + 1} = W T A_{c}^{d} (ℓ_{. i}^{t})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Face and Expression Recognition · Machine Learning in Bioinformatics

Full text

Modeling Winner-Take-All Competition in Sparse Binary Projections

Wenye Li

School of Science and Engineering

The Chinese University of Hong Kong, Shenzhen

Shenzhen, China

[email protected]

Abstract

Inspired by the advances in biological science, the study of sparse binary projection models has attracted considerable recent research attention. The models project dense input samples into a higher-dimensional space and output sparse binary data representations after the Winner-Take-All competition, subject to the constraint that the projection matrix is also sparse and binary. Following the work along this line, we developed a supervised-WTA model when training samples with both input and output representations are available, from which the optimal projection matrix can be obtained with a simple, effective yet efficient algorithm. We further extended the model and the algorithm to an unsupervised setting where only the input representation of the samples is available. In a series of empirical evaluation on similarity search tasks, the proposed models reported significantly improved results over the state-of-the-art methods in both search accuracies and running speed. The successful results give us strong confidence that the work provides a highly practical tool to real world applications.

K****eywords Sparse Binary Projection $\cdot$ Winner-Take-All Competition $\cdot$ Unsupervised Learning

1 Introduction

Random projection has emerged as a powerful tool in data analysis applications [1]. It is often used to reduce the dimension of data samples in the Euclidean space. It provides a simple and computationally efficient way to reduce the storage complexity of the data by trading a controlled amount of representation error for faster processing speed and smaller model sizes [2].

Very recently, with strong biological evidence, a sparse binary projection model called the FLY algorithm was designed and attracted people’s much attention. Instead of performing dimension reduction, the algorithm increases the dimension of the input samples with a random sparse binary projection matrix. After the winner-take-all (WTA) competition that happens in the output space, the samples are converted into a set of sparse binary vectors. In similarity search tasks, it was reported that such sparse binary vectors outperformed the hashed vectors produced by the classical locality sensitive hashing (LSH) method that is based on the random dense projection [3].

Following the work along this line, we proposed two models with the explicit treatment of the WTA competition. Instead of residing on the random generation of the projection matrix, one of our models seeks the optimal projection matrix under a supervised setting, while the other model operates purely in an unsupervised manner. For each model, we derived an algorithm that is surprisingly simple. In empirical evaluations, both algorithms reported significantly improved results in similarity search accuracies and running speed over the state-of-the-art approaches, and hence provided a practical tool in data analysis applications with high potential.

A note on notation. Unless specified otherwise, a capital letter, such as $W$ , denotes a matrix. A lower-cased letter, with or without a subscript, denotes a vector or a scalar. For example $w_{i.}$ denotes the $i$ -th row, $w_{.j}$ denotes the $j$ -th column, and $w_{ij}$ denotes the $\left(i,j\right)$ -th entry of the matrix $W$ .

The paper is organized as follows. Section 2 introduces the necessary background. Section 3 presents our models and the algorithms. Section 4 reports the experiments and the results, followed by the conclusion in Section 5.

2 Background

2.1 Sparse Binary Projection Algorithms

Different from classical projection methods that commonly map data from a higher-dimensional space to a lower-dimensional space, the FLY algorithm increases the dimension of the data. It was designed by simulating the fruit fly’s olfactory circuit, whose function is to associate similar odors with similar tags. Each odor is initially represented as a $50$ -dimensional feature vector of firing rates. To associate each odor with a tag involves three steps. Firstly, a divisive normalization step [4] centers the mean of the feature vector. Secondly, the dimension of the feature vector is expanded from $50$ to $2,000$ with a sparse binary connection matrix [5, 6], which has the same number of ones in each row. Thirdly, the WTA competition is involved as a result of strong inhibitory feedback coming from an inhibitory neuron. After the competition, all but the highest-firing $5\%$ out of the $2,000$ features are silenced [7]. These remaining $5\%$ features just correspond to the tag assigned to the input odor.

The FLY algorithm can be studied as a special form of the LSH method which produces similar hashes for similar input samples. But different from the classical LSH method which reduces the data dimension, the FLY algorithm increases the dimension with a random sparse binary matrix, while ensuring the sparsity and binarization of the data in the output space. Empirically, the FLY algorithm reported improved results over the LSH method [3] in similarity search applications.

The success of the FLY algorithm inspired considerable research attention, among which one of particular interest to us is the LIFTING algorithm [8] that removes the randomness assumption of the projection matrix, which is partially supported by most recent biological discoveries [6]. In the work, the projection matrix is obtained through supervised learning. Suppose training samples with both dense input representation $X\in\mathcal{R}^{d\times n}$ and sparse output representation $Y\in\left\{0,1\right\}^{d^{\prime}\times n}$ are available. The LIFTING algorithm seeks the projection matrix $W$ that minimizes $\left\|WX-Y\right\|_{F}^{2}+\beta\left\|W\right\|_{\frac{1}{2}}$ in the feasible region of sparse binary matrices. To solve the optimization problem, the Frank-Wolfe algorithm was found to have quite good performance [9, 10].

2.2 Winner-Take-All Competition

Evidences in neuroscience showed that excitation and inhibition are common activities in neurons [7, 11]. Based on the lateral information, some neurons raise to the excitatory state, while the others get inhibited and remain silent. Such excitation and inhibition result in competitions among neurons. Modeling neuron competitions is of key importance, with which useful applications are found in a variety of tasks [12, 13]. Specifically in machine learning, the competition mechanism has motivated the design of computer algorithms for a long time, from the early self-organizing map [14] to more recent work in developing novel neural network architectures [15, 16].

To model the competition stage, the WTA model is routinely adopted. We are interested with the following form of the WTA model. For a $d$ -dimensional input vector $x$ and a given hash length $k\left(k\ll d\right)$ , a function $WTA_{k}^{d}:\mathcal{R}^{d}\rightarrow\left\{0,1\right\}^{d}$ outputs a vector $y=WTA_{k}^{d}\left(x\right)$ satisfying, for each $1\leq i\leq d$ ,

[TABLE]

Thus the output entries with value $1$ just mark the positions of top- $k$ values of $x$ . For simplicity and without causing ambiguity, we do not differentiate whether the input/output vector of the $WTA$ function is a row vector or a column vector.

3 Model

3.1 Supervised Training

We start from a supervised setting. Let a set of samples be given in the form of $X\in\mathcal{R}^{d\times n}$ and $Y\in\left\{0,1\right\}^{d^{\prime}\times n}$ with each $x_{.m}\left(1\leq m\leq n\right)$ being an input sample and $y_{.m}$ being its output representation satisfying $\left\|y_{.m}\right\|_{1}=k$ for a given integer $k$ . We assume that, for a fixed integer $c$ 111As in [3], $c$ is set to $\left\lfloor 0.1\times d\right\rfloor$ in this paper., there exists a projection matrix $W\in\left\{0,1\right\}^{d^{\prime}\times d}$ with $\left\|w_{i.}\right\|_{1}=c$ ( $1\leq i\leq d^{\prime}$ ) and $y_{.m}=WTA_{k}^{d^{\prime}}\left(Wx_{.m}\right)$ .

The WTA function defined in Eq. (1) satisfies:

[TABLE]

for all $1\leq m\leq n$ and $1\leq i,j\leq d^{\prime}$ .

Now we are interested in inferring such a projection matrix $W$ from the given data. But unfortunately, seeking the matrix directly from Eq. (2) is generally hard. A matrix that satisfies all the constraints may not exist due to the noise in the observed samples. Even if it exists, the computational requirement can be non-trivial. A straightforward modeling of the problem as a linear integer program would involve $d^{\prime}\times d$ variables and $O\left(nk\left(d^{\prime}-k\right)+d^{\prime}\right)$ constraints, which is infeasible to solve even for moderately small $n$ and $d^{\prime}$ .

To ensure the tractability, we resort to a relaxation approach. For any feasible $m$ , $i$ and $j$ , we define a measure $y_{im}\left(1-y_{jm}\right)\left(w_{i.}x_{.m}-w_{j.}x_{.m}\right)$ to quantify the compliance with the condition in Eq. (2). When $y_{im}=1$ and $y_{jm}=0$ , the measure is non-negative if the condition is met; otherwise, it is negative. Naturally, we sum up the values of the measure over all $m$ , $i$ and $j$ , and define

[TABLE]

The value of $L_{s}\left(W\right)$ measures how well a matrix $W$ meets the conditions in Eq. (2). Maximizing $L_{s}$ with respect to $W$ in the feasible region of sparse binary matrices provides a principled solution to seeking the projection matrix. And we call it the supervised-WTA model.

Considering that

[TABLE]

Therefore, maximizing $L_{s}\left(W\right)$ is equivalent to $d^{\prime}$ maximization sub-problems. Each sub-problem seeks a row vector $w_{i.}\left(1\leq i\leq d^{\prime}\right)$ by

[TABLE]

subject to:

[TABLE]

Denote

[TABLE]

and the optimal solution of $w_{i.}$ to Eq. (4) is given by

[TABLE]

3.2 Unsupervised Training

The supervised-WTA model utilizes both input and output representations to learn a projection matrix. When only the input representation is available, we can extend the work to an unsupervised-WTA model, by maximizing the objective:

[TABLE]

subject to the constraints: $w_{i.}\in\left\{0,1\right\}^{1\times d}$ , $\left\|w_{i.}\right\|_{1}=c$ , $y_{.m}\in\left\{0,1\right\}^{d^{\prime}\times 1}$ , and $\left\|y_{.m}\right\|_{1}=k$ for all $1\leq i\leq d^{\prime}$ and $1\leq m\leq n$ .

Different from the supervised model, the unsupervised model treats the unknown output representation $Y$ as a variable, and jointly optimizes on both $W$ and $Y$ . To maximize $L_{u}$ , an alternating algorithm can be used. Start with a random initialization of $W$ as $W^{1}$ , and solve the model iteratively. In $t$ -th ( $t=1,2,\cdots$ ) iteration, maximize $L_{u}\left(W^{t},Y\right)$ with respect to $Y$ and get the optimal $Y^{t}$ . Then maximize $L_{u}\left(W,Y^{t}\right)$ with respect to $W$ and get the optimal $W^{t+1}$ .

The optimal $Y^{t}$ is given by:

[TABLE]

for all $1\leq m\leq n$ . Similarly to the supervised model, the optimal $W^{t+1}$ is given by:

[TABLE]

for all $1\leq i\leq d^{\prime}$ , where $\ell_{.i}^{t}=\sum_{m=1}^{n}x_{.m}\left(y_{im}^{t}-\frac{k}{d^{\prime}}\right)$ .

Denote by $L_{u}^{t}=L_{u}\left(W^{t},Y^{t}\right)$ . Obviously, the sequence $\left\{L_{u}^{t}\right\}$ monotonically increases for $t=1,2,\cdots$ . Therefore the alternating optimization process is guaranteed to converge when the objective value $L_{u}^{t}$ can’t be increased any more.

It is worth mentioning that the unsupervised-WTA model can be studied as a generic clustering method [17]. The model puts $m$ data samples into $d^{\prime}$ clusters and each sample belongs to $k$ clusters. A special case of $k=1$ leads to a hard clustering method. Two samples with the element of one in the same output dimension indicates that they have the same cluster membership.

The unsupervised-WTA model can also be treated as a feature selection method [18]. This can be seen from the fact that each output dimension is associated with a subset of $c$ features, instead of all $d$ features in the input space. The model is able to choose these $c$ features automatically and encode the information in the projection matrix $W$ . A detailed discussion of the clustering and the feature selection viewpoints goes beyond the scope of this paper and is hence omitted.

3.3 Complexity Issues

Computing the optimal solution to the supervised-WTA model is straightforward and can be implemented with high efficiency. To obtain each projection vector $w_{i.}$ , a naïve implementation needs $O\left(dn+d\log c\right)$ operations, among which $O\left(dn\right)$ are for the summation operation in Eq. (6) and $O\left(d\log c\right)$ are for the sorting operations in Eq. (7) by the Heapsort algorithm [19]. Therefore, computing the whole projection matrix needs $O\left(d^{\prime}dn+d^{\prime}d\log c\right)$ operations. In fact, by utilizing the sparse structure of the output matrix $Y$ , the computational complexity for $W$ can be further reduced to $O\left(kdn+d^{\prime}d\log c\right)$ . As seen in Section 4.3, this is a highly efficient result.

To solve the unsupervised-WTA model, in each iteration we need to compute both $Y$ and $W$ . Computing one $Y$ needs $O\left(cdn+d^{\prime}d\log k\right)$ operations by utilizing the sparse structure of $W$ , where $O\left(cdn\right)$ are for multiplying $W$ with $X$ and $O\left(d^{\prime}d\log k\right)$ are for the sorting operations in Eq. (9). Computing one $W$ has the same complexity as in the supervised-WTA model, $O\left(kdn+d^{\prime}d\log c\right)$ . Therefore, the total complexity per iteration is $O\left(\left(k+c\right)dn+d^{\prime}d\log\left(kc\right)\right)$ , which is also an efficient solution as seen in Section 4.3.

For both WTA models, the memory requirement is mainly from the matrices $X$ , $Y$ and $W$ , and the storage complexity is $O\left(dn+d^{\prime}n+d^{\prime}d\right)$ , which can be further reduced to $O\left(dn+kn+d^{\prime}c\right)$ if sparse matrix representation is adopted.

The training algorithms are parallelizable. Each vector of $W$ and $Y$ can be solved independently with high parallel efficiency. It is also notable that, after simple pre-processing of the training data, all computations only involve simple vector addition and scalar comparison operations.

4 Evaluation

4.1 General Settings

To evaluate the performance of the proposed models, we carried out a series of experiments under the following settings.

Application: Similarly to the work of [3], we applied the proposed models in similarity search tasks. Similarity search aims to find similar samples to a given query object among potential candidates, according to a certain distance or similarity measure [20]. The complexity of accurately determining similar samples relies heavily on both the number of candidates and their dimension. Computing the distances seems straightforward, but unfortunately could often become prohibitive if the number of candidates is too large or the dimension of the data is too high.

To handle the difficulty brought by the high dimension of the input data, we can either reduce the data dimension while approximately preserving their pairwise distances, or increase the dimension but confining the data in the output space to be sparse and binary, in the hope of significantly improved search speed with the new representation.

Objective: Our major objective is to evaluate and compare the similarity search accuracies for different algorithms. Each sample in a given dataset was used, in turn, as the query object, and the other samples in the same dataset were used as the search candidates. For each query object, we compared its $100$ nearest neighbors in the output space with its $100$ nearest neighbors in the input space, and recorded the ratio of common neighbors in both spaces. The ratio is averaged over all query objects as the search accuracy of each algorithm. Obviously, a higher similarity search accuracy indicates a better preserving of locality structures from the input space to the output space by the algorithm.

Datasets: In the evaluation, four real datasets and five artificially generated datasets were used. The real datasets have the input representation $X$ only; while the artificial datasets have both the input representation $X$ and the output representation $Y$ . Specifically these datasets are:

•

GLOVE [21]: $100$ - to $1000$ -dimensional GloVe word vectors trained on a subset of 330 million tokens from wikimedia database dumps222https://dumps.wikimedia.org/ with the $50,000$ most frequent words.

•

ImageNet [22]: a large collection of images represented as $1,000$ -dimensional visual words quantized from SIFT features.

•

MNIST [23]: $784$ -dimensional images of handwritten digits in gray-scale.

•

SIFT [24]: $128$ -dimensional SIFT descriptors of images used for similarity search.

•

ARTFC: five sets of $1,000$ -dimensional dense vectors ( $X$ ) and $2,000$ -dimensional sparse binary vectors ( $Y$ ). For each hash length of $k=2/4/8/16/32$ , a set of $2,000$ -dimensional sparse binary vectors were randomly generated with the hash length. Then the vectors were projected to $1,000$ -dimensional dense vectors through principal component analysis. In this way, the samples’ pairwise distances are roughly preserved between the input space and the output space; i.e., $\left\|x_{.m}-x_{.m^{\prime}}\right\|^{2}\approx\left\|y_{.m}-y_{.m^{\prime}}\right\|^{2}$ for all pairs of samples in the same set.

Algorithms to compare: We compared the proposed supervised-WTA (denoted by SUP) model and the unsupervised-WTA (UNSUP) model with the LSH algorithm [25, 26], the fast Jonson-Lindenstrauss projection (FJL) algorithm [27], the FLY algorithm [3] and the LIFTING algorithm [8]. The LSH algorithm maps $d$ -dimensional inputs to $k$ -dimensional dense vectors with a random dense projection matrix. The FJL algorithm is a fast implementation of the LSH algorithm with a sparse projection matrix. The FLY algorithm uses a random sparse binary matrix to map $d$ -dimensional inputs to $d^{\prime}$ -dimensional vectors. The LIFTING algorithm trains a sparse binary projection matrix in a supervised manner for the $d$ -dimensional to $d^{\prime}$ -dimensional projection. Both the FLY and the LIFTING algorithms involve a WTA competition stage in the output space to generate sparse binary vectors for each hash length.

Besides, we conducted the comparison with a number of other hashing algorithms, including iterative quantization (ITQ) [28], spherical hashing (SPH) [29] and isotrophic hashing (ISOH) [30]. These algorithms were popularly used in literature to produce sparse binary data embeddings.

Computing environment: All the algorithms were implemented in MATLAB platform running on an 8-way computing server, with which a maximum of $128$ threads were enabled for each algorithm. For the LIFTING algorithm, IBM CPLEX was used as the linear program solver that was needed by the Frank-Wolfe algorithm.

4.2 Similarity Search Accuracy

We carried out the experiment on the artificial datasets and the real datasets. From each ARTFC dataset, we randomly chose $10,000$ training samples with both the input ( $X$ ) and the output ( $Y$ ) representations, and chose another $10,000$ testing samples with the input representation only. For the two proposed WTA models, we trained a sparse binary projection matrix $W$ each based on the training data. Then we generated $2,000$ -dimensional sparse binary output vectors via the WTA competition after projecting the testing samples with the matrix. For the LIFTING algorithm, the same training and testing procedures were applied. For all other algorithms, we applied each of them on the testing samples to get either dense or sparse binary output vectors. Then the output vectors are used in similarity search and compared against the input vectors, as illustrated in Section 4.1.

We repeated the process for fifty runs and recorded the average accuracies. The results are given in Table 1. Each row shows the similarity search accuracies with a specific hash length 333As in [3, 8], the hash length is defined as the number of ones in each output vector for the FLY, LIFTING and WTA algorithms. For other algorithms, it is defined as the output dimension.. Consistent with the results reported in [3], the sparse binary projection algorithms reported improved results over the classical LSH method. Among the algorithms, it is evidently shown that, with the support of the supervised information, the LIFTING and the supervised-WTA algorithms reported further improved results over the FLY algorithm. Most prominently, with the hash length $k=4$ , the FLY algorithm has an accuracy of $6.73\%$ , while the supervised-WTA model’s accuracy reaches $66.7\%$ , almost ten times higher. When comparing the two supervised algorithms, the supervised-WTA model outperformed the LIFTING algorithm with all hash lengths.

Among the unsupervised learning algorithms, the proposed unsupervised-WTA model reported the best performances, significantly better than the results given by the LSH, FJL, FLY, ITQ, SPH and ISOH algorithms. Its accuracies are even better than the supervised-WTA model with the hash length $k=8$ .

On GLOVE/MNIST/SIFT datasets, only the input representation $X$ is available. We randomly chose $10,000$ samples for training and $10,000$ samples for testing. As suggested in [8], we computed $Y^{\ast}=\arg_{Y}\min\frac{1}{2}\left\|X^{T}X-Y^{T}Y\right\|_{F}^{2}+\gamma\left\|Y\right\|_{\frac{1}{2}}$ for the training data via the Frank-Wolfe algorithm, and used $Y^{*}$ as the output representation for training. Then we carried out the experiment under the same setting as on ARTFC datasets. Again the two WTA models reported evidently improved results.

When comparing the two WTA models on these datasets, the unsupervised-WTA model performed even better than the supervised-WTA model on most tests. In cases with known $X$ only, an approximation of $Y$ has to be obtained through matrix factorization. The quality of this approximated $Y$ becomes critical to the supervised-WTA model. We believe this is the major reason why the supervised model no longer excels.

Besides, we tested the algorithms’ performances on a much larger ImageNet dataset with one million images for training and $10,000$ images for testing. Computing the output representation $Y^{\ast}$ becomes infeasible on such a large training set, and therefore the results of SUP and LIFTING were not available. Comparing the with the other available algorithms, once again the unsupervised-WTA algorithm reported significantly improved search accuracies.

In addition to the experiment on similarity search accuracies, we further investigated the influence of different input/output dimensions on the performance of the proposed models. We fixed the output dimension to $d^{\prime}=2,000$ while varying the input dimension from $100$ to $1,000$ on GloVe word vectors. We recorded the similarity search accuracies by all the algorithms. From the results in Table 2, we can see that the WTA models reported consistently improved results.

4.3 Running Speed

As a practical concern, we compared the training time of the proposed WTA models with the LIFTING algorithm. In the experiment, we used the ARTFC datasets with $1,000$ -dimensional inputs and $2,000$ -dimensional outputs, and the number of training samples varied from $1,000$ to $50,000$ .

We recorded the training time of each algorithm to compute the sparse binary projection matrix $W$ . On all training sets, the proposed models reported significantly faster speed than the LIFTING algorithm. With $1,000$ samples (ref. Fig. 1(a)), the supervised-WTA model took less than $0.2$ seconds to get the optimal solution, hundreds of times faster than the LIFTING algorithm which took around $50$ seconds.

The unsupervised-WTA model needs to solve multiple $W^{t}$ and $Y^{t}$ iteratively. It took $10$ to $20$ seconds with $1,000$ samples, which was slower than the supervised-WTA model but several times faster than the LIFTING algorithm. With $50,000$ samples (ref. Fig. 1(d)), the supervised-WTA model took less than $10$ seconds, and the unsupervised-WTA model took about $400$ seconds to get the solutions. For the LIFTING algorithm, we didn’t finish the execution in our platform within $12$ hours. All these real results were consistent with the complexity analysis given in Section 3.3, and justified the running efficiency of the proposed WTA models.

We further carried out the experiment on the much larger ImageNet dataset and reported the results in Fig. 1(e). Due to the prohibitive computation to obtain the output representations, only the results of the UNSUP algorithm are available. From the results we can see, with a hash length $k=32$ , the algorithm took less than $200$ seconds to train a projection matrix with $10K$ samples, and took around $6,000$ seconds to train with one million samples. We believe that these are reasonably efficient and promising results to practical application scenarios.

5 Conclusion

With strong evidence from biological science, the study of sparse binary projection models has attracted much research attention recently. By mapping lower-dimensional dense data to higher-dimensional sparse binary vectors, the models have reported excellent empirical results and proved to be useful in practical applications.

Sparse binary projections are tightly coupled with WTA competitions. The competition is an important stage for pattern recognition activities that happen in the brain. Accordingly, our work started from the explicit treatment of the competition, and proposed two models to seek the desired projection matrix. Specifically, one model utilizes both input and output representations of the samples, and trains the projection matrix as a supervised learning problem. The other model utilizes the input representation only and trains the matrix in an unsupervised manner, which equips the model with wider application scenarios. For each model, we developed a simple, effective and efficient algorithm. In the evaluation, the models significantly outperformed the state-of-the-art methods, in both search accuracies and running speed.

Our work potentially triggers a number of topics to study. Firstly, the algorithms for both models only involve simple vector addition and scalar comparison operations, which are highly parallelizable. Such characteristics make the computing procedures suitable to be implemented with customized hardware [31], which provides a high-throughput and economical solution for large-scale data analysis applications.

Secondly, the unsupervised-WTA model provides a unified framework that combines the clustering and the feature selection techniques. We firmly believe that new applications along this line are possible. Besides, this viewpoint may provide a potential bridge that helps to make clear why the WTA competition leads to algorithms that preserve the locality structures of the data well, as reported in this paper.

Thirdly, there is potential to design new artificial neural network architectures. The WTA competition and the relaxation techniques adopted in this paper can be used as an activation function of the neurons in an artificial neural network. We warmly anticipate future work along this direction [32, 16].

6 Acknowledgments

This work was supported by Shenzhen Fundamental Research Fund (JCYJ20170306141038939, KQJSCX20170728162302784).

Bibliography32

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] E. Bingham and H. Mannila. Random projection in dimensionality reduction: applications to image and text data. In SIGKDD , pages 245–250. ACM, 2001.
2[2] W. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics , 26(189-206):1, 1984.
3[3] S. Dasgupta, C. Stevens, and S. Navlakha. A neural algorithm for a fundamental computing problem. Science , 358(6364):793–796, 2017.
4[4] S. Olsen, V. Bhandawat, and R. Wilson. Divisive normalization in olfactory population codes. Neuron , 66(2):287–299, 2010.
5[5] S. Caron, V. Ruta, L. Abbott, and R. Axel. Random convergence of olfactory inputs in the drosophila mushroom body. Nature , 497(7447):113, 2013.
6[6] Z. Zheng, S. Lauritzen, E. Perlman, C. Robinson, et al. A complete electron microscopy volume of the brain of adult drosophila melanogaster. Cell , 174(3):730–743, 2018.
7[7] C. Stevens. What the fly’s nose tells the fly’s brain. Proceedings of the National Academy of Sciences , 112(30):9460–9465, 2015.
8[8] W. Li, J. Mao, Y. Zhang, and S. Cui. Fast similarity search via optimal sparse lifting. In NIPS , pages 176–184, 2018.