Pairwise Teacher-Student Network for Semi-Supervised Hashing

Shifeng Zhang; Jianmin Li; Bo Zhang

arXiv:1902.00643·cs.CV·February 5, 2019

Pairwise Teacher-Student Network for Semi-Supervised Hashing

Shifeng Zhang, Jianmin Li, Bo Zhang

PDF

Open Access

TL;DR

This paper introduces a teacher-student semi-supervised hashing framework that leverages pairwise information from a teacher network to improve data retrieval accuracy, especially on complex datasets with limited labeled pairs.

Contribution

It proposes a novel teacher-student approach for semi-supervised hashing that outperforms existing methods and addresses limitations of graph-based structures for complex data.

Findings

01

Achieves significant improvements over supervised baselines.

02

Outperforms state-of-the-art semi-supervised hashing methods.

03

Effective on large-scale complex datasets.

Abstract

Hashing method maps similar high-dimensional data to binary hashcodes with smaller hamming distance, and it has received broad attention due to its low storage cost and fast retrieval speed. Pairwise similarity is easily obtained and widely used for retrieval, and most supervised hashing algorithms are carefully designed for the pairwise supervisions. As labeling all data pairs is difficult, semi-supervised hashing is proposed which aims at learning efficient codes with limited labeled pairs and abundant unlabeled ones. Existing methods build graphs to capture the structure of dataset, but they are not working well for complex data as the graph is built based on the data representations and determining the representations of complex data is difficult. In this paper, we propose a novel teacher-student semi-supervised hashing framework in which the student is trained with the pairwise…

Tables2

Table 1. Table 1: Accuracy in terms of MAP for the semi-supervised and supervised hashing methods. The numbers in blankets are the relative gain compared with the baselines. Unless specified, the results are directly drawn from the original papers.

Semi-Supervised Hashing
Method	Net	CIFAR-10				Nuswide				ImageNet-100⁴
Method	Net	12 bits	24 bits	32 bits	48 bits	12 bits	24 bits	32 bits	48 bits	16 bits	32 bits	48 bits	64 bits
SSDH	VGG-F	0.801	0.813	0.812	0.814	0.773	0.779	0.778	0.778	-¹	-	-	-
BGDH	VGG-F	0.805	0.824	0.826	0.833	0.803	0.818	0.822	0.828	-	-	-	-
PTS³H-DSH	AlexNet	0.798	0.828	0.835	0.843	0.752	0.774	0.783	0.789	0.612	0.680	0.697	0.703
PTS³H-DSH	AlexNet	(+0.056)	(+0.034)	(+0.026)	(+0.023)	(+0.012)	(+0.012)	(+0.019)	(+0.016)	(+0.023)	(+0.032)	(+0.047)	(+0.041)
PTS³H-DPSH	AlexNet	0.789	0.799	0.801	0.805	0.803	0.827	0.831	0.842	0.397	0.542	0.618	0.634
PTS³H-DPSH	AlexNet	(+0.038)	(+0.028)	(+0.025)	(+0.027)	(+0.004)	(+0.006)	(+0.003)	(+0.009)	(+0.018)	(+0.014)	(+0.027)	(+0.026)
Supervised Hashing Baselines
DSH²	AlexNet	0.741	0.794	0.809	0.820	0.740	0.762	0.764	0.773	0.589	0.648	0.650	0.662
DPSH²	AlexNet	0.751	0.771	0.776	0.778	0.799	0.821	0.827	0.834	0.379	0.528	0.591	0.608
DSDH	VGG-F	0.740	0.786	0.801	0.820	0.776	0.808	0.820	0.829	-	-	-	-
DISH	AlexNet	0.758	0.784	0.799	0.791	0.787	0.810	0.810	0.813	-	-	-	-
HashNet³	AlexNet	0.686³	-	0.692	0.718	0.733³	-	0.755	0.762	0.502	0.622	0.661	0.682
DMDH³	AlexNet	0.704³	-	0.732	0.737	0.751³	-	0.781	0.789	0.513	0.612	0.673	0.692
MIHash	AlexNet	0.738	0.775	0.791	0.816	0.773	0.820	0.831	0.843	0.569	0.661	0.685	0.694

Table 2. Table 2: Results of the variants of the proposed PTS 3 H algorithm on CIFAR-10 and Nuswide dataset. PTS 3 H and PTS 3 H-S are both proposed method but the codes are generated by the teacher and the student respectively. AlexNet is used for pre-training. Precision denotes the precision at Hamming distance within 2 value.

Method	Dataset	MAP		Precision
Method	Dataset	32 bits	48 bits	32 bits	48 bits
PTS³H-P	CIFAR-10	0.829	0.838	0.829	0.827
PTS³H-Q		0.817	0.826	0.821	0.814
PTS³H		0.835	0.843	0.832	0.829
PTS³H-S		0.833	0.842	0.834	0.830
PTS³H-P	Nuswide	0.777	0.787	0.763	0.727
PTS³H-Q		0.772	0.777	0.759	0.710
PTS³H		0.782	0.789	0.770	0.737
PTS³H-S		0.783	0.789	0.771	0.739

Equations24

L_{s} = \frac{1}{∣ S ∣} (i, j) \in S \sum l (u_{ij}, s_{ij}), u_{ij} = sim (h_{i}, h_{j})

L_{s} = \frac{1}{∣ S ∣} (i, j) \in S \sum l (u_{ij}, s_{ij}), u_{ij} = sim (h_{i}, h_{j})

L = L_{s} + ω R_{u}

L = L_{s} + ω R_{u}

L^{(c)} = L_{s}^{(c)} + ω R_{u}^{(c)}

L^{(c)} = L_{s}^{(c)} + ω R_{u}^{(c)}

R_{u}^{(c)} = x \in X \sum d (f (\tilde{x}^{(1)}), f_{T} (\tilde{x}^{(2)}))

R_{u}^{(c)} = x \in X \sum d (f (\tilde{x}^{(1)}), f_{T} (\tilde{x}^{(2)}))

θ_{T} (t) = α θ_{T} (t - 1) + (1 - α) θ (t)

θ_{T} (t) = α θ_{T} (t - 1) + (1 - α) θ (t)

R_{u} u_{12} u_{T 12} = \frac{1}{∣ X ∣ ^{2}} x_{1}, x_{2} \in X \sum l_{c} (u_{12}, u_{T 12}) = sim (H (\tilde{x_{1}}^{(1)}), H (\tilde{x_{2}}^{(1)})) = sim (H_{T} (\tilde{x_{1}}^{(2)}), H_{T} (\tilde{x_{2}}^{(2)}))

R_{u} u_{12} u_{T 12} = \frac{1}{∣ X ∣ ^{2}} x_{1}, x_{2} \in X \sum l_{c} (u_{12}, u_{T 12}) = sim (H (\tilde{x_{1}}^{(1)}), H (\tilde{x_{2}}^{(1)})) = sim (H_{T} (\tilde{x_{1}}^{(2)}), H_{T} (\tilde{x_{2}}^{(2)}))

l_{c} (u, u_{T}) = (u - u_{T})^{2}

l_{c} (u, u_{T}) = (u - u_{T})^{2}

W_{ij} = {10 u_{T ij} \geq t h r u_{T ij} < t h r

W_{ij} = {10 u_{T ij} \geq t h r u_{T ij} < t h r

l_{c} (u_{12}, u_{T 12}) = l (u_{12}, W_{12})

l_{c} (u_{12}, u_{T 12}) = l (u_{12}, W_{12})

R_{u} = R_{u p} + γ R_{u q}

R_{u} = R_{u p} + γ R_{u q}

F min L = L_{s}^{(r)} + ω R_{u}^{(r)} + η \frac{1}{∣ X ∣} x \in X \sum ∥ h - F (\tilde{x}^{(1)}) ∥_{1}

F min L = L_{s}^{(r)} + ω R_{u}^{(r)} + η \frac{1}{∣ X ∣} x \in X \sum ∥ h - F (\tilde{x}^{(1)}) ∥_{1}

\begin{split}\mathcal{L}_{s}^{(r)}=&\frac{1}{|\mathcal{S}|}\sum_{(i,j)\in\mathcal{S}}l(u^{r}_{ij},s_{ij})\\ \mathcal{R}_{u}^{(r)}=&\frac{1}{|\mathcal{X}|^{2}}\sum_{\mathbf{x}_{1},\mathbf{x}_{2}\in\mathcal{X}}\Big{[}(u^{r}_{12}-u^{r}_{T12})^{2}+\gamma l(u^{r}_{12},W_{12})\Big{]}\end{split}

\begin{split}\mathcal{L}_{s}^{(r)}=&\frac{1}{|\mathcal{S}|}\sum_{(i,j)\in\mathcal{S}}l(u^{r}_{ij},s_{ij})\\ \mathcal{R}_{u}^{(r)}=&\frac{1}{|\mathcal{X}|^{2}}\sum_{\mathbf{x}_{1},\mathbf{x}_{2}\in\mathcal{X}}\Big{[}(u^{r}_{12}-u^{r}_{T12})^{2}+\gamma l(u^{r}_{12},W_{12})\Big{]}\end{split}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Video Surveillance and Tracking Methods · Caching and Content Delivery

Full text

Pairwise Teacher-Student Network for Semi-Supervised Hashing

Shifeng Zhang, Jianmin Li and Bo Zhang

Institute for Artificial Intelligence, State Key Lab of Intelligent Technology and Systems,

Beijing National Research Center for Information Science and Technology,

Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China

[email protected], [email protected], [email protected]

Abstract

Hashing method maps similar high-dimensional data to binary hashcodes with smaller hamming distance, and it has received broad attention due to its low storage cost and fast retrieval speed. Pairwise similarity is easily obtained and widely used for retrieval, and most supervised hashing algorithms are carefully designed for the pairwise supervisions. As labeling all data pairs is difficult, semi-supervised hashing is proposed which aims at learning efficient codes with limited labeled pairs and abundant unlabeled ones. Existing methods build graphs to capture the structure of dataset, but they are not working well for complex data as the graph is built based on the data representations and determining the representations of complex data is difficult. In this paper, we propose a novel teacher-student semi-supervised hashing framework in which the student is trained with the pairwise information produced by the teacher network. The network follows the smoothness assumption, which achieves consistent distances for similar data pairs so that the retrieval results are similar for neighborhood queries. Experiments on large-scale datasets show that the proposed method reaches impressive gain over the supervised baselines and is superior to state-of-the-art semi-supervised hashing methods.

1 Introduction

With the explosion of high-dimensional media data, approximate Nearest Neighbor(ANN) search [6] has attracted broad attention for efficient information retrieval. Among the existing ANN methods, hashing has become a popular tool for ANN search on large-scale datasets due to its fast search time and small storage space [6, 18, 25, 30, 28]. It aims at encoding high-dimensional data into compact hashcodes, so that similar data are mapped to hashcodes with similar hamming distance.

Among the existing hashing methods, data-dependent learning-to-hash methods aim at learning hash functions with the training data, and the learned codes is able to capture the data distributions. Learning-to-hash methods can be divided into three categories: unsupervised hashing [7, 28], supervised hashing [18, 25] and semi-supervised hashing [32, 31, 23]. Experiments convey that the codes learned by (semi-)supervised hashing methods can capture more semantic information than unsupervised ones. Recently, with the rapid development of deep learning [11, 8], deep hashing methods have achieved great success [30, 12, 17, 34, 32, 14, 2]. It aims at learning hashcodes and the deep networks simultaneously, thus the codes generated by deep networks contain much better semantic information.

For ANN search, pairwise similarities between data pairs play an important role in evaluating the quality of search. For generating efficient hashcodes, (deep) supervised hashing problems regard the pairwise similarity as the basic supervision such that similar data pairs should be mapped to codes with small hamming distance. Most hashing methods model the similarities with the pairwise losses, and optimizing them is expected to generate the codes where the hamming distances are accordant with the similarities. For ease of back-propagation, these methods simply generate data pairs within a mini-batch and achieve good results [17, 3, 20, 15].

Despite the success of supervised hashing, labeling all the database data (pairs) is almost intractable as the number of data is dramatically increasing. To utilize the abundant database data, deep semi-supervised hashing [32, 31] has been proposed in which the hash function is trained with the labeled data pairs and abundant unlabeled ones. The success of semi-supervised hashing lies in the smoothness assumption such that neighborhood data are likely to have the same predictions. These methods construct graphs for the unlabeled data to capture the neighborhood structure among the samples. However, the data and their representations may lie in high-dimensional nonlinear manifolds, especially for complex data like images and videos, and the representations may not learn well with limited data. As the graph is built based on data representations, the graph may not model the neighborhood structure of data precisely, which violates the smoothness assumption to some extent and affect the hashing performance.

Recently, perturbation-based teacher-student semi-supervised learning (SSL) algorithms have witnessed great success [13, 26]. These methods follow the smoothness assumption in which the learned classifiers produce consensus prediction of a perturbed input, thus they can better capture the structure of unlabeled data [35] and produce better representations for the graph training [21]. However, the proposed teacher-student method can just deal with data with single label, but does not consider the pairwise relationship between samples, which is crucial for semi-supervised hashing. By carefully designing the teacher-student architecture and the loss for pairwise similarities, we may utilize the advantage of this architecture and obtain a novel semi-supervised hashing method.

In this paper, we propose a novel semi-supervised hashing algorithm called Pairwise Teacher-Student Semi-Supervised Hashing(PTS3H) in which the pairwise similarities are used for supervision and abundant unlabeled data pairs are provided. The proposed PTS3H is a teacher-student network architecture where the student is trained with pairwise loss and unsupervised regularizers, and the teacher is the average of the student network to generate efficient pairwise representations. As hashing mainly focuses on the pairwise information, we propose the general consistent pairwise loss such that similar queries produce similar pairwise similarities with the database and achieve similar retrieval results, aiming at following the smoothness assumptions [26, 13]. For modeling pairwise similarities between samples with local and global pairwise information, we propose two types of losses: consistent similarity loss for consistent pairwise similarities among data, and quantized similarity loss in which the quantized [9] similarities can be modeled by the teacher network by global data pairs. Experiment shows that the proposed PTS3H achieves great improvement over the supervised baselines, and it is superior or comparable with the state-of-the-art semi-supervised hashing algorithms.

2 Background

Suppose we are given $n$ data samples $\mathbf{x}_{1},\mathbf{x}_{2},...,\mathbf{x}_{n}\in\mathcal{X}$ , and $\mathcal{X}$ is the training dataset. Denote $\mathcal{S}$ as a set such that $(i,j)\in\mathcal{S}$ implies $\mathbf{x}_{i},\mathbf{x}_{j}$ have similarity information, and we denote $s_{ij}=1$ if $\mathbf{x}_{i},\mathbf{x}_{j}$ are similar, and $s_{ij}=0$ otherwise. In practical applications, the similarity information of some data pairs is unknown. We denote $\mathcal{U}$ as the pairs where the pairwise similarity information is unknown.

Denote $b$ as the length of the hashcode to learn, the goal of the semi-supervised hash learning is to learn the hash function $H(\mathbf{x})=[h_{1}(\mathbf{x}),...,h_{b}(\mathbf{x})]^{\top}\in\{-1,1\}^{b}$ with $n$ data samples and the pairwise similarities. We denote $\mathbf{h}_{i}=H(\mathbf{x}_{i}),i=1,2,...,n$ as the learned hashcode of $\mathbf{x}_{i}$ .

2.1 Pairwise Loss for Supervised Hashing

Pairwise losses is widely used for solving (deep) supervised hashing algorithm with pairwise similarity as supervision [18, 16, 17, 3, 15, 31]. For the given training data and pairwise information, the basic formulation of pairwise loss is

[TABLE]

where $u_{ij}=\mathrm{sim}(\mathbf{h}_{i},\mathbf{h}_{j})$ are the similarity (or distance) between the codes $\mathbf{h}_{i},\mathbf{h}_{j}$ .

Different types of $l(u_{ij},s_{ij})$ are discovered in different supervised hashing algorithms such that

•

KSH loss: $l(u_{ij},s_{ij})=[b(2s_{ij}-1)-u_{ij}]^{2},u_{ij}=\mathbf{h}_{i}^{\top}\mathbf{h}_{j}$ in KSH [18] and FastH [16];

•

DSH loss: $l(u_{ij},s_{ij})=-s_{ij}u_{ij}+(1-s_{ij})\max(0,2b+u_{ij}),u_{ij}=-(\mathbf{h}_{i}-\mathbf{h}_{j})^{2}$ in DSH [17];

•

DPSH loss: $l(u_{ij},s_{ij})=-s_{ij}u_{ij}+\log(1+e^{u_{ij}}),u_{ij}=\frac{1}{2}\mathbf{h}_{i}^{\top}\mathbf{h}_{j}$ in DPSH [15], DHN [3].

Optimizing $l(u_{ij},s_{ij})$ is expected to learn hashcodes such that similar data pairs have codes with small hamming distance, and vice versa. It should be noticed that the supervised information is just pairwise information, which is widespread in the real world.

2.2 Semi-Supervised Hashing

Semi-supervised hashing focuses on learning hash function with limited labeled data pairs as well as abundant unlabeled pairs in the database. Similar with semi-supervised learning, the general form of loss to be optimized is

[TABLE]

where $\mathcal{L}_{s}$ is Eq. (1), $\mathcal{R}_{u}$ ls the regularization term for unlabeled data. SPLH [27] adopts the bit-balanced constraint for regularization, but it does not consider the relationship between samples. Graph-based methods like SSDH [33] and BGDH [31] construct an affinity graph for indicating pairwise similarities between unlabeled samples, and the regularization loss is constructed based on the graph. These methods succeed in capturing the neighborhood structures between samples, but the graph is constructed by data representations, and the semantic gap may be involved among the representations, which may violate the smoothness assumptions. Recently, deep generative models have achieved success in semi-supervised learning problems, and DSH-GANs [23] proposes a GAN [24] based hashing method. The conditional GAN is trained with labeled and unlabeled data to generate labeled samples, which are used for training the hashing network. It achieves state-of-the-art in some datasets, it is not able to be trained with pairwise supervision.

2.3 Teacher-Student Network for Semi-Supervised Learning

Semi-supervised learning (SSL) aims at learning with limited labeled data and abundant unlabeled data. Most semi-supervised learning methods lies in the smoothness assumption such that similar data correspond to the same label. Various approaches are discovered such as transductive approach [10, 29], graph-based methods [1, 35], but they are not working well in complex dataset as the underlying structure of data is hard to capture. Recently, perturbation-based semi-supervised learning approach has achieved great success, where a perturbed input corresponds to the consensus prediction. These methods propose a dual role, i.e., the teacher and the student. The student is learned as before; the teacher generates the targets for training the student. Formally, considering the dataset $\mathcal{X}$ where part of data are labeled, we aim at optimizing the following loss function:

[TABLE]

where $c$ denotes classification, $\mathcal{L}^{(c)}_{s}$ is the supervised term such as the softmax loss, $\mathcal{R}^{(c)}_{u}$ is the unsupervised regularization such that

[TABLE]

where $\tilde{\mathbf{x}}^{(1)},\tilde{\mathbf{x}}^{(2)}$ are two random perturbations, $f(\cdot),f_{T}(\cdot)$ are the outputs of student and teacher network respectively, and $d(\cdot,\cdot)$ is the distance between two features. There are several ways to define the teacher $f_{T}$ . TempEns [13] considers $f_{T}$ as the exponentially moving average(EMA) of the student’s output; Mean Teacher [26] averages the weights of student with EMA to form the teacher network; VAT [22] introduces the adversarial perturbations instead of random perturbations. These methods achieve state-of-the-art on SSL problems.

In spite of this, perturbation-based methods is just able to regularize the single data point, but do not consider neighborhood structure between samples. SNTG [21] constructs a graph by the teacher to capture the neighborhood structure, and introduces a pairwise regularization term with the graph. Experiments convey that the additional term achieves better performance as both the consistency of the perturbed data and neighborhood samples is considered. However, the graph in SNTG is built specifically for classification.

With the success of teacher-student network for semi-supervised learning, in this paper, we propose a novel teacher-student framework for semi-supervised hashing in which only small portion of pairwise similarity information is provided. Considering we perform the hamming distance learning, we propose a novel consistent pairwise loss in which the consistent feature distances for similar data pairs are reached so that it is able to follow the smoothness assumption where neighborhood queries achieve similar retrieval results. Experiments show its superiority over the state-of-the-art semi-supervised hashing algorithms.

3 Methodology

In this section, we propose the novel deep semi-supervised hashing called Pairwise Teacher-Student Semi-Supervised Hashing(PTS3H), in which the teacher-student network is adopted.

3.1 The Teacher-Student Framework

The proposed PTS3H is a teacher-student architecture shown in Figure 1(a). The architecture of teacher network and the student are the same, in which the last layer is the fully-connected layer with $b$ outputs ( $b$ is the hashcode length), and the rest layers can be the basic deep network like AlexNet, VGGNet, etc.

The update rule of the teacher-student network is similar as Mean Teacher [26]. The student is learned with labeled data pairs and guided by the teacher. Denote $\theta(t)$ and $\theta_{T}(t)$ as the parameters of the student and teacher network at training step $t$ respectively, the teacher network is updated by EMA as follows:

[TABLE]

thus the teacher is the average embedding of the student, and the teacher’s output can be regarded as the mean embedding of the student’s.

Denote $F(\mathbf{x}),F_{T}(\mathbf{x})\in\mathbb{R}^{r}$ as the output of the student and teacher networks respectively, the binary codes of data $\mathbf{x}$ can be easily obtained with either the student network such that $H(\mathbf{x})=\mathrm{sgn}(F(\mathbf{x}))$ , or the teacher network $H_{T}(\mathbf{x})=\mathrm{sgn}(F_{T}(\mathbf{x}))$ . Note that the $\mathbf{x}$ is not perturbed in the code generation.

3.2 Loss Function

The general form of loss to be optimized is Eq. (2). For labeled data pairs, the training loss is the pairwise loss function in Eq. (1). For training with the unlabeled data, $\mathcal{R}_{u}$ should be defined in which the teacher network generates targets to guide the student network. As hash learning focuses on the pairwise similarities of the codes, learning the similarities of the embedded hamming space are quite important. For input pairs, the targets for the student should be the similarities of the codes generated by the teacher. We therefore propose the general form of the consistent pairwise loss such that

[TABLE]

where $\tilde{\mathbf{x}_{i}}^{(1)},\tilde{\mathbf{x}_{i}}^{(2)},i=1,2$ are two random perturbations of $\mathbf{x}_{i}$ , $l_{c}(u,u_{T})$ is a certain type of loss and $u,u_{T}$ denote the pairwise similarities of codes generated from the student and the teacher respectively. Eq. (6) is quite different from the original Mean Teacher [26] in which only the single data point is considered for training.

For Eq. (6), We propose two simple but efficient form of losses named consistent similarity loss and quantized similarity loss.

Consistent Similarity Loss It is expected that the learned codes should follow the smoothness assumption in that a noisy input query correspond to the consistent retrieval results. To what follows, the similarities of codes between the noisy data pairs should be consistent. As illustrated in Figure 1(b.2), if $\mathbf{x}_{1},\mathbf{x}_{2}$ is quite similar and so as $\mathbf{x}_{3},\mathbf{x}_{4}$ , the difference between $\mathrm{sim}(H(\mathbf{x}_{1}),H(\mathbf{x}_{3}))$ and $\mathrm{sim}(H_{T}(\mathbf{x}_{2}),H_{T}(\mathbf{x}_{4}))$ should be small. Thus the consistent similarity loss is defined with

[TABLE]

where $l_{c}(u,u_{T})$ are the same as Eq. (6). We rename the $\mathcal{R}_{u}$ as $\mathcal{R}_{up}$ if Eq. (7) is introduced.

Quantized Similarity Loss The consistent similarity loss is only able to capture the locally structure of a certain data pair, ignoring the global neighborhood structure between samples. Inspired by the quantization methods in which large amount of information can be compressed with quantization [9], we quantize the pairwise similarity produced by the ensembled teacher to guide the hash learning. As the quantization procedure is based on global unlabeled data pairs, it is expected that the quantized similarities contain global pairwise information, leading to better learned codes.

We denote $\mathbf{W}\in\{0,1\}^{n\times n}$ as the quantized similarity matrix to be learned, where $n$ is the number of training samples. Denote $W_{ij}$ as the element at $i$ th row and $j$ th column, thus $W_{ij}=1$ indicates $\mathbf{x}_{i}$ and $\mathbf{x}_{j}$ are pseudo similar pair, and [math] otherwise. Considering the teacher output $H_{T}(\mathbf{x})$ is the ensemble of embedded codes of $\mathbf{x}$ , thus $H_{T}(\mathbf{x})$ can be regarded as precise feature embedding of the data point $\mathbf{x}$ . To what follows, we use the teacher output to determine the pseudo similar pairs. The similarity matrix is defined according to the distances of teacher output such that

[TABLE]

where $u_{Tij}=\mathrm{sim}(H_{T}(\tilde{\mathbf{x}_{i}}^{(2)}),H_{T}(\tilde{\mathbf{x}_{j}}^{(2)}))$ is defined the same as Eq. (6), $thr$ is the threshold, which is set according to the dataset. In practical applications, the distribution between labeled and unlabeled pairs are expected to be the same. We can set $thr$ such that the ratio of pseudo similar pairs is the same as the ratio of similar pairs among labeled pairs, so that the unlabeled similar pairs generated by the teacher can be almost positive and the distribution of similar pairs are expected to the same as the ground-truth similar pairs.

Given the generated pseudo similarity pairs, we can simply train the student with the pairwise loss shown in Eq. (1) to capture the global structure of the embedded codes in the hamming space. We propose the quantized similarity loss by defining $l_{c}$ such that:

[TABLE]

where $l(\cdot,\cdot)$ has the same form as that defined in Eq. (1). It should be noticed that Eq. (9) can be regarded as the ranking loss for the global data pairs to some extent in that similar pairs produced by the teacher are more likely to be pseudo similar pairs, thus they are expected to achieve similar hamming distances during training. We rename $\mathcal{R}_{u}$ as $\mathcal{R}_{uq}$ if Eq. (9) is introduced.

Overall Training Loss The overall training loss is defined the same as Eq. (2), where $\mathcal{L}_{s}$ is defined in Eq. (1), and $\mathcal{R}_{u}$ can be regarded as the combination of consistent similarity loss and quantized similarity loss such that

[TABLE]

As the teacher outputs in Eq. (10) lead to better abstract representations and can model the pairwise information locally and globally, it is expected that the proposed loss can better meet the smoothness assumptions and achieves better codes. Moreover, the hamming distances is accordant with the similarities on both labeled and unlabeled data.

Implementation and Relaxation Eq. (10) conveys that the both the original and the perturbed samples should be fed into the network. For simplicity, we just regard the perturbed data as input, shown in Figure 1.

It is clear that directly optimizing Eq. (2) is intractable as the discrete constraints are involved. As used in most deep hashing algorithms [3, 32, 31], the simple and efficient way is removing the $\mathrm{sgn}$ function and adding the quantization loss. We reformulate the relaxed problem as follows

[TABLE]

where $\mathbf{h}=\mathrm{sgn}(F(\tilde{\mathbf{x}}^{(1)}))$ , $\mathcal{L}_{s}^{(r)},\mathcal{R}_{u}^{(r)}$ is the relaxation of Eq. (1,10) respectively such that

[TABLE]

For $\mathcal{L}_{s}^{(r)}$ , we directly remove the $\mathrm{sgn}$ function to compute $u^{r}_{ij}$ , and $u^{r}_{ij}$ is defined the same as that in Eq. (1). For $\mathcal{R}_{u}^{(r)}$ , we use $u^{r}_{12}=\mathrm{sim}(F(\tilde{\mathbf{x}_{1}}^{(1)}),F(\tilde{\mathbf{x}_{2}}^{(1)})),u^{r}_{T12}=\mathrm{sim}(F_{T}(\tilde{\mathbf{x}_{1}}^{(2)}),F_{T}(\tilde{\mathbf{x}_{2}}^{(2)}))$ , and $\mathrm{sim}(\mathrm{s},\mathrm{t})=-\Arrowvert\frac{\mathbf{s}}{\|\mathbf{s}\|}-\frac{\mathbf{t}}{\|\mathbf{t}\|}\Arrowvert^{2}$ where $\|\cdot\|$ is the $L_{2}$ normalization. The use of $L_{2}$ normalization is inspired by the original Mean Teacher where the consistent output is the normalized classification probabilities. Moreover, the norm of the hashcodes are the same, thus similar normalized feature embeddings correspond to similar hashcodes.

As a result, the consistent pairwise losses can capture both the local and global neighborhood structure. Moreover, semantic information can be embedded with supervised pairwise loss, and the real-valued space is able to be mapped into hamming space with the quantization loss.

3.3 Mini-batch Optimization

The training procedure is roughly the same as [26]. The teacher is the average embedding of the student network and is updated by Eq. (5) each iteration, and the student is trained with back-propagation. We use the ramp-up procedure for both the learning rate and the regularization term $\omega=\omega(t)$ in the beginning of training. The training algorithm is summarized in Algorithm 1.

We mainly focus on training the student network. It is clear that the student can be trained by optimizing Eq. (11) with SGD. We follow the common practice in which we randomly sample mini-batch to estimate the losses for each iteration. For a mini-batch $B$ , we just compute the pairwise losses $\mathcal{L}_{s}^{(r)},\mathcal{R}_{u}^{(r)}$ within the mini-batch, and so as computing the pesudo similar pairs. It is clear that the complexity of the loss just $O(|B|^{2})$ , thus the computational cost is not large compared with the computational cost of deep networks. To utilize both the labeled and unlabeled data, the ratio of number of labeled data pairs and unlabeled ones is constant in a mini-batch.

4 Experiments

In this section, we conduct various large-scale retrieval experiments to show the efficiency of the proposed PTS3H method. We compare our PTS3H method with recent state-of-the-art semi-supervised deep hashing methods on the retrieval performance. Some ablation studies and sensitivity of parameters are also discussed in this section.

4.1 Datasets and Evaluation Metrics

We run large-scale retrieval experiments on three image benchmarks: CIFAR-10111http://www.cs.toronto.edu/~kriz/cifar.html, Nuswide222http://lms.comp.nus.edu.sg/research/NUS-WIDE.htm and ImageNet-100. CIFAR-10 consists of 60,000 $32\times 32$ color images from 10 object categories. ImageNet-100 is the subset of ImageNet dataset333http://image-net.org with 100 randomly sampled classes. Nuswide dataset contains about 220K available images associating with 81 ground truth concept labels. Following [19], we only use the images associated with the 21 most frequent concept tags, where the total number of images is about 190K.

The experimental protocols is similar with [30]. In CIFAR-10 dataset, we randomly select 1,000 images (100 images per class) as query set, the rest 59,000 images as retrieval database, and we random select 5,000 images from the database as the training data. In Nuswide dataset, we randomly select 2,100 images (100 images per class) as the query set and 10,500 images as the training set. In ImageNet-100 dataset, we use the same data split as HashNet [4] such that 130 images per class(totally 13K images) for training, and all images in the selected classes from the validation set are used as queries. The rest unlabeled data in the databest are regarded as the unlabeled dataset.

As we just consider the pairwise similarity for training, the data pairs are constructed among the training data. For CIFAR-10 and ImageNet-100, similar data pairs share the same semantic label. For Nuswide dataset, similar images share at least one semantic label. The rest data pairs(pairs between unlabeled data and all the database) are regarded as the unlabeled pairs.

Our method is implemented with the PyTorch444http://pytorch.org/ framework. We adopt the pre-trained AlexNet [11] for deep hashing methods but replace the last fully-connected layer. The images are resized to $224\times 224$ to train the network. For supervised pairwise loss in Eq. (1), we mainly use the DSH loss and DPSH loss and name them as PTS3H-DSH and PTS3H-DPSH respectively. SGD with momentum 0.9 is used for optimization, and the initial learning rate of the last layer is $10^{-3}\sim 10^{-2}$ which is ten times larger of the lower layers. The hyper-parameters $\omega,\mu,\alpha$ is different according to datasets, which are selected with the validation set. We first of all randomly select part of training data as validation set to determine the parameters. For CIFAR-10, we use $\{\omega=0.8,\gamma=0.5,\eta=0.004\}$ with DSH loss and $\{\omega=0.02,\gamma=0.5,\eta=0.01\}$ with DPSH loss; For Nuswide, we use $\{\omega=0.8,\gamma=0.1,\eta=0.01\}$ with DSH loss and $\{\omega=0.2,\gamma=0.1,\eta=0.01\}$ with DPSH loss. For ImageNet-100, we use $\{\omega=0.5,\gamma=0.1,\eta=0.004\}$ with DSH loss and $\{\omega=0.5,\gamma=0.02,\eta=0.004\}$ for DPSH loss. Following [26], we set $\alpha=0.995$ , and the ratio of number of unlabeled data pairs and labeled data pairs within a minibatch is 15. The image perturbation strategy includes random resize, random cropping, random horizontal flipping, etc. The training is done on a server with two Intel(R) Xeon(R) E5-2683 [email protected] CPUs, 256GB RAM and a Geforce GTX TITAN Pascal with 12GB memory. We train 60 epochs for CIFAR-10, 20 epochs for Nuswide, and 240 epochs for ImageNet-100. We apply center cropped input and the teacher network to generate hashcodes for simplicity, and Section 4.3 shows that there are little difference between codes generated by the teacher and the student.

Similar with [30, 4], for each retrieval dataset, we report the compared results in terms of mean average precision(MAP), precision at Hamming distance within 2, precision of top returned candidates. We calculate the MAP value within the top 5000 returned neighbors for NusWide and top 1000 for ImageNet-100, and report the MAP of all retrieved samples on CIFAR-10. Groundtruths are defined by whether two candidates are similar. We run each experiment for 5 times and get the average result.

4.2 Results

We compare our PTS3H method with recent state-of-the-art deep hashing methods including SSDH [32], BGDH [31]. We do not take DSH-GANs [23] into consideration as it utilizes the label of each data point. Results on other supervised hashing methods like DSH [17], DPSH [15], DSDH [14], DISH [34], DMDH [5] and MIHash [2] are also proposed for comparison. They follow similar settings, and the network used is either VGG-F or AlexNet, which share similar architectures. Table 1 conveys that DSH and DPSH are good supervised hashing algorithms, we therefore regard the DSH and DPSH loss as the baselines of PTS3H-DSH and PTS3H-DPSH respectively. We report the supervised baselines so that the relative gains of the PTS3H are also taken into consideration.

Retrieval results of different methods are shown in Table 1 and Figure 2. We re-implement the DSH and DPSH algorithms for all the datasets, and most results of the two baselines are better than previously reported. Note that the settings of Imagenet-100 are the same as that in [4]. With the network structure and the training loss fixed, the proposed PTS3H algorithm performs much better than the baselines by about 1-5 percents on MAP and precision at Hamming distance within 2 value, which conveys that the proposed semi-supervised setting is able to capture more semantic information with unlabeled data. Moreover, our semi-supervised algorithm achieves much better retrieval performance by a large margin at most bits if proper supervised baselines are selected (DSH for CIFAR-10,ImageNet-100 and DPSH for Nuswide), showing the effectiveness of the proposed teacher-student architecture.

It should be noticed that the classification performance of VGG-F is slightly better than AlexNet, thus the hashing performance is expected not to decrease and may even be better if replacing AlexNet with VGG-F. Moreover, the proposed baselines are widely used but not the state-of-the-art, thus it is expected to achieve better results if adopting the state-of-the-art supervised hashing methods [5].

4.3 Ablation Study

Variants of PTS3H In order to verify the effectiveness of our PTS3H method, several variants are also considered. First we set $\gamma=0$ to show the effectiveness of the $\mathcal{R}_{up}$ , named PTS3H-P. Then we remove $\mathcal{R}_{up}$ to show the effectiveness of $\mathcal{R}_{uq}$ , denote PTS3H-Q. The hyper-parameters of the variants are determined with the validation set. Retrieval results are shown in Table 2. The consistent similarity loss reaches about 70% performance gain as it produces consistent simialrities for smooth data pairs. The quantized similarity loss also achieves better performance as they model the pairwise similarities for perturbed inputs with global information. It should be noticed that there are little performance gain on MAP with the quantized similarity loss for Nuswide dataset, as the distribution of similar pairs underlying the dataset is a little complicated. Better results may achieved if better similarity construction strategy is involved.

The Teacher vs. the Student We denote PTS3H and PTS3H-S as hashcodes generated by the teacher (denote $H_{T}(\cdot)$ ) and the student (denote $H(\cdot)$ ) respectively. Table 2 shows retrieval results of PTS3H and PTS3H-S. It implies that the performances are almost the same, thus we are able to use the teacher or the student freely. As the student is converged during training, the teacher will be similar with the student in the end of training. Nevertheless, the parameters of the teacher and the student are quite different during training. As the teacher is the ensemble of the student, the representations generated by the teacher are expected to contain more semantic information than the student at the most training stage [26], guiding the student to generate better codes.

4.4 Sensitivity to Parameters

In this section, the influence on different setting of the proposed PTS3H is evaluated. The code length is 48 and we use DSH loss for evaluation. We do not report the influence on $\eta$ as it has been discussed in the original papers [17, 15].

Influence of $\omega$ Figure 3(a)(b) shows the performance on different values of $\omega$ . It can be seen clearly that setting a certain $\omega$ achieves better hashing performance. It means that a proper consistent weight $\omega$ can arrive at better semi-supervised training.

Influence of $\gamma$ Figure 3(c)(d) shows the performance on different values of $\gamma$ . It should be noticed that a proper $\omega$ is set for different $\gamma$ . There are some improvement for a proper $\gamma$ , especially the precision at Hamming distance within 2 value on Nuswide dataset. Similar as $\omega$ , a proper $\gamma$ should be set for better performance.

Influence of $thr$ As discussed in Sec. 3.2, the $thr$ is set dynamically such that the ratio of pseudo similar pairs of unlabeled data is constant. Figure 4 shows the performance on different ratio value and the variation of $thr$ during training. It is clear that performance is not sensitive for different ratio of pseudo similar pairs $thr$ , thus we can set this parameter freely.

5 Conclusion and Future Work

In this paper, we propose a novel semi-supervised hashing algorithm named PTS3H in which the pairwise supervision and abundant unlabeled data are provided. The proposed PTS3H is a teacher-student network architecture which is carefully designed for labeled and unlabeled pairs. We propose the general consistent pairwise loss in which the pairwise information generated by the teacher network guides the training of the student. There are two types of losses: consistent similarity loss models the locally pairwise information, and quantized similarity loss models the information globally by quantizing the similarities between samples. This procedure aims at generating similar retrieval results for neighborhood queries. Experiment shows that the proposed PTS3H achieves great improvement over the baselines, and it is superior or comparable with the state-of-the-art semi-supervised hashing algorithms.

It should be noticed that we use the popular pairwise loss baselines and achieve the good hashing results. As the proposed PTS3H algorithm is a general framework for semi-supervised hashing, it is expected to arrive at better retrieval performance by incorporating the state-of-the-art supervised hashing algorithm with pairwise supervisions.

Bibliography35

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of machine learning research , 7(Nov):2399–2434, 2006.
2[2] F. Cakir, K. He, S. A. Bargal, and S. Sclaroff. Hashing with mutual information. ar Xiv preprint ar Xiv:1803.00974 , 2018.
3[3] Y. Cao, M. Long, J. Wang, H. Zhu, and Q. Wen. Deep quantization network for efficient image retrieval. In AAAI , pages 3457–3463, 2016.
4[4] Z. Cao, M. Long, J. Wang, and S. Y. Philip. Hashnet: Deep learning to hash by continuation. In ICCV , pages 5609–5618, 2017.
5[5] Z. Chena, X. Yuana, J. Lua, Q. Tiand, and J. Zhoua. Deep hashing via discrepancy minimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 6838–6847, 2018.
6[6] A. Gionis, P. Indyk, R. Motwani, et al. Similarity search in high dimensions via hashing. In VLDB , volume 99, pages 518–529, 1999.
7[7] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. Pattern Analysis and Machine Intelligence, IEEE Transactions on , 35(12):2916–2929, 2013.
8[8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 770–778, 2016.