Pairwise Teacher-Student Network for Semi-Supervised Hashing
Shifeng Zhang, Jianmin Li, Bo Zhang

TL;DR
This paper introduces a teacher-student semi-supervised hashing framework that leverages pairwise information from a teacher network to improve data retrieval accuracy, especially on complex datasets with limited labeled pairs.
Contribution
It proposes a novel teacher-student approach for semi-supervised hashing that outperforms existing methods and addresses limitations of graph-based structures for complex data.
Findings
Achieves significant improvements over supervised baselines.
Outperforms state-of-the-art semi-supervised hashing methods.
Effective on large-scale complex datasets.
Abstract
Hashing method maps similar high-dimensional data to binary hashcodes with smaller hamming distance, and it has received broad attention due to its low storage cost and fast retrieval speed. Pairwise similarity is easily obtained and widely used for retrieval, and most supervised hashing algorithms are carefully designed for the pairwise supervisions. As labeling all data pairs is difficult, semi-supervised hashing is proposed which aims at learning efficient codes with limited labeled pairs and abundant unlabeled ones. Existing methods build graphs to capture the structure of dataset, but they are not working well for complex data as the graph is built based on the data representations and determining the representations of complex data is difficult. In this paper, we propose a novel teacher-student semi-supervised hashing framework in which the student is trained with the pairwise…
| Method | Net | CIFAR-10 | Nuswide | ImageNet-1004 | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 12 bits | 24 bits | 32 bits | 48 bits | 12 bits | 24 bits | 32 bits | 48 bits | 16 bits | 32 bits | 48 bits | 64 bits | ||
| Semi-Supervised Hashing | |||||||||||||
| SSDH | VGG-F | 0.801 | 0.813 | 0.812 | 0.814 | 0.773 | 0.779 | 0.778 | 0.778 | -1 | - | - | - |
| BGDH | VGG-F | 0.805 | 0.824 | 0.826 | 0.833 | 0.803 | 0.818 | 0.822 | 0.828 | - | - | - | - |
| PTS3H-DSH | AlexNet | 0.798 | 0.828 | 0.835 | 0.843 | 0.752 | 0.774 | 0.783 | 0.789 | 0.612 | 0.680 | 0.697 | 0.703 |
| (+0.056) | (+0.034) | (+0.026) | (+0.023) | (+0.012) | (+0.012) | (+0.019) | (+0.016) | (+0.023) | (+0.032) | (+0.047) | (+0.041) | ||
| PTS3H-DPSH | AlexNet | 0.789 | 0.799 | 0.801 | 0.805 | 0.803 | 0.827 | 0.831 | 0.842 | 0.397 | 0.542 | 0.618 | 0.634 |
| (+0.038) | (+0.028) | (+0.025) | (+0.027) | (+0.004) | (+0.006) | (+0.003) | (+0.009) | (+0.018) | (+0.014) | (+0.027) | (+0.026) | ||
| Supervised Hashing Baselines | |||||||||||||
| DSH2 | AlexNet | 0.741 | 0.794 | 0.809 | 0.820 | 0.740 | 0.762 | 0.764 | 0.773 | 0.589 | 0.648 | 0.650 | 0.662 |
| DPSH2 | AlexNet | 0.751 | 0.771 | 0.776 | 0.778 | 0.799 | 0.821 | 0.827 | 0.834 | 0.379 | 0.528 | 0.591 | 0.608 |
| DSDH | VGG-F | 0.740 | 0.786 | 0.801 | 0.820 | 0.776 | 0.808 | 0.820 | 0.829 | - | - | - | - |
| DISH | AlexNet | 0.758 | 0.784 | 0.799 | 0.791 | 0.787 | 0.810 | 0.810 | 0.813 | - | - | - | - |
| HashNet3 | AlexNet | 0.6863 | - | 0.692 | 0.718 | 0.7333 | - | 0.755 | 0.762 | 0.502 | 0.622 | 0.661 | 0.682 |
| DMDH3 | AlexNet | 0.7043 | - | 0.732 | 0.737 | 0.7513 | - | 0.781 | 0.789 | 0.513 | 0.612 | 0.673 | 0.692 |
| MIHash | AlexNet | 0.738 | 0.775 | 0.791 | 0.816 | 0.773 | 0.820 | 0.831 | 0.843 | 0.569 | 0.661 | 0.685 | 0.694 |
| Method | Dataset | MAP | Precision | ||
|---|---|---|---|---|---|
| 32 bits | 48 bits | 32 bits | 48 bits | ||
| PTS3H-P | CIFAR-10 | 0.829 | 0.838 | 0.829 | 0.827 |
| PTS3H-Q | 0.817 | 0.826 | 0.821 | 0.814 | |
| PTS3H | 0.835 | 0.843 | 0.832 | 0.829 | |
| PTS3H-S | 0.833 | 0.842 | 0.834 | 0.830 | |
| PTS3H-P | Nuswide | 0.777 | 0.787 | 0.763 | 0.727 |
| PTS3H-Q | 0.772 | 0.777 | 0.759 | 0.710 | |
| PTS3H | 0.782 | 0.789 | 0.770 | 0.737 | |
| PTS3H-S | 0.783 | 0.789 | 0.771 | 0.739 | |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Video Surveillance and Tracking Methods · Caching and Content Delivery
Pairwise Teacher-Student Network for Semi-Supervised Hashing
Shifeng Zhang, Jianmin Li and Bo Zhang
Institute for Artificial Intelligence, State Key Lab of Intelligent Technology and Systems,
Beijing National Research Center for Information Science and Technology,
Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
[email protected], [email protected], [email protected]
Abstract
Hashing method maps similar high-dimensional data to binary hashcodes with smaller hamming distance, and it has received broad attention due to its low storage cost and fast retrieval speed. Pairwise similarity is easily obtained and widely used for retrieval, and most supervised hashing algorithms are carefully designed for the pairwise supervisions. As labeling all data pairs is difficult, semi-supervised hashing is proposed which aims at learning efficient codes with limited labeled pairs and abundant unlabeled ones. Existing methods build graphs to capture the structure of dataset, but they are not working well for complex data as the graph is built based on the data representations and determining the representations of complex data is difficult. In this paper, we propose a novel teacher-student semi-supervised hashing framework in which the student is trained with the pairwise information produced by the teacher network. The network follows the smoothness assumption, which achieves consistent distances for similar data pairs so that the retrieval results are similar for neighborhood queries. Experiments on large-scale datasets show that the proposed method reaches impressive gain over the supervised baselines and is superior to state-of-the-art semi-supervised hashing methods.
1 Introduction
With the explosion of high-dimensional media data, approximate Nearest Neighbor(ANN) search [6] has attracted broad attention for efficient information retrieval. Among the existing ANN methods, hashing has become a popular tool for ANN search on large-scale datasets due to its fast search time and small storage space [6, 18, 25, 30, 28]. It aims at encoding high-dimensional data into compact hashcodes, so that similar data are mapped to hashcodes with similar hamming distance.
Among the existing hashing methods, data-dependent learning-to-hash methods aim at learning hash functions with the training data, and the learned codes is able to capture the data distributions. Learning-to-hash methods can be divided into three categories: unsupervised hashing [7, 28], supervised hashing [18, 25] and semi-supervised hashing [32, 31, 23]. Experiments convey that the codes learned by (semi-)supervised hashing methods can capture more semantic information than unsupervised ones. Recently, with the rapid development of deep learning [11, 8], deep hashing methods have achieved great success [30, 12, 17, 34, 32, 14, 2]. It aims at learning hashcodes and the deep networks simultaneously, thus the codes generated by deep networks contain much better semantic information.
For ANN search, pairwise similarities between data pairs play an important role in evaluating the quality of search. For generating efficient hashcodes, (deep) supervised hashing problems regard the pairwise similarity as the basic supervision such that similar data pairs should be mapped to codes with small hamming distance. Most hashing methods model the similarities with the pairwise losses, and optimizing them is expected to generate the codes where the hamming distances are accordant with the similarities. For ease of back-propagation, these methods simply generate data pairs within a mini-batch and achieve good results [17, 3, 20, 15].
Despite the success of supervised hashing, labeling all the database data (pairs) is almost intractable as the number of data is dramatically increasing. To utilize the abundant database data, deep semi-supervised hashing [32, 31] has been proposed in which the hash function is trained with the labeled data pairs and abundant unlabeled ones. The success of semi-supervised hashing lies in the smoothness assumption such that neighborhood data are likely to have the same predictions. These methods construct graphs for the unlabeled data to capture the neighborhood structure among the samples. However, the data and their representations may lie in high-dimensional nonlinear manifolds, especially for complex data like images and videos, and the representations may not learn well with limited data. As the graph is built based on data representations, the graph may not model the neighborhood structure of data precisely, which violates the smoothness assumption to some extent and affect the hashing performance.
Recently, perturbation-based teacher-student semi-supervised learning (SSL) algorithms have witnessed great success [13, 26]. These methods follow the smoothness assumption in which the learned classifiers produce consensus prediction of a perturbed input, thus they can better capture the structure of unlabeled data [35] and produce better representations for the graph training [21]. However, the proposed teacher-student method can just deal with data with single label, but does not consider the pairwise relationship between samples, which is crucial for semi-supervised hashing. By carefully designing the teacher-student architecture and the loss for pairwise similarities, we may utilize the advantage of this architecture and obtain a novel semi-supervised hashing method.
In this paper, we propose a novel semi-supervised hashing algorithm called Pairwise Teacher-Student Semi-Supervised Hashing(PTS3H) in which the pairwise similarities are used for supervision and abundant unlabeled data pairs are provided. The proposed PTS3H is a teacher-student network architecture where the student is trained with pairwise loss and unsupervised regularizers, and the teacher is the average of the student network to generate efficient pairwise representations. As hashing mainly focuses on the pairwise information, we propose the general consistent pairwise loss such that similar queries produce similar pairwise similarities with the database and achieve similar retrieval results, aiming at following the smoothness assumptions [26, 13]. For modeling pairwise similarities between samples with local and global pairwise information, we propose two types of losses: consistent similarity loss for consistent pairwise similarities among data, and quantized similarity loss in which the quantized [9] similarities can be modeled by the teacher network by global data pairs. Experiment shows that the proposed PTS3H achieves great improvement over the supervised baselines, and it is superior or comparable with the state-of-the-art semi-supervised hashing algorithms.
2 Background
Suppose we are given data samples , and is the training dataset. Denote as a set such that implies have similarity information, and we denote if are similar, and otherwise. In practical applications, the similarity information of some data pairs is unknown. We denote as the pairs where the pairwise similarity information is unknown.
Denote as the length of the hashcode to learn, the goal of the semi-supervised hash learning is to learn the hash function with data samples and the pairwise similarities. We denote as the learned hashcode of .
2.1 Pairwise Loss for Supervised Hashing
Pairwise losses is widely used for solving (deep) supervised hashing algorithm with pairwise similarity as supervision [18, 16, 17, 3, 15, 31]. For the given training data and pairwise information, the basic formulation of pairwise loss is
[TABLE]
where are the similarity (or distance) between the codes .
Different types of are discovered in different supervised hashing algorithms such that
- •
KSH loss: in KSH [18] and FastH [16];
- •
DSH loss: in DSH [17];
- •
DPSH loss: in DPSH [15], DHN [3].
Optimizing is expected to learn hashcodes such that similar data pairs have codes with small hamming distance, and vice versa. It should be noticed that the supervised information is just pairwise information, which is widespread in the real world.
2.2 Semi-Supervised Hashing
Semi-supervised hashing focuses on learning hash function with limited labeled data pairs as well as abundant unlabeled pairs in the database. Similar with semi-supervised learning, the general form of loss to be optimized is
[TABLE]
where is Eq. (1), ls the regularization term for unlabeled data. SPLH [27] adopts the bit-balanced constraint for regularization, but it does not consider the relationship between samples. Graph-based methods like SSDH [33] and BGDH [31] construct an affinity graph for indicating pairwise similarities between unlabeled samples, and the regularization loss is constructed based on the graph. These methods succeed in capturing the neighborhood structures between samples, but the graph is constructed by data representations, and the semantic gap may be involved among the representations, which may violate the smoothness assumptions. Recently, deep generative models have achieved success in semi-supervised learning problems, and DSH-GANs [23] proposes a GAN [24] based hashing method. The conditional GAN is trained with labeled and unlabeled data to generate labeled samples, which are used for training the hashing network. It achieves state-of-the-art in some datasets, it is not able to be trained with pairwise supervision.
2.3 Teacher-Student Network for Semi-Supervised Learning
Semi-supervised learning (SSL) aims at learning with limited labeled data and abundant unlabeled data. Most semi-supervised learning methods lies in the smoothness assumption such that similar data correspond to the same label. Various approaches are discovered such as transductive approach [10, 29], graph-based methods [1, 35], but they are not working well in complex dataset as the underlying structure of data is hard to capture. Recently, perturbation-based semi-supervised learning approach has achieved great success, where a perturbed input corresponds to the consensus prediction. These methods propose a dual role, i.e., the teacher and the student. The student is learned as before; the teacher generates the targets for training the student. Formally, considering the dataset where part of data are labeled, we aim at optimizing the following loss function:
[TABLE]
where denotes classification, is the supervised term such as the softmax loss, is the unsupervised regularization such that
[TABLE]
where are two random perturbations, are the outputs of student and teacher network respectively, and is the distance between two features. There are several ways to define the teacher . TempEns [13] considers as the exponentially moving average(EMA) of the student’s output; Mean Teacher [26] averages the weights of student with EMA to form the teacher network; VAT [22] introduces the adversarial perturbations instead of random perturbations. These methods achieve state-of-the-art on SSL problems.
In spite of this, perturbation-based methods is just able to regularize the single data point, but do not consider neighborhood structure between samples. SNTG [21] constructs a graph by the teacher to capture the neighborhood structure, and introduces a pairwise regularization term with the graph. Experiments convey that the additional term achieves better performance as both the consistency of the perturbed data and neighborhood samples is considered. However, the graph in SNTG is built specifically for classification.
With the success of teacher-student network for semi-supervised learning, in this paper, we propose a novel teacher-student framework for semi-supervised hashing in which only small portion of pairwise similarity information is provided. Considering we perform the hamming distance learning, we propose a novel consistent pairwise loss in which the consistent feature distances for similar data pairs are reached so that it is able to follow the smoothness assumption where neighborhood queries achieve similar retrieval results. Experiments show its superiority over the state-of-the-art semi-supervised hashing algorithms.
3 Methodology
In this section, we propose the novel deep semi-supervised hashing called Pairwise Teacher-Student Semi-Supervised Hashing(PTS3H), in which the teacher-student network is adopted.
3.1 The Teacher-Student Framework
The proposed PTS3H is a teacher-student architecture shown in Figure 1(a). The architecture of teacher network and the student are the same, in which the last layer is the fully-connected layer with outputs ( is the hashcode length), and the rest layers can be the basic deep network like AlexNet, VGGNet, etc.
The update rule of the teacher-student network is similar as Mean Teacher [26]. The student is learned with labeled data pairs and guided by the teacher. Denote and as the parameters of the student and teacher network at training step respectively, the teacher network is updated by EMA as follows:
[TABLE]
thus the teacher is the average embedding of the student, and the teacher’s output can be regarded as the mean embedding of the student’s.
Denote as the output of the student and teacher networks respectively, the binary codes of data can be easily obtained with either the student network such that , or the teacher network . Note that the is not perturbed in the code generation.
3.2 Loss Function
The general form of loss to be optimized is Eq. (2). For labeled data pairs, the training loss is the pairwise loss function in Eq. (1). For training with the unlabeled data, should be defined in which the teacher network generates targets to guide the student network. As hash learning focuses on the pairwise similarities of the codes, learning the similarities of the embedded hamming space are quite important. For input pairs, the targets for the student should be the similarities of the codes generated by the teacher. We therefore propose the general form of the consistent pairwise loss such that
[TABLE]
where are two random perturbations of , is a certain type of loss and denote the pairwise similarities of codes generated from the student and the teacher respectively. Eq. (6) is quite different from the original Mean Teacher [26] in which only the single data point is considered for training.
For Eq. (6), We propose two simple but efficient form of losses named consistent similarity loss and quantized similarity loss.
Consistent Similarity Loss It is expected that the learned codes should follow the smoothness assumption in that a noisy input query correspond to the consistent retrieval results. To what follows, the similarities of codes between the noisy data pairs should be consistent. As illustrated in Figure 1(b.2), if is quite similar and so as , the difference between and should be small. Thus the consistent similarity loss is defined with
[TABLE]
where are the same as Eq. (6). We rename the as if Eq. (7) is introduced.
Quantized Similarity Loss The consistent similarity loss is only able to capture the locally structure of a certain data pair, ignoring the global neighborhood structure between samples. Inspired by the quantization methods in which large amount of information can be compressed with quantization [9], we quantize the pairwise similarity produced by the ensembled teacher to guide the hash learning. As the quantization procedure is based on global unlabeled data pairs, it is expected that the quantized similarities contain global pairwise information, leading to better learned codes.
We denote as the quantized similarity matrix to be learned, where is the number of training samples. Denote as the element at th row and th column, thus indicates and are pseudo similar pair, and [math] otherwise. Considering the teacher output is the ensemble of embedded codes of , thus can be regarded as precise feature embedding of the data point . To what follows, we use the teacher output to determine the pseudo similar pairs. The similarity matrix is defined according to the distances of teacher output such that
[TABLE]
where is defined the same as Eq. (6), is the threshold, which is set according to the dataset. In practical applications, the distribution between labeled and unlabeled pairs are expected to be the same. We can set such that the ratio of pseudo similar pairs is the same as the ratio of similar pairs among labeled pairs, so that the unlabeled similar pairs generated by the teacher can be almost positive and the distribution of similar pairs are expected to the same as the ground-truth similar pairs.
Given the generated pseudo similarity pairs, we can simply train the student with the pairwise loss shown in Eq. (1) to capture the global structure of the embedded codes in the hamming space. We propose the quantized similarity loss by defining such that:
[TABLE]
where has the same form as that defined in Eq. (1). It should be noticed that Eq. (9) can be regarded as the ranking loss for the global data pairs to some extent in that similar pairs produced by the teacher are more likely to be pseudo similar pairs, thus they are expected to achieve similar hamming distances during training. We rename as if Eq. (9) is introduced.
Overall Training Loss The overall training loss is defined the same as Eq. (2), where is defined in Eq. (1), and can be regarded as the combination of consistent similarity loss and quantized similarity loss such that
[TABLE]
As the teacher outputs in Eq. (10) lead to better abstract representations and can model the pairwise information locally and globally, it is expected that the proposed loss can better meet the smoothness assumptions and achieves better codes. Moreover, the hamming distances is accordant with the similarities on both labeled and unlabeled data.
Implementation and Relaxation Eq. (10) conveys that the both the original and the perturbed samples should be fed into the network. For simplicity, we just regard the perturbed data as input, shown in Figure 1.
It is clear that directly optimizing Eq. (2) is intractable as the discrete constraints are involved. As used in most deep hashing algorithms [3, 32, 31], the simple and efficient way is removing the function and adding the quantization loss. We reformulate the relaxed problem as follows
[TABLE]
where , is the relaxation of Eq. (1,10) respectively such that
[TABLE]
For , we directly remove the function to compute , and is defined the same as that in Eq. (1). For , we use , and where is the normalization. The use of normalization is inspired by the original Mean Teacher where the consistent output is the normalized classification probabilities. Moreover, the norm of the hashcodes are the same, thus similar normalized feature embeddings correspond to similar hashcodes.
As a result, the consistent pairwise losses can capture both the local and global neighborhood structure. Moreover, semantic information can be embedded with supervised pairwise loss, and the real-valued space is able to be mapped into hamming space with the quantization loss.
3.3 Mini-batch Optimization
The training procedure is roughly the same as [26]. The teacher is the average embedding of the student network and is updated by Eq. (5) each iteration, and the student is trained with back-propagation. We use the ramp-up procedure for both the learning rate and the regularization term in the beginning of training. The training algorithm is summarized in Algorithm 1.
We mainly focus on training the student network. It is clear that the student can be trained by optimizing Eq. (11) with SGD. We follow the common practice in which we randomly sample mini-batch to estimate the losses for each iteration. For a mini-batch , we just compute the pairwise losses within the mini-batch, and so as computing the pesudo similar pairs. It is clear that the complexity of the loss just , thus the computational cost is not large compared with the computational cost of deep networks. To utilize both the labeled and unlabeled data, the ratio of number of labeled data pairs and unlabeled ones is constant in a mini-batch.
4 Experiments
In this section, we conduct various large-scale retrieval experiments to show the efficiency of the proposed PTS3H method. We compare our PTS3H method with recent state-of-the-art semi-supervised deep hashing methods on the retrieval performance. Some ablation studies and sensitivity of parameters are also discussed in this section.
4.1 Datasets and Evaluation Metrics
We run large-scale retrieval experiments on three image benchmarks: CIFAR-10111http://www.cs.toronto.edu/~kriz/cifar.html, Nuswide222http://lms.comp.nus.edu.sg/research/NUS-WIDE.htm and ImageNet-100. CIFAR-10 consists of 60,000 color images from 10 object categories. ImageNet-100 is the subset of ImageNet dataset333http://image-net.org with 100 randomly sampled classes. Nuswide dataset contains about 220K available images associating with 81 ground truth concept labels. Following [19], we only use the images associated with the 21 most frequent concept tags, where the total number of images is about 190K.
The experimental protocols is similar with [30]. In CIFAR-10 dataset, we randomly select 1,000 images (100 images per class) as query set, the rest 59,000 images as retrieval database, and we random select 5,000 images from the database as the training data. In Nuswide dataset, we randomly select 2,100 images (100 images per class) as the query set and 10,500 images as the training set. In ImageNet-100 dataset, we use the same data split as HashNet [4] such that 130 images per class(totally 13K images) for training, and all images in the selected classes from the validation set are used as queries. The rest unlabeled data in the databest are regarded as the unlabeled dataset.
As we just consider the pairwise similarity for training, the data pairs are constructed among the training data. For CIFAR-10 and ImageNet-100, similar data pairs share the same semantic label. For Nuswide dataset, similar images share at least one semantic label. The rest data pairs(pairs between unlabeled data and all the database) are regarded as the unlabeled pairs.
Our method is implemented with the PyTorch444http://pytorch.org/ framework. We adopt the pre-trained AlexNet [11] for deep hashing methods but replace the last fully-connected layer. The images are resized to to train the network. For supervised pairwise loss in Eq. (1), we mainly use the DSH loss and DPSH loss and name them as PTS3H-DSH and PTS3H-DPSH respectively. SGD with momentum 0.9 is used for optimization, and the initial learning rate of the last layer is which is ten times larger of the lower layers. The hyper-parameters is different according to datasets, which are selected with the validation set. We first of all randomly select part of training data as validation set to determine the parameters. For CIFAR-10, we use with DSH loss and with DPSH loss; For Nuswide, we use with DSH loss and with DPSH loss. For ImageNet-100, we use with DSH loss and for DPSH loss. Following [26], we set , and the ratio of number of unlabeled data pairs and labeled data pairs within a minibatch is 15. The image perturbation strategy includes random resize, random cropping, random horizontal flipping, etc. The training is done on a server with two Intel(R) Xeon(R) E5-2683 [email protected] CPUs, 256GB RAM and a Geforce GTX TITAN Pascal with 12GB memory. We train 60 epochs for CIFAR-10, 20 epochs for Nuswide, and 240 epochs for ImageNet-100. We apply center cropped input and the teacher network to generate hashcodes for simplicity, and Section 4.3 shows that there are little difference between codes generated by the teacher and the student.
Similar with [30, 4], for each retrieval dataset, we report the compared results in terms of mean average precision(MAP), precision at Hamming distance within 2, precision of top returned candidates. We calculate the MAP value within the top 5000 returned neighbors for NusWide and top 1000 for ImageNet-100, and report the MAP of all retrieved samples on CIFAR-10. Groundtruths are defined by whether two candidates are similar. We run each experiment for 5 times and get the average result.
4.2 Results
We compare our PTS3H method with recent state-of-the-art deep hashing methods including SSDH [32], BGDH [31]. We do not take DSH-GANs [23] into consideration as it utilizes the label of each data point. Results on other supervised hashing methods like DSH [17], DPSH [15], DSDH [14], DISH [34], DMDH [5] and MIHash [2] are also proposed for comparison. They follow similar settings, and the network used is either VGG-F or AlexNet, which share similar architectures. Table 1 conveys that DSH and DPSH are good supervised hashing algorithms, we therefore regard the DSH and DPSH loss as the baselines of PTS3H-DSH and PTS3H-DPSH respectively. We report the supervised baselines so that the relative gains of the PTS3H are also taken into consideration.
Retrieval results of different methods are shown in Table 1 and Figure 2. We re-implement the DSH and DPSH algorithms for all the datasets, and most results of the two baselines are better than previously reported. Note that the settings of Imagenet-100 are the same as that in [4]. With the network structure and the training loss fixed, the proposed PTS3H algorithm performs much better than the baselines by about 1-5 percents on MAP and precision at Hamming distance within 2 value, which conveys that the proposed semi-supervised setting is able to capture more semantic information with unlabeled data. Moreover, our semi-supervised algorithm achieves much better retrieval performance by a large margin at most bits if proper supervised baselines are selected (DSH for CIFAR-10,ImageNet-100 and DPSH for Nuswide), showing the effectiveness of the proposed teacher-student architecture.
It should be noticed that the classification performance of VGG-F is slightly better than AlexNet, thus the hashing performance is expected not to decrease and may even be better if replacing AlexNet with VGG-F. Moreover, the proposed baselines are widely used but not the state-of-the-art, thus it is expected to achieve better results if adopting the state-of-the-art supervised hashing methods [5].
4.3 Ablation Study
Variants of PTS3H In order to verify the effectiveness of our PTS3H method, several variants are also considered. First we set to show the effectiveness of the , named PTS3H-P. Then we remove to show the effectiveness of , denote PTS3H-Q. The hyper-parameters of the variants are determined with the validation set. Retrieval results are shown in Table 2. The consistent similarity loss reaches about 70% performance gain as it produces consistent simialrities for smooth data pairs. The quantized similarity loss also achieves better performance as they model the pairwise similarities for perturbed inputs with global information. It should be noticed that there are little performance gain on MAP with the quantized similarity loss for Nuswide dataset, as the distribution of similar pairs underlying the dataset is a little complicated. Better results may achieved if better similarity construction strategy is involved.
The Teacher vs. the Student We denote PTS3H and PTS3H-S as hashcodes generated by the teacher (denote ) and the student (denote ) respectively. Table 2 shows retrieval results of PTS3H and PTS3H-S. It implies that the performances are almost the same, thus we are able to use the teacher or the student freely. As the student is converged during training, the teacher will be similar with the student in the end of training. Nevertheless, the parameters of the teacher and the student are quite different during training. As the teacher is the ensemble of the student, the representations generated by the teacher are expected to contain more semantic information than the student at the most training stage [26], guiding the student to generate better codes.
4.4 Sensitivity to Parameters
In this section, the influence on different setting of the proposed PTS3H is evaluated. The code length is 48 and we use DSH loss for evaluation. We do not report the influence on as it has been discussed in the original papers [17, 15].
Influence of Figure 3(a)(b) shows the performance on different values of . It can be seen clearly that setting a certain achieves better hashing performance. It means that a proper consistent weight can arrive at better semi-supervised training.
Influence of Figure 3(c)(d) shows the performance on different values of . It should be noticed that a proper is set for different . There are some improvement for a proper , especially the precision at Hamming distance within 2 value on Nuswide dataset. Similar as , a proper should be set for better performance.
Influence of As discussed in Sec. 3.2, the is set dynamically such that the ratio of pseudo similar pairs of unlabeled data is constant. Figure 4 shows the performance on different ratio value and the variation of during training. It is clear that performance is not sensitive for different ratio of pseudo similar pairs , thus we can set this parameter freely.
5 Conclusion and Future Work
In this paper, we propose a novel semi-supervised hashing algorithm named PTS3H in which the pairwise supervision and abundant unlabeled data are provided. The proposed PTS3H is a teacher-student network architecture which is carefully designed for labeled and unlabeled pairs. We propose the general consistent pairwise loss in which the pairwise information generated by the teacher network guides the training of the student. There are two types of losses: consistent similarity loss models the locally pairwise information, and quantized similarity loss models the information globally by quantizing the similarities between samples. This procedure aims at generating similar retrieval results for neighborhood queries. Experiment shows that the proposed PTS3H achieves great improvement over the baselines, and it is superior or comparable with the state-of-the-art semi-supervised hashing algorithms.
It should be noticed that we use the popular pairwise loss baselines and achieve the good hashing results. As the proposed PTS3H algorithm is a general framework for semi-supervised hashing, it is expected to arrive at better retrieval performance by incorporating the state-of-the-art supervised hashing algorithm with pairwise supervisions.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of machine learning research , 7(Nov):2399–2434, 2006.
- 2[2] F. Cakir, K. He, S. A. Bargal, and S. Sclaroff. Hashing with mutual information. ar Xiv preprint ar Xiv:1803.00974 , 2018.
- 3[3] Y. Cao, M. Long, J. Wang, H. Zhu, and Q. Wen. Deep quantization network for efficient image retrieval. In AAAI , pages 3457–3463, 2016.
- 4[4] Z. Cao, M. Long, J. Wang, and S. Y. Philip. Hashnet: Deep learning to hash by continuation. In ICCV , pages 5609–5618, 2017.
- 5[5] Z. Chena, X. Yuana, J. Lua, Q. Tiand, and J. Zhoua. Deep hashing via discrepancy minimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 6838–6847, 2018.
- 6[6] A. Gionis, P. Indyk, R. Motwani, et al. Similarity search in high dimensions via hashing. In VLDB , volume 99, pages 518–529, 1999.
- 7[7] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. Pattern Analysis and Machine Intelligence, IEEE Transactions on , 35(12):2916–2929, 2013.
- 8[8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 770–778, 2016.
