Feature Pyramid Hashing
Yifan Yang, Libing Geng, Hanjiang Lai, Yan Pan, Jian Yin

TL;DR
This paper introduces a two-pyramid hashing architecture that combines high-level semantic features with low-level details for improved fine-grained image retrieval, outperforming existing methods.
Contribution
It proposes a novel two-pyramid structure and a consensus fusion strategy to effectively capture both semantic and subtle appearance details in deep hashing.
Findings
Significant improvement over state-of-the-art on CUB-200-2011 dataset.
Effective capture of subtle differences enhances fine-grained retrieval.
Demonstrates the benefit of combining high and low-layer features.
Abstract
In recent years, deep-networks-based hashing has become a leading approach for large-scale image retrieval. Most deep hashing approaches use the high layer to extract the powerful semantic representations. However, these methods have limited ability for fine-grained image retrieval because the semantic features extracted from the high layer are difficult in capturing the subtle differences. To this end, we propose a novel two-pyramid hashing architecture to learn both the semantic information and the subtle appearance details for fine-grained image search. Inspired by the feature pyramids of convolutional neural network, a vertical pyramid is proposed to capture the high-layer features and a horizontal pyramid combines multiple low-layer features with structural information to capture the subtle differences. To fuse the low-level features, a novel combination strategy, called consensus…
| stage | name in ResNet(He et al., 2016a) | output size | remarks |
| 0 | conv1 | , 64, stride 2 | |
| 1 | conv2_x | max pool, stride 2 | |
| 2 | |||
| 2 | conv3_x | 2 | |
| 3 | conv4_x | 2 | |
| 4 | conv5_x | 2 | |
| Methods | CUB-200-2011 | Stanford Dogs | ||||||
| 16bits | 32bits | 48bits | 64bits | 16bits | 32bits | 48bits | 64bits | |
| Ours | 0.5169 | 0.5832 | 0.6124 | 0.6233 | 0.6340 | 0.6909 | 0.7060 | 0.7130 |
| DTH (Lai et al., 2015) | 0.4641 | 0.5454 | 0.5771 | 0.5881 | 0.5435 | 0.6258 | 0.6362 | 0.6573 |
| DSH (Liu et al., 2016) | 0.3156 | 0.4930 | 0.5408 | 0.5967 | 0.4728 | 0.5587 | 0.6128 | 0.6319 |
| HashNet (Cao et al., 2017) | 0.3791 | 0.4628 | 0.4853 | 0.5123 | 0.4745 | 0.5521 | 0.5575 | 0.5934 |
| DPSH (Li et al., 2016) | 0.3497 | 0.4301 | 0.4908 | 0.5225 | 0.4270 | 0.5528 | 0.6080 | 0.6231 |
| CCA-ITQ | 0.1142 | 0.1580 | 0.1813 | 0.1986 | 0.2632 | 0.3681 | 0.4175 | 0.4402 |
| MLH | 0.0915 | 0.1289 | 0.1281 | 0.1983 | 0.2735 | 0.3531 | 0.3831 | 0.4084 |
| ITQ | 0.0637 | 0.0907 | 0.1048 | 0.1129 | 0.2023 | 0.2838 | 0.3123 | 0.3248 |
| SH | 0.0453 | 0.0595 | 0.0643 | 0.0686 | 0.1362 | 0.1628 | 0.1859 | 0.1832 |
| LSH | 0.0162 | 0.0234 | 0.0302 | 0.0340 | 0.0297 | 0.0517 | 0.0640 | 0.0850 |
| Methods | Oxford Flower-17 | Stanford Dogs | ||||||
| 16bits | 32bits | 48bits | 64bits | 16bits | 32bits | 48bits | 64bits | |
| Ours | 0.9542 | 0.9653 | 0.9691 | 0.9783 | 0.6224 | 0.6688 | 0.6924 | 0.6974 |
| DaSH (Jin, 2018) | 0.9225 | 0.9267 | 0.9692 | 0.9756 | 0.3976 | 0.5283 | 0.5950 | 0.6452 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Feature Pyramid Hashing
Yifan Yang
School of Data and Computer Science
Sun Yat-Sen UniversityGuangzhouChina
,
Libing Geng
School of Data and Computer Science
Sun Yat-Sen UniversityGuangzhouChina
,
Hanjiang Lai
School of Data and Computer Science
Sun Yat-Sen UniversityGuangzhouChina
,
Yan Pan
School of Data and Computer Science
Sun Yat-Sen UniversityGuangzhouChina
and
Jian Yin
School of Data and Computer Science
Sun Yat-Sen UniversityGuangzhouChina
(2019)
Abstract.
In recent years, deep-networks-based hashing has become a leading approach for large-scale image retrieval. Most deep hashing approaches use the high layer to extract the powerful semantic representations. However, these methods have limited ability for fine-grained image retrieval because the semantic features extracted from the high layer are difficult in capturing the subtle differences. To this end, we propose a novel two-pyramid hashing architecture to learn both the semantic information and the subtle appearance details for fine-grained image search. Inspired by the feature pyramids of convolutional neural network, a vertical pyramid is proposed to capture the high-layer features and a horizontal pyramid combines multiple low-layer features with structural information to capture the subtle differences. To fuse the low-level features, a novel combination strategy, called consensus fusion, is proposed to capture all subtle information from several low-layers for finer retrieval. Extensive evaluation on two fine-grained datasets CUB-200-2011 and Stanford Dogs demonstrate that the proposed method achieves significant performance compared with the state-of-art baselines.
Image retrieval, Deep Hashing, Feature Pyramid
††copyright: rightsretained††doi: 10.475/123_4††isbn: 123-4567-24-567/08/06††conference: ACM International Conference on Multimedia Retrieval; July 2019; El Paso, Texas USA††journalyear: 2019††article: 4††price: 15.00††ccs: Information systems Image search
1. Introduction
Due to the rapid development of the internet, the amount of images grows rapidly. Image retrieval has attracted increasing interest, especially for the large-scale image databases with millions to billions of images. Hashing methods, which encode data into binary codes, have been widely studied due to the retrieval efficiency in both storage and computation. In this paper, we focus on deep hashing for fine-grained image retrieval.
Much effort has been devoted to deep-networks-based hashing for large-scale image retrieval (e.g., (Wang et al., 2018; Cao et al., 2018)). These approaches use deep networks to learn similarity-preserving hash functions, and the similar images will be encoded to nearby hash codes. Xia et al. (Xia et al., 2014) firstly present a two-stage method for learning good image representation and hash functions. Further, Lai et al. (Lai et al., 2015) and Zhuang et al. (Zhuang et al., 2016) proposed to use the triplet ranking loss to preserve the similarities. The deep pairwise methods also proposed to learn the hash functions, e.g., DPSH (Li et al., 2015) and DSH (Liu et al., 2016). Recently, the generative adversarial networks have been achieved much attention for image retrieval, e.g., (Lin et al., 2018; Zhang et al., 2018).
However, most of the existing deep hashing methods are designed for the coarse-grained datasets, e.g., CIFAR-10 and NUS-WIDE. For coarse-grained datasets, the most important thing is to find semantic differences between the images from different categories as shown in Figure 1. Since the high layers of the CNN tend to extract the semantic information (Yu et al., 2018), most deep hashing methods utilize the highest layer, e.g., fc7, to extract the power image representations and show impressive performance on the coarse-grained databases. However, for fine-grained objects, only considering the semantic information may not enough. Taken two images of CUB-200-2011 as an example in Figure 1, it is indistinguishable by only using the high-level features. The differences of the fine-grained objects rely on the subtle appearance details such as a small part of the tail and the beak. When it turns to deep hashing for fine-grained data, the problem translates into how to capture subtle details and embed them into hash codes.
Feature pyramids (Lin et al., 2017; Chen et al., 2018), which improve the performance by using different layers in convolutional network, become a popular approach. U-Net (Ronneberger et al., 2015) presents a contracting path to associate the low-level features and the symmetric high-level features. Feature Pyramid Network (FPN) (Lin et al., 2017) uses the inherent multi-scale pyramidal hierarchy of deep convolutional networks for object detection, which obtained outstanding accuracy on small-scale objects. Kong et al. (Kong et al., 2018) proposed a reconfiguration architecture to combine low-level and high-level features in a non-linear way. Although the success, it is still a problem that has not been studied for hashing: how to encode the feature pyramids into efficient binary codes. Recently, Zhao et al. (Zhao et al., 2017) proposed a spatial pyramid deep hashing for large-scale image retrieval. Jin (Jin, 2018) showed an attention mechanism to learn the hashing codes for fine-grained data. However, these methods perform the attention or spatial pyramid pooling on the high layer and do not consider that objects have multiple scales. Hence, how to combine the low-level features and capture the subtle appearance differences into binary codes should be more explored.
In this paper, we propose a simple yet efficient two-pyramid architecture using the pyramidal features to compose our hash codes. It highly improves the performance of deep hashing for fine-grained image retrieval. Specially, as shown in Figure 2, our architecture has two pyramids: the vertical and horizontal pyramids. 1) The vertical pyramid aims to capture the semantic differences between the fine-grained objects. It firstly captures the feature of input images by the sub-network consists of stacked convolution layer, then applying the average pooling layer on the top feature followed by a full connected layer with sigmoid activation to learn the hash code. On the top of the hash code, we use a triplet ranking loss to preserve relative similarities among the input images and maintain the hashing capability throughout the networks holistically. 2) The horizontal pyramid was proposed to capture the subtle details and encode these details into binary codes via consensus learning. As shown in Figure 2 (b), the horizontal pyramid firstly uses the feature maps from different stages of the sub-network to generate hash features by capturing different scales of the objects. A consensus fusion mechanism is proposed to encode all these multi-scale features into one powerful hash code. The consensus fusion mechanism which is composed of two modules in Figure 2 (b) includes average pooling layers, fully-connected layers and addition function layers. In the end, we employ the triplet ranking loss to generate the similarity-preserving hash codes for the codes with subtle information.
The main contributions of our work are listed as follows. Firstly, we are one of the first attempts at using ConvNet’s pyramidal features hierarchy to compose the hash code for fine-grained retrieval. The proposed method can capture not only the semantic differences but also the subtle differences of fine-grained objects. Secondly, we propose a consensus fusion mechanism to encode all subtle details into the binary codes. Finally, Our architecture obtains significant results on the two corresponding fine-grained datasets in comparison with several state-of-the-art methods.
2. Related Work
Hashing method (Wang et al., 2018), which learns similarity-preserving hash functions to encode data into binary codes, has become a popular approach. Existing methods can be mainly divided into three categories: unsupervised, semi-supervised and supervised methods.
Unsupervised methods (Lin et al., 2016) attempt to learn similarity-preserving hash functions by utilizing unlabeled data during the training procedure. ITerative Quantization (ITQ) (Gong et al., 2013), Anchor Graph Hashing (AGH) (Liu et al., 2011), Kernerlized LSH (KLSH) (Kulis and Darrell, 2009), Spectral Hashing (SH) (Weiss et al., 2009) and semantic hashing (Salakhutdinov and Hinton, 2007) are the representative methods. Lately, an unsupervised deep hashing approach, named DeepBit, was proposed by Lin et al. (Lin et al., 2016). It learns the binary codes by satisfying three criterions on binary codes: minimal quantization loss, evenly distributed codes and uncorrelated bits. Further, a similarity adaptive deep hashing (SADH) (Shen et al., 2018) was proposed, which alternatively proceeds over three modules: deep hash model training, similarity graph updating and binary code optimization.
Semi-supervised methods make use of the labelled data and the abundant unlabelled data to learn better hashing functions. One of representative work is Semi-Supervised Hashing (SSH) (Wang et al., 2010a), which regularizes the hashing functions over the labelled and the unlabeled data. Sequential Projection Learning for Hashing (SPLH) (Wang et al., 2010b) is proposed to learn the hash functions in sequence. Xu et al. (Wu et al., 2013) proposed bootstrap sequential projection learning for nonlinear hashing (Bootstrap-NSPLH). DSH-GAN (Qiu et al., 2017) is a deep architecture, which contains a semi-supervised GAN to produce synthetic images, and a deep semantic hashing network with real-synthetic triplets to learn hash functions.
Supervised methods (Lin et al., 2013) (Lai et al., 2015) seek to utilize supervised information, e.g., pairwise similarities, or relative similarities of images, to learn better bit wise representations. For example, Minimal loss hashing (MLH) (Norouzi and Blei, 2011) uses structural SVMs with latent variables to encodes images. Kernel-based Supervised Hashing (KSH) (Liu et al., 2012) learns hash functions by minimizing similar pairs’ hamming distance and maximized one of the dissimilar pairs. Binary Reconstruction Embedding (BRE) (Kulis and Darrell, 2009) tries to minimizes the reconstruction errors between the Hamming distance of the learned binary codes and the original distances of the data points. The ranking preserving hashing approach (Wang et al., 2015) directly optimizes the NDCG measure.
In recent years, inspired by the significant achievements of deep neural networks, learning the hash codes with deep neural networks (deep hashing) has become a novel stream of supervised hashing methods. For example, Lai et al. (Lai et al., 2015) proposed a deep triplet-based loss function for supervised hashing method. DPSH (Li et al., 2015) is a deep hashing method to perform simultaneous feature learning and hash code learning with pairwise labels. DSH (Liu et al., 2016) speeds up the training of the network by adding a regular term instead of activation function to loss function. HashNet (Cao et al., 2017) utilizes the weighted pairwise loss to maximize the likelihood function and takes a weighted attenuation factor on the activation function. It reduces the semantic loss caused by feature-to-hash code mapping. SPDH-SPBPM (Zhao et al., 2017) divides the feature map of the last convolutional layer into several sets of spatial parts. However, these methods are designed for the coarse-grained datasets. Few works (Jin, 2018) have been proposed for the fine-grained image retrieval. Different from these existing fine-grained hashing method which use the high layer features, we combine both the low-level and high-level features into our framework.
Feature pyramids have achieved great success in many vision tasks. For example, FPN (Lin et al., 2017) adds the feature maps of the highest-layer to feature maps of several low-layer, and then performs object detection on each layer. Different from the existing feature pyramid methods, our goal is to generate hash codes. We do not directly generate hash codes from one layer but use multi-level features. A consensus fusion is proposed to combine the multi-level features.
3. The Proposed Approach
We denote as the image space. The task of learning-based hashing for images is to learn a mapping function such that an input image can be mapped to an -bit binary code , where the similarities among images are preserved in the Hamming space.
In this paper, we propose a deep architecture for learning hash codes. As shown in Figure 2, the proposed architecture has two pyramids from vertical orientation and horizontal orientation, respectively. The vertical pyramid extracts the feature from raw images and maintains the hashing capability throughout the networks holistically. The horizontal pyramid leverages pyramidal features from different stages of CNN for learning the hash feature and then aggregates the hash feature into the final binary code for retrieval.
3.1. Vertical Pyramid
Vertical pyramid contains two components: (1) the feature learning module with stacked convolution layers to capture the effective feature of an input image; (2) the hashing module to maintains the hashing capability throughout the networks holistically. In the following, we will present the details of these components, respectively.
3.1.1. Feature Learning Module
As shown in Figure 2(a), we use a convolutional sub-network with multiple convolution layers as feature learning module to capture a discriminative feature representation of the input images. The feature learning module is based on the architecture of ResNets (He et al., 2016a), which computes a feature hierarchy consisting of feature maps at several scales with a scaling step of 2. There are often many layers producing feature maps of the same size, and we denote these layers are in the same network . For a clearer description of the network , Table 1 shows the detailed division of the ResNet18 on the , which follows the same division suggestion of in the FPN(Lin et al., 2017).
We define one pyramid level for each stage. The output of the last layer of each stage will serve as the output of each stage and play a role as the side-output in the vertical pyramid. In training, we use the pre-trained ResNet (He et al., 2016a) model to initialize the weights in this sub-network. We denote the whole Module as and for the output of of the input image . In order to describe the side outputs feature of the sub-network, let’s denote the output feature map of the as , respectively. Specially, .
3.1.2. Hashing Module
Based on our prior knowledge and cross-validation results, the feature maps has the highest-level semantics and achieve better performance for retrieval when quantized as a binary hash code compared to other stages’ feature. Accordingly, in order to learn powerful image representations, we employed the and the hash module in training. Following the traditional deep learning to hash setting, on top of the feature map , we add a fully connected layer with sigmoid activation function . Specifically, using the be the output vector of the fully connected layer (i.e., the hash feature), one can obtain the hash code by:
[TABLE]
where , and is -dimensional hash feature.
[TABLE]
where is an -dimensional hash code, each of whose elements is in the range , respectively.
For ranking-based image retrieval, it is a common practice to preserve relative similarities of the form “image is more similar to image than to image ”. To learn hash codes preserving such relative similarities, the triplet ranking loss has been proposed in the existing hashing methods (Lai et al., 2015). Specifically, for a triplet of images that is more similar to than to , we denote the real-valued hash code for , and as , and , respectively. The triplet ranking loss function is defined as:
[TABLE]
where is the margin parameter depending on the hash code length , is the norm.
Note that the triplet loss in Eq.(9) is designed for single-label data. It can be verified that this triplet ranking loss is convex, which can be easily integrated into the back propagation process of neural networks.
3.2. Horizontal pyramid
This pyramid consists of three main building blocks: (1) the lateral hashing module to leverage pyramidal feature from different stage of CNN for hashing; (2) the consensus module to compose a consensus hash code from lateral hash feature. (3) the binary embedding module to map the consensus hash code to binary hash code.
3.2.1. Lateral Hashing Module
In the proposed architecture for hashing, one of the key components is the lateral hashing module. The goal of this component is to transform the feature map into a specified dimension hash feature. As shown in Figure 2, there are three connections between the two pyramids, and the lateral hashing module is based on the lateral connections. Therefore, we will present the details according to the lateral connection. First, for the middle lateral connections whose colour is black, as shown in the Figure 2, feature map is used as input to the lateral hash module. We first apply an average pooling layer to zoom out the feature map (i.e., a or ) before adding a fully connected layer on account of the fact that using a fully connected layer on too many elements may bring serious computational complexity. At the same time, zooming out the feature map is not directly discarding the spatial information in feature map on the same channel, for the fact that there are 4 and 16 spatial partitions on the feature maps of size and , respectively. Moreover, reducing the size of the feature map does not affect the channel semantics information we use at different stages. Getting a relatively small feature map, we expand it and input it into the fully connected layer , which can be formulated as:
[TABLE]
where , , and is -dimensional hash feature.
For the lower position lateral connections, we imitate the pipeline of the higher one:
[TABLE]
where , , and is -dimensional hash feature.
In order to reduce the repeated calculation, for the hash feature of stage 4, we directly share the hash feature of the hashing module in the vertical pyramid. Specially, the design of the lateral hashing module is based on the structure of the hashing module in the vertical pyramid. As the annotation in equation (1), (4) and (5) shown, the feature’s dimension in lateral module actually follows a specific diminution pattern and incremental pattern. The feature map is incremented according to the aspect ratio of the original feature map (i.e., correspond to ). With regard to incremental of the feature , considering that the low-stage feature map has too many elements, passing it directly through the fully-connected layer to obtain a feature vector with too low dimensions may result in excessive semantic loss and easy over-fitting (He et al., 2016b). Therefore, the strategy we use is that the lower the feature map of the stage, the more feature elements are obtained through the fully connected layer.
3.2.2. Consensus Module
Equipped with the pyramidal hash feature, we use two ”mediator” to compose a consensus hash code. We first reveal the implementation details of the mediator on the left-hand side. More specifically in , we apply an average pooling layer on to compress its dimensions from to . And then we add it to the to get a new feature vector as the output of the mediator, which can be formulated as:
[TABLE]
where is -dimensional hash feature.
Considering that the hash code we used for retrieval is dimension, we need average pooling layer to compress the dimension of the low-stage hash feature. The same strategy applies to the design of the mediator on the right-hand side, and it can also be formulated as:
[TABLE]
where is -dimensional consensus hash feature.
Considering the rarity of hash code for retrieval, we do not directly fix some bits with the feature of certain stage but adopt a fusion method to generate a consensus of several stages as a hash code on each bit of hash vector. For the two mediators, we combine the features of different stages, so that each bit of the final output is not determined by a certain layer alone, but determined by several stages’ ”comments” to produce a consensus on the final hash feature.
With the -dimensional consensus hash feature, we employ a sigmoid activation layer to restrict each element of the hash feature to the range [0, 1]. We denote the output vector of as :
[TABLE]
where is -dimensional consensus hash code.
In the end, we still use the triplet ranking loss to preserve the semantic similarity between different images with the consensus hash code. As with the triplet ranking loss detailed above, for a triplet of images that is more similar to than to , we denote the the real-valued consensus hash code for , and as , and , respectively. And the loss function in the consensus module can be defined as:
[TABLE]
where is the margin parameter depending on the hash code length , is the norm.
Combination of Loss Functions During the training phase, we use a combination of the above loss functions with stochastic gradient descent defined by:
[TABLE]
where and are the hash code and consensus hash code corresponding to the triplet images , M is the number of triplets.
3.2.3. Binary Embedding Module
This module mainly works in the test phase. Specifically, for an input image and its consensus hash code , the -bit binary code b can be obtained by:
[TABLE]
where / is the -th element in /, respectively.
4. Experiments
4.1. Datasets
We conduct extensive evaluations of the proposed method and compare with state-of-the-art baselines on two fine-grained datasets:
- •
CUB-200-2011111http://www.vision.caltech.edu/visipedia/CUB-200-2011.html: It is an extended version of CUB-200 (Welinder et al., 2010), a challenging dataset that pushes the limits of visual abilities for both humans and computer consists of 11788 images of birds in 200 classes.
- •
Standford Dogs222http://vision.stanford.edu/aditya86/ImageNetDogs/: It is a challenging and large-scale dataset which includes over 22,580 annotated images of dogs belonging to 120 species aimed at fine-grained image tasks. This dataset is extremely challenging due to two reason: first, there is little inter-class variation; second, there is very large intra-class variation.
In CUB-200-2011, we use the official split, where 5794 test images as the test query set, 5994 training images as the training set to train the hash models and also used as the retrieval database. In Standford Dogs, the official split will also be applied, 22,580 annotated images will be split into two parts, 12000 training samples and 8580 testing samples. All the training samples also serve as a retrieval database besides training the network. For a fair comparison, all of the methods for comparison use identical training/test sets and retrieval database.
4.2. Evaluation Metrics
To measure the performance of hashing, we use four evaluation metrics: mean average precision(MAP), precision-recall curves, precision curve within hamming radius 3 and precision curves w.r.t. different numbers of top returned samples. MAP is a widely used evaluation measure for ranking, The average precision of image can be defined as:
[TABLE]
where is the number of images in the retrieval database. The is an indicator function, in which if the image at position is positive, then , otherwise . The is the number of relevant images within the top images and represents the total number of relevant images w.r.t the -th query image. For all query images, the MAP is defined as:
[TABLE]
4.3. Settings and Implementation Details
The experiments for our proposed method are completed with the open source PyTorch (Paszke et al., 2017) framework on a GeForce GTX TITAN X server.
For fair comparison, all deep CNN-based methods, including ours and previous baselines, are based on the same CNN architecture, i.e., ResNet (He et al., 2016a). Specially, we remove the last fully-connected layer since it is for 1,000 classifications to make the rest of the ResNet act as the convolutional sub-network in our architecture. The weights of the convolutional sub-network are initialized with the pre-trained ResNet model 333https://download.pytorch.org/models/resnet18-5c106cde.pth that learns from the ImageNet dataset. For other non-deep-network-based methods, we use the pre-trained ResNet model to extract features from raw images and then use these features as input, i.e., the last layer of ResNet output 512-dimensional vector after removing the fully-connected layer from the pre-trained model.
For all the method, we resize all of the images to the size and use the raw image pixels as input. During the training phase, all the training samples will divide into mini batches before inputting to the network and the batch size is 100. The proposed architecture in this paper is trained by the stochastic gradient descent with 0.9 momentum and 0.0005 weight decay for the triplet ranking loss function is a strictly convex function. The base learning rate is 0.001 and the step size is 1800 which means the learning rate will be 10 times smaller every 1800 epoch where the total epoch is set to be 4000.
We compare the proposed method with several state-of-the-art learning-based hashing methods, which can be roughly divided into two categorized:
- •
Conventional hashing methods: ITQ(Gong et al., 2013) reduced the quantization errors by learning an orthogonal rotation matrix; CCA-ITQ (Yunchao et al., 2013), an extension of ITQ, uses label information to find better projections for the image descriptors; LSH (Gionis et al., 1999) uses random projections to produce hashing bits. Spectral Hashing (SH) (Weiss et al., 2009) tries to minimizes the weighted Hamming distance of image pairs, where the weights are defined to be the similarity metrics of image pairs. MLH (Norouzi and Blei, 2011) uses structural SVMs with latent variables to encodes images.
- •
Deep-network-based hashing methods including: DTH (Lai et al., 2015) proposed a deep triplet-based loss function for supervised hashing method. DPSH (Li et al., 2015) proposed a deep hashing method to perform simultaneous feature learning and hash code learning for applications with pairwise labels. DSH (Liu et al., 2016) speeds up the training of the network by adding a regular term to loss function instead of using activation function and employs the pairwise loss function with margin to preserve the similarity of images. HASHNET (Cao et al., 2017) utilizes the weighted pairwise loss to maximize the WML likelihood function, and takes a weighted attenuation factor on the activation function, thereby reducing the semantic loss caused by feature-to-hash code mapping. Additionally, for hashing method designed specifically for fine-grained data, deep saliency hashing(DSaH) (Jin, 2018) uses an attention mechanism to learn the hashing codes.
Specially, the implementation of DTH in this paper is a variant of (Lai et al., 2015), in which we replace the divide-and-encode module by a fully connected layer with sigmoid activation. Note that the architecture of DTH is just the same as that of the vertical pyramid without side output, which makes it convenient for us to observe the improvement made by the horizontal pyramid.
4.4. Experimental Results
4.4.1. Comparison with State-of-the-art Methods
To illustrate the accuracy of the proposed method, we evaluate and compare our method with several state-of-the-art baselines.
Compared to other hashing baselines, the proposed method shows substantially better performance gains. Take our main evaluation metrics MAP as an example, as shown in Table 2, the proposed method shows a relative improvement of a relative improvement of 5.911.3/8.416.6 against the second best baseline on CUB-200-2011/Stanford Dogs, respectively. In addition, in Figure 3 and Figure 4, it can be observed that the proposed method performs better than all previous methods in precision with hamming radius 3 curves, precision-recall and precision on 16bits for most levels.
In particular, the proposed method consistently outperforms DTH. Since the implementation of the DTH is exactly the same as that of the vertical pyramid in Figure 2 (a), DTH is equivalent to the proposed method without the pyramid architecture using the pyramidal feature to compose the hash code. The predominant performance of the proposed method against DTH verifies that using the pyramidal feature consensus to compose the hash code can improve the performance of deep hashing.
4.4.2. Comparison with Hashing Method Specifically Designed for Fine-grained Data
As our approach works primarily for fine-grained data, we compare it to other approaches specifically designed for fine-grained data.
Since the code of DaSH(Jin, 2018) is not publicly available, and it is hard to re-implement the complex method, we utilize the same experimental settings used in DaSH for our method. The results of DaSH are directly cited from (Jin, 2018) for a fair comparison. Following the DaSH setting, we also use the VGG (Simonyan and Zisserman, 2014) as the basic architecture, which has the same expression ability as ResNet.
In DaSH(Jin, 2018), two datasets are mainly used: the first is the Stanford Dogs which is described in detail above, and the second is the Oxford Flower-17444http://www.robots.ox.ac.uk/ vgg/data/flowers/17/, which consists of dataset consists of 1360 images of flowers belonging to 17 mutually classes. The training set and the test set partition also follow the setting in DaSH.
The MAP results on Oxford Flower-17 and Stanford Dogs are shown in Table 3, which show the superior performance gain of the proposed method over the other approach specifically designed for fine-grained data.
5. Conclusions
In this paper, we developed a clean and simple two-pyramid architecture that learning both the semantic information and the subtle appearance details from fine-grained objects to improve the performance of deep hashing, in which the vertical pyramid capture the high-layer features and the horizontal pyramid combines multiple low-layer features with more structural information to capture the subtle differences. Empirical evaluations on two representative fine-grained images datasets show that the proposed method achieves better performance of deep hashing.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1)
- 2Cao et al . (2018) Yue Cao, Mingsheng Long, Bin Liu, Jianmin Wang, and MOE K Liss. 2018. Deep Cauchy Hashing for Hamming Space Retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 1229–1237.
- 3Cao et al . (2017) Zhangjie Cao, Mingsheng Long, Jianmin Wang, and S Yu Philip. 2017. Hash Net: Deep Learning to Hash by Continuation.. In Proceedings of the IEEE international conference on computer vision . 5609–5618.
- 4Chen et al . (2018) Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, and Jian Sun. 2018. Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 7103–7112.
- 5Gionis et al . (1999) Aristides Gionis, Piotr Indyk, Rajeev Motwani, et al . 1999. Similarity search in high dimensions via hashing. In Proceedings of the International Conference on Very Large Data Bases , Vol. 99. 518–529.
- 6Gong et al . (2013) Yunchao Gong, Svetlana Lazebnik, Albert Gordo, and Florent Perronnin. 2013. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 12 (2013), 2916–2929.
- 7He et al . (2016 a) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016 a. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition . 770–778.
- 8He et al . (2016 b) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016 b. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition . 770–778.
