Asymmetric Deep Semantic Quantization for Image Retrieval

Zhan Yang; Osolo Ian Raymond; WuQing Sun; Jun Long

arXiv:1903.12493·cs.CV·June 4, 2019

Asymmetric Deep Semantic Quantization for Image Retrieval

Zhan Yang, Osolo Ian Raymond, WuQing Sun, Jun Long

PDF

TL;DR

This paper introduces ADSQ, a novel asymmetric deep semantic quantization method for image retrieval that leverages semantic information to generate discriminative hash codes, outperforming existing methods on benchmark datasets.

Contribution

The paper proposes a new asymmetric deep hashing framework with a LabelNet and two ImgNets to improve semantic discrimination and code optimization in image retrieval.

Findings

01

ADSQ outperforms state-of-the-art methods on CIFAR-10, NUS-WIDE, and ImageNet.

02

The asymmetric framework effectively captures rich semantic information.

03

The method improves the discriminability of hash codes in retrieval tasks.

Abstract

Due to its fast retrieval and storage efficiency capabilities, hashing has been widely used in nearest neighbor retrieval tasks. By using deep learning based techniques, hashing can outperform non-learning based hashing technique in many applications. However, we argue that the current deep learning based hashing methods ignore some critical problems (e.g., the learned hash codes are not discriminative due to the hashing methods being unable to discover rich semantic information and the training strategy having difficulty optimizing the discrete binary codes). In this paper, we propose a novel image hashing method, termed as \textbf{\underline{A}}symmetric \textbf{\underline{D}}eep \textbf{\underline{S}}emantic \textbf{\underline{Q}}uantization (\textbf{ADSQ}). \textbf{ADSQ} is implemented using three stream frameworks, which consist of one \emph{LabelNet} and two \emph{ImgNets}. The…

Tables6

Algorithm 1 Asymmetric Deep Semantic Quantization (ADSQ).
Input Training set $(𝑿, 𝒀, 𝑳)$ ; similarity matrix $𝑺 \in {- 1, + 1}^{n \times n}$ ; hash code length $K$ .
Output Parameters $𝑾_{x}$ of Hashing functions $ℱ_{x}$ and parameters $𝑾_{y}$ of $ℱ_{y}$ .
Initialization Network parameters: $W_{κ}$ , $W_{l}$ , where $κ = 𝒙, 𝒚$ . Hyper-parameters: $α$ , $β$ , $γ$ , $ν$ , and $η$ . Iteration number: $T^{l}$ , $T^{v}$ . Learning rate: $μ$ .
repeat
1. for $t = 1 : T^{l}$ epoch do
2. Update $W_{l}$ by standard BP algorithm:
3. $W_{l} \leftarrow W_{l} - μ \nabla_{W_{l}} ℒ^{l}$ according to (4).
4. end for
1. for $t = 1 : T^{v}$ epoch do
2. Update $𝑾_{κ}$ : Fixing $𝑩^{κ}$ to solve $𝑾_{κ}$ using standard BP algorithm
according to (9), $κ = 𝒙, 𝒚$ .
3. Update $𝑩^{κ}$ : Fixing $𝑾_{κ}$ to solve $𝑩^{κ}$ according to (13), $κ = 𝒙, 𝒚$ .
5. end for
until convergence

Table 2. Table 1: Configuration of the convolutional layers in ImgNets (i.e., ImgNet- 𝒙 𝒙 \boldsymbol{x} , and ImgNet- 𝒚 𝒚 \boldsymbol{y} ).

Layer	Configuration
Layer	Filter Size	Stride	Padding	LRN	Pooling
conv1	$64 \times 11 \times 11$	$4 \times 4$	0	ON	$2 \times 2$
conv2	$256 \times 5 \times 5$	$1 \times 1$	2	ON	$2 \times 2$
conv3	$256 \times 3 \times 3$	$1 \times 1$	1	OFF	-
conv4	$256 \times 3 \times 3$	$1 \times 1$	1	OFF	-
conv5	$256 \times 3 \times 3$	$1 \times 1$	1	OFF	$2 \times 2$

Table 3. Table 2: Configuration of the fully-connected layers in ImgNets (i.e., ImgNet- 𝒙 𝒙 \boldsymbol{x} , and ImgNet- 𝒚 𝒚 \boldsymbol{y} ).

Layer	Configuration
full6	4096
full7	4096
Semantic layer	512
Hash layer	$K / 2$ -bit hash code

Table 4. Table 3: Configuration of the LabelNet.

Layer	Configuration
full-connected layer	4096
Semantic layer	512
Hash layer	$K / 2$ -bit hash code

Table 5. Table 4: mean Average Precision (mAP) of Hamming Ranking for Different Number of Bits on the Three Image Datasets.

Method	CIFAR-10				NUS-WIDE				ImageNet
Method	12 bits	24 bits	32 bits	48 bits	12 bits	24 bits	32 bits	48 bits	12 bits	24 bits	32 bits	48 bits
SH [29]	0.127	0.128	0.126	0.129	0.454	0.406	0.405	0.400	0.185	0.273	0.328	0.395
ITQ [11]	0.162	0.169	0.172	0.175	0.452	0.468	0.472	0.477	0.305	0.363	0.462	0.517
SDH [16]	0.285	0.329	0.341	0.356	0.568	0.600	0.608	0.637	0.253	0.371	0.455	0.525
KSH [12]	0.303	0.337	0.346	0.356	0.556	0.572	0.581	0.588	0.136	0.233	0.298	0.342
DHN [19]	0.555	0.594	0.603	0.621	0.708	0.735	0.748	0.758	0.269	0.363	0.461	0.530
CNNH [20]	0.429	0.511	0.509	0.522	0.611	0.618	0.625	0.608	0.237	0.364	0.450	0.525
DNNH [39]	0.552	0.566	0.558	0.581	0.674	0.697	0.713	0.715	0.219	0.372	0.461	0.530
DPSH [47]	0.713	0.727	0.744	0.757	0.752	0.790	0.794	0.812	0.143	0.268	0.304	0.407
DSDH [40]	0.726	0.762	0.785	0.803	0.743	0.782	0.799	0.816	0.312	0.353	0.481	0.533
DSEH [24]	0.753	0.781	0.807	0.822	0.745	0.785	0.811	0.819	0.449	0.487	0.545	0.576
ADSQ	0.792	0.823	0.836	0.851	0.761	0.793	0.828	0.833	0.493	0.553	0.621	0.649

Table 6. Table 5: The mAP results of ablation study of our ADSQ on NUS-WIDE dataset.

Method	12 bits	24 bits	32 bits	48 bits
ADSQ	0.761	0.793	0.828	0.833
ADSQ- $s y m$	0.756	0.790	0.817	0.824
ADSQ- $𝒜$	0.651	0.677	0.729	0.740
ADSQ- $𝒮$	0.743	0.774	0.811	0.817
ADSQ- $𝒜 𝒮$	0.637	0.651	0.699	0.714

Equations48

log P (B ∣ S)

log P (B ∣ S)

= S_{ij} \sum log P (s_{ij} ∣ b_{i}, b_{j}) P (b_{i}, b_{j}),

P (s_{ij} ∣ b_{i}, b_{j}) = {σ (⟨ b_{i}, b_{j} ⟩), 1 - σ (⟨ b_{i}, b_{j} ⟩), s_{ij} = 1 s_{ij} = 0

P (s_{ij} ∣ b_{i}, b_{j}) = {σ (⟨ b_{i}, b_{j} ⟩), 1 - σ (⟨ b_{i}, b_{j} ⟩), s_{ij} = 1 s_{ij} = 0

P (s_{ij} ∣ r_{i}, r_{j}) = {σ (⟨ r_{i}, r_{j} ⟩), 1 - σ (⟨ r_{i}, r_{j} ⟩) . s_{ij} = 1 s_{ij} = 0

P (s_{ij} ∣ r_{i}, r_{j}) = {σ (⟨ r_{i}, r_{j} ⟩), 1 - σ (⟨ r_{i}, r_{j} ⟩) . s_{ij} = 1 s_{ij} = 0

W_{l} min L^{l} =

W_{l} min L^{l} =

=

- β s_{ij} \in S \sum (s_{ij} Θ_{ij}^{l} - lo g (1 + e^{Θ_{ij}^{l}}))

+ γ s_{ij} \in S \sum (∣∣ ω_{i}^{l} - 1 ∣ ∣_{1} + ∣∣ ω_{j}^{l} - 1 ∣ ∣_{1})

+ δ i = 1 \sum N ∣∣ \tilde{L} - L ∣ ∣_{F}^{2},

min ∣∣ I^{T} B^{κ} - K S ∣ ∣_{F}^{2},

min ∣∣ I^{T} B^{κ} - K S ∣ ∣_{F}^{2},

min ∣∣ \tilde{I}^{T} B^{κ} - K S ∣ ∣_{F}^{2},

min ∣∣ \tilde{I}^{T} B^{κ} - K S ∣ ∣_{F}^{2},

B^{κ}, W_{κ} min L^{κ} =

B^{κ}, W_{κ} min L^{κ} =

=

- β s_{ij} \in S \sum (s_{ij} Θ_{ij}^{κ} - lo g (1 + e^{Θ_{ij}^{κ}}))

+ η ∣∣ \tilde{I} - B^{κ} ∣ ∣_{F}^{2}

+ ν ∣∣ \tilde{I}^{T} 1 ∣ ∣_{F}^{2}

+ ∣∣ \tilde{I}^{T} B^{κ} - K S ∣ ∣_{F}^{2}

s . t . κ = x, y, B^{κ} \in {- 1, + 1}^{n \times K},

W_{κ} min

W_{κ} min

- α s_{ij} \in S \sum (s_{ij} Λ_{ij}^{κ} - lo g (1 + e^{Λ_{ij}^{κ}}))

- β s_{ij} \in S \sum (s_{ij} Θ_{ij}^{κ} - lo g (1 + e^{Θ_{ij}^{κ}}))

+ η ∣∣ tanh (F_{κ} (κ_{i}, W_{κ})) - B^{κ} ∣ ∣_{F}^{2}

+ ν ∣∣ tanh (F_{κ} (κ_{i}, W_{κ}))^{T} 1 ∣ ∣_{F}^{2} .

\frac{\partial L ^{κ}}{\partial v _{i}}

\frac{\partial L ^{κ}}{\partial v _{i}}

\displaystyle+\frac{\beta}{2}(\sigma(\Theta_{ij})\omega_{j}^{l}-S_{ij}\omega_{j}^{l})]+2\eta(\boldsymbol{u}_{i}-b_{i})+2\nu\boldsymbol{U}^{T}\boldsymbol{1}\bigg{\}}

\otimes (1 - u_{i}^{2}),

B^{κ} min

B^{κ} min

s . t . B_{i}^{κ} \in {- 1, + 1}^{n \times K},

B^{κ} min

B^{κ} min

s . t . B_{i}^{κ} \in {- 1, + 1}^{n \times K},

B_{* c}^{κ} min

B_{* c}^{κ} min

s . t . B^{κ} \in {- 1, + 1}^{n \times K} .

B_{* c}^{κ} = - s i g n (2 \tilde{B}_{c}^{κ} \tilde{U}_{c}^{T} U_{* c} + P_{* c}),

B_{* c}^{κ} = - s i g n (2 \tilde{B}_{c}^{κ} \tilde{U}_{c}^{T} U_{* c} + P_{* c}),

b_{q} = co n c a t [b_{i}^{q}, b_{j}^{q}] \in R^{K} .

b_{q} = co n c a t [b_{i}^{q}, b_{j}^{q}] \in R^{K} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Asymmetric Deep Semantic Quantization for Image Retrieval

Zhan Yang1, Osolo Ian Raymond1, WuQing Sun1, Jun Long1,2

1School of Computer Science and Engineering, Central South University, Changsha 410083, China

2Network Resources Management and Trust Evaluation Key Laboratory of Hunan Province

[email protected] Corresponding author. This work was supported in part by the Key Technology R&D Program of Hunan Province (2018GK2052) and the Science and Technology Plan of Hunan (2016TP1003).©2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. See http://www.ieee.org/publications_standards/publications/rights/index.html for more informationAccepted to IEEE ACCESS, DOI: 10.1109/ACCESS.2019.2920712

I. Introduction

Over the last decade, the amount of image data available has increased exponentially. Finding ways to efficiently store and quickly search through this data has become a major challenge. Among all Nearest Neighbor Search (NNS) [1] methods, hashing has been of considerable interest in many real-world applications in the image retrieval field due to its speedy search capabilities and low storage cost. In general, the basic hashing idea is to learn a mapping function $\{\mathcal{F}:\mathbb{R}^{d}\rightarrow\{0,1\}^{k}\}$ , which map the original $d$ -dimensional space (high-dimensional space) into $k$ -bit Hamming space where the similarity is preserved. With the binary representation, the search speed for the data can be remarkably improved and the storage cost dramatically reduced. As a result of this, hashing techniques have become a popular tool for many image retrieval [2, 3, 4, 5, 6] and text-image cross-model retrieval tasks [7, 8].

Hashing methods can be divided into two categories: data-independent hashing methods and data-dependent hashing methods. Data-independent hashing methods adopt random projections as hash functions to map the data points from the original high-dimension representation space into a low-dimension representation space (i.e, binary codes). In other words, data-independent hashing methods define a function that is not dependent on the data itself. Unfortunately, data-independent hashing methods need long hash codes to achieve satisfactory retrieval performance. In order to solve the limitation of data-independent hashing methods, recent works have shown that data-dependent hashing methods can achieve better performance with shorter hash codes. Data-dependent hashing methods, which can be further categorized into unsupervised and supervised methods, learn the hash function from training data points. Unsupervised methods are primarily measured by the use of distance metrics (e.g., Euclidean distance or cosine distance) of data point features [9, 10, 11]. Therefore, in order to bridge the semantic gap, unlike the unsupervised hashing methods, supervised hashing methods utilize the semantic labels to boost the hash function quality. Many researchers currently focus on studying the supervised hashing methods [6, 12, 13, 14, 16, 15]. However, most of them map the original data point into binary codes by using the hand-crafted features, if the feature distribution of a dataset (large-scale dataset) is complex, the performance of these methods will decrease.

Fortunately, Deep Neural Networks (DNNs), especially Convolutional Neural Networks (CNNs) have been widely used in the computer vision field [17] and have shown their powerful feature extraction capabilities. Inspired by this, some learning based hashing methods [18, 19, 20, 22] that adopt convolutional neural networks as the nonlinear hashing functions to enable end-to-end learning of learnable representations and hash codes, have demonstrated satisfactory retrieval performance on many benchmark datasets. Despite recent learning based hashing methods achieving significant progress in image retrieval, there are still some limitations to their usage, e.g., the label information is a simple construction of the similarity matrix, and does not make full use of the multiple label information of the data points [23]. Taking the NUS-WIDE dataset as an example, there is an instance that is annotated with multiple labels, such as “person”, “tree” and “sea”, which can provide abundant semantic information and perfect similar relationship. As described in [24], a method named Deep Joint Semantic-Embedding Hashing (DSEH) that makes full use of multiple label information was proposed. This method can exploit the learned semantic correlation and hash codes in LabNet as supervised information and transfer them to ImgNet. However, there are still two limitations that should be addressed. Firstly, the real-continuous values will be converted by a relaxation scheme to the compact binary code. This is a mixed-integer optimization problem which results in an NP-hard optimization problem. To solve this issue, DSEH addresses the problem by quantizing the real-continuous values to compact binary values, which will cause a large quantization loss. Secondly, another limitation of the DSEH arises when measuring the similarity between each pair of image instances. This is measured by estimating the Hamming distance between the outputs of the same hash function. Because DSEH employs a symmetric structure (same structure with same weights, i.e, the same networks), this symmetric structure usually leads to the appearance of highly correlated bits in practice, which will degrade the retrieval performance. Intuitively, a pair of images with the same or different labels should not be seen as completely similar or dissimilar. Inspired by this, we use the asymmetric structure (same structure with different weights) to learn half of the codes which is capable of effectively decorrelating different bits, making the learned hash codes more informative.

To solve the above-mentioned challenges, this paper presents Asymmetric Deep Semantic Quantization (ADSQ) for efficient and effective image retrieval, which introduces a novel asymmetric training strategy for quantization and offering superior retrieval performance with three contributions detailed as below:

We develop a novel asymmetric framework for image retrieval, consisting of two ImgNets and one LabelNet. Two convolutional neural networks (i.e., ImgNets) are trained as different hash functions to generate compact binary codes for image pairs, and one fully-connected network (i.e., LabelNet) to capture abundant semantic correlation information from the image pair. The model effectively captures similarity relationships between the real-continuous features and binary hash codes, and can generate the discriminative compact hash codes. 2. 2.

Binary hash codes from training data points are learned with an iterative optimization strategy. Furthermore, based on the optimization scheme, an asymmetric loss between the binary-like codes and the learned discrete hash codes is imposed to reduce the quantization error. 3. 3.

Results from our experiments demonstrate that ADSQ outperforms several state-of-the-art methods for the task of image retrieval.

The organization of the rest part is structured as follows. Section II briefly introduces the related works on learning based hashing quantization. In Section III, we formulate the problem and provide the details of the proposed training strategy. Section IV shows the experimental results and Section V gives conclusion of this work.

II. Related Work

By representing images as binary codes and taking advantage of fast query retrieval, the use of hashing techniques in image retrieval has attracted considerable attention. A comprehensive survey that covers the recent hashing techniques are provided in [25].

According to previous research, hashing methods can be roughly divided into two categories: data-independent and data-dependent methods. Spectral Hashing (SH) [29] and Locality Sensitive Hashing (LSH) are two of the most common data-independent methods used. LSH aims to use several random projections such as hash functions to map the data points into a Hamming space [30]. Some variants of LSH (e.g., kernel LSH [31] and $p$ -norm LSH [32]) have been used to improve the performance of LSH. Unlike the data independent methods, the data-dependent methods attain more compact hash codes by combining datasets to achieve a better retrieval accuracy. Data dependent methods can be categorized into unsupervised and supervised methods. Unsupervised hashing methods aim to preserve the linkages among the unlabeled training data points. Typical examples include graph based hashing [10, 36, 37], minimize quantization error [38], and minimize reconstruction error [9, 11, 33, 34, 35]. Supervised methods utilize the semantic labels or relevance information to improve the quality of hash codes. For example, Supervised Hashing with Kernels (KSH) [12] and Supervised Discrete Hashing (SDH) [16] generate binary hash codes by minimizing the Hamming distances across similar pairs of data points. Distortion Minimization Hashing (DMS) [6], Minimal Loss Hashing (MLH) [13], and Order Preserving Hashing (OPH) [14] learn hash codes by minimizing the triplet loss based on similar pairs of data points. COlumn Sampling based Discrete Supervised Hashing (COSDISH) [15] learns the discrete hashing code from semantic information.

More recently, deep learning based hashing methods have shown superior performance by blending the powerful feature extraction of deep learning [41, 17]. In particular, Convolutional Neural Network Hashing (CNNH) [20] is a two-stage hashing method which learns hash codes and deep hash functions separately for image retrieval. Following this work, many learning based hashing techniques have been proposed, e.g., Weakly-shared Deep Transfer Networks (WDTN) [21] which can adequately mitigate the problem of insufficient image training data by bringing in rich labels from the text domain. Deep Semantic Ranking Hashing (DSRH) [28] employs multilevel semantic ranking supervision to learn deep hash functions based on CNN which preserves the semantic structure if multi-label image. Deep Discrete Supervised Hashing (DDSH) [2] utilizes pairwise supervised information to directly guide both discrete coding procedure and deep feature learning procedure and thus enhance the feedback between these two important procedures. Deep Supervised Hashing (DSH) [23] utilizes a CNN architecture that takes pairs of images (similar of dissimilar) as training inputs and encourages the output of each image to approximate discrete values. Deep Ordinal Hashing (DOH) [26] uses an effective spatial attention model to capture the local spatial information by selectively learning well-specified locations closely related to target objects. Generalized Deep Transfer Networks (DTNs) [27] is a model which can learn the semantic knowledge from Web texts and then transfer it to images by the learned translator function when there is a lack of sufficient training data in the visual domain. Network In Network Hashing (NINH) [39] uses a “one-stage” supervised hashing method via a deep architecture that maps images to hash codes. Deep Supervised Discrete Hashing (DSDH) [40] constrains the outputs of the last layer to be binary codes directly, and adopts an alternating minimization method to optimize the objective function by using the discrete cyclic coordinate descend method. Deep Joint Semantic-Embedding Hashing (DSEH) [24] consists of LabNet and ImgNet. Specifically, LabNet is explored to capture abundant semantic correlation between sample pairs and supervise ImgNet from both semantic level and hash codes level.

However, even though DSEH captures abundant semantic correlation to indicate the accurate similarity relationship between samples, it is based on a shallow architecture which cannot effectively differentiate between the real-continuous features and discrete hash codes, because of their high degree of similarity. Therefore, in this paper, we propose a novel learning based hashing method, which can not only capture rich semantic correlation information, but also semantically associate the learned real-continuous features with the binary codes through an asymmetric network.

III. Asymmetric Deep Semantic Quantization

In this section, we details the proposed ADSQ method. In order to attain robust image representations, the proposed ADSQ method that includes three stream frameworks, i.e., two ImgNets and a LabelNet. The two ImgNets which adopt the same convolutional neural network structure but with different weights, are used to generate discriminative compact hash codes. The LabelNet, which captures rich semantic correlation information, is used to guide the two ImgNets minimizing the quantization gap. As shown in [41], the top layers of the deep convolutional neural network can gradually extract global and more high-level representations. The details of our model are described in the following subsections.

i. Notations and Problem Definition

In this paper, we use boldface uppercase characters like $\boldsymbol{B}$ to denote a matrix, and vectors are denoted by boldface lowercase characters like $\boldsymbol{b}$ . $\boldsymbol{B}_{ij}$ means the ( $i,j$ )-th element of $\boldsymbol{B}$ . $\boldsymbol{B}^{T}$ is the transpose of $\boldsymbol{B}$ , and the $\ell_{2}$ -norm of a vector $\boldsymbol{b}\in\mathbb{R}^{D}$ is defined as $||\boldsymbol{b}||_{2}=(\sum_{i=1}^{D}|b_{i}|^{2})^{1/2}$ . The Frobenius norm of a matrix $\boldsymbol{B}\in\mathbb{R}^{m\times n}$ as $||\boldsymbol{B}||_{F}^{2}=\sum_{i=1}^{m}\sum_{j=1}^{n}B_{ij}^{2}=\text{tr}[\boldsymbol{B}^{T}\boldsymbol{B}]$ , while $\text{tr}[\boldsymbol{B}]$ is the trace of $\boldsymbol{B}$ if $\boldsymbol{B}$ is square. The symbol $\otimes$ denotes the element-wise product (i.e., Hadamard product). We use $\boldsymbol{1}$ to denote a vector with all elements being 1. The $sign(\cdot)$ is an element-wise $\boldsymbol{sign}$ function, and $sign(x)=1$ if $x\geq 0$ , otherwise $sign(x)=-1$ .

In similarity retrieval systems, we are given a training set $\mathcal{D}=\{\boldsymbol{d_{i}}\}_{i=1}^{N}$ , $\boldsymbol{d_{i}}=\{\boldsymbol{x_{i},y_{i},l_{i}}\}$ , where $\boldsymbol{x_{i}}\in\mathbb{R}^{1\times D}$ and $\boldsymbol{y_{i}}\in\mathbb{R}^{1\times D}$ to denote the feature vector of the $i$ -th image in the first and second deep convolutional neural networks111Note that, although we use different symbols $\boldsymbol{x}$ and $\boldsymbol{y}$ to represent images, both of them denote the same training dataset., respectively. $\boldsymbol{l_{i}}=[l_{i1},l_{i2},...,l_{ic}]$ are the label annotations assigned to $\boldsymbol{d_{i}}$ , where $c$ is the number of categories. Furthermore, for supervised learning based hashing methods, pairwise information can be used which is denoted by $\boldsymbol{S}=\{s_{ij}\}$ 222Note that one image may belong to multiple categories.. If $s_{ij}=1$ , it means that $\boldsymbol{x_{i}}$ and $\boldsymbol{y_{j}}$ are similar, while $s_{ij}=0$ implies that $\boldsymbol{x_{i}}$ and $\boldsymbol{y_{j}}$ are dissimilar. The goal of a learning based hashing method for quantization is to learn a quantizer $\mathcal{Q}:\boldsymbol{x}\rightarrow b_{i}\in\{-1,1\}^{K}$ from an input space $\mathbb{R}^{D}$ to Hamming space $\{-1,1\}^{K}$ with a deep neural network, where $K$ is the length of the binary codes. The similarity labels $\boldsymbol{S}=\{s_{ij}\}$ can be constructed from semantic labels of data points or relevance feedback in real retrieval systems.

For two binary hash codes $b_{i}$ and $b_{j}$ , the similar relationship is defined according to a distance metric: $\boldsymbol{D}(b_{i},b_{j})$ , where $\boldsymbol{D}(\cdot)$ is a distance metric function (e.g., Hamming distance or cosine distance). In this paper, the aim of our model is to learn two mapping functions $\mathcal{F}_{\boldsymbol{x}}$ and $\mathcal{F}_{\boldsymbol{y}}$ to map $\boldsymbol{X}$ and $\boldsymbol{Y}$ into the Hamming space $\boldsymbol{B}$ : $b_{i}=sign(\mathcal{F}_{\boldsymbol{x}}(\boldsymbol{x_{i}}))\in\mathbb{R}^{K/2}$ and $b_{j}=sign(\mathcal{F}_{\boldsymbol{y}}(\boldsymbol{y_{j}}))\in\mathbb{R}^{K/2}$ . For notation simplicity, we denote the length of the hash codes generated by each ImgNet from $K/2$ , as $K$ . Therefore, the length of the final hash codes is $2K$ . We define the relationship between their Hamming distance $\boldsymbol{D}ist_{H}$ and inner product $\langle\cdot,\cdot\rangle$ can be calculated using: $\boldsymbol{D}ist_{H}=\frac{1}{2}(K-\langle b_{i},b_{j}\rangle)$ . Therefore, we can use the inner product operation to measure the similarity of two binary codes.

Given the pairwise similarity labels $\boldsymbol{S}=\{s_{ij}\}$ , the logarithm Maximum a Posteriori (MAP) estimation of the hash codes $\boldsymbol{B}=[b_{1},b_{2},...,b_{N}]$ for all $N$ training points is:

[TABLE]

where $P(\boldsymbol{S}|\boldsymbol{B})$ denotes the likelihood function, and $P(\boldsymbol{B})$ is the prior distribution. For each pair, $P(s_{ij}|b_{i},b_{j})$ is the conditional probability of $s_{ij}$ given the pair of corresponding hash codes $[b_{i},b_{j}]$ , which is naturally defined by the binary distribution,

[TABLE]

where $\sigma(x)=1/(1+e^{-x})$ , and $\langle b_{i},b_{j}\rangle=\frac{1}{2}b_{i}^{T}b_{j}$ .

Similar to the hash layer, in the semantic layer, replace two real-continuous features $r_{i}$ and $r_{j}$ in (2), the similar information between two real-continuous features can also be used in the same function. Therefore, the similarity probability of $r_{i}$ and $r_{j}$ can be expressed as binary distribution:

[TABLE]

ii. LabelNet Training

In this section, we have designed an end-to-end fully-connected neural network, named LabelNet, to bridge the semantic information at a more fine-grained level. Given a multiple label vector for instance, LabelNet extracts the semantic features layer-by-layer. Let $\mathcal{F}_{l}(\boldsymbol{l_{i}};W_{l})$ denote embedding labels for label point $\boldsymbol{l_{i}}$ , and $W_{l}$ denote the parameters of the LabelNet. Our goal is to maintain the similarity relationship between features and their corresponding hash codes. For LabelNet, the final loss can be defined as follows:

[TABLE]

where $\boldsymbol{\Lambda}_{ij}^{l}=\frac{1}{2}(r_{i}^{l})^{T}(r_{j}^{l})$ , $\boldsymbol{\Theta}_{ij}^{l}=\frac{1}{2}(\omega_{i}^{l})^{T}(\omega_{j}^{l})$ . $r_{i}^{l}$ denotes the semantic representation. $\omega^{l}$ represents the binary-like codes which are obtained by the output of the LabelNet and $\boldsymbol{\tilde{L}}=[\tilde{l}_{1},\tilde{l}_{2},...,\tilde{l}_{i}]$ , $\tilde{l}_{i}=(W^{l})^{T}\omega_{i}^{l}+b_{i}^{l}$ are the predicted labels of output, $\boldsymbol{L}$ is the true label. $\alpha,\ \beta,\ \gamma,\ \delta$ are hyper-parameters. In (4), $\mathcal{J}_{1}$ and $\mathcal{J}_{2}$ are the intra-pairwise loss terms. $\mathcal{J}_{1}$ is used to preserve the similarity information between semantic features in the semantic space whereas $\mathcal{J}_{2}$ is used to preserve the similarity between hashing features in the Hamming space. $\mathcal{J}_{3}$ is the binary regularization (i.e, to promote the hash code discretization), and $\mathcal{J}_{4}$ is to maintain the classification loss between the true label and the predicted label.

iii. ImgNet Training

The image framework of the proposed method is shown in Fig 1. As can be seen, we designed two end-to-end networks, named ImgNet- $\boldsymbol{x}$ and ImgNet- $\boldsymbol{y}$ , which can map the features of an image into binary codes. These ImgNets are guided by LabelNet using the semantic features and the learned hash codes. $\mathcal{F}_{x}(\boldsymbol{x}_{i},W_{x})$ represents the output of the $i$ -th image in the last layer of the ImgNet- $\boldsymbol{x}$ , where $W_{x}$ stands for the parameters of the network. Similarly, we can obtain the output $\mathcal{F}_{y}(\boldsymbol{y}_{j},W_{y})$ corresponding to the $j$ -th image using the parameters $W_{y}$ in the ImgNet- $\boldsymbol{y}$ . In order to learn the optimal hash codes which can preserve the similarity information between the learned binary codes and the real-value features, one common way is to minimize the Frobenius norm between the similarity information and the inner product of the learned binary codes and the real-value features:

[TABLE]

where $\boldsymbol{I}$ which denotes $sign(\mathcal{F}_{\kappa}(\kappa,W_{\kappa})),\kappa=\boldsymbol{x},\boldsymbol{y}$ . $\boldsymbol{B}^{\kappa}$ represents the learned binary codes. $K$ is the length of hash codes. $\boldsymbol{S}$ is the pairwise supervised information.

However, there exists a problem in the formulation in (5), it is difficult to implement a back-propagation (BP) algorithm for the gradient with respect to $\boldsymbol{I}$ due to their gradients always being zero. Hence, in this paper, we adopt $\tanh(\cdot)$ to approximate the threshold function $sign(\cdot)$ . Thus, Equation (5) is transformed into:

[TABLE]

where $\boldsymbol{\tilde{I}}$ denotes $\tanh(\mathcal{F}_{\kappa}(\kappa,W_{\kappa})),\kappa=\boldsymbol{x},\boldsymbol{y}$ . For ImgNet, the final loss can be defined as follows:

[TABLE]

where $\boldsymbol{\Lambda}_{ij}^{\kappa}=\frac{1}{2}(r_{i}^{l})^{T}(r_{j}^{\kappa})$ , and $\boldsymbol{\Theta}_{ij}^{\kappa}=\frac{1}{2}(\omega_{i}^{l})^{T}(\omega_{j}^{\kappa})$ , $r_{i}^{l}$ and $r_{k}^{\kappa}$ are semantic representations from LabelNet and ImgNets, respectively. $\omega^{\kappa}$ represents the binary-like codes which are obtained from the output of the ImgNets. $\alpha,\ \beta,\ \eta,\ \nu$ are the hyper-parameters. In (7), $\mathcal{J}_{1}$ and $\mathcal{J}_{2}$ are two negative-log likelihood functions (a.k.a. $\mathcal{J}_{1}$ and $\mathcal{J}_{2}$ exploit the inter-class and intra-class information). Note that although $\mathcal{J}_{1}$ and $\mathcal{J}_{2}$ in (4) and (7) are similar they represent different meanings. As such, we use the supervised features $r^{l}_{i}$ and $\omega_{i}^{l}$ which are learned from the LabelNet to guide the training of the asymmetric ImgNets. The relevance can be established using the LabelNet. Therefore, the semantic information can be fully utilized. $\mathcal{J}_{3}$ is the approximation loss between binary-like codes and hash codes. Note that, $\mathcal{J}_{4}$ makes a balance for each bit, which encourages the number of negative and positive numbers ( $\pm 1$ ) to be approximately similar among all data points (i.e., $\mathcal{J}_{4}$ is used to maximize the information provided by each bit) [42]. $\mathcal{A}$ is the asymmetric term, this term is used to exploit the semantic information between the binary code and real-value data.

iv. Optimization

In this section, we introduce the training strategy. Firstly, we randomly initialize LabelNet and train it until (4) converges. Secondly, we use the semantic representations and binary-like codes generated by LabelNet to guide the ImgNet training. Finally, the training procedure is repeated for LabelNet and ImgNet until convergence. Here, we only present the training detail for problem (7) since problem (4) can be easily adapted by using stochastic gradient descent with a back-propagation algorithm. Hence, we optimize the problem (7) through iterative optimization. Specifically, in each iteration we learn one variable with the other fixed, and so on.

$\boldsymbol{W_{\kappa}}$ -step: Fixing $B^{\kappa}$ to solve $W_{\kappa}$ , then the objective problem can be transformed into:

[TABLE]

Then we use the Back-Propagation (BP) algorithm to update $W_{\kappa}$ . For the sake of simplicity, we define $\boldsymbol{v}_{i}=\mathcal{F}_{\kappa}(\kappa_{i},W_{\kappa})$ and $\boldsymbol{u}_{i}=\tanh(\mathcal{F}_{\kappa}(\kappa_{i},W_{\kappa}))$ . Then we can compute the gradient of $\boldsymbol{v}_{i}$ as follows:

[TABLE]

where $r_{j}^{l}$ and $\omega_{j}^{l}$ are semantic representations and Hamming representations generated from LabelNet, respectively. $\boldsymbol{U}=[\boldsymbol{u}_{1},\boldsymbol{u}_{2},...,\boldsymbol{u}_{i}]$ , symbol $\otimes$ denotes the Hadamard product. After getting the gradient $\frac{\partial\mathcal{L}^{\kappa}}{\partial\boldsymbol{v}_{i}}$ , the chain rule is used to obtain $\frac{\partial\mathcal{L}^{\kappa}}{\partial W_{\kappa}}$ , and $W_{\kappa}$ is updated by using the standard BP algorithm.

$\boldsymbol{B^{\kappa}}$ -step: Fixing $W_{\kappa}$ to solve $B^{\kappa}$ , then the objective problem can be transformed into:

[TABLE]

where $\boldsymbol{U}=[\boldsymbol{u}_{1},\boldsymbol{u}_{2},...,\boldsymbol{u}_{i}]$ , Then (10) can be rewrote as:

[TABLE]

where $c$ means a constant value and $\boldsymbol{P}=-2K\boldsymbol{S}^{T}\boldsymbol{U}-2\eta\boldsymbol{U}$ . According to [4], we can update $\boldsymbol{B}^{\kappa}$ bit by bit. In other words, we update one column of $\boldsymbol{B}^{\kappa}$ with other columns fixed. Let $\boldsymbol{B}^{\kappa}_{*c}$ denote the $c$ -th column and $\boldsymbol{\tilde{B}}^{\kappa}_{c}$ denote the remaining columns in $\boldsymbol{B}^{\kappa}$ . Let $\boldsymbol{U}_{*c}$ denote the $c$ -th column of $\boldsymbol{U}$ and $\boldsymbol{\tilde{U}}_{c}$ denote the matrix of $\boldsymbol{U}$ excluding $\boldsymbol{U}_{*c}$ . Let $\boldsymbol{P}_{*c}$ denote the $c$ -th column of $\boldsymbol{P}$ and $\boldsymbol{\tilde{P}}_{c}$ denote the remaining columns in $\boldsymbol{P}$ . Then (11) can be rewrote as:

[TABLE]

The optimal solution of (12) can be found as follows:

[TABLE]

Equation (13) can be used repeatedly until all columns are updated.

v. Out-of-Sample Extension

When $\boldsymbol{W}_{\kappa}$ are learned, the asymmetric hash functions corresponding to the two asymmetric ImgNets are obtained. For example, given a new instance $x_{q}\notin\mathcal{X}$ , we directly use it as the input of the ADSQ model, each model only needs to output $K/2$ -bit hash codes, which are $b_{i}^{q}=sign(\mathcal{F}_{x}(\boldsymbol{x_{q}},W_{x})\in\mathbb{R}^{K/2}$ and $b_{j}^{q}=sign(\mathcal{F}_{y}(\boldsymbol{x_{q}},W_{y}))\in\mathbb{R}^{K/2}$ , respectively. Therefore, we concatenate the two $K/2$ -bit binary codes to obtain the final hash codes:

[TABLE]

IV. Experiments

In this section, we evaluate the proposed ADSQ hashing with comparisons to the state-of-the-art methods [29, 11, 16, 12, 19, 20, 39, 47, 40, 24] on three benchmark datasets.

i. Datasets and Settings

CIFAR-10 is a standard dataset contains 60,000 images with 10 categories including “truck”, “airplane”, “ship”, “automobile”, “horse”, “bird”, “cat”, “deer”, “frog”, “dog”. We randomly selected 100 images per class as query set (totally 1,000 images), 500 images per class as the training set (totally 5,000 images). The rest of the images are used as the database.

NUS-WIDE [44] is a multi-label image dataset contains 269,648 images collected from Flickr.com with 81 ground truth concepts. Following [20] and [40], we filter 21 most common classes. The 100 images per class are selected as the query set (totally 2,100 images), and 500 images per class are selected as the training set (totally 10,500 images). The rest of the images are used as the database. Two images are treated as similar if they share at least 1 common label. Otherwise, they are considered to be dissimilar.

ImageNet [45] is a benchmark dataset contains over 1.2M images. It is a single-label dataset, where each image is labeled by one of 1,000 classes. Following [4] and [46], we randomly select 100 classes, and randomly select 50 images per class as the query set (totally 5,000 images), 100 images per class as the training set (totally 10,000 images).

ii. Baselines

We compared our proposed ADSQ method with ten state-of-the-art hashing methods, including: unsupervised hashing methods, supervised hashing methods, learning based hashing methods and semantic supervised learning based method. The unsupervised hashing methods used include: SH [29], ITQ [11], and supervised hashing methods: SDH [16], KSH [12]. The learning based hashing methods used include DPSH [47], DHN [19], CNNH [20], DNNH [39], DSDH [40]. The semantic supervised learning based method chosen was DSEH [24]. We adopted $\text{DeCAF}_{7}$ features [49] for non-deep learning based methods. For the deep learning based methods, the AlexNet [41] or CNN-F [48] network is used for comparison.

In this paper, we adopt the following metrics to measure the performance of the methods: mean Average Precision (mAP), Precision curves within Hamming distance 2 (P@H $\mathbf{=}$ 2), Precision-Recall curves (PR), Precision curves with different Number of top returned samples (P@N). For fair comparison, we adopted MAP@5000 for CIFAR-10 and NUS-WIDE datasets and MAP@1000 for ImageNet as in [40].

iii. Implementation Details

As shown in Figure 1, our model consists of three networks: a LabelNet and two ImgNets. We used Alexnet [41] for the two asymmetric ImgNets, and we add two other fully-connected layers (i.e., semantic layer and hash layer) to extract the semantic feature and project to $\mathbb{R}^{K/2}$ space, respectively. We fine-tuned convolutional layers and fully-connected layers copied from AlexNet pre-trained on ImageNet and trained the semantic layer and hashing layer by back-propagation (BP). More specifically, the overall model structure contains 5 convolutional layers (i.e., “conv1”-“conv5”) and 4 fully-connected layers (i.e., “full6”-“full7”-“semantic layer”-“hash layer”). The detailed configuration of the 5 convolutional layers is shown in Table 1, where “filter size” denotes the number of convolutional filters. “stride” denotes the convolutional stride. “padding” indicates the number of pixels to add to each size of the input feature. “LRN” denotes whether Local Response Normalization (LRN) [41] is applied or not. “pooling” denotes the down-sampling operation. The configuration of the 4 full-connected layers is shown in Table 2, where the numbers in the table represent the number of nodes in each layer. The LabelNet contains 3 layers, the detailed configuration of the 3 layers is shown in Table 3. In our proposed ADSQ method, images in batch form are used as the input and every two images in the same batch constitute an image pair. The parameters of ADSQ model are learned by alternative training strategy. We summarize the whole learning algorithm for ADSQ in Algorithm 1.

Network Parameters In our ADSQ, the value of hyper-parameters are $\alpha=\beta=1$ , $\gamma=10^{-2}$ and $\nu=\eta=10$ . Our model is implemented on Pytorch333https://pytorch.org/ on a server with a NVIDIA TITAN X GPUs. The network is optimized by stochastic gradient descent with learning rate from $10^{-5}$ to $10^{-2}$ with a multiplicative step-size $10^{\frac{1}{2}}$ . The batch size of LabelNet and two asymmetric ImgNets are set to 32 and the weight decay parameter selected was 0.0005. The momentum is set to 0.9.

iv. Results and Discussions

Table 4 reports the mAP results on CIFAR-10, NUS-WIDE, and ImageNet dataset, respectively. The length of the hash codes varies from 12 to 48 (i.e., 12, 24, 36, and 48). From the Table 4, it can be observed that the performance of our ADSQ achieves the best image retrieval accuracy, and ADSQ is better than all baseline methods, including unsupervised hashing methods, supervised hashing methods, learning based hashing methods, and semantic supervised learning based hashing methods. Specifically, compared to the best shallow hashing method (i.e., ITQ) using deep features achieves an absolute score of more than 78% increase on the mAP performance measure metric for image retrieval on the CIFAR-10 dataset. Compared to the best learning based hashing method, i.e., DSDH, our ADSQ achieves an absolute score of more than 8% increase on the mAP performance measure metric. When comparing our ADSQ with the semantic supervised learning based hashing method DSEH, it can be seen that ADSQ can achieve a more than 3% increase in mAP. On the multi-label dataset NUS-WIDE, compared to the best shallow hashing method, i.e., ITQ, our ADSQ achieves an absolute score of more than a 40% increase in mAP. Compared to the best learning based hashing method, i.e., DSDH, our ADSQ achieves an absolute score of more than a 3% increase in mAP. When compared to the semantic supervised hashing method, i.e., DSEH, our ADSQ achieves an absolute score of more than a 1.5% increase in mAP. On large-scale dataset ImageNet, compared with ITQ, DSDH, and DSEH, our ADSQ achieves an absolute score of more than a 20%, 17%, and 8% increase in mAP, respectively. The main difference between our proposed ADSQ and DSEH is that our ADSQ utilizes semantic information to guide the asymmetric discrete learning procedure but DSEH does not have an asymmetric structure to generate the discriminative compact hash codes. Therefore, the results demonstrate that the motivation of ADSQ, i.e., using semantic information to guide the asymmetric discrete learning procedure can improve image retrieval performance in practical applications. Through an in-depth analysis of Table 4, we can find some other insights. (1) By comparing KSH, SDH to SH, we can observe that the supervised hashing methods can outperform unsupervised hashing methods because the supervised information can improve performance. (2) By comparing DSDH, DPSH, DNNH, CNNH, DHN to SDH, we find that the learning based hashing methods can significantly outperform the traditional hashing methods. These results demonstrate the advantages of using a deep end-to-end learning structure. (3) By comparing semantic supervised learning based hashing methods, i.e., ADSQ and DSEH to other baseline hashing methods, we can find that semantic learning based deep hashing can outperform similar learning based deep hashing methods, which means that by using semantic information we are able to learn more optimal binary codes. (4) The performance of all methods keeps improving with the increase in hash code length.

As shown in Figures 2(a), 3(a), 4(a) and Figures 2(b), 3(b), 4(b), experiments were conducted to evaluate the performance by using the metrics of Precision-Recall curves (PR) and Precision curves with a different Number of top returned samples (P@N), respectively. These metrics are widely used in deploying practical applications. The proposed ADSQ method significantly outperforms all the baseline methods it was compared to. In particular, ADSQ achieves higher precision at lower recall levels and smaller number of top returned images than all compared baseline methods. This is very important for image retrieval precision as the primary purpose, where it takes only a small $N$ to count more on the top- $N$ returned results. This proves the value of the ADSQ method in actual image retrieval systems.

The other important indicator is Precision within Hamming radius 2 (P@H=2) since it only requires $O(1)$ time for each query operation. As shown in Figures 2(c), 3(c), and 4(c), ADSQ achieves the highest P@H=2 results on all the datasets with regards to different hash code lengths. This validates the assertion that the proposed ADSQ method can attain higher-quality hash codes than all baseline methods and can enable more efficient and accurate Hamming space retrieval. When the length of the hash codes increases, few data points fall within the Hamming sphere with a radius of 2, which is caused by the sparse of the Hamming space [50]. Therefore, many learning based hashing methods can achieve good image retrieval performances on short hash codes. It is worth noting that ADSQ achieves a relatively slightly decrease in accuracy by longer code lengths, validating that ADSQ can effectively concentrate hash codes of similar data points together to be within the Hamming radius 2.

v. Discussion

v.1 Ablation Study

In this section we will analyze the role of the asymmetric loss term $\mathcal{A}$ and semantic supervision $\mathcal{J}_{1}$ of (7), and the reason for choosing an asymmetric structure.

In order to demonstrate that the asymmetric loss term and semantic supervision of (7) and the asymmetric structure are necessary for ADSQ, we designed four variants of ADSQ on the NUS-WIDE dataset. ADSQ- $\mathcal{A}$ denotes a variant where (7) is used without the asymmetric loss term. Therefore, the $\mathcal{L}^{\kappa}_{\mathcal{A}}$ can be rewritten as $\mathcal{L}^{\kappa}_{\mathcal{A}}=\alpha\mathcal{J}_{1}+\beta\mathcal{J}_{2}+\eta\mathcal{J}_{3}+\nu\mathcal{J}_{4}$ . ADSQ- $\mathcal{S}$ denotes a variant of ADSQ where (7) is used without the semantic supervision loss term. Thus, the $\mathcal{L}^{\kappa}_{\mathcal{S}}$ can be rewritten as to $\mathcal{L}^{\kappa}_{\mathcal{S}}=\beta\mathcal{J}_{2}+\eta\mathcal{J}_{3}+\nu\mathcal{J}_{4}+\mathcal{A}$ . In the variant ADSQ- $\mathcal{AS}$ , (7) is used without both the asymmetric loss term and the semantic supervision. Thus, the $\mathcal{L}^{\kappa}_{\mathcal{AS}}$ can be rewritten as $\mathcal{L}^{\kappa}_{\mathcal{AS}}=\beta\mathcal{J}_{2}+\eta\mathcal{J}_{3}+\nu\mathcal{J}_{4}$ . The final variant of ADSQ is called ADSQ- $sym$ where we use the symmetric structure (uses the same convolutional neural network to generate the compact hash codes like DSEH [24]). The mAP results are shown in Table 5. From Table 5 the following observations were made:

ADSQ outperforms ADSQ- $\mathcal{A}$ , ADSQ- $\mathcal{S}$ , and ADSQ- $\mathcal{AS}$ on all cases on NUS-WIDE dataset, which confirms the assertion that the asymmetric loss term $\mathcal{A}$ and semantic supervision term $\mathcal{J}_{1}$ are necessary for ADSQ. 2. 2.

The gap between ADSQ and ADSQ- $\mathcal{A}$ is larger than that between ADSQ and ADSQ- $\mathcal{S}$ . This result demonstrates that the asymmetric loss term $\mathcal{A}$ has a greater impact on ADSQ than the semantic supervision term $\mathcal{J}_{1}$ . 3. 3.

The performance with asymmetric structure (ADSQ) is better than the symmetric one (ADSQ- $sym$ ). The reason is that the use of a symmetric structure usually leads to highly correlated bits in practice, limiting the performance of image retrieval, and the use of the asymmetric structure to learn half of the codes can reduce the correlation between hash codes and enhance the robustness of the learned hash codes.

v.2 Sensitivity Analysis

In this subsection, we analyze the impact of the hyper-parameters, i.e., $\alpha,\ \beta,\ \gamma,\ \nu$ , and $\eta$ . The experiments are conducted on the NUS-WIDE dataset. We tune a hyper-parameter with others fixed. Specifically, we tune $\alpha$ by fixing $\beta=1$ , $\gamma=10^{-2}$ and $\nu=\eta=10$ . Similarly, we fix $\alpha=1$ , $\gamma=10^{-2}$ and $\nu=\eta=10$ when tuning the value of $\beta$ and so on. As shown in Figure 5, our model is not affected much by the change of hyper-parameters. This results demonstrate the robustness of our proposed method.

v.3 Visualization

To better illustrate the discriminative ability of ADSQ, the distribution of the hash codes learned by the proposed ADSQ method and the state-of-the-art semantic supervised hashing method DSEH on the ImageNet dataset with 48 bits are visualized by using t-SNE visualization [51] (we sample 10 categories for the case of visualization). It can be observed that the learned hash codes by the proposed ADSQ method are more discriminative than those learned by DSEH. That is, the learned hash codes by ADSQ are more discriminative.

V. Conclusion

In this work, we proposed a novel supervised hashing approach, dubbed Asymmetric Deep Semantic Quantization (ADSQ), for image retrieval. ADSQ consists a LabelNet and two asymmetric ImgNets, the LabelNet is used to discover semantic information from labels. The two asymmetric ImgNets are used to generate their respective discriminative compact hash codes. Moreover, ADSQ uses rich semantic information to guide the two ImgNets in minimizing the gap between the real-continuous features and discrete binary codes. ADSQ is the first asymmetric supervised hashing method which can use the abundant semantic information generated by LabelNet to guide the discrete hash code generation of asymmetric ImgNets. Extensive experiments on the three benchmark datasets demonstrated that the proposed ADSQ achieves the best performance in contrast with several state-of-the-art methods. In the future, we will use two asymmetric networks with different structures to generate high-quality hash codes.

Acknowledgment

The authors would also like to thank the associate editor and anonymous reviewers for their comments to improve the paper.

Bibliography51

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. Andoni, “Nearest Neighbor Search in High-dimensional Spaces,” in Proceedings of International Symposium Mathematical Foundations of Computer Science (MFCS) , Aug. 2011, pp. 1-33.
2[2] Q. Y. Jiang, X. Cui, and W. J. Li, “Deep Discrete Supervised Hashing,” IEEE Trans. Image Processing , vol. 27, no. 12, pp. 5996-6009, 2018.
3[3] Z. K. Chen, F. M. Zhong, G. Y. Min, Y. L. Leng, and Y. M. Ying, “Supervised Intra- and Inter-Modality Similarity Preserving Hashing for Cross-Modal Retrieval,” IEEE Access, vol.6, pp. 27796-27808, 2018.
4[4] Q. Y. Jiang, and W. J. Li, “Asymmetric Deep Supervised Hashing,” in Proceedings of the Conference on Artificial Intelligence (AAAI) , Feb. 2018, pp. 3342-3349.
5[5] Z. Yang, O. I. Raymond, W. Q. Sun, and J. Long, “Deep Attention-Guided Hashing,” IEEE Access, vol. 7, pp. 11209-11221, 2019.
6[6] T. T. Yuan, W. H. Deng, and J. N. Hu, “Distortion Minimization Hashing,” IEEE Access, vol. 5, pp. 23425-23435, 2017.
7[7] C. Deng, Z. J. Chen, X. L. Liu, X. B. Gao, and D. C. Tao, “Triplet-Based Deep Hashing Network for Cross-Modal Retrieval,” IEEE Trans. Image Processing , vol. 27, no. 8, pp. 3893-3903, 2018.
8[8] H. Liu, M. B. Lin, S. C. Zhang, Y. J. Wu, F. Y. Huang, and R. R. Ji, “Dense Auto-Encoder Hashing for Robust Cross-Modality Retrieval,” in Proceedings of Conference on ACM Multimedia (ACMMM) , Oct. 2018, pp. 1589-1597.