Quadruplet Selection Methods for Deep Embedding Learning
Kaan Karaman, Erhan Gundogdu, Aykut Koc, A. Aydin Alatan

TL;DR
This paper introduces a novel quadruplet selection method for deep embedding learning that leverages hierarchical labels and hard negative mining to improve fine-grained object recognition accuracy.
Contribution
It proposes a new feature selection approach for quadruplet training samples that enhances embedding quality and recognition performance in fine-grained classification tasks.
Findings
Hard negative sample selection improves recognition metrics.
The proposed method outperforms state-of-the-art approaches.
Hierarchical labels aid in effective quadruplet formation.
Abstract
Recognition of objects with subtle differences has been used in many practical applications, such as car model recognition and maritime vessel identification. For discrimination of the objects in fine-grained detail, we focus on deep embedding learning by using a multi-task learning framework, in which the hierarchical labels (coarse and fine labels) of the samples are utilized both for classification and a quadruplet-based loss function. In order to improve the recognition strength of the learned features, we present a novel feature selection method specifically designed for four training samples of a quadruplet. By experiments, it is observed that the selection of very hard negative samples with relatively easy positive ones from the same coarse and fine classes significantly increases some performance metrics in a fine-grained dataset when compared to selecting the quadruplet samples…
| Method | R@1 | R@2 | R@4 | R@8 | NMI |
| Semi-Hard [18] | 51.54 | 63.78 | 73.52 | 82.41 | 55.38 |
| Lifted Structure [21] | 52.98 | 65.70 | 76.01 | 84.27 | 56.50 |
| N-Pairs [4] | 53.90 | 66.76 | 77.75 | 86.35 | 57.24 |
| Clustering [22] | 58.11 | 70.64 | 80.27 | 87.81 | 59.23 |
| Triplet Global [20] | 61.41 | 72.51 | 81.75 | 88.39 | 58.61 |
| Random Quadruplet Selection [9] | 61.49 | 73.41 | 82.88 | 89.92 | 54.50 |
| Proposed Method 1 | 64.85 | 75.59 | 83.41 | 89.55 | 57.32 |
| Proposed Method 2 | 66.06 | 76.62 | 84.84 | 90.63 | 57.00 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsFeature Selection
QUADRUPLET SELECTION METHODS FOR DEEP EMBEDDING LEARNING
Abstract
Recognition of objects with subtle differences has been used in many practical applications, such as car model recognition and maritime vessel identification. For discrimination of the objects in fine-grained detail, we focus on deep embedding learning by using a multi-task learning framework, in which the hierarchical labels (coarse and fine labels) of the samples are utilized both for classification and a quadruplet-based loss function. In order to improve the recognition strength of the learned features, we present a novel feature selection method specifically designed for four training samples of a quadruplet. By experiments, it is observed that the selection of very hard negative samples with relatively easy positive ones from the same coarse and fine classes significantly increases some performance metrics in a fine-grained dataset when compared to selecting the quadruplet samples randomly. The feature embedding learned by the proposed method achieves favorable performance against its state-of-the-art counterparts.
**Index Terms— ** Deep distance metric learning, embedding learning, fine-grained classification/recognition.
††† This work was done when Erhan Gundogdu was with Middle East Technical University.††Copyright 2019 IEEE. Published in the IEEE 2019 International Conference on Image Processing (ICIP 2019), scheduled for 22-25 September 2019 in Taipei, Taiwan. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact:Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 908-562-3966.
1 Introduction
Recently, embedding learning has become one of the most popular issues in machine learning [1, 2, 22]. Proper mapping from the raw data to a feature space is commonly utilized for image retrieval [4] and duplicate detection [5], which are used in many applications such as online image search.
For training a model that can extract proper features, the distance between two samples of a dataset in the feature space should be considered. Moreover, some embedding learning methods are employed to increase the classification accuracy, e.g., fine-grained object recognition [6] by using deep convolutional neural network (CNN) models which require a significant amount of training samples. Fortunately, there are datasets for various purposes such as car model recognition [7] and maritime vessel classification and identification [8]. Some of these datasets can be used for classifying land, marine, and air vehicles in a real-world scenario. Concretely, car model recognition can be employed in the context of visual surveillance and security for the land traffic control [6] and marine vessel recognition is used for the purpose of coastal surveillance [9] [10]. In this work, we focus on the feature learning problem specifically designed for car model recognition.
Recently developed studies on feature learning focus on extracting features from raw data such that the samples belonging to different classes are well-separated and the ones from the same classes are close to each other in the feature space. The state-of-the-art network architectures such as VGG [11] and GoogLeNet [12] are frequently used for extracting features from images by several different training processes. In the early years, pairwise similarity is used for signature verification with contrastive loss [13]. Since consideration of the whole pairs or triplet samples in a dataset is not computationally tractable, carefully designed mining techniques are proposed, such as hard positive [14] and negative [15] mining.
In the previous methods that employ a hard mining step during training, at each iteration of the optimization, they focus on the separation of samples in the feature space in a selected batch from the dataset. Therefore, the distance relations among the samples in a dataset are not fully exploited. Moreover, the classification loss function for the fine-grained labels is not considered in the training phase. On the other hand, our proposed method for the quadruplet sample selection enables to convey more information from the utilized dataset by considering the globally hard negatives and relatively easy positives in the distance loss terms and the auxiliary classification layers.
The contributions of this work are summarized as follows: (1) In order to improve embedding learning, we have proposed two novel quadruplet selection methods where the globally hardest negative and moderately easy positive samples are selected. (2) Our framework contains a CNN trained with the combination of the classification and distance losses. These losses are designed to exploit the hierarchical labels of the training samples. (3) To test the proposed method, we have conducted experiments on the Stanford Cars 196 dataset [7] and observed that the recognition accuracy of the unobserved classes has been improved with respect to the random selection of samples in the quadruplets while outperforming the state-of-the-art feature learning methods.
2 Related Work
Earlier works on metric learning are based on Siamese Nets [13]. In that study, two identical neural networks extract the features of two arbitrary images. Next, these features are compared by a metric which is based on a radial function333The distance between any two members in the feature space is defined as the cosine of the angle between them [13]. . While their loss function forces the samples in the same class to be closer to each other in the sense of the selected distance function, the samples in the different classes are forced to be mapped far from each other. The cost function of such a network is given below [16] where represents the operation of , and are distances in between samples.
[TABLE]
A similar approach uses triplets for training process as in [17], where each triplet sample consists of three members: (1) Reference (anchor) sample, , (2) Positive sample, , (3) Negative sample, . The constraints of a triplet are as follows: the reference and positive samples belong to the same class, whereas the negative sample does not (, , and , where denotes the class label of the reference sample). For well-separation of the classes, should be closer to than . The selection method of triplets is known to be an important issue for convergence [17]. Among the existing studies, some of them indicate that selecting the samples randomly reduces the efficiency of training. A recent study in [15] proposes hard negative mining, which emphasizes that selecting close to increases the performance of separation in the feature space. On the other hand, hard positive mining is also suggested to enhance the performance by selecting far from [14]. Moreover, hard negative and positive mining methods are also used for the face recognition purpose [18]. For triplet-based approaches [19], the following function is utilized where the distances are defined as norm444The distance between any two members ( and ) in the space is defined as ., and is a margin:
[TABLE]
Another approach is to utilize the hierarchical class labels of the training samples [6]. In that method, samples with similar fine labels have the same coarse label, i.e. a sample has more than one label. The cost function is modified by considering both the coarse and fine labels. For this purpose, each quadruplet sample is constructed as follows: (1) Reference sample (anchor sample), , (2) Positive positive sample, , (3) Positive negative sample, , (4) Negative sample, . Similar to the triplet selection, the quadruplets are selected such that three constraints should be taken into account. First, both the coarse and fine classes of and should be the same. Second, although the coarse class of is the same as the coarse class of , the fine classes are different. Finally, the coarse class of and should be different.
Moreover, the loss function for the quadruplets is similar to the triplet based methods [6]. On the other hand, in [9], the use of the global loss has been proposed, while the quadruplet samples are selected randomly (Note that these quadruplets hold the constraints). The global loss penalizes the network in case of the mean and variance of the distances between the samples in a quadruplet are not appropriate, as given in (3)555In (3), , , and , as defined in [20]., where and are the margins, similar to (2).
[TABLE]
In [6], the hierarchical labels of the training samples are utilized. It should be noted that a model has difficulty in convergence when the samples are selected randomly since the most informative pairs are not effectively considered. Here, we propose two methods for sample selection to address this issue.
3 Proposed Method
Each quadruplet sample is represented as where . represents the vector of the pixels of an image ( is the number of the pixels in the image), and represents the coarse, and fine classes, respectively, where ( is the number of coarse classes) and similarly, . Let the weights of a CNN be where is the number of the weights, then the network can be defined as where is the dimension of the feature space.
Our proposed cost function consists of two parts: the classification (Section 3.1) and distance (Section 3.2) cost functions. The aim of these cost functions is to form the feature space so that fine classes are well-separated. However, the learning process highly depends on the selection of the quadruplets. The training process takes more time when selecting the quadruplets in an erroneous strategy. We propose to select the members of the quadruplets from the most informative region in the feature space in Section 3.3. As validated by the experiments (Section 4), proposed method increases the performance of separation significantly as it can be observed from both Recall@K and Normalized Mutual Information (NMI) values in Table 1.
3.1 Classification Cost Function
In order to increase the discriminativeness of the features for the available class labels, softmax loss is employed. Contrary to the traditional one, the proposed neural network has two outputs which are dedicated to the fine and coarse classes. Let where denotes the output for the coarse class, whereas is for the fine class. Then, the proposed cost function is obtained:
[TABLE]
and specify the coarse and fine classes, respectively. is the probability that the vector belongs to the coarse class. If , then by using hard decision, where is the Kronecker delta function. Similarly, is also calculated for . represents the element of the vector, where is the score vector for the coarse classes (). Likewise, is the one for the fine classes (). and are the weights of the fine and coarse classification terms of the cost function.
3.2 Distance Cost Function
The distances between the samples in the feature space are commonly defined by a radial function [17]. For this reason, the representations which will be learned by our proposed framework are -dimensional feature vectors. The distance for any two members can be defined by norm. Hence, we can clearly formulate our goal by the inequality . The first part can be rewritten as , and the second part would be where and are the margins, which should be positive numbers. Moreover, we emphasize the discrimination of the coarse classes by using the condition . Then, the new cost function can be proposed as:
[TABLE]
Finally, the overall proposed network is shown in Figure 1 with the loss function given in (6). This loss function, which is the combination of (5) and (3), consider the distances of the samples in the feature space using while regularizes the statistics of the distances batch-wise.
[TABLE]
3.3 Quadruplet Selection
In the previous section, we have briefly summarized our novel loss function. As it is mentioned before, selecting the quadruplet samples randomly makes it difficult to exploit the most informative training examples. Instead of attempting to cover all the quadruplet combinations in the training set, we propose two novel selection strategies. First, a reference sample is randomly selected with equal probability from the training set (Let the reference sample be selected as , where and are the coarse and fine labels of the reference sample, respectively.). The negative sample is selected from the set of the samples belonging to the different coarse classes. The critical point is that, like hard negative mining in [15], we should select the closest negative sample to (). At this point, we propose two different methods for the selection of and . The experimental comparison of these two methods is given in Section 4.
3.3.1 Method 1
For determining , we select the sample whose fine class is the same as the fine class of , and which is closest to . At this point, the constraint for selection of is as follows: the distance between and is greater than the distance between and (). Similarly, we select whose coarse class is the same as the coarse class of , which is the closest sample to , and also satisfying . This method is visualized in Figure 2.
3.3.2 Method 2
In the second method, after selecting , the distance between and () determines a hyper-sphere which takes as its center. After selecting the labels of and according to the constraints in Section 2, and are selected from the predetermined classes such that they are the closest points to but outside the region enclosed by this hyper-sphere. If there are no samples which are both close to and outside of the hyper-sphere, then the furthest sample to inside the hyper-sphere is selected. This selection method is illustrated in Figure 2.
4 Results
We compare the performance of our proposed method against the state-of-the-art feature learning approaches in [18, 21, 4, 22, 20] by using the same evaluation methods. In addition, the randomly selected quadruplets are utilized as in [9]. Stanford Cars 196 dataset [7] is used in the experiments. To implement the proposed methods, a hierarchical structure is required for all the samples in the dataset, where each sample originally has only one label. For this purpose, we should add the high-level classes (coarse labels) to the dataset. In other words, the classes, which are originally in the dataset, are taken as the fine classes and coarse classes are added using the types of the cars, similar to the study in [6].
The important point in the generation of the training and test sets is that they should not share any fine class labels. With this restriction, we want to measure the adequacy of our neural network to separate the classes that have not been seen before. The most common performance analysis methods for zero-shot learning are Recall@K and NMI. Recall@K specifies whether the samples belonging to the same fine class are close to each other, and NMI is a measure of clustering quantity as mentioned in [22].
For this purpose, the first fine classes of the dataset are selected as the training set, and the rest are used only as the test set similar to the study in [1]. In our experimental setup, the pre-trained ResNet101 model [23] (that has been trained using the ImageNet dataset [24]) is employed as our CNN model to extract the features. The experiments are performed on Pytorch platform [25]. In addition, the hyper-parameters of the cost function are selected as for , for ; for , , and . The margins are for , and ; for , and . The learning parameters are as follows: the learning rate is , the momentum is , and stochastic gradient descent algorithm is used for optimization. The results can be examined in Table 1.
Our proposed quadruplet based learning framework has improved the precision in terms of Recall@K even if they are selected randomly. According to Recall@K metric, random quadruplet selection method outperforms the previous studies in [18, 21, 4, 22], and it is comparable to the study in [20]. On top of that, when the proposed selection methods are used, even higher levels of accuracy can be obtained. As it is demonstrated in Table 1, Method results in accuracy of Recall@1, which is an improvement by at least compared to the other studies; while Method results in accuracy of Recall@1 corresponding to a increase.
5 Conclusion
We have demonstrated the proposed method of selection significantly increases the rate of separation of a model in terms of recall performance. Unlike previous studies that consider only the distances between - and -, the proposed methods consider also the distances between - in the feature space. This consideration helps us improve the model and achieve better accuracy performance. These two proposed selection methods allow the loss function not only to enlarge margins between the samples in the different classes but also to create several tight clusters for each class. Moreover, these two proposed methods have the advantage that they pay attention to the samples at the region around the critical hyper-sphere. Especially, the second method attacks the easier problem, i.e. while the first method can reshape the only particular region in the feature space, the second one can use all the region on the surface of a hyper-sphere. Therefore, the feature space is manipulated through a better optimization procedure.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] V. B. G. Kumar, B. Harwood, G. Carneiro, I. Reid, and T. Drummond, “Smart mining for deep metric learning,” ar Xiv preprint ar Xiv:1704.01285 , 2017.
- 2[2] Y. Movshovitz-Attias, A. Toshev, T. K. Leung, S. Ioffe, and S. Singh, “No fuss distance metric learning using proxies,” ar Xiv preprint ar Xiv:1703.07464 , 2017.
- 3[3] H. Oh Song, S. Jegelka, V. Rathod, and K. Murphy, “Deep metric learning via facility location,” in Computer Vision and Pattern Recognition (CVPR) , 2017.
- 4[4] K. Sohn, “Improved deep metric learning with multi-class n-pair loss objective,” in Advances in Neural Information Processing Systems , 2016, pp. 1857–1865.
- 5[5] S. Zheng, Y. Song, T. Leung, and I. Goodfellow, “Improving the robustness of deep neural networks via stability training,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2016, pp. 4480–4488.
- 6[6] X. Zhang, F. Zhou, Y. Lin, and S. Zhang, “Embedding label structures for fine-grained feature representation,” in Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on . IEEE, 2016, pp. 1114–1123.
- 7[7] J. Krause, J. Deng, M. Stark, and L. Fei-Fei, “Collecting a large-scale dataset of fine-grained cars,” 2013.
- 8[8] E. Gundogdu, B. Solmaz, V. Yücesoy, and A. Koc, “Marvel: A large-scale image dataset for maritime vessels,” in Asian Conference on Computer Vision . Springer, 2016, pp. 165–180.
