Cluster Loss for Person Re-Identification

Doney Alex; Zishan Sami; Sumandeep Banerjee; Subrat Panda

arXiv:1812.10325·cs.CV·December 27, 2018

Cluster Loss for Person Re-Identification

Doney Alex, Zishan Sami, Sumandeep Banerjee, Subrat Panda

PDF

TL;DR

This paper introduces a cluster loss function for person re-identification that enhances clustering performance by increasing inter-class variation and reducing intra-class variation, outperforming triplet loss.

Contribution

The paper proposes a novel cluster loss and a batch hard training mechanism to improve person ReID accuracy and generalization in clustering tasks.

Findings

01

Cluster loss yields larger inter-class and smaller intra-class variations.

02

The method achieves higher accuracy on test sets compared to triplet loss.

03

Batch hard training accelerates convergence and improves results.

Abstract

Person re-identification (ReID) is an important problem in computer vision, especially for video surveillance applications. The problem focuses on identifying people across different cameras or across different frames of the same camera. The main challenge lies in identifying the similarity of the same person against large appearance and structure variations, while differentiating between individuals. Recently, deep learning networks with triplet loss have become a common framework for person ReID. However, triplet loss focuses on obtaining correct orders on the training set. We demonstrate that it performs inferior in a clustering task. In this paper, we design a cluster loss, which can lead to the model output with a larger inter-class variation and a smaller intra-class variation compared to the triplet loss. As a result, our model has a better generalization ability and can achieve…

Equations18

∣∣ f (x_{i}^{a}) - f (x_{i}^{p}) ∣ ∣_{2}^{2} + α < ∣∣ f (x_{i}^{a}) - f (x_{i}^{n}) ∣ ∣_{2}^{2}

∣∣ f (x_{i}^{a}) - f (x_{i}^{p}) ∣ ∣_{2}^{2} + α < ∣∣ f (x_{i}^{a}) - f (x_{i}^{n}) ∣ ∣_{2}^{2}

L_{t r p} = i \sum N [∣∣ f (x_{i}^{a}) - f (x_{i}^{p}) ∣ ∣_{2}^{2} - ∣∣ f (x_{i}^{a}) - f (x_{i}^{n}) ∣ ∣_{2}^{2} + α]

L_{t r p} = i \sum N [∣∣ f (x_{i}^{a}) - f (x_{i}^{p}) ∣ ∣_{2}^{2} - ∣∣ f (x_{i}^{a}) - f (x_{i}^{n}) ∣ ∣_{2}^{2} + α]

f_{i}^{m} = \frac{\sum ^{K} f ( x )}{K}

f_{i}^{m} = \frac{\sum ^{K} f ( x )}{K}

d_{i}^{in t r a} = k \sum ∣∣ f (x) - f_{i}^{m} ∣ ∣_{2}^{2}

d_{i}^{in t r a} = k \sum ∣∣ f (x) - f_{i}^{m} ∣ ∣_{2}^{2}

d_{i}^{in t er} = \forall i_{d} \in P, i_{d} \neq = i \sum ∣∣ f_{i}^{m} - f_{i_{d}}^{m} ∣ ∣_{2}^{2}

d_{i}^{in t er} = \forall i_{d} \in P, i_{d} \neq = i \sum ∣∣ f_{i}^{m} - f_{i_{d}}^{m} ∣ ∣_{2}^{2}

L_{c} = \frac{β \sum _{i}^{P} d _{i}^{in t r a}}{γ + \sum _{i}^{P} d _{i}^{in t er}}

L_{c} = \frac{β \sum _{i}^{P} d _{i}^{in t r a}}{γ + \sum _{i}^{P} d _{i}^{in t er}}

d_{i}^{in t r a} = K max ∣∣ f (x) - f_{i}^{m} ∣ ∣_{2}^{2}

d_{i}^{in t r a} = K max ∣∣ f (x) - f_{i}^{m} ∣ ∣_{2}^{2}

d_{i}^{in t er} = \forall i_{d} \in P, i_{d} \neq = i min ∣∣ f_{i}^{m} - f_{i_{d}}^{m} ∣ ∣_{2}^{2}

d_{i}^{in t er} = \forall i_{d} \in P, i_{d} \neq = i min ∣∣ f_{i}^{m} - f_{i_{d}}^{m} ∣ ∣_{2}^{2}

L b_{c} = i \sum P max ((d_{i}^{in t r a} - d_{i}^{in t er} + α), 0)

L b_{c} = i \sum P max ((d_{i}^{in t r a} - d_{i}^{in t er} + α), 0)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsTriplet Loss

Full text

Cluster Loss for Person Re-Identification

Doney Alex

https://orcid.org/0000-0002-7848-5461

Capillary TechnologiesP.O. Box 560068560068

[email protected]

,

Zishan Sami

Indian Institute of Technology Kharagpur

[email protected]

,

Sumandeep Banerjee

Capillary TechnologiesP.O. Box 560068560068

[email protected]

and

Subrat Panda

Capillary TechnologiesP.O. Box 560068560068

[email protected]

(2018)

Abstract.

Person re-identification (ReID) is an important problem in computer vision, especially for video surveillance applications. The problem focuses on identifying people across different cameras or across different frames of same camera. The main challenge lies in identifying similarity of the same person against large appearance and structure variations, while differentiating between individuals. Recently, deep learning networks with triplet loss has become a common framework for person ReID. However, triplet loss focuses on obtaining correct orders on the training set. We demonstrate that it performs inferior in a clustering task. In this paper, we design a cluster loss, which can lead to the model output with a larger inter-class variation and a smaller intra-class variation compared to the triplet loss. As a result, our model has a better generalisation ability and can achieve a higher accuracy on the test set especially for a clustering task. We also introduce a batch hard training mechanism for improving the results and faster convergence of training.

††copyright: rightsretained††journalyear: 2018††copyright: acmcopyright††conference: 11th Indian Conference on Computer Vision, Graphics and Image Processing; December 18–22, 2018; Hyderabad, India††booktitle: 11th Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP 2018), December 18–22, 2018, Hyderabad, India††price: 15.00††doi: 10.1145/3293353.3293396††isbn: 978-1-4503-6615-1/18/12††ccs: Computing methodologies Matching††ccs: Computing methodologies Biometrics††ccs: Computing methodologies Neural networks

1. Introduction

Person re-identification (ReID) is an important problem in computer vision especially for video surveillance applications. Major challenges include variations of lighting conditions, poses, viewpoints, blurring effects, image resolutions, camera settings, occlusions, background etc. The person ReID task is similar to image retrieval or face recognition in many ways. With advancements in deep learning, significant improvements have been made in the areas of image retrieval. There are many works in person ReID which were motivated from face recognition. One such example on which many person ReID methods are based on is FaceNet(Schroff et al., 2015), a convolutional neural network (CNN) used to learn an embedding for faces. The key component of FaceNet is to use the triplet loss, as introduced by Weinberger and Saul(Weinberger and Saul, 2009), for training the CNN as an embedding function. The triplet loss optimizes the embedding space such that data points with the same identity are closer to each other than those with different identities.

Even though there are a variety of approaches in loss functions such as classification loss, with a combination of verification loss in some cases (Chen et al., 2017b; Geng et al., 2016; Zheng et al., 2016b; Li et al., 2017b) or other losses like DeepLDA (Wu et al., 2017), triplet losse and its variations(Khamis et al., 2015; Ding et al., 2015; Paisitkriangkrai et al., 2015; Cheng et al., 2016; Wang et al., 2016; Shi et al., 2016; Su et al., 2016; Liu et al., 2016; Chen et al., 2017b; Liu et al., 2017; Hermans* et al., 2017; Chen et al., 2017a) seem to be the most common and successful approach. Cumulative Matching Characteristic curve which follows rank-n criteria is the most common (Karanam et al., 2016; Hirzer et al., 2012a; Zheng et al., 2016a) method used for performance evaluation of person ReID. Recent deep learning approaches (Chen et al., 2016; Cheng et al., 2016; Su et al., 2016; Wang et al., 2016; Ding et al., 2015; Hermans* et al., 2017) usually treat person ReID as a ranking task and apply a triplet loss to address the problem. The main purpose of the triplet loss, which is motivated in the context of nearest-neighbour classification(Schroff et al., 2015; Weinberger and Saul, 2009), is to obtain a correct order for each probe image and distinguish identities in the projected space. But these methods seem to perform inferior in clustering tasks. The underlying reason is that the model trained by a triplet loss would still cause a relatively large intra-class variation(Cheng et al., 2016; Wen et al., 2016).

In this paper we introduce cluster loss, motivated by Linear Discriminant Analysis and K-Means clustering. While triplet loss tries to minimize the distance between similar images, our clustering loss tries to minimize the distance between images to the mean of their class and maximize the distance between the means of other classes. This results in all images of same identity to come together to form a cluster and the clusters to stay separated. Hence our model is capable of achieving a smaller intra-class variation and a larger inter-class variation with significant performance on the test set.

Many recent deep learning approaches treat the person ReID as a ranking task and use rank-n criteria for performance evaluation. Clustering is also an important application of person ReID. For eg, in a scenario where continuous feed from a surveillance camera captures a moving person in multiple frames, images belonging to similar identity would need to be grouped ( as shown in Fig 1), in order to do any analysis/recognition. Therefore we wanted to evaluate the output of person ReID network using clustering algorithms. We used a simple sequential clustering (explained in section 7) to evaluate the performance in a clustering scenario. It was observed that our method outperforms the existing methods for Person ReID in a clustering task by a huge margin.

2. Related Work

Most developments in person ReID problem concentrate on feature extraction and similarity measurement. Traditional feature extraction techniques largely use colour histograms, local binary patterns, texture filters etc. Gray and Tao (Gray and Tao, 2008) use 8 colour channels (RGB, HS, and YCbCr) and 21 texture filters on the luminance channel, and the pedestrian is partitioned into horizontal stripes. A number of later works (Prosser et al., 2010; Zheng et al., 2013; Ma et al., 2013)employ the same set of features as (Gray and Tao, 2008). Similarly, Mignon et al. (Mignon and Jurie, 2012) built the feature vector from RGB, YUV and HSV channels and the LBP texture histograms in horizontal stripes. Most hand crafted features rely on colour histograms and texture filters but there are works which use complex features like SIFT(Zhao et al., 2014) or local maximal occurrence (LOMO) descriptor(Liao et al., 2015), which includes the colour and SILTP histograms. Another choice is the attribute-based features which are more robust to image translations compared to low-level descriptors. The low-level features like colour,texture or category labels are used to train the attribute classifiers(Farenzena et al., 2010; Shi et al., 2015).

In a ReID system with hand crafted features, a good distance metric is critical for its success, because the high-dimensional visual features may not capture the invariant factors under sample variances. In person ReID many works fall into the scope of supervised global distance metric learning. The task of global metric learning is to keep all the vectors of the same class closer while pushing vectors of different classes further apart. The most commonly used formulation is based on the class of Mahalanobis distance functions and its modifications (Hirzer et al., 2012b; Chen et al., 2015), which generalizes Euclidean distance using linear scalings and rotations of the feature space. One popular metric learning method is KISSME (Köstinger et al., 2012) which is based on Mahalanobis distance and the decision on whether a pair is similar or not is formulated as a likelihood ratio test. Apart from the methods that use Mahalanobis distance, some use other learning tools such as support vector machine (SVM) or boosting. In (Liu et al., 2015), a structural SVM is employed to combine different colour descriptors at decision level and in (Zhang et al., 2016a), a specific SVM is learned for each training identity and map each testing image to a weight vector inferred from its visual features. Gray *et al. * propose using the AdaBoost algorithm to select and combine many different kinds of simple features into a single similarity function in (Gray and Tao, 2008).

In traditional methods, feature extraction and similarity measurement are treated independently, because of which those methods could not reach the performance level of CNN based systems, where the end-to-end system can be globally optimized via back-propagation. The major bottleneck of deep learning methods in ReID was the lack of training data. With the advancement of deep learning in almost all fields and the increasing availability of datasets, CNN based methods which automatically learn features and metrics became common in ReID and hence the handcrafted features and metrics struggle to keep top performance widely, especially on large scale datasets.

Most CNN-based ReID methods focus on the Siamese model. In (Yi et al., 2014), an input image is partitioned into three overlapping horizontal parts, and the parts go through two convolutional layers and a fully connected layer which fuses them and outputs a vector for the image and the similarity of the two output vectors are computed using the cosine distance. There are many modified versions of Siamese model like (Ahmed et al., 2015) in which cross- input neighbourhood difference features are computed, which compares the features from one input image to features in neighbouring locations of the other image or like (Li et al., 2014) which uses product to compute patch similarity in similar latitude. Meanwhile, there are methods (Li et al., 2014; Ahmed et al., 2015; Wu et al., 2016) which tackle the person ReID problem using a classification/identification mode, which makes full use of the re-ID labels. In (Xiao et al., 2016), training identities from multiple datasets jointly form the training set and a softmax loss is employed in the classification network. Some of them use a softmax layer with the cross-entropy loss in their networks (Li et al., 2014; Wu et al., 2016). The cross-entropy loss can well represent the probability that the two images in the pair are of the same person or not. Some other methods use a margin-based loss (Wang et al., 2016), which builds a margin to maintain the largest separation between positive and negative pairs. For instance, Varior *et al. * (Varior et al., 2016) incorporate long short-term memory (LSTM) modules into a Siamese network. LSTMs process image parts sequentially so that the spatial connections can be memorized to enhance the discriminative ability of the deep features.

While Siamese networks based works use image pairs, Cheng *et al. * (Cheng et al., 2016) design a triplet loss function that takes three images as input. A drawback of the Siamese model with triplet loss is that it does not make full use of ReID annotations. These models only needs to consider pairwise (or triplet) labels. Telling whether an image pair is similar (belong to the same identity) or not is a weak label in ReID. Sometimes triplet loss based networks may produce disappointing results especially when applied naively. An essential part of learning using the triplet loss is the mining of hard triplets, as otherwise training will quickly stagnate. However, mining such hard triplets is time consuming and it is unclear what defines ”good” hard triplets (Shi et al., 2016). Even worse, selecting too many hard triplets too often makes the training unstable. Another major caveat of the triplet loss is that as the dataset gets larger, the possible number of triplets grows cubically, rendering a long enough training impractical. As training progress, the transformation output relatively quickly learns to correctly map most trivial triplets, rendering a large fraction of all triplets uninformative. Hence we introduce cluster loss, which tries to minimise not just the distance between the similar pairs but the distance between all similar images with respect to their mean and increases the distance between the means thereby making sure each each unique cluster stay apart.

3. The proposed approach

We strive for an embedding $f(x)$ , from an image $x$ into a feature space $R^{d}$ , a $d$ -dimensional Euclidean space, such that the squared distance between all person images, independent of imaging conditions, belonging to the same identity is small to form a cluster and the squared distance between clusters is large. The triplet loss is motivated in the context of nearest-neighbour classification. We introduce our clustering loss taking motivation from K-Means clustering and Linear Discriminant Analysis.

3.1. Network Architecture

We use the ResNet-50 architecture for the convolutional layers similar to that used in (Hermans* et al., 2017). We experimented with other networks like VGG (Liu and Deng, 2015) and GoogLeNet (Szegedy et al., 2015) but the results were similar to that of ResNet-50. The ResNet-50 was chosen because it is computationally less demanding compared to other deeper networks like VGG and GoogLeNet. In ResNet-50, the last layer is discarded and we add two fully connected layers for our task as shown in Fig 2. The first has 1024 units, followed by batch normalization (Ioffe and Szegedy, 2015) and ReLU (Glorot et al., 2011), the second goes down to 128 units, our final embedding dimension. The network had about 25.74 M parameters. The batch size is limited to 256 containing P = 16 persons with K = 16 images each. We chose learning rate $\epsilon_{0}=3$ x $10^{-5}$ with learning rate decay starting after 25000 iteration for a total of 50000 iterations and Adam optimizer (Kingma and Ba, 2014) with the default hyper-parameter values ( $\epsilon=\epsilon_{0},\beta 1=0.9,\beta 2=0.999$ ) for the experiments. We performed all our experiments using the Tensorflow(Abadi et al., 2016) framework.

3.2. Loss Function

We use Euclidean distance as metric for separation between two samples in the transformed space $R^{d}$ . Triplet loss is used for performance comparisons. Hence we are going to introduce triplet loss first.

3.2.1. Triplet Loss

In triplet loss we create a collection of triplets such that we select an anchor image $x_{i}^{a}$ , a positive image $x_{i}^{p}$ which is another image of same person and a negative image $x_{i}^{n}$ of a different person. The triplet loss wants to keep $x_{i}^{a}$ and $x_{i}^{p}$ closer. For every set $i$ , we want

[TABLE]

Hence the loss that is being minimized is

[TABLE]

In Eq 1, the triplet loss adopts the Euclidean distance to measure the similarity of extracted features from two images. The major challenge with triplet loss is that as the dataset gets larger, the possible number of triplets grows cubically, rendering a long enough training impractical. The transformation function $f$ relatively quickly learns to correctly map most trivial triplets, rendering a large fraction of all triplets uninformative.

3.2.2. Cluster Loss

We take motivation from K-Means clustering and Linear Discriminant Analysis. The target is to minimise intra class variations and to maximize the inter class variations. In a batch of $N$ images with $P$ person identities containing $K$ images of each person, for a person identity $i\in P$ , mean $f^{m}_{i}$ in feature space $R^{d}$ is,

[TABLE]

Intra class variation for an identity is represented by the distance of each sample of that identity to the mean of that identity. Hence for an identity $i$ , intra class variation $d^{intra}_{i}$ is given by

[TABLE]

Similarly, inter class variation for an identity is represented by the distance of the mean of that identity to means of all other identities. Hence for an identity $i$ , inter class variation $d^{inter}_{i}$ is given by

[TABLE]

The task is to minimise intra class distances and maximise the inter class distances. Hence the loss that is being minimised is

[TABLE]

The summation term in the numerator in Eq. 6 accumulates PK distances where as summation in the denominator accumulates P(P-1) distances. Hence $\beta$ is a hyper parameter which act as a normalising constant and $\gamma$ is a very small value.

3.3. Batch Hard Training

The loss function shown by Eq.6 describes the basic concept of cluster loss. We strive to minimise the intra class distance which is measured as distance of samples of a class with respect to their mean, at the same time maximising inter class distances which is measured as distances between the means of different classes. Although Eq. 6 is a good representation of cluster loss, when we trained the network with that particular loss function, the results were not promising and the number of iterations required for convergence was very high. This is because the loss contained equal contributions from all samples. This is similar to training using triplet loss without mining hard triplets. The transformation $f$ relatively quickly learns to correctly map most trivial samples, rendering a large fraction of all samples uninformative. Thus mining hard positive/negative samples becomes crucial for learning. Intuitively, being told over and over again that people with differently coloured clothes are different persons does not teach one anything, whereas seeing similarly looking but different people (hard negatives), or pictures of the same person in wildly different poses or from different camera angles (hard positives) dramatically helps in understanding the concept of re-identification.

So we modified the loss function in such a way that it does not take cumulative contribution from all images in a batch but the samples which contribute most to the loss, so that the correction step by minimization affects those samples which have the maximum error. Hence only hard samples contribute directly to the loss function. In this approach for the new $d^{intra}_{i}$ of an identity $i$ , we take the sample which lies farthest from the mean $f^{m}_{i}$ and take the corresponding distance as $d^{intra}_{i}$ .

[TABLE]

For the new $d^{inter}_{i}$ for an identity $i$ , we take the distance with that mean which is closest to the mean of considered identity.

[TABLE]

The final loss function to be minimised is

[TABLE]

Mining hard samples ensures that the training converges fast and better results. In (Hermans* et al., 2017) a method for mining hard triplets is described which gives better results compared to other triplet loss based methods. The downside of basing the loss function only on few triplets is that, the transformation function is adjusted based only on the distance between those samples. In our method, even though we consider only hard samples, since their distances are calculated with respect to mean, every sample contributes indirectly. Hence with every iteration, the transformation adjusts to decrease the distance within the clusters while making sure that the clusters stay far apart.

4. Experiments and Results

We focused on two types of performance evaluations. 1) Performance for a ranking task and 2) Performance for sequential clustering task. The datasets we employed were Market-1501 (Zheng et al., 2015), one of the largest person ReID datasets currently available and CUHK03 (Li et al., 2014) dataset. The Market-1501 dataset contains bounding boxes from a person detector which have been selected based on their intersection-over-union overlap with manually annotated bounding boxes. It contains 32668 images of 1501 persons, split into train/test sets of 12936/19732 images as defined by (Zheng et al., 2015). We also show results on the CUHK03 (Li et al., 2014) dataset which contain 13164 images of 1360 identities.

Augmenting training data is a common practice. We performed random crops and random horizontal flips during training. Similar to the augmentation steps in TriNet (Hermans* et al., 2017), we resize all images of size H x W to 1 $\frac{1}{8}$ (H x W), of which we take random crops of size H x W , keeping their aspect ratio intact. We set H = 256, W = 128 on Market-1501 and H = 256, W = 96 on CUHK03. We apply test-time augmentation in our experiments. From each image, we deterministically create five crops of size H x W : four corner crops and one center crop, as well as a horizontally flipped copy of each. The embeddings of all these ten images are then averaged, resulting in the final embedding for a person. We also experimented with transfer learning. We initialized our network with weights of existing network which was trained for ReID task like TriNet (Hermans* et al., 2017) and this yielded a better result which converged with fewer iterations.

4.1. Performance for ranking task

We evaluated the performance for ranking task on both Market-1501 (Zheng et al., 2015) and CUHK03 (Li et al., 2014) datasets. We used the standard evaluation, namely the mean average precision score (mAP) and the cumulative matching curve (CMC) at rank-1 and rank-5. We followed the evaluation codes provided by (Zhong et al., 2017) and (Hermans* et al., 2017).

Table 3 compares our results to a set of related, top performing approaches on Market-1501 with single query. We also evaluated how our model performs when combined with the re-ranking approach by Zhong et al. * (Zhong et al., 2017). This can be applied on top of any ranking methods and uses information from nearest neighbours in the gallery to improve the ranking result. Table 4 compares our results to a set of related, top performing approaches on CUHK03. It is evident that all deep learning based methods outperform the traditional methods by a huge margin. The TriNet (Hermans et al., 2017) which is based on triplet loss with an improvement in hard mining of samples seems to perform best among all existing methods. Our method performs slightly better than TriNet in the ranking task.

4.2. Performance for sequential clustering task

We wanted to evaluate the performance of our method for a clustering task. We did a simple sequential clustering explained in Fig 5, in which images ( $x_{i},x_{2},....x_{i}....)$ are fed in sequence. For an image $x_{i}$ , after passing through the network to find the transformation $f(x_{i})$ , Euclidean distances $d_{i}$ are computed with the means $f_{m}$ of

all existing classes and the minimum among them is taken as $d_{k}$ ( as the minimum distance was with class $k$ ). If $d_{k}$ is less than threshold $th$ , image $x_{i}$ is marked as belonging to class $k$ and mean of class $k$ is updated with $f(x_{i})$ . If $d_{k}$ is greater than or equal to $th$ , $x_{i}$ is marked as a new class, increasing the total number of classes by 1. We used the test set of Market-1501 (Zheng et al., 2015) data set for the experiments after removing the ”distractor” and ”junk” images. To prepare the feed, we randomly select 4 to 6 identities (from a total of 750 identities) from the set and then shuffle all of their images and then push it to the feed. This is done to resemble the real life cases where multiple people are passing in front of a camera. Triplet loss based TriNet (Hermans* et al., 2017) gave highest accuracy for ranking task among all previously existing methods. So we compared the sequential clustering performance of our method with TriNet by replacing the transformation network $f(x)$ with that of Trinet and created embeddings. We used two metrics for measuring the clustering accuracy.

Cluster Quality ( $C_{q}$ ): At any stage of the image feed, we know the true identity of every image that has been fed to the clustering algorithm until that point. So we try to map identities to all existing clusters. The criterion for tagging a cluster with an identity is that, maximum number of images in that cluster belong to that particular identity ie an identity $I$ is assigned to a particular cluster $C$ if the maximum number of images in that cluster belong to $I$ . Same identity can be assigned to more than one cluster. In such cases we take the cluster with most number of images belonging to that identity, and mark the other clusters as unassigned. The criterion for an image $x_{i}$ to be ”clustered” correctly are, it should belong to a cluster which was assigned an identity,and whose identity is same as that of the image $x_{i}$ . Cluster Quality( $C_{q}$ ) at any point is defined as the ratio of number of images clustered correctly to the total number of images fed for clustering until that point. 2)Rand Index ( $R_{id}$ ): Rand Index (Wong, 2015) is a standard evaluation metric for any clustering algorithm. It is based on the intra-cluster similarity and inter-cluster dissimilarity. For the intra-cluster similarity, if a pair of data vectors is assigned the same cluster in both the target result and the clustering result, then the score will be increased by one. For the inter-cluster dissimilarity, if a pair of vectors is assigned different clusters in both the target result and the clustering result, then the score will be increased by one. On the contrary, if a pair of data vectors is in the same cluster in the target result, but not in the clustering result, the score will not be increased. After we have checked all the possible pairs, the score is normalized by the total number of possible pairs. Exact formulation is given in (Wong, 2015).

The accuracy comparisons are shown in Fig 6 and Fig 7. It is evident that our network trained with cluster loss outperforms Trinet in the clustering task.

Choosing the threshold $th$ in the sequential clustering experiment is tricky. Since we are doing a performance evaluation rather than deployment, we tried different thresholds and chose the one that gave best accuracies. This was done for both networks. From the experiments it is observed that our method has better ranking accuracies compared to other existing methods for person ReID, even though by a small margin. Our method excel in a clustering task. This is because the training loss not only tries to bring all similar identities together but also tries to keep the clusters far apart.

4.3. Cluster loss - Triplet loss comparison

We created t-SNE plots for embeddings generated by networks trained using cluster loss and triplet loss to compare the cluster formation. We did this on MNIST(LeCun and Cortes, 2010) dataset and Market-1501(Zheng et al., 2015). For MNIST dataset, a CNN having two convolutional layers with 128 and 256 filters of kernel size 7x7 and 5x5 respectively and embedding dimension of 4, was trained using triplet loss and cluster loss. The t-SNE plot on the validation data is shown in figures 8.a and 8.b. For Market-1501, we used the same networks as used in section 7. We randomly picked 10 identities from test set and their embeddings are plotted in figures 8.c and 8.d. From the t-SNE plots it is evident that the triplet loss is able to bring the samples from the same class together but the cluster loss performs a better job in separating the clusters.

5. Conclusion

In this paper, we introduce cluster loss for training a network for a person ReID task. Training the network with cluster loss shows that it outperforms in learning better parameters for the transformation function which increases the inter class variation and decreases the intra class variation for person re -identification. Our method performs better in ranking tasks compared to the existing state-of-the-art methods. In a clustering task, our network outperformed TriNet (Hermans* et al., 2017) which had the best ranking accuracy among existing methods by a huge margin. In future, as an extension to this work on person ReID, we want to train and evaluate a cluster loss based network for a face recognition task.

Bibliography59

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Abadi et al . (2016) Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. Tensor Flow: A System for Large-scale Machine Learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Imple
3Ahmed et al . (2015) E. Ahmed, M. Jones, and T. K. Marks. 2015. An improved deep learning architecture for person re-identification. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . 3908–3916. https://doi.org/10.1109/CVPR.2015.7299016 · doi ↗
4Chen et al . (2015) D. Chen, Zejian Yuan, G. Hua, N. Zheng, and J. Wang. 2015. Similarity learning on an explicit polynomial kernel feature map for person re-identification. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . 1565–1573. https://doi.org/10.1109/CVPR.2015.7298764 · doi ↗
5Chen et al . (2016) S. Z. Chen, C. C. Guo, and J. H. Lai. 2016. Deep Ranking for Person Re-Identification via Joint Representation Learning. IEEE Transactions on Image Processing 25, 5 (May 2016), 2353–2367. https://doi.org/10.1109/TIP.2016.2545929 · doi ↗
6Chen et al . (2017 a) W. Chen, X. Chen, J. Zhang, and K. Huang. 2017 a. Beyond Triplet Loss: A Deep Quadruplet Network for Person Re-identification. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . 1320–1329. https://doi.org/10.1109/CVPR.2017.145 · doi ↗
7Chen et al . (2017 b) Weihua Chen, Xiaotang Chen, Jianguo Zhang, and Kaiqi Huang. 2017 b. A Multi-Task Deep Network for Person Re-Identification. (2017). https://aaai.org/ocs/index.php/AAAI/AAAI 17/paper/view/14313
8Cheng et al . (2016) D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng. 2016. Person Re-identification by Multi-Channel Parts-Based CNN with Improved Triplet Loss Function. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . 1335–1344. https://doi.org/10.1109/CVPR.2016.149 · doi ↗