Deep Constrained Dominant Sets for Person Re-identification

Leulseged Tesfaye Alemu; Marcello Pelillo; Mubarak Shah

arXiv:1904.11397·cs.CV·June 20, 2019

Deep Constrained Dominant Sets for Person Re-identification

Leulseged Tesfaye Alemu, Marcello Pelillo, Mubarak Shah

PDF

1 Repo

TL;DR

This paper introduces a novel end-to-end constrained clustering approach called deep constrained dominant sets (DCDS) for person re-identification, leveraging probe constraints and multi-scale features to improve accuracy over existing methods.

Contribution

The paper proposes a new constrained clustering framework for person re-id that incorporates probe constraints and end-to-end optimization, enhancing robustness and performance.

Findings

01

Outperforms state-of-the-art methods on benchmark datasets.

02

Effectively leverages probe constraints to reduce noise propagation.

03

Integrates multi-scale ResNet features for improved accuracy.

Abstract

In this work, we propose an end-to-end constrained clustering scheme to tackle the person re-identification (re-id) problem. Deep neural networks (DNN) have recently proven to be effective on person re-identification task. In particular, rather than leveraging solely a probe-gallery similarity, diffusing the similarities among the gallery images in an end-to-end manner has proven to be effective in yielding a robust probe-gallery affinity. However, existing methods do not apply probe image as a constraint, and are prone to noise propagation during the similarity diffusion process. To overcome this, we propose an intriguing scheme which treats person-image retrieval problem as a {\em constrained clustering optimization} problem, called deep constrained dominant sets (DCDS). Given a probe and gallery images, we re-formulate person re-id problem as finding a constrained cluster, where the…

Tables5

Table 1. Table 1 : A comparison of the proposed method with state-of-the-art methods on Market1501 dataset. Upper block, without re-ranking methods. Lower block, with re-ranking method, w / R R 𝑤 𝑅 𝑅 w/RR , [ 46 ] .

Methods	mAP	rank-1	rank-5
SGGNN [26] ECCV18	82.8	92.3	96.1
DKPM [27] CVPR18	75.3	90.1	96.7
DGSRW [25] CVPR18	82.5	92.7	96.9
GCSL [7] CVPR18	81.6	93.5	-
CPC [33] CVPR18	69.48	83.7	-
MLFN [6] CVPR18	74.3	90.0	-
HA-CNN [21] CVPR18	75.7	91.2	-
PA [28] ECCV18	74.5	88.8	95.6
HSP [16] CVPR18	83.3	93.6	97.5
Ours	85.8	94.81	98.1
$R A_{w / R R}$ [34] CVPR18	86.7	90.9	-
$P A_{w / R R}$ [28] ECCV18	89.9	93.4	96.4
$H S P_{w / R R}$ [16] CVPR18	90.9	94.6	96.8
Ours_w/RR	93.3	95.4	98.3

Table 2. Table 2 : Ablation studies on the proposed method. SD and MD respectively refer to the method trained on single and multiple-aggregated datasets. Baseline is the proposed method without CDS branch.

Methods	Market1501			CUHK03		DukeMTMC-reID
Methods	mAP	rank-1	rank-5	rank-1	rank-5	mAP	rank-1	rank-5
Baseline SD	72.2	86.5	94.0	87.1	94.3	61.1	77.6	87.3
Baseline MD	74.3	87.5	95.3	87.7	95.2	62.3	79.1	88.8
DCDS (SD )	81.4	93.3	97.6	93.1	98.8	69.1	83.3	89.0
DCDS (MD)	82.3	93.7	98.0	93.9	98.9	70.5	84.0	90.3
Ours (SD + Auxil Net)	83.0	93.9	98.2	95.4	99.0	74.4	85.6	93.7
Ours (MD + Auxil Net)	85.8	94.1	98.1	95.8	99.1	75.5	86.1	93.2

Table 3. Table 3 : A comparison of the proposed method with state-of-the-art methods on CUHK03 dataset.

Methods	rank-1	rank-5
SGGNN [26] ECCV18	95.3	99.1
DKPM [27] CVPR18	91.1	98.3
DGSRW [25] CVPR18	94.9	98.7
GCSL [7] CVPR18	90.2	98.5
MLFN [6] CVPR18	89.2	-
CPC [33] CVPR18	88.1	-
PA [28] ECCV18	88.0	97.6
HSP [16] CVPR18	94.28	99.04
Ours	95.8	99.1

Table 4. Table 4 : A comparison of the proposed method with state-of-the-art methods on DukeMTMC-reID dataset.Upper block, without re-ranking methods. Lower block, with re-ranking method, w / R R 𝑤 𝑅 𝑅 w/RR , [ 46 ] .

Methods	mAP	rank-1	rank-5
SGGNN [26] ECCV18	68.2	81.1	88.4
DKPM [27] CVPR18	63.2	80.3	89.5
DGSRW [25] CVPR18	66.4	80.7	88.5
GCSL [7] CVPR18	69.5	84.9	-
CPC [33] CVPR18	59.49	76.44	-
MLFN [6] CVPR18	62.8	81.0	-
RAPR [34] CVPR18	80.0	84.4	-
PA [28] ECCV18	64.2	82.1	90.2
HSP [16] CVPR18	73.3	85.9	92.9
Ours	75.5	87.5	-
$P A_{w / R R}$ [28] ECCV18	83.9	88.3	93.1
$H S P_{w / R R}$ [16] CVPR18	84.99	88.9	94.27
Ours _w/RR	86.1	88.5	-

Table 5. Table 5 : A comparison of the proposed method with PUL [ 13 ] on Market1501 dataset.

	Train on Duke, CUHK03 $\to$ Test on Market1501
Methods	mAP	rank-1
PUL [13]	20.5	45.5
Ours	24.5	51.3

Equations16

\begin{array}[]{ll}\text{maximize }&f({\bf x})={\bf x}^{\prime}A{\bf x},\\ \text{subject to}&\mathbf{x}\in\Delta\end{array}

\begin{array}[]{ll}\text{maximize }&f({\bf x})={\bf x}^{\prime}A{\bf x},\\ \text{subject to}&\mathbf{x}\in\Delta\end{array}

\begin{array}[]{ll}\text{maximize }&f_{P}^{\alpha}(X)={\bf x}^{\prime}(A-\alpha\hat{I}_{P}){\bf x},\\ \text{subject to}&\mathbf{x}\in\Delta\end{array}

\begin{array}[]{ll}\text{maximize }&f_{P}^{\alpha}(X)={\bf x}^{\prime}(A-\alpha\hat{I}_{P}){\bf x},\\ \text{subject to}&\mathbf{x}\in\Delta\end{array}

x_{i} (t + 1) = x_{i} (t) \frac{( A x ( t ) ) _{i}}{x ( t ) ^{'} A x ( t )} . \vspace - 0.2 c m

x_{i} (t + 1) = x_{i} (t) \frac{( A x ( t ) ) _{i}}{x ( t ) ^{'} A x ( t )} . \vspace - 0.2 c m

ϕ_{S} (i, j) = a_{ij} - \frac{1}{∣ S ∣} k \in S \sum a_{ik}, \vspace - 0.2 c m

ϕ_{S} (i, j) = a_{ij} - \frac{1}{∣ S ∣} k \in S \sum a_{ik}, \vspace - 0.2 c m

\vspace - 0 c m w_{S} (i) = {1, \sum_{j \in S ∖ {i}} ϕ_{S ∖ {i}} (j, i) w_{S ∖ {i}} (j), if ∣ S ∣ = 1, otherwise

\vspace - 0 c m w_{S} (i) = {1, \sum_{j \in S ∖ {i}} ϕ_{S ∖ {i}} (j, i) w_{S ∖ {i}} (j), if ∣ S ∣ = 1, otherwise

\begin{array}[]{ll}\text{maximize }&f_{P}^{\alpha}({\bf x})^{i}={\bf x}^{\prime}B{\bf x}\quad where,B=A-\alpha\hat{I}_{p}.\\ \text{subject to}&\mathbf{x}\in\Delta\end{array}

\begin{array}[]{ll}\text{maximize }&f_{P}^{\alpha}({\bf x})^{i}={\bf x}^{\prime}B{\bf x}\quad where,B=A-\alpha\hat{I}_{p}.\\ \text{subject to}&\mathbf{x}\in\Delta\end{array}

Y = x_{i}^{*} ⋮ x_{M}^{*} = z_{g_{1}}^{1} ⋮ z_{g_{1}}^{M} z_{g_{2}}^{1} z_{g_{2}}^{M} \dots ⋱ \dots z_{g_{M}}^{1} ⋮ z_{g_{M}}^{M} .

Y = x_{i}^{*} ⋮ x_{M}^{*} = z_{g_{1}}^{1} ⋮ z_{g_{1}}^{M} z_{g_{2}}^{1} z_{g_{2}}^{M} \dots ⋱ \dots z_{g_{M}}^{1} ⋮ z_{g_{M}}^{M} .

\begin{array}[]{ll}F_{s}=\beta(Y)\otimes(1-\beta)(S^{\prime}),\\ F_{d}=\beta(Y_{d})\otimes(1-\beta)(D^{\prime}),\quad where,\quad Y_{d}=\delta-Y\end{array}

\begin{array}[]{ll}F_{s}=\beta(Y)\otimes(1-\beta)(S^{\prime}),\\ F_{d}=\beta(Y_{d})\otimes(1-\beta)(D^{\prime}),\quad where,\quad Y_{d}=\delta-Y\end{array}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

leule/DCDS
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Deep Constrained Dominant Sets for Person Re-identification

Leulseged Tesfaye Alemu

Ca’ Foscari University of Venice

Marcello Pelillo

Ca’ Foscari University of Venice

ECLT, European Centre for Living Technology

Mubarak Shah

CRCV, University of Central Florida

Abstract

In this work, we propose an end-to-end constrained clustering scheme to tackle the person re-identification (re-id) problem. Deep neural networks (DNN) have recently proven to be effective on person re-identification task. In particular, rather than leveraging solely a probe-gallery similarity, diffusing the similarities among the gallery images in an end-to-end manner has proven to be effective in yielding a robust probe-gallery affinity. However, existing methods do not apply probe image as a constraint, and are prone to noise propagation during the similarity diffusion process. To overcome this, we propose an intriguing scheme which treats person-image retrieval problem as a constrained clustering optimization problem, called deep constrained dominant sets (DCDS). Given a probe and gallery images, we re-formulate person re-id problem as finding a constrained cluster, where the probe image is taken as a constraint (seed) and each cluster corresponds to a set of images corresponding to the same person. By optimizing the constrained clustering in an end-to-end manner, we naturally leverage the contextual knowledge of a set of images corresponding to the given person-images. We further enhance the performance by integrating an auxiliary net alongside DCDS, which employs a multi-scale Resnet. To validate the effectiveness of our method we present experiments on several benchmark datasets and show that the proposed method can outperform state-of-the-art methods.

1 Introduction

Person re-identification aims at retrieving the most similar images to the probe image, from a large scale gallery set captured by camera networks. Among the challenges which hinder person re-id tasks, include background clutter, Pose, view and illumination variation can be mentioned.

Person re-id can be taken as a person retrieval problem based on the ranked similarity score, which is obtained from the pairwise affinities between the probe and the dataset images. However, relying solely on the pairwise affinities of probe-gallery images, ignoring the underlying contextual information between the gallery images often leads to an undesirable similarity ranking. To tackle this, several works have been reported, which employ similarity diffusion to estimate a second order similarity that considers the intrinsic manifold structure of the given affinity matrix [3], [22], [12], [4]. Similarity diffusion is a process of exploiting the contextual information between all the gallery images to provide a context sensitive similarity. Nevertheless, all these methods do not leverage the advantage of deep neural networks. Instead, they employ the similarity diffusion process as a post-processing step on the top of the DNN model. Aiming to improve the discriminative power of a DNN model, there have been recent works which incorporate a similarity diffusion process in an end-to-end manner [25],[26],[7]. Following [5], which applies a random walk in an end-to-end fashion for solving semantic segmentation problem, authors in [25] proposed a group-shuffling random walk network for fully utilizing the affinity information between gallery images in both the training and testing phase. Also, the authors of [26] proposed similarity-guided graph neural network (SGGNN) to exploit the relationship between several prob-gallery image similarities.

However, most of the existing graph-based end-to-end learning methods apply the similarity diffusion without considering any constraint or attention mechanism to the specific query image. Due to that the second order similarity these methods yield is highly prone to noise. To tackle this problem, one possible mechanism could be to guide the similarity propagation by providing seed (or constraint) and let the optimization process estimate the optimal similarity between the seed and nearest neighbors, while treating the seed as our attention point. To formalize this idea, in this paper, we model person re-id problem as finding an internally coherent and externally incoherent constrained cluster in an end-to-end fashion. To this end, we adopt a graph and game theoretic method called constrained dominant sets in an end-to-end manner. To the best of our knowledge, we are the first ones to integrate the well known unsupervised clustering method called dominant sets in a DNN model. To summarize, the contributions of the proposed work are:

•

For the very first time, the dominant sets clustering method is integrated in a DNN and optimized in end-to-end fashion.

•

A one-to-one correspondence between person re-identification and constrained clustering problem is established.

•

State-of-the-art results are significantly improved.

The paper is structured as follow. In section 2, we review the related works. In section 3, we discuss the proposed method with a brief introduction to dominant sets and constrained dominant sets. Finally, in section 4, we provide an extensive experimental analysis on three different benchmark datasets.

2 Related works

Person re-id is one of the challenging computer vision tasks due to the variation of illumination condition, backgrounds, pose and viewpoints. Most recent methods train DNN models with different learning objectives including verification, classification, and similarity learning [9], [42], [31], [1]. For instance, verification network (V-Net) [19], Figure 1(b), applies a binary classification of image-pair representation which is trained under the supervision of binary softmax loss. Learning accurate similarity and robust feature embedding has a vital role in the course of person re-identification process. Methods which integrate siamese network with contrastive loss are a typical showcase of deep similarity learning for person re-id [8]. The optimization goal of these models is to estimate the minimum distance between the same person images, while maximizing the distance between images of different persons. However, these methods focus on the pairwise distance ignoring the contextual or relative distances. Different schemes have tried to overcome these shortcomings. In Figure 1(c), triplet loss is exploited to enforce the correct order of relative distances among image triplets [9], [11], [42] . In Figure 1(d), Quadruplet loss [8] leverages the advantage of both contrastive and triplet loss, thus it is able to maximize the intra-class similarity while minimizing the inter-class similarity. Emphasizing the fact that these methods entirely neglect the global structure of the embedding space, [7], [25], [26] proposed graph based end-to-end diffusion methods shown in Figure 1(e).

Graph based end-to-end learning. Graph-based methods have played a vital role in the rapid growth of computer vision applications in the past. However, lately, the advent of deep convolutional neural networks and their tremendous achievements in the field has attracted great attention of researchers. Accordingly, researchers have made a significant effort to integrate, classical methods, in particular, graph theoretical methods, in end-to-end learning. Shen et al. [26] developed two constructions of deep convolutional networks on a graph, the first one is based upon hierarchical clustering of the domain, and the other one is based on the spectrum of graph Laplacian. Yan et al. [37] proposed a model of dynamic skeletons called Spatial-Temporal Graph Convolutional Networks (ST-GCN), which provides a capability to automatically learn both the spatial and temporal pattern of data. Bertasius et al. [5] designed a convolutional random walk (RWN), where by jointly optimizing the objective of pixelwise affinity and semantic segmentation they are able to address the problem of blobby boundary and spatially fragmented predictions. Likewise, [25] integrates random walk method in end-to-end learning to tackle person re-identification problem. In [25], through the proposed deep random walk and the complementary feature grouping and group shuffling scheme, the authors demonstrate that one can estimate a robust probe-gallery affinity. Unlike recent Graph neural network (GNN) methods [26], [17], [25], [7], Shen et al. [26] learn the edge weights by exploiting the training label supervision, thus they are able to learn more accurate feature fusion weights for updating nodes feature.

Recent applications of dominant sets. Dominant sets (DS) clustering [24] and its constraint variant constrained dominant sets (CDS) [40] have been employed in several recent computer vision applications ranging from person tracking [29], [30], geo-localization [41], image retrieval [38], [2], 3D object recognition [32], to Image segmentation and co-segmentation [39]. Zemene et al. [40] presented CDS with its applications to interactive Image segmentation. Following, [39] uses CDS to tackle both image segmentation and co-segmentation in interactive and unsupervised setup. Wang et al. [32] recently used dominant sets clustering in a recursive manner to select representative images from a collection of images and applied a pooling operation on the refined images, which survive at the recursive selection process. Nevertheless, none of the above works have attempted to leverage the dominant sets algorithm in an end-to-end manner.

In this work, unlike most of the existing graph-based DNN model, we propose a constrained clustering based scheme in an end-to-end fashion, thereby, leveraging the contextual information hidden in the relationship among person images. In addition, the proposed scheme significantly magnifies the inter-class variation of different person-images while reducing the intra-class variation of the same person-images. The big picture of our proposed method is depicted in Figure 1(f), as can be seen, the objective here is to find a coherent constrained cluster which incorporates the given probe image $P$ .

3 Our Approach

In this work, we cast probe-gallery matching as optimizing a constrained clustering problem, where the probe image is treated as a constraint, while the positive images to the probe are taken as members of the constrained-cluster. Thereby, we integrate such clustering mechanism into a deep CNN to learn a robust features through the leveraged contextual information. This is achieved by traversing through the global structure of the given graph to induce a compact set of images based on the given initial similarity(edge-weight).

3.1 Dominant Sets and Constrained Dominant Sets

Dominant sets is a graph theoretic notion of a cluster, which generalizes the concept of a maximal clique to edge-weighted graphs. First, the data to be clustered are represented as an undirected edge-weighted graph with no self-loops, $G=(V,E,w)$ , where $V=\{1,...,M\}$ is the vertex set, $E\subseteq V\times V$ is the edge set, and $w:E\rightarrow R_{+}^{*}$ is the (positive) weight function. Vertices in $G$ correspond to data points, edges represent neighborhood relationships, and edge-weights reflect similarity between pairs of linked vertices. As customary, we represent the graph $G$ with the corresponding weighted adjacency (or similarity) matrix, which is the $M\times M$ nonnegative, symmetric matrix $A=(a_{ij})$ , defined as $a_{ij}=w(i,j)$ , if $(i,j)\in E$ , and $a_{ij}=0$ otherwise. Note that the diagonal elements of the adjacency matrix A are always set to zero indicating that there is no self-loops in graph $G.$ As proved in [24], one can extract a coherent cluster from a given graph by solving a quadratic program $f({\bf x})$ as,

[TABLE]

where, $\Delta$ is the standard simplex of $R^{n}$ . Zemene et. al [40] proposed an extension of dominant sets which allows one to constrained the clustering process to contain intended constraint nodes $P$ . Constrained dominant set (CDS) is an extensions of dominant set which contains a parameterized regularization term that controls the global shape of the energy landscape. When the regularization parameter is zero the local solutions are known to be in one-to-one correspondence with the dominant sets. A compact constrained cluster could be easily obtained from a given graph by defining a paramertized quadratic program as,

[TABLE]

where, $\hat{I}_{P}$ refers to $M\times M$ diagonal matrix whose diagonal elements are set to zero in correspondence to the probe $P$ and to 1 otherwise. Let $\alpha>\lambda_{max}(A_{V\backslash P}),$ where $\lambda_{max}(A_{V\backslash P})$ is the largest eigenvalue of the principal submatrix of $A$ indexed by the element of $V\backslash P.$ If ${\bf x}$ is a local maximizer of $f_{P}^{\alpha}({\bf x})$ in $\Delta,$ then $\delta({\bf x})\cap P\neq\emptyset,$ where, $\delta({\bf x})={i\in V:{\bf x}_{i}>0.}$ We refer the reader to [40] for the proof. Equations 1 and 2 can be simply solved with a straightforward continuous optimization technique from evolutionary game theory called replicator dynamics, as follows:

[TABLE]

for $i$ = $1,...,M.$

3.2 Modeling person re-id as a Dominant Set

Recent methods [7], [5] have proposed different models, which leverage local and group similarity of images in an end-to-end manner. Authors in [7] define a group similarity which emphasizes the advantages of estimating a similarity of two images, by employing the dependencies among the whole set of images in a given group. In this work, we establish a natural connection between finding a robust probe-gallery similarity and constrained dominant sets. Let us first elaborate the intuitive concept of finding a coherent subset from a given set based on the global similarity of given images. For simplicity, we represent person-images as vertices of graph $G,$ and their similarity as edge-weight $w_{ij}$ . Given vertices $V,$ and $S\subseteq V$ be a non-empty subset of vertices and $i\in S$ , average weighted degree of each $i$ with regard to $S$ is given as

[TABLE]

where $\phi_{S}(i,j)$ measures the (relative) similarity between node $j$ and $i$ , with respect to the average similarity between node $i$ and its neighbors in $S$ . Note that $\phi_{S}(i,j)$ can be either positive or negative. Next, to each vertex $i\in S$ we assign a weight defined (recursively) as follows:

[TABLE]

where $w_{ij}(i)=w_{ij}(j)=a_{ij}$ for all $i,j\in V(i\neq j)$ .

Intuitively, $w_{S}(i)$ gives us a measure of the overall similarity between vertex $i$ and the vertices of $S\setminus\{i\}$ , with respect to the overall similarity among the vertices in $S\setminus\{i\}$ . Hence, a positive $w_{S}(i)$ indicates that adding $i$ into its neighbors in $S$ will raise the internal coherence of the set, whereas in the presence of a negative value we expect the overall coherence to decline. In CDS, besides the additional feature, which allows us to incorporate a constraint element in the resulting cluster, all the characters of DS are inherited.

3.2.1 A Set of a person images as a constrained cluster

We cast person re-identification as finding a constrained cluster, where, elements of the cluster correspond to a set of same person images and the constraint refers to the probe image used to extract the corresponding cluster. As customary, let us consider a given mini-batch with $M$ number of person-images, and each mini batch with $k$ person identities (ID), thus, each person-ID has $\Omega=M/k$ images in the given mini-batch. Note that, here, instead of a random sampling we design a custom sampler which samples $k$ number of person IDs in each mini-batch. Let $B=\{I_{p_{1}}^{1},...I_{p_{1}}^{\Omega},I_{p_{2}}^{1},...I_{p_{2}}^{\Omega},...I_{p_{k}}^{1},...I_{pk}^{\Omega}\}$ refers to the set of images in a single mini-batch. Each time when we consider image $I_{p_{1}}^{1}$ as a probe image $P$ , images which belong to the same person id, $\{I_{p_{1}}^{2},I_{p_{1}}^{3}...I_{p_{1}}^{k}\},$ should be assigned a large membership score to be in that cluster. In contrast, the remaining images in the mini-batch should be assigned significantly smaller membership-score to be part of that cluster. Note that our ultimate goal here is to find a constrained cluster which comprises all the images of the corresponding person given in that specific mini-batch. Thus, each participant in a given mini-batch is assigned a membership-score to be part of a cluster. Furthermore, the characteristics vector, which contains the membership scores of all participants is always a stochastic vector, meaning that $\sum_{i=1}^{M}z_{i}=1,$ where $z_{i}$ denotes the membership score of each image in the cluster.

As can be seen from the toy example in Figure 3, the initial pairwise similarities between the query and gallery images hold valuable information, which define the relation of nodes in the given graph. However, it is not straightforward to redefine the initial pairwise similarities in a way which exploit the inter-images relationship. Dominant Sets (DS) overcome this problem with defining a weight of each image $p,g_{1},g_{2},g_{3}$ with regard to subset $S\backslash i$ as depicted in Figure $\ref{fig:exampler},(2-5),$ respectively. As can be observed from Figure 3, adding node $g_{3}$ to cluster $S$ degrades the coherency of cluster $S=\{p,g_{1},g_{2},g_{3}\},$ whereas the relative similarity of the remaining images with respect to set $S=\{p,g_{1},g_{2}\}$ has a positive impact on the coherency of the cluster. It is evident that the illustration in Figure 3 verifies that the proposed DCDS (Deep Constrained Dominant Set) could easily measure the contribution of each node in the graph and utilize it in an end-to-end learning process. Thereby, unlike a siamese, triplet and quadruplet based contrastive methods, DCDS consider the whole set of images in the mini-batch to measure the similarity of image pairs and enhance the learning process.

3.3 CDS Based End-to-end Learning

In this section, we discuss the integration of CDS in end-to-end learning. We adopt a siamese based Resent101, with a novel verification loss to find probe-gallery similarity, $R$ , and dissimilarity, $D$ scores. As can be seen from Figure 2, we have two main branches: CDS network branch (CDS-Net) and verification network branch (V-Net). In the CDS-Net, the elements of pairwise affinity matrix are computed first as a dot product of the global pooling feature of a pair of images. Afterward, the replicator dynamics [36] is applied, which is a discrete time solver of the parametrized quadratic program, Equ. 5, whose solution corresponds to the CDS. Thus, assuming that there are $M$ images in the given mini-batch, the replicator dynamics, Equ. 3, is recursively applied $M$ times taking each image in the mini-batch as a constraint. Given graph $G=(V,E,w)$ and its corresponding adjacency matrix $A\in R^{M\times M},$ and probe $P\subseteq V.$ First, a proper modification of the affinity matrix $A$ is applied by setting parameter $-\alpha$ to the diagonal corresponding to the subset $V\backslash P$ and zero to the diagonal corresponding to the constraint image $P$ . Next, the modified adjacency matrix, $B,$ is feed to the Replicator dynamics, by initiating the dynamics with a characteristic vector of uniform distribution $x^{t_{0}}$ , such that initially all the images in the mini-batch are assigned equal membership probability to be part of the cluster. Then, to find a constrained cluster a parametrized quadratic program is defined as:

[TABLE]

The solution, ${\bf x}_{i}^{*},$ of $f_{P}^{\alpha}({\bf x})^{i}$ is a characteristics vector which indicates the probability of each gallery image to be included in a cluster, containing the probe image $P^{i}$ . Thus, once we obtain the CDS, ${\bf x}_{i}^{*}=[z^{i}_{g_{1}},z^{i}_{g_{2}}...z^{i}_{g_{M}}],$ for each probe image, we store each solution ${\bf x}_{i}^{*}$ , in $Y\in{\rm I\!R}^{M\times M},$ as

[TABLE]

Likewise, for each probe, $P^{i},$ we store the probe-gallery similarity, R, and dissimilarity, D, obtained from the V-Net (shown in Figure 2) in $S^{\prime}$ and $D^{\prime}$ as, $S^{\prime}=[R^{1},R^{2},...R^{M}]$ and $D^{\prime}=[D^{1},D^{2},...D^{M}].$ Next, we fuse the similarity obtained from the CDS branch with the similarity from the V-Net as

[TABLE]

$\delta$ is empirically set to 0.3. We then vectorize $F_{s}$ and $F_{d}$ into ${\rm I\!R}^{(M^{2}\times 2)},$ where, the first column stores the dissimilarity score, while the second column stores the similarity score. Afterward, we simply apply cross entropy loss to find the prediction loss. The intriguing feature of our model is that it does not need any custom optimization technique, it can be end-to-end optimized through a standard back-propagation algorithm. Note that, Figure 2 illustrates the case of a single probe-gallery, whereas Equ. 6 shows the solution of $M$ probe images in a given mini-batch.

3.4 Auxiliary Net

In this work, we integrate an auxiliary net to further improve the performance of our model. The auxiliary net is trained based on the multi-scale prediction of Resnet50 [15]. It is a simple yet effective architecture, whereby we can easily compute both triplet and cross entropy loss of different layers of Resnet50 [15], hence further enhancing the learning capability. Consequently, we compute the average of both losses to find the final loss. As can be observed from Figure 4, we employ three features at different layers from Resnet50 $conv5\_x$ Layer, and then we fed these three features to the subsequent layers, MP, Conv, BN, and FC layers. Next, we compute triplet and cross entropy loss for each feature which comes from the Relu and FC layers, respectively. During testing phase we concatenate the features that come from the DCDS and Auxiliary Net to find 4096 dimensional feature. We then apply CDS to find the final ranking $\_$ score, (See Figure 5).

3.5 Constraint Expansion During Testing

We propose a new scheme (illustrated in Figure 6) to expand the number of constraints in order to guide the similarity propagation during the testing phase. Given an affinity matrix, which is constructed using the features obtained from the concatinated feature (shown in Figure 5), we first collect k-NN’s of the probe image. Then, we run CDS on the graph of the NN’s. Next, from the resulting constrained cluster, we select the one with the highest membership score, which is used as a constraint in the subsequent step. We then use multiple-constraints and run CDS over the entire graph.

4 Experiments

To validate the performance of our method we have conducted several experiments on three publicly available benchmark datasets, namely CUHK03 [19], Market1501 [43], and DukeMTMC-reID [45].

4.1 Datasets and evaluation metrics

Datasets: CUHK03 [19] dataset comprises 14,097 manually and automatically cropped images of 1,467 identities, which are captured by two cameras on campus; in our experiments, we have used manually annotated images. Market1501 dataset [43] contains 32,668 images which are split into 12, 936 and 19,732 images as training and testing set, respectively. Market1501 dataset has totally 1501 identities which are captured by five high-resolution and one low-resolution cameras, the training and testing sets have 751 and 750 identities respectively. To obtain the person bounding boxes, Deformable part Model (DPM) [14] is utilized. DukeMTMC-reID is generated from a tracking dataset called DukeMTMC. DukeMTMC is captured by 8 high-resolution cameras, and person-bounding box is manually cropped; it is organized as 16,522 images of 702 person for training and 18, 363 images of 702 person for testing.

Evaluation Metrics: Following the recent person re-id methods, we use mean average precision (mAP) as suggested in [43], and Cumulated Matching Characteristics (CMC) curves to evaluate the performance of our model. Furthermore, all the experiments are conducted using the standard single query setting [43].

4.2 Implementation Details

We implement DCDS based on Resnet101 [15] architecture, which is pretrained on imagenet dataset. We adopt the training strategy of Kalayeh et al. [16], and aggregate eight different person re-id benchmark dataset to train our model. In total, the merged dataset contains 89,091 images, which comprises 4937 person-ID (detail of the eight datasets is given in the supplementary material). We first train our model using the merged dataset (denoted as multi-dataset (MD)) for 150 epochs and fine-tune it with CUHK03, Market1501, and DukeMTMC-reID dataset. To train our model using the merged dataset, we set image resolution to 450 $\times$ 150. Subsequently, for fine-tuning the model we set image resolution to 384 $\times$ 128. Mini-batch size is set to 64, each mini-batch has 16 person-ID and each person-ID has 4 images. We also experiment only using a single dataset for training and testing, denoted as single-dataset (SD). For data augmentation, we apply random horizontal flipping and random erasing [47]. For optimization we use Adam, we initially set the learning rate to 0.0001, and drop it by 0.1 in every 40 epochs. The fusing parameter in Equ. 6, $\beta$ , is set to 0.9.

4.3 Results on Market1501 Datasets

As can be seen from Table 1, on Market dataset, our proposed method improves state-of-the-art method [16] by $2.5\%,1.21\%,$ and $0.6\%$ in mAP, rank-1 and rank-5 scores, respectively. Moreover, comparing to state-of-the-art graph-based DNN method, SGGNN [26], the improvement margins are $3\%,2.5\%,$ and $2\%$ in mAP, rank-1, and rank-5 score, respectively. Thus, our framework has significantly demonstrated its benefits over state-of-the-art graph-based DNN models. To further improve the result we have adapted a re-ranking scheme [46], and we compare our method with state-of-the art methods which use a re-ranking method as a post-processing. As it can be seen from Table 1, our method has gain mAP of $2.2\%$ over HSP [16], and 10.5 $\%$ over SGGNN[26], 10.8 $\%$ over DGSRW.

4.4 Results on CUHK03 Datasets

Table 5 shows the performance of our method on CUHK03 dataset. Since most of the Graph-based DNN models report their result on the standard protocol [20], we have experimented on the standard evaluation protocol, to make fair comparison. As can be observed from Table 5, our method gain a marginal improvement in the mAP. Using a reranking method [46], we have reported a competitive result in all evaluation metrics.

4.5 Results on DukeMTMC-reID Dataset

Likewise, in DukeMTMC-reID dataset, the improvements of our proposed method is noticeable. Our method has surpassed state-of-the-art method [16] by $1.7\%/1.6\%$ in mAP/rank-1 scores. Moreover, comparing to state-of-the-art graph-based DNN, our method outperforms DGSRW [25], SGGNN [26] and GCSL [7] by $9.1\%,7.3\%,$ and $6\%$ in mAP, respectively.

4.6 Ablation Study

To investigate the impact of each component in our architecture, we have performed an ablation study. Thus, we have reported the contributions of each module in Table 2. To make a fair comparison with the baseline and graph-based DNN models, the ablations study is conducted in a single-dataset (SD) setup.

Improvements over the Baseline. As our main contribution is the DCDS, we examine its impact over the baseline method. The baseline method refers to the lower branch of our architecture that incorporates the verification network, which has also been utilized in [27], [25], [26]. On Market1501 dataset, DCDS provides improvements of $9.2\%,6.8\%$ and $3.6\%$ in mAP, rank-1, and rank-5 scores, respectively, over the baseline method; whereas in DukeMTMC-reID dataset the proposed DCDS improves the baseline method by $8.0\%,5.5\%$ and $1.7\%$ in mAP, rank-1, and rank-5 scores, respectively.

Comparison with graph-based deep models. We compare our method with recent graph-based-deep models, which adapt similar baseline method as ours, such as [25],[26]. As a result, on DukeMTMC-reID dataset our method surpass [25] by $9.1\%/6.8\%,$ and [26] by 17.9 $\%$ / 7.4 $\%$ in mAP / rank-1 scores. In light of this, We can conclude that incorporating a constrained-clustering mechanism in end-to-end learning has a significant benefit on finding a robust similarity ranking. In addition, experimental findings demonstrate the superiority of DCDS over existing graph-based DNN models.

Parameter analysis. Experimental results by varying several parameters are shown in Figure 7. Figure 7(a) shows the effect of fusing parameter, $\beta,$ Equ. (6) on the mAP. Thereby, we can observe that the mAP tends to increase with a larger $\beta$ value. This shows that the result gets better when we deviate much from the CDS branch. Figure 7(b) shows the impact of the number of images per person-ID ( $\Omega$ ) in a given batch. We have experimented setting $\Omega$ to 4, 8, and 16, as can be seen, we obtain a marginal improvement when we set $\Omega$ to 16. However, considering the direct relationship between the running time and $\Omega$ , the improvement is negligible. c) and d) show probe-gallery similarity obtained from baseline and DCDS method, using three different probe-images, with a batch size of 64, and setting $\Omega$ to 4, 8 and 16.

5 Conclusion

In this work, we presented a novel insight to enhance the learning capability of a DNN through the exploitation of a constrained clustering mechanism. To validate our method, we have conducted extensive experiments on several benchmark datasets. Thereby, the proposed method not only improves state-of-the-art person re-id methods but also demonstrates the benefit of incorporating a constrained-clustering mechanism in the end-to-end learning process. Furthermore, the presented work could naturally be extended to other applications which leverage a similarity-based learning. As a future work, we would like to investigate dominant sets clustering as a loss function.

Acknowledgment

This research is partly supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA $R\&D$ Contract No. D17PC00345. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.”

Appendix A Datasets

In multiple dataset (MD) setup, we first train our model on eight datasets: CUHK03 [20], CUHK01 [18], Market1501 [43], DukeMTMC-reID [45], Viper [10], MSMT17 [35], GRID [23], and ILIDS [44]. Next, we fine-tune and evaluate on each of CUHK03 [20], Market1501 [43], and DukeMTMC-reID [45] datasets.

Appendix B Experiments on Cross-datasets Evaluation

Due to the lack of abundant labeled data, cross-dataset person re-id has attracted great interest. Recently, Fan et al. [13] have developed a progressive clustering-based method to attack cross-dataset person re-id problem. To further validate our proposed DCDS, we apply our method on cross-dataset person re-id problem and compare it with progressive unsupervised learning (PUL) [13]. To this end, we train our model on DukeMTMC-reID and CUHK03 datasets and test it on Market1501 dataset. We then compare it with PUL [13], which has also been trained on CUHK03 and DukeMTMC-reID datasets. As can be observed from Table 5, even though our proposed method is not intended for cross-dataset re-id, it has gained a substantial improvements over PUL [13], that was mainly designed to attack person re-id problem in a cross-dataset setup.

Appendix C Parameter Analysis

Similar to the parameter analysis reported in the main manuscript, we report hyper parameter analysis on DukeMTMC-reID and CUHK03 dataset. The performance of our method with respect to the fusing parameters on DukeMTMC-reID and CUHK03 are shown in Figure 10 (a) and Figure 10 (b), respectively. Thereby, as can be observed, the results show similar phenomena as in Market1501, where the mAP increases with a larger $\beta$ value. Figure 11 shows the similarity distribution given by the baseline and the proposed DCDS using three different probe-images, with a batch size of 64, and setting $\Omega$ to 4, 8 and 16.

Bibliography47

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] E. Ahmed, M. J. Jones, and T. K. Marks. An improved deep learning architecture for person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015 , pages 3908--3916, 2015.
2[2] L. T. Alemu and M. Pelillo. Multi-feature fusion for image retrieval using constrained dominant sets. Co RR , abs/1808.05075, 2018.
3[3] S. Bai, X. Bai, and Q. Tian. Scalable person re-identification on supervised smoothed manifold. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017 , pages 3356--3365, 2017.
4[4] S. Bai, Z. Zhou, J. Wang, X. Bai, L. J. Latecki, and Q. Tian. Ensemble diffusion for retrieval. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017 , pages 774--783, 2017.
5[5] G. Bertasius, L. Torresani, S. X. Yu, and J. Shi. Convolutional random walk networks for semantic image segmentation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017 , pages 6137--6145, 2017.
6[6] X. Chang, T. M. Hospedales, and T. Xiang. Multi-level factorisation net for person re-identification. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018 , pages 2109--2118, 2018.
7[7] D. Chen, D. Xu, H. Li, N. Sebe, and X. Wang. Group consistent similarity learning via deep CRF for person re-identification. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018 , pages 8649--8658, 2018.
8[8] W. Chen, X. Chen, J. Zhang, and K. Huang. Beyond triplet loss: A deep quadruplet network for person re-identification. In CVPR , pages 1320--1329. IEEE Computer Society, 2017.