Distance Metric Learned Collaborative Representation Classifier

Tapabrata Chakraborti; Brendan McCane; Steven Mills; Umapada Pal

arXiv:1905.01168·cs.CV·October 4, 2021

Distance Metric Learned Collaborative Representation Classifier

Tapabrata Chakraborti, Brendan McCane, Steven Mills, Umapada Pal

PDF

Open Access

TL;DR

This paper introduces DML-CRC, a method that learns an optimal Mahalanobis distance metric within a deep network to improve fine-grained classification accuracy, achieving state-of-the-art results.

Contribution

It proposes a novel end-to-end approach to learn a Mahalanobis distance metric integrated with a convolutional network for enhanced classification.

Findings

01

State-of-the-art accuracy on CUB Birds, Oxford Flowers, Oxford-IIIT Pets datasets.

02

Network-agnostic method applicable to various classification tasks.

03

Effective integration of metric learning with deep feature extraction.

Abstract

Any generic deep machine learning algorithm is essentially a function fitting exercise, where the network tunes its weights and parameters to learn discriminatory features by minimizing some cost function. Though the network tries to learn the optimal feature space, it seldom tries to learn an optimal distance metric in the cost function, and hence misses out on an additional layer of abstraction. We present a simple effective way of achieving this by learning a generic Mahalanabis distance in a collaborative loss function in an end-to-end fashion with any standard convolutional network as the feature learner. The proposed method DML-CRC gives state-of-the-art performance on benchmark fine-grained classification datasets CUB Birds, Oxford Flowers and Oxford-IIIT Pets using the VGG-19 deep network. The method is network agnostic and can be used for any similar classification tasks.

Figures6

Click any figure to enlarge with its caption.

Tables1

Table 1. Table 1: Classification results of proposed DML-CRC versus competitors on three fine-grained datasets (five-fold cross-validation mean percentage accuracy).

	CUB Birds	Oxford Flowers	Oxford-IIIT Pets
CRC	75.24	91.83	83.30
PCRC	76.95	93.06	84.88
ProCRC	78.33	94.87	86.92
Constellation	81.01	95.34	91.60
OPAM	85.83	97.10	93.81
DML-CRC	88.49	98.65	95.12
DML-ProCRC	89.95	99.33	96.58

Equations24

J (α, λ) = arg α min (∥ y - X α ∥_{2}^{2} + λ ∥ α ∥_{2}^{2})

J (α, λ) = arg α min (∥ y - X α ∥_{2}^{2} + λ ∥ α ∥_{2}^{2})

\overset{α}{^} = (X^{T} X + λ I)^{- 1} X^{T} y

\overset{α}{^} = (X^{T} X + λ I)^{- 1} X^{T} y

r_{i} (y) = \frac{∥ y - X _{i} α ^ _{i} ∥ _{2}^{2}}{∥ α ^ _{i} ∥ _{2}^{2}} \forall i \in 1, \dots, c

r_{i} (y) = \frac{∥ y - X _{i} α ^ _{i} ∥ _{2}^{2}}{∥ α ^ _{i} ∥ _{2}^{2}} \forall i \in 1, \dots, c

C (y) = arg i min r_{i} (y)

C (y) = arg i min r_{i} (y)

J (α, Σ) = (y - X α)^{T} Σ^{- 1} (y - X α) + λ ∥ α ∥_{2}^{2} + γ ∥Σ ∥_{2}^{2}

J (α, Σ) = (y - X α)^{T} Σ^{- 1} (y - X α) + λ ∥ α ∥_{2}^{2} + γ ∥Σ ∥_{2}^{2}

\frac{\partial J}{\partial α} = - 2 X^{T} Σ^{- 1} (y - X α) + 2 λ α = 0

\frac{\partial J}{\partial α} = - 2 X^{T} Σ^{- 1} (y - X α) + 2 λ α = 0

\frac{\partial J}{\partial Σ} = - Σ^{- 1} (y - X α) (y - X α)^{T} Σ^{- 1} + 2 γ Σ = 0

\frac{\partial J}{\partial Σ} = - Σ^{- 1} (y - X α) (y - X α)^{T} Σ^{- 1} + 2 γ Σ = 0

Σ = \frac{Σ ^{- 1} ( y - X α ) ( y - X α ) ^{T} Σ ^{- 1}}{2 γ}

Σ = \frac{Σ ^{- 1} ( y - X α ) ( y - X α ) ^{T} Σ ^{- 1}}{2 γ}

α = (X^{T} Σ^{- 1} X + λ I)^{- 1} X^{T} Σ^{- 1} y

α = (X^{T} Σ^{- 1} X + λ I)^{- 1} X^{T} Σ^{- 1} y

\frac{\partial J}{\partial X} = - 2 Σ^{- 1} (y - X α) α^{- 1}

\frac{\partial J}{\partial X} = - 2 Σ^{- 1} (y - X α) α^{- 1}

J (p_{j}, λ) = ∥ y_{j} - M_{j} p_{j} ∥_{2}^{2} + λ ∥ p_{j} ∥_{2}^{2}

J (p_{j}, λ) = ∥ y_{j} - M_{j} p_{j} ∥_{2}^{2} + λ ∥ p_{j} ∥_{2}^{2}

J (α, λ, γ) = ∥ y - X α ∥_{2}^{2} + λ ∥ α ∥_{2}^{2} + \frac{γ}{K} k = 1 \sum K ∥ X α - X_{k} α_{k} ∥_{2}^{2}

J (α, λ, γ) = ∥ y - X α ∥_{2}^{2} + λ ∥ α ∥_{2}^{2} + \frac{γ}{K} k = 1 \sum K ∥ X α - X_{k} α_{k} ∥_{2}^{2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace and Expression Recognition · Neural Networks and Applications · Advanced Image and Video Retrieval Techniques

MethodsVisual Geometry Group 19 Layer CNN

Full text

Distance Metric Learned Collaborative Representation Classifier

Tapabrata Chakraborti

Dept. of Computer Science, University of Otago, NZ

Brendan McCane

Dept. of Computer Science, University of Otago, NZ

Steven Mills

Dept. of Computer Science, University of Otago, NZ

Umapada Pal

CVPR Unit, Indian Statistical Institute, India

Abstract

Any generic deep machine learning algorithm is essentially a function fitting exercise, where the network tunes its weights and parameters to learn discriminatory features by minimizing some cost function. Though the network tries to learn the optimal feature space, it seldom tries to learn an optimal distance metric in the cost function, and hence misses out on an additional layer of abstraction. We present a simple effective way of achieving this by learning a generic Mahalanabis distance in a collaborative loss function in an end-to-end fashion with any standard convolutional network as the feature learner. The proposed method DML-CRC gives state-of-the-art performance on benchmark fine-grained classification datasets CUB Birds, Oxford Flowers and Oxford-IIIT Pets using the VGG-19 deep network. The method is network agnostic and can be used for any similar classification tasks.

1 Introduction

Deep learning has achieved near human accuracy in recent years in many vision based tasks. However, any neural network inspired machine learning algorithm basically fits a function to given data using many parameters so as to learn discriminatory features from the input in an end-to-end manner. These features are then used to do the final discrimination operation using a standard distance metric. Though the network tries to learn the optimal feature space, it seldom tries to learn an optimal distance metric in the cost function, and hence misses out on an additional layer of abstraction [16,17].

The intuition for this work is that if the deep learned features are fed into a cost function with a distance metric which is also learned in tandem in an end-to-end manner, then it might help to further maximize the inter-class distance and help for such advanced classification tasks like fine-grained visual categorization. Deep convolutional networks are already proficient at recognizing base classes with sufficient data, but robust classification of sub-classes with fine-grained differences is still an open problem [1]. Thus as the representative problem to demonstrate our method, we choose fine-grained species recognition [2].

As the cost function, we use a collaborative representation classifier (CRC), which has recently been shown to be effective in fine-grained recognition [3]. CRC represents the test image as an optimal weighted average of training images across all classes and the predicted label is the class having least residual. CRC has a closed form solution; thus it is efficient and analytic. It is also a general feature representation-classification scheme and thus most popular features are compatible with it.

The main contribution of this letter is to learn a generic distance metric in the cost function of a deep network in tandem with the learned features in an end-to-end manner. We provide an analytical derivation of the partial derivatives needed to optimise the distance metric and then back-propagate the gradients. The resulting system has wide generalisation since it is agnostic of the deep architecture and so can be used for any classification task. The method achieves state-of-the art results on three benchmark fine-grained species recognition datasets with the standard VGG-19 [9] deep network. We use standard publicly available models pre-trained on ImageNet [10] and fine-tuned on the three datasets, CUB Birds [5], Oxford Flowers [6] and Oxford-IIIT Pets [8], for fair comparison and ready reproducibility.

The rest of the paper is organized as follows. In Section 2, we present the original CRC in brief and the proposed DML-CRC in detail. Section 3 provides the experimental setup (the datasets, deep network and competing classifiers). Section 4 presents the experimental results with statistical analysis, followed by concluding remarks in Section 5.

2 Proposed Method

In this Section, we present a brief description of the original formulation of the Collaborative Representation Classifiers (CRC) [4][20][21]. We then introduce in details the proposed Distance Metric Learned CRC (DML-CRC).

2.1 Collaborative Representation Classifiers (CRC)

Consider a training dataset with images in some feature space (such as one learned by a deep convolutional network) as $X=[X_{1},\dots,X_{c}]\in\varmathbb{R}^{d\times N}$ where $N$ is the total number of samples over $c$ classes and $d$ is the feature dimension per sample. Thus $X_{i}\in\varmathbb{R}^{d\times n_{i}}$ is the feature space representation of class $i$ with $n_{i}$ samples such that $\sum_{i=1}^{c}n_{i}=N$ .

The CRC model reconstructs a test image in the feature space $y\in\varmathbb{R}^{d}$ as an optimal collaboration of all training samples, while at the same time limiting the size of the reconstruction parameters, via the regularization term $\lambda$ .

The CRC cost function is given as

[TABLE]

where $\alpha=[\alpha_{1},\dots,\alpha_{c}]\in\varmathbb{R}^{N}$ and $\alpha_{i}\in\varmathbb{R}^{n_{i}}$ is the reconstruction matrix corresponding to class $i$ .

A least-squares derivation yields the optimal solution as

[TABLE]

The representation residual of class $i$ for test sample $y$ can be calculated as:

[TABLE]

The final class of test sample $y$ is thus given by

[TABLE]

2.2 Distance Metric Learned CRC (DML-CRC)

Most CRC methods, if not all, use the Eucledian $l_{2}$ norm or the Frobenius norm in the cost function. We replace it by a general Mahalanobis distance metric $\Sigma$ which can be optimised analytically, giving:

[TABLE]

Let $X$ be the training set in some feature domain using the pre-trained deep model. Now, $y$ is each incoming image in the same feature domain, being used to fine-tune the network. Our aim is to find optimal $\Sigma$ , $\alpha$ so as to minimize the cost function during the fine-tuning process.

Differentiating $J$ with respect to $\alpha$ , keeping $\Sigma$ constant we have:

[TABLE]

Differentiating $J$ with respect to $\Sigma$ , keeping $\alpha$ constant we have:

[TABLE]

Solving simultaneous equations 6 and 7, we have the new values of $\Sigma$ and $\alpha$ as:

[TABLE]

During a specific round of back-propagation, once the new $\Sigma$ and $\alpha$ are set, the weights are then propagated back using the partial derivative with $X$ as follows.

[TABLE]

The DML-CRC training procedure for fine-tuning is presented in Algorithm 1. For further details on similar back-propagation schemes, the reader may refer to [18,19].

3 Experimental Setup

In this Section, we describe the experimental setup: the datasets, chosen deep network and competing classifiers.

3.1 Benchmark Datasets

We have used three benchmark fine-grained species recognition datasets.

• CUB-2011 dataset: It contains 11,788 images of 200 bird species [5]. The main challenge of this dataset is considerable variation and confounding features in background information compared to subtle inter-class differences in birds.

• Oxford Flowers dataset: It has 8,189 images of 102 flowers, with at least 40 images per class [6]. It was developed by the Robotics Group at Oxford University. It is an expansion of the earlier dataset by the same group with 17 flower types with 80 images per class [7].

• Oxford Pets dataset: This dataset, compiled by the Oxford Robotics Group and IIIT Hyderabad, consists of 37 categories of pet cats and dogs with around 200 images belonging to each class [8].

3.2 Training on VGG-19 Deep Convolutional Network

We have used the standard VGG-19 deep convolutional network from the Oxford Robotics group [9]. It has 19 layers, is trained on more than one million images from the ImageNet [10] dataset, and can classify up to 1000 object categories. We have fine-tuned the pre-trained VGG-19 model on our target datasets. For details of the training protocol, please directly refer to the benchmark work by Simon et al. on neural constellation activations [11]. For fair comparison, we have used the baseline models provided by the authors of [11] in their GitHub repository [15]: pre-trained VGG-19 models on ImageNet and well as fine-tuned models on CUB Birds, Oxford Flowers and Oxford-IIIT Pets dataset using the CAFFE deep learning framework.

3.3 Competing Classifiers

We discuss two CRC based and two non-CRC based methods that have been used here for comparison. Note that all the methods have been used with VGG-19 features, but our method can be applied with any learned features.

3.3.1 CRC based deep network classifiers

There are many variants of CRC available; we choose patch based CRC (PCRC) as a major sub-class and probabilistic CRC (ProCRC) as a recent variant.

Patch based CRC (PCRC) by Zhu et al. [12] is a patch-based framework for collaborative representation (PCRC). Let the query image $y$ be divided into $q$ overlapping patches $y=\{y_{1},\dots,y_{q}\}$ . From the feature matrix $X$ , a local feature matrix $M_{j}$ is extracted corresponding to location of patch $y_{j}$ . Thus the modified cost function becomes:

[TABLE]

where $M_{j}=[M_{j1},\dots,M_{jc}]$ are the local dictionaries for the $c$ classes and $\hat{p}_{j}=[\hat{p}_{j1},\dots,\hat{p}_{jc}]$ is the optimal reconstruction matrix for the patch $j$ .

Probabilistic CRC (ProCRC) by Cai et al. [13] is a probabilistic formulation of CRC where each of the terms are modeled by Gaussian distributions and the final cost function for ProCRC is formulated as maximisation of the joint probability of the test image belonging to each of the possible classes as independent events. The final classification is performed by checking which class has the maximum likelihood.

[TABLE]

3.3.2 Non-CRC based classifiers used with VGG-Net

We use Constellation models due to its popularity in fine-grained recognition and also because we have used their pre-trained models directly for fair comparison. The other choice is the very recent paper on part attention models to compare against the state-of-the-art.

Constellation Neural Activations by Simon et al. [11] finds activation patterns with the help of convolutional networks in a completely unsupervised manner (no annotation or bounding box) to identify discriminatory parts for fine-grained classification. This is one of the popular baseline works in fine-grained classification and also provides the pre-trained models used in the current work.

Object Part Attention Models by Peng et al. [14] is a very recently published work in fine-grained recognition and can be considered state-of-the-art. It reports results on the same datasets used in this work with VGG-19 features. This work combines an object level and a part level attention models with a spatial constraint that preserves spatial patterns.

4 Experimental Results

For each dataset, experiments are conducted with five fold cross validation and percentage classification accuracies are presented in Table 1 with the accuracy of our method highlighted in bold. Among the CRC-based methods, basic CRC has the least accuracy and then there is an increase in the performance of the CRC variants. The proposed DML-CRC outperforms the original CRC and its variants comfortably. We also compare our method against two deep learning based methods, Constellation Model [11] and OPAM [14]. The rationale of choosing these two particular methods, have been discussed in previous Section. The proposed DML-CRC gives better results than both of these methods, thus establishing a new state-of-the-art. Fig. 1 presents qualitative results from the Oxford-IIIT Pets dataset.

It is important to note here that we have used the original CRC cost function first deliberately, to emphasize the contribution of the distance metric learning. This is demonstrated by the fact that even with vanilla CRC, we outperform the state-of-the-art albeit marginally in few cases. So it might be expected, that if a more recent version of CRC is used (like ProCRC), the margin of outperformance might increase. So we plug in the ProCRC cost function in place of the original CRC and the results are reported in Table 1, and as expected the performance improves further.

5 Conclusion

We have shown that learning the distance metric for final discrimination of a convolutional network in an end-to-end manner enhances the performance of the system, keeping other factors like network architecture, data and training protocol constant. We used a collaborative representation based cost function (CRC) to evaluate an optimal Mahalanobis distance metric. A detailed analytic derivation is provided for partial derivatives of the CRC function. CRC has been recently shown to be effective for fine-grained classification and we achieve state-of-the-art results on three benchmark fine-grained classification datasets (CUB Birds, Oxford Flowers and Oxford-IIIT Pets). It should be further noted that the proposed method is agnostic of the deep architecture used and also may be utilised for any generic visual classification task. The current work uses vanilla CRC first deliberately for benchmarking; then we use the more recent ProCRC and observe improvement in performance. This suggests use of other new CRC formulations in future for possible further improvement in accuracy.

Bibliography21

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Y. Chai, “Advances in Fine-grained Visual Categorization”, University of Oxford , 2015.
2[2] E. Rodner, M. Simon, G. Brehm, S. Pietsch, J.-W.Wägele, and J. Denzler, “Fine-grained Recognition Datasets for Biodiversity Analysis”, In Proc. CVPR , 2015.
3[3] T. Chakraborti, B. Mc Cane, S. Mills, and U. Pal, “Collaborative representation based fine-grained species recognition”, In Proc. IVCNZ , 2016.
4[4] L. Zhang, M. Yang, and X. Feng, “Sparse representation or collaborative representation: Which helps face recognition?”, In Proc. ICCV , 2011.
5[5] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The Caltech-UCSD Birds-200-2011 Dataset”, Technical Report CNS-TR-2011-001 , California Institute of Technology, 2011.
6[6] M-E. Nilsback and A. Zisserman, “Delving into the whorl of flower segmentation”, In Proc. BMVC , 2007.
7[7] M-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes”, In Proc. ICVGIP , 2008.
8[8] O. M. Parkhi, A. Vedaldi, A. Zisserman and C. V. Jawahar, “Cats and Dogs”, In Proc. CVPR , 2012.