On Minimum Discrepancy Estimation for Deep Domain Adaptation

Mohammad Mahfujur Rahman; Clinton Fookes; Mahsa Baktashmotlagh; Sridha; Sridharan

arXiv:1901.00282·cs.CV·January 3, 2019·Domain Adaptation for Visual Understanding

On Minimum Discrepancy Estimation for Deep Domain Adaptation

Mohammad Mahfujur Rahman, Clinton Fookes, Mahsa Baktashmotlagh, Sridha, Sridharan

PDF

1 Repo

TL;DR

This paper introduces a novel unsupervised deep domain adaptation method that aligns second order statistics and maximum mean discrepancy within a two-stream CNN to improve image classification across different domains.

Contribution

It proposes a new approach combining covariance alignment and MMD in a two-stream CNN for unsupervised domain adaptation, achieving state-of-the-art results.

Findings

01

Achieves state-of-the-art performance on benchmark datasets

02

Effective in handling domain shift in image classification

03

Outperforms existing domain adaptation methods

Abstract

In the presence of large sets of labeled data, Deep Learning (DL) has accomplished extraordinary triumphs in the avenue of computer vision, particularly in object classification and recognition tasks. However, DL cannot always perform well when the training and testing images come from different distributions or in the presence of domain shift between training and testing images. They also suffer in the absence of labeled input data. Domain adaptation (DA) methods have been proposed to make up the poor performance due to domain shift. In this paper, we present a new unsupervised deep domain adaptation method based on the alignment of second order statistics (covariances) as well as maximum mean discrepancy of the source and target data with a two stream Convolutional Neural Network (CNN). We demonstrate the ability of the proposed approach to achieve state-of the-art performance for…

Tables3

Table 1. Table 1: Image classification accuracies for deep domain adaptation on the Office-31 dataset.We use the standard protocol for unsupervised domain adaptation where source data are labeled, but target data are unlabeled. A - W indicates A (Amazon) is source and W (Webcam) is target.

Methods	A-W	D-W	D-A	W-A	W-D	A-D	Avg.
TCA [23]	21.5	50.1	8.0	14.6	58.4	11.4	27.3
GFK [8]	19.7	49.7	7.9	15.8	63.1	10.6	27.8
VGG16 [29]	63.9	81.6	46.9	54.1	91.9	63.1	66.9
AlexNet [15]	53.4	79.9	46.9	47.5	84.1	55.6	61.2
DANN [5]	73.9	94.9	-	-	99.5	-	-
D-CORAL [32]	67.2	94.5	52.6	51.6	98.7	64.9	71.6
DAN [17]	68.5	96.0	50.0	49.8	99.0	66.8	71.7
DRCN [6]	68.7	96.4	56.0	54.9	99.0	66.8	73.6
RTN [18]	73.3	96.8	50.5	51.0	99.6	71.0	73.7
DAH [37]	68.3	96.1	55.5	53.0	98.8	66.5	73.0
Our method	72.1	97.3	54.6	53.9	98.7	71.2	74.6

Table 2. Table 2: Image classification accuracies for deep domain adaptation on the Office-Home dataset. We use the standard protocol for unsupervised domain adaptation where source data are labeled, but target data are unlabeled. Ar - Cl indicates Ar (Art) is source domain and Cl (Clipart) is target domain.

Methods	A-C	A-P	A-R	C-A	C-P	C-R	P-A	P-C	P-R	R-A	R-C	R - P	Avg.
TCA [23]	19.93	32.08	35.71	19.00	31.36	31.74	21.92	23.64	42.12	30.74	27.15	48.68	30.34
GFK [8]	21.60	31.72	38.83	21.63	34.94	34.20	24.52	25.73	42.92	32.88	28.96	50.89	32.40
VGG16 [29]	30.40	45.92	57.54	35.40	48.67	50.75	35.77	30.51	60.20	49.62	34.54	64.00	45.28
AlexNet [15]	27.40	34.53	45.04	32.40	43.90	46.72	29.76	32.94	50.20	40.74	35.07	55.99	39.74
DANN [5]	33.33	42.96	54.42	32.26	49.13	49.76	30.49	38.14	56.76	44.71	42.66	64.65	44.94
D-CORAL [32]	32.18	40.47	54.45	31.47	45.8	47.29	30.03	32.33	55.27	44.73	42.75	59.40	42.79
DAN [17]	30.66	42.17	54.13	32.83	47.59	49.78	29.07	34.05	56.70	43.58	38.25	62.73	43.46
RTN [18]	31.23	40.19	54.56	32.46	46.60	48.25	28.20	32.89	56.38	45.53	44.74	61.28	43.53
DAH [37]	31.64	40.75	51.73	34.69	51.93	52.79	29.91	39.63	60.71	44.99	45.13	62.54	45.54
Our method	35.15	44.35	57.17	36.82	52.45	53.67	34.80	37.17	62.15	49.95	46.29	66.05	48.00

Table 3. Table 3: Image classification accuracies for deep domain adaptation on the Office-Caltech dataset. We use the protocol for unsupervised domain adaptation where source data are labeled, but target data are unlabeled. A - C indicates A (Amazon) is source and C (Caltech) is target.

Methods	A-W	D-W	D-A	W-A	W-D	A-D	A-C	W-C	C-W	C-D	D-C	C-A	Avg.
TCA [23]	84.4	96.9	90.4	85.6	99.4	82.8	81.2	75.5	88.1	87.9	79.6	92.1	87.0
GFK [8]	89.5	97.0	89.8	88.5	98.1	86.0	76.2	77.1	78.0	77.1	77.9	90.7	85.5
AlexNet [15]	79.5	97.7	87.1	83.8	100.0	87.4	83.0	73.0	83.7	87.1	79.0	91.9	86.1
D-CORAL [32]	89.8	97.3	91.0	91.9	100.0	90.5	83.7	81.5	90.1	88.6	80.1	92.3	89.7
DAN [17]	91.8	98.5	90.0	92.1	100.0	91.7	84.1	81.2	90.3	89.3	80.3	92.0	90.1
RTN [18]	95.2	99.2	93.8	92.5	100.0	95.5	88.1	86.6	96.9	94.2	84.6	93.7	93.4
Our method	95.7	99.4	94.7	94.8	100.0	96.6	89.1	86.5	95.2	93.4	84.7	93.6	93.6

Equations21

F_{s}, F_{t} min D_{l} (D_{s}, D_{t})_{f c 7} + F_{s}, F_{t} min M M D^{2} (D_{s}, D_{t})_{f c 7} +

F_{s}, F_{t} min D_{l} (D_{s}, D_{t})_{f c 7} + F_{s}, F_{t} min M M D^{2} (D_{s}, D_{t})_{f c 7} +

F_{s}, F_{t} min D_{l} (D_{s}, D_{t})_{f c 8} + F_{s}, F_{t} min M M D^{2} (D_{s}, D_{t})_{f c 8} +

i = 1 \sum N_{t} H (F_{t} (X_{i}^{t})) .

F_{s}, F_{t} min D_{l} (D_{s}, D_{t}) = \frac{1}{4 d ^{2}} ∥ C_{s} - C_{t} ∥_{F}^{2},

F_{s}, F_{t} min D_{l} (D_{s}, D_{t}) = \frac{1}{4 d ^{2}} ∥ C_{s} - C_{t} ∥_{F}^{2},

C_{s} = \frac{1}{N _{s} - 1} (D_{s}^{T} D_{s} - \frac{1}{N _{s}} (1^{T} D_{s})^{T} (1^{T} D_{s}),

C_{s} = \frac{1}{N _{s} - 1} (D_{s}^{T} D_{s} - \frac{1}{N _{s}} (1^{T} D_{s})^{T} (1^{T} D_{s}),

C_{t} = \frac{1}{N _{t} - 1} (D_{t}^{T} D_{t} - \frac{1}{N _{t}} (1^{T} D_{t})^{T} (1^{T} D_{t}) .

C_{t} = \frac{1}{N _{t} - 1} (D_{t}^{T} D_{t} - \frac{1}{N _{t}} (1^{T} D_{t})^{T} (1^{T} D_{t}) .

F_{s}, F_{t} min M M D^{2} (D_{s}, D_{t}) =

F_{s}, F_{t} min M M D^{2} (D_{s}, D_{t}) =

∥ \frac{1}{N _{s}} i = 1 \sum N_{s} ϕ (X_{i}^{s}) - \frac{1}{N _{t}} i = 1 \sum N_{t} ϕ (X_{i}^{t}) ∥_{H}^{2},

K (X_{i}^{s}, X_{i}^{t}) =< ϕ (X_{i}^{s}), ϕ (X_{i}^{t}) > K (X_{i}^{s}, X_{i}^{t}) .

K (X_{i}^{s}, X_{i}^{t}) =< ϕ (X_{i}^{s}), ϕ (X_{i}^{t}) > K (X_{i}^{s}, X_{i}^{t}) .

K (X_{i}^{s}, X_{i}^{t}) = l = 1 \sum L β_{1} K_{1} (X_{i}^{s}, X_{i}^{t}) s . t . β_{1} \geq 0, l = 1 \sum L β_{1} = 1.

K (X_{i}^{s}, X_{i}^{t}) = l = 1 \sum L β_{1} K_{1} (X_{i}^{s}, X_{i}^{t}) s . t . β_{1} \geq 0, l = 1 \sum L β_{1} = 1.

F_{t} min \frac{1}{N _{t}} = i = 1 \sum N_{t} H (F_{t} (X_{i}^{t})),

F_{t} min \frac{1}{N _{t}} = i = 1 \sum N_{t} H (F_{t} (X_{i}^{t})),

A_{i} = \frac{t}{n} \times 100,

A_{i} = \frac{t}{n} \times 100,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

oezyurty/cluda
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

11institutetext: Image and Video Laboratory, Queensland University of Technology (QUT),

Brisbane, QLD, Australia

11email: {m27.rahman, c.fookes, m.baktashmotlagh, s.sridharan}@qut.edu.au

On Minimum Discrepancy Estimation for Deep Domain Adaptation

Mohammad Mahfujur Rahman 11

Clinton Fookes 11

Mahsa Baktashmotlagh 11

Sridha Sridharan 11

Abstract

In the presence of large sets of labeled data, Deep Learning (DL) has accomplished extraordinary triumphs in the avenue of computer vision, particularly in object classification and recognition tasks. However, DL cannot always perform well when the training and testing images come from different distributions or in the presence of domain shift between training and testing images. They also suffer in the absence of labeled input data. Domain adaptation (DA) methods have been proposed to make up the poor performance due to domain shift. In this paper, we present a new unsupervised deep domain adaptation method based on the alignment of second order statistics (covariances) as well as maximum mean discrepancy of the source and target data with a two stream Convolutional Neural Network (CNN). We demonstrate the ability of the proposed approach to achieve state-of-the-art performance for image classification on three benchmark domain adaptation datasets: Office-31 [27], Office-Home [37] and Office-Caltech [8].

Keywords:

Unsupervised Domain Adaptation Domain Discrepancy Classification Visual Adaptation Transfer Learning Feature Learning.

1 Introduction

Deep Neural Networks (DNN) [16] have brought tremendous advances across many machine learning tasks and applications such as object detection [7], object recognition [15], speech recognition [2], person re-identification [13] and machine translation [33]. For an example, in [9] a DNN achieves 97.84% accuracy in multi digit number classification from street view images because of the ability of joint feature and classifier learning of the DNN. The dramatic success of large scale image classification based on DNNs commenced in 2012. In [15], they attained the best performance in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) by developing AlexNet. These victories were achieved in part from the accessibility of large labeled datasets such as the widely used ImageNet [15]. While the introduction of such datasets have unlocked many breakthroughs, the process of obtaining such labels still remains a time consuming and manual task.

In object recognition or classification, the training images may be different than the target images due to backgrounds, camera viewpoints, object transformations and human selection preference. When the source data and target data distributions are dissimilar, classifier’s performance can be significantly impacted. In computer vision, this is generally known as dataset bias or dataset shift [34, 18]. Learning a discriminative model of different distributions of training and test data is known as domain adaptation [24, 25, 40]. The principle objective of unsupervised domain adaptation algorithms is to interface the source and target distributions by acquiring a domain-constant informations where the target data are used without any labels.

Recent investigations have demonstrated that deep neural networks learn more transferable components for unsupervised domain adaptation [30]. Recently, unsupervised domain adaptation methods [32, 30, 10, 11, 28, 20, 21, 38, 26] have been proposed where features are adapted by aligning the second order statistics of the source and target data. Although [30] introduces a new loss named Correlation Alignment (CORAL) Loss, it depends on a linear transformation, and it is not an end-to-end trainable method. After feature extraction, the linear transformation is applied, and a Support Vector Machine (SVM) classifier is trained in another phase. Moreover, the features are fixed in these type of shallow domain adaptation methods. The approach in [30] is extended in [32] to incorporate the CORAL loss directly into deep neural networks. Maximum Mean Discrepancy (MMD) is another popular metric for feature adaptation. MMD based DA techniques have achieved great success to minimize the discrepancy between source and target data. MMD can also be incorporated with deep neural networks to achieve stronger performance over conventional methods.

In our approach, we get motivation from both of the above top performing metrics and propose a new domain adaptation method which leverages the advantages of both feature adaptation metrics: CORAL and MMD. The difference between previous research and our work is that previous approaches either minimize the source and target data discrepancy using maximum mean discrepancy or second order statistics for feature adaptation. However, in our approach we minimize the discrepancy using both metrics (MMD and CORAL) for feature adaptation. MMD based methods for domain adaptation utilize symmetric transformation to distributions of the source and target data whereas CORAL based approaches apply asymmetric transformation. However, symmetric transformations neglects the dissimilarities between the source and target data. On the other hand, asymmetric transformations attempt to link the source and target domains [31]. CORAL aligns the second order statistics that can be reconstructed utilizing all eigenvectors and eigenvalues instead of aligning only the top $k$ eigenvectors and eigenvalues as subspace based methods [4].

We present an assessment of our proposed deep domain adaptation by aligning covariances or second order statistics and maximum mean discrepancy within a two stream of CNN on three benchmark datasets: Office-31 [27], the recently released Office-Home [37] and Office-Caltech [8].

In summary, the contributions of this paper are given as follows:

•

We propose a novel deep neural network approach for unsupervised domain adaptation in the context of image classification in computer vision.

•

The proposed deep domain adaptation architecture jointly adapts features using two popular feature adaptation metrics: MMD and CORAL.

•

We report competitive accuracy compared to the state- of-art methods on three benchmark domain adaptation datasets for image classification. We achieve the best average image classification accuracies on three datasets compared to other state-of-the art methods.

The rest of the paper is organized as follows: Section 2 describes related research; the proposed methodology is described in Section 3; Section 4 illustrates a comprehensive evaluation; and finally, Section 5 concludes the paper.

2 Related Works

There have been many domain adaptation methods [1, 26, 38, 21, 20, 28, 35, 32] proposed in recent years to solve the problem of domain bias. All the methods can be categorized into two main categories, Conventional Domain Adaptation and Deep Domain Adaptation methods. The conventional domain adaptation methods develop their model into two stages, feature extraction and classification. In the first phase, these domain adaptation methods extract features and in the second phase, a classifiers is trained to classify the objects. However, the performance of these DA methods are not satisfactory.

Obtaining the features using deep neural network even without adaptation technique outperform the conventional DA methods by large margin. However, the results achieved with the Deep Convolutional Activation Features (DeCAF) [3] even without using any adaptation technique to the target data are remarkably better than the outcomes acquired with any conventional domain adaptation methods because DNNs extract more robust features using nonlinear transform. As a result deep neural network based domain adaptation methods are getting popular day by day.

MMD is a popular metric for measuring the distributions of source and target samples. Tzeng et al. [36] proposed the Deep Domain Confusion (DDC) domain adaptation framework based on a confusion layer for the discrepancy between source and target data. In [35], the previous work is extended by introducing soft label distribution matching loss. Long et al. [17] proposed the Domain Adaptation Network (DAN) that propose the integration of MMDs defined among several layers, including the soft prediction layer. This idea was further improved by introducing residual transfer networks [18] and Joint Adaptation Networks [19]. Venkateswara et al. [37] proposed a new Deep Hasing Network for unsupervised domain adaptation where hash codes are used to address the domain adaptation issue.

Another popular metric for feature adaptation between domains is aligning covariance or second order statistics which is known as Correlation Alignment. In [30, 32], unsupervised deep domain adaptation techniques have been proposed where domain shift is minimized by aligning the covariances of the source and target data. The idea is similar to Deep Domain Confusion (DDC) [36] and Deep Adaptation Network (DAN) [17] except that the CORAL loss is used instead of MMD to minimize the discrepancy between source and target data. Both [30, 32] introduces a new loss named coral loss which is the distance between the second-order statistics of the source and target representations. In [14], a deep domain adaptation approach based on the mixture of alignments of second order or higher-order scatter statistics between source and target distributions has been proposed. All these methods utilized two stream of CNN where the source network and target network combined at the classifier level. Another deep domain adaptation method is Domain-Adversarial Neural Networks (DANN) [5] which introduces a new deep learning domain adaptation approach by integrating a gradient reversal layer into the standard architecture. This gradient reversal layer do not change during forward propagation, but during back propagation its gradient reverse.

In our work, we adapt the features using both CORAL and MMD metric to minimize the dissimilarity between the source and target domains. CORAL is used to align the second order statistics and MMD is used to align higher order statistics.

3 Proposed Approach

Our proposed methodology is illustrated in Figure 1. In Our proposed method, the features of the source and target domains are jointly adapted using CORAL and MMD metrics. The source and target data uses two separate CNNs. In fc7 and fc8 layers, CORAL and MMD loss layer are added to minimize the discrepancy between the source and target data. Finally, the discrepancy between source and target data is minimized by entropy minimization of the unlabeled target data.

We consider the unsupervised domain adaptation scenario where labeled source data and unlabeled target data are available. Let us consider that the source domain data samples are $D_{s}=\{X_{i}^{s}\}$ with available labels $L_{s}=\{Y_{i}\}$ and the target data samples are $D_{t}=\{X_{i}^{t}\}$ without labels. The number of source and target samples are $N_{s}$ and $N_{t}$ respectively. Let the classifiers for source domain and target domain be $F_{s}(X_{i}^{s})$ and $F_{t}(X_{i}^{t})$ respectively. The distribution of the data of source and target domains are non-identical, i.e., $P_{s}(X_{i}^{s},Y_{s})$ $\neq$ $P_{t}(X_{i}^{t},Y_{t})$ . We build a deep learning architecture which aids the learning of a transfer classifiers, such as $Y=F_{s}(X_{i}^{s})=F_{t}(X_{i}^{t})$ to minimize the source-target discrepancy or mismatch.

We propose a new deep DA method which has two streams of convolutional Neural Network (CNN), one for source data and another for target data. It adapts features by aligning second order statistics and maximum mean discrepancy of the source and target data. The discrepancy of the source and target data are minimized by the following equation,

[TABLE]

Moreover, the proposed method also adapts the classifiers using entropy minimization.

The features are adapted by aligning second order statistics as well as maximum mean discrepancy. We define the coral loss of the source and target activation features (such a loss function is used in prior work [32] ) as,

[TABLE]

where $C_{s}$ and $C_{t}$ denote the features covariance matrices of the source and target data and $||.||_{F}^{2}$ denotes the squared matrix Frobenius norm. The $C_{s}$ and $C_{t}$ are given by the following equation [32],

[TABLE]

The features are further adapted by using another popular metric for feature adaptation, MMD. The MMD loss function is defined as,

[TABLE]

where $\phi(X_{i}^{s})$ denotes the feature map associated with kernel map,

[TABLE]

$K(X_{i}^{s},X_{i}^{t})$ is usually defined as the convex combination of $L$ basis kernels $K_{l}(X_{i}^{s},X_{i}^{t})$ [39],

[TABLE]

Since feature adaptation cannot eliminate the discrepancy[18], we adapt classifiers along with feature adaptation. In this work, the classifier is adapted by decreasing the entropy of class-conditional distribution on the target data $D_{t}$ (similar loss function has been proposed in prior work [18]),

[TABLE]

where $H($ · $)$ represents the class-conditional distribution entropy function.

3.1 Discussion

The main difference between our work and prior works is that they consider only one metric for feature adaptation whereas we consider two metrics for minimizing the discrepancy between the source and target data. In [32], CORAL layer is used in between fc8 layers of the source and target CNNs, but we used CORAL layer in between fc7 and fc8 layers. It is mentioned that the MMD metric is used in between fc8 layers in [18] and MMD layer is used in between fc6, fc7 and fc8 layers in [17]. The difference between our work and [18] is that RTN uses Residual Transfer Network and MMD metric whereas we use simple AlexNet architecture that consists of 5 convolutional followed by 3 fully connected layers and CORAL and MMD metrics to adapt the features. In our research, we have found that if multiple feature adaptation metrics are used in between fc7 and fc8, we get better accuracy using simple CNN architecture, and the best configuration of domain adaptation architecture is to use feature adaptation metric in between fc7 and fc8.

4 Experiments

In this section we conduct extensive experiments to assess the proposed method and compare the method against recently published state-of-the-art unsupervised deep domain adaptation approaches.

4.1 Datasets

We evaluate all the methods on three standard domain adaptation benchmark datasets: Office-31 [27], Office-Home [37] and Office-Caltech [8] in the context of image classification.

4.1.1 Office-31

In the context of image classification, Office-31 is the most prominent benchmark dataset for domain adaptation. The dataset contains everyday object images from an office environment. It consists of 4110 images with 31 object categories and 3 image domains: Amazon (A) contains images downloaded from amazon.com, DSLR (D) contains images taken by Digital SLR camera and Webcam (W) contains images taken by web camera with different photo graphical settings. For all experiments, we use the source data with labels and target data without any labels for unsupervised domain adaptation. We conduct experiments on all six transfer tasks for all possible combinations of source and target pairs for the available three domains. The average performance of all transfer tasks are also calculated.

4.1.2 Office-Home

The Office-Home dataset contains four domains and each domain contains images from 65 different classes (categories). The four domains are Art (Ar), Clipart (Cl), Product (Pr) and Real-World (Rw). Art domain contains the images from sketches, paintings, ornamentation form of artistic depictions of images. Clipart domain is the collection of clipart images. The images of Product domain have no background, and Real-World domain consists of images that are captured by a regular camera. It has around 15,500 images. Every category has an average of around 70 images and a maximum of 99 images. We conduct experiments on all 12 transfer tasks for all combinations of source and target pairs for the 4 domains. Figure 2 presents some sample images of 7 classes of Office-Home dataset.

4.1.3 Office-Caltech

The Office-Caltech is another popular benchmark dataset in the domain adaptation community which is formed by taking the 10 common classes shared by Office-31 and Caltech-256. It has four domains named Amazon (A), Webcam (W), DSLR (D) and Caltech (C). We conduct experiments on all 12 transfer task as it has four different domains.

4.2 Experimental Setup

In our method we used two streams of Convolutional Neural Network (CNN). We extended AlexNet deep learning architecture which was pretrained on the ImageNet dataset for both stream of CNN. The dimension of the last fully connected layer (fc8) is set to the number of classes of the objects (31 for office 31, 65 for home-office and 10 for Office-Caltech datasets). We set the learning rate to 0.0001 to optimize the network. We set the batch size to 128, momentum to 0.9 and weight decay to $5\times 10^{-4}$ during training phase.

4.3 Results and Discussion

In this section we provide the details of the performance of our method in the context of unsupervised domain adaptation where we use the labeled source data and unlabeled target data. Our proposed approach is compared with both conventional DA and recently published deep architecture based approaches: Geodesic Flow Kernel (GFK) [8], Transfer Component Analysis (TCA) [22], AlexNet (No adaptation) [15], VGG16 (No Adaptation) [29], Domain Adversarial Neural Network (DANN) [5], Deep Correlation alignment (D-CORAL) [32], DAN [17], Deep Reconstruction-Classification Networks (DRCN) [6], Residual Transfer Networks (RTN) [18], and Deep Hashing Network (DAH) [37].

TCA is a traditional domain adaptation approach based on MMD-regularized Kernel primary component analysis (PCA). GFK is a subspace based domain adaptation approach. Both TCA and GFK do not use a deep neural architecture. These methods are not end-to-end approach. At first features are extracted and then the features are used in domain adaptation networks. Both AlexNet and VGG16 deep convolutional neural networks are also used as deep feature extractors without adaptation techniques to show that a standalone deep architecture works better than conventional domain adaptation techniques. DANN introduces a deep learning approach domain adaptation technique by integrating a gradient reversal layer into the standard architecture. D-CORAL is also another deep domain adaptation architecture where second order statistics alignment technique is used to adapt features. DAN uses MMD to minimize the dissimilarity between source and target domains. DRCN introduces an unsupervised domain adaptation model which reconstruct source images that have a similar appearance to or qualities in common with the target images. RTN introduces residual transfer network where classifiers and features are adapted simultaneously. DAH uses deep hashing network for unsupervised domain adaptation. In DAH, MMD is utilized to decrease the dissimilarities between the source and target domains.

We use Caffe [12] framework to implement our proposed method. We use Alexnet architecture [15]. We conduct experiments with one NVIDIA GeForce GTX 1070 Graphics Processing Unit (GPU). For unsupervised domain adaptation techniques, we follow the standard protocol where the source data are labeled, but the target data are unlabeled. We make a comparison based on average classification accuracy for each transfer task.

As shown in Table 1, 2 and 3, we compare the results of our proposed method with state-of-the-art approaches on three datasets (Office 31, Office-Home and Office-Caltech) in the context of classification accuracy. The classification accuracy of a model $A_{i}$ depends on the images correctly identified. We evaluated all the methods by using the following formula:

[TABLE]

where, t is the total number of correctly classified images, and n belong to the total images.

For Office-31 dataset, we report the image classification results in Table 1 for target data on different transfer tasks. In Table 2, the target data classification accuracy are reported for Office-Home dataset on twelve transfer tasks. For Office-Caltech dataset, the target classification accuracy on different transfer tasks are reported in Table 3. The accuracies stand for the percentage of correctly classified target images.

For Office-31 dataset, the previous best average result achieved by [18] and [6] which are 73.7% and 73.6% respectively. In contrast with their approach, our combined CORAL and MMD loss outperforms their results by 0.9% and 1.0% respectively. For Office-Home dataset, our proposed method achieves average 48.0% classification accuracy which outperforms most state-of-the-art approaches, such as, DAH [37] by 2.46%. For the Office-Caltech dataset, the existing best result was achieved by [18]. Our proposed method beats their average classification accuracy by 0.2%. Thus, the proposed model based on MMD and CORAL outperforms all comparison methods on most transfer tasks on the datasets. From Table 1, 2 and 3, we can see that the proposed method achieves better average performance than other baseline conventional and deep domain adaptation methods.

These results provides the suggestion that our proposed method is capable to acquire better classifiers which are adaptive in between domains and transferable features to solve domain adaptation issue.

From all the results in terms of image classification, we can find the following observations:

•

Traditional deep learning approaches without domain adaptation perform better than the standard domain adaptation methods.

•

The proposed unsupervised deep domain adaptation based on joint aligning of the second order statistics and maximum mean discrepancy outperforms the state-of-the-art methods.

•

Our models works better where the number of classes of objects are more. For an example Office-Home dataset contains 65 categories and we achieved $48\%$ accuracy using our model.

4.4 Visualization

We use t-SNE for embedding visualization. To produce an embedding, we take images from Amazon and Webcam domains of Office-31 dataset. We use the CNN model to acquire the corresponding fc7-4096 dimensional vector for each image. After that, we plug these fc7-4096 vectors into t-SNE and generate 2-dimensional vector for each image. We plot a t-SNE embedding in Figure 3 of images that are taken from Amazon and Webcam domains using our learned representation (right) and make a comparison it to an embedding formed with AlexNet in Figure 3 (left). Examining the embeddings, we found that the clusters created by our model separate the classes while mixing the domains much more efficiently than the AlexNet approach where there is no domain adaptation technique is applied.

5 Conclusion

In this paper, we introduce an unsupervised deep domain adaptation architecture where the features and classifiers are adapted jointly. The source and target features are adapted by aligning covariances as well as maximum mean discrepancy and the classifiers are adapted by minimizing the entropy loss of the target data. Extensive Experimental results on standard benchmark datasets suggest the state-of-the art performance. Prior deep domain adaptation techniques either use MMD or CORAL to decrease the mismatch between the source and target data. However, unlike previous work, we use both MMD and CORAL to adapt the features across domains. This makes our method a decent supplement to existing procedures.

Bibliography40

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2018.
2[2] G. E. Dahl, T. N. Sainath, and G. E. Hinton. Improving deep neural networks for lvcsr using rectified linear units and dropout. In International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2013.
3[3] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International Conference on Machine Learning (ICML) , 2014.
4[4] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars. Unsupervised visual domain adaptation using subspace alignment. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2013.
5[5] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky. Domain-adversarial training of neural networks. Journal of Machine Learning Research , 17(1), 2016.
6[6] M. Ghifary, W. B. Kleijn, M. Zhang, D. Balduzzi, and W. Li. Deep Reconstruction-Classification Networks for Unsupervised Domain Adaptation . Springer International Publishing, Cham, 2016.
7[7] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2014.
8[8] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic flow kernel for unsupervised domain adaptation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2012.