A Survey of Unsupervised Deep Domain Adaptation

Garrett Wilson; Diane J. Cook

arXiv:1812.02849·cs.LG·February 10, 2020

A Survey of Unsupervised Deep Domain Adaptation

Garrett Wilson, Diane J. Cook

PDF

1 Repo

TL;DR

This survey reviews unsupervised deep domain adaptation methods, comparing various approaches, their theoretical foundations, and applications, highlighting current challenges and future research directions.

Contribution

It provides a comprehensive comparison of existing unsupervised deep domain adaptation techniques, analyzing their methods, results, and theoretical insights.

Findings

01

Multiple approaches exist for unsupervised domain adaptation.

02

Deep learning enhances domain adaptation effectiveness.

03

Open research directions include handling diverse domain shifts.

Abstract

Deep learning has produced state-of-the-art results for a variety of tasks. While such approaches for supervised learning have performed well, they assume that training and testing data are drawn from the same distribution, which may not always be the case. As a complement to this challenge, single-source unsupervised domain adaptation can handle situations where a network is trained on labeled data from a source domain and unlabeled data from a related but different target domain with the goal of performing well at test-time on the target domain. Many single-source and typically homogeneous unsupervised deep domain adaptation approaches have thus been developed, combining the powerful, hierarchical representations from deep learning with domain adaptation to reduce reliance on potentially-costly target data labels. This survey will compare these approaches by examining alternative…

Tables5

Table 1. Table 1 . Comparison of different neural network based domain adaptation methods based on method of adaptation (domain-invariant feature learning [DI], domain mapping [DM], normalization [N], ensemble [En], target discriminative [TD]), various loss functions (distance, promoting different features, cycle consistency, semantic consistency, task, feature- or pixel-level adversarial), usage of a generator, and which weights are shared (in the feature extractor).

Name

Method

Loss Functions

Adversarial Loss

Generator

Shared Weights

Distance

Diff.

Cycle

Sem.

Task

Feature

Pixel

CAN(Kang et al., 2019)

DI,N

CCD

✓

not BN

French et al.(French et al., 2018)

En,N

sq. diff.

✓

EMA

Co-DA(Kumar et al., 2018)¹¹1also incorporate virtual adversarial training (Miyato et al., 2018)

DI,En,N,TD

L1

✓

optional

VADA(Shu et al., 2018)1

DI,TD

✓

DeepJDOT(Damodaran et al., 2018)

DI

JDOT

✓

CyCADA(Hoffman et al., 2018b)

DI,DM

✓

Gen. to Adapt(Sankaranarayanan et al., 2018a)

DI

✓

SimNet(Pinheiro, 2018)

DI

prototypes

✓

MADA(Pei et al., 2018)

DI,En

✓

MCD(Saito et al., 2018b)

DI,En,TD

✓

GAGL(Wei and Hsu, 2018)

DI,TD

✓

SBADA-GAN(Russo et al., 2018)²²2also a self-labeled classification loss (learn label on source images, pseudo-label mapped target to source)

DM

✓

MCA(Zhang et al., 2018)

DI

MCA

✓

{CCN}^{+ +}

(Hsu et al., 2018)

DI

clusters

✓

M-ADDA(Laradji and Babanezhad, 2018)

DI

clusters

✓

Rozant. et al.(Rozantsev et al., 2019)

DI

MMD

✓

regularize

XGAN(Royer et al., 2017)

DM

✓

some

StarGAN(Choi et al., 2018)

DM

✓

PixelDA(Bousmalis et al., 2017)

DM

✓

AutoDIAL(Carlucci et al., 2017)

N,TD

✓

not BN

AdaBN(Liu et al., 2018)

N

not BN

JAN-A(Long et al., 2017)

DI

JMMD

✓

LogCORAL(Wang et al., 2017a)

DI

logCOR, mean

✓

Log D-CORAL(Morerio and Murino, 2017)

DI

logDCOR

✓

VRADA(Purushotham et al., 2017)

DI

✓

ATT(Saito et al., 2017)

En

✓

SimGAN(Shrivastava et al., 2017)

DM

✓

N/A³³3maps to target domain so only have feature extractor for target (part of the classifier)

ADDA(Tzeng et al., 2017)

DI

✓

CycleGAN(Zhu et al., 2017)

DM

✓

⁴⁴4unspecified; originally not applied to domain adaptation, but later used for this (Hoffman et al., 2018b; Benaim and Wolf, 2017; Fu et al., 2018)

RegCGAN(Mao and Li, 2018)

DM

✓

Sener et al.(Sener et al., 2016)

DI

k

-NN

✓

DSN(Bousmalis et al., 2016)

DI

✓

some

DRCN(Ghifary et al., 2016)

DI

✓

CoGAN(Liu and Tuzel, 2016)

DM

✓

some

Deep CORAL(Sun and Saenko, 2016)

DI

CORAL

✓

DANN(Ajakan et al., 2014; Ganin and Lempitsky, 2015; Ganin et al., 2016)

DI

✓

DAN(Long et al., 2015)

DI

MK-MMD

✓

low

Tzeng et al.(Tzeng et al., 2015)⁵⁵5semi-supervised for some classes, i.e., requires some labeled target data for some of the classes

DI

✓

Table 2. Table 2 . Classification accuracy (source → → \rightarrow target, mean ± plus-or-minus \pm std %) of different neural network based domain adaptation methods on various computer vision datasets (only including those used in > 2 absent 2 >2 papers). Adversarial approaches denoted by ∗ .

Name	MNIST and USPS		MNIST and SVHN		MNIST[-M]	Synthetic to Real
Name	MN $\to$ US	US $\to$ MN	SV $\to$ MN	MN $\to$ SV	MN $\to$ MN-M	${SYN}_{N}$ $\to$ SV	${SYN}_{S}$ $\to$ GTSRB
Target only (i.e., if we had the target labels)	96.3 $\pm$ 0.1 (Hoffman et al., 2018b) 96.5 (Bousmalis et al., 2017)	99.2 $\pm$ 0.1 (Hoffman et al., 2018b)	99.2 $\pm$ 0.1 (Hoffman et al., 2018b) 99.5 (Bousmalis et al., 2016) 99.51 (Ganin and Lempitsky, 2015)		96.4 (Bousmalis et al., 2017) 98.7 (Bousmalis et al., 2016) 98.91 (Ganin and Lempitsky, 2015)	92.44 (Ganin and Lempitsky, 2015) 92.4 (Bousmalis et al., 2016)	99.87 (Ganin and Lempitsky, 2015) 99.8 (Bousmalis et al., 2016)
French et al.(French et al., 2018)	98.2	99.5	99.3	37.5 97.0⁶⁶6problem-specific hyperparameter tuning of data augmentation to match pixel intensities of target domain images		97.1	99.4
Co-DA(Kumar et al., 2018)⁷⁷7hyperparameter tuned on some labeled target data^∗			98.6	81.7	97.5	96.0
DIRT-T(Shu et al., 2018)7^∗			99.4	76.5	98.7	96.2	99.6
VADA(Shu et al., 2018)7^∗			94.5	73.3	95.7	94.9	99.2
DeepJDOT(Damodaran et al., 2018)	95.7	96.4	96.7		92.4
CyCADA(Hoffman et al., 2018b)^∗	95.6 $\pm$ 0.2	96.5 $\pm$ 0.1	90.4 $\pm$ 0.4
Gen. to Adapt(Sankaranarayanan et al., 2018a)^∗	92.8 $\pm$ 0.9	90.8 $\pm$ 1.3	92.4 $\pm$ 0.9
SimNet(Pinheiro, 2018)^∗	96.4	95.6			90.5
MCD(Saito et al., 2018b)^∗	96.5 $\pm$ 0.3	94.1 $\pm$ 0.3	96.2 $\pm$ 0.4				94.4 $\pm$ 0.3
GAGL(Wei and Hsu, 2018)7^∗			96.7	74.6	94.9	93.1	97.6
SBADA-GAN(Russo et al., 2018)7^∗	97.6	95.0	76.1	61.1	99.4		96.7
MCA(Zhang et al., 2018)			96.6		96.8	89.0
${CCN}^{+ +}$ (Hsu et al., 2018)^∗			89.1
M-ADDA(Laradji and Babanezhad, 2018)^∗	98	97
Rozantsev et al.(Rozantsev et al., 2019)	60.7	67.3
PixelDA(Bousmalis et al., 2017)^∗	95.9				98.2
ATT(Saito et al., 2017)			85.0	52.8	94.0	92.9	96.2
ADDA(Tzeng et al., 2017)^∗	89.4 $\pm$ 0.2	90.1 $\pm$ 0.8	76.0 $\pm$ 1.8
RegCGAN(Mao and Li, 2018)^∗	93.1 $\pm$ 0.7	89.5 $\pm$ 0.9
DTN(Taigman et al., 2016)^∗			84.4
Sener et al.(Sener et al., 2016)			78.8	40.3	86.7
DSN(Bousmalis et al., 2016)7^∗	91.3 (Bousmalis et al., 2017)		82.7		83.2	91.2	93.1
DRCN(Ghifary et al., 2016)	91.80 $\pm$ 0.09	73.67 $\pm$ 0.04	81.97 $\pm$ 0.16	40.05 $\pm$ 0.07
CoGAN(Liu and Tuzel, 2016)^∗	91.2 $\pm$ 0.8	89.1 $\pm$ 0.8			62.0 (Bousmalis et al., 2017)
DANN(Ganin and Lempitsky, 2015; Ganin et al., 2016)^∗	85.1 (Bousmalis et al., 2017)		71.07 70.7 (Bousmalis et al., 2016) 71.1 (Saito et al., 2017) 73.6 (Hoffman et al., 2018b)	35.7 (Saito et al., 2017)	81.49 77.4 (Bousmalis et al., 2016) 81.5 (Saito et al., 2017)	90.48 90.3 (Bousmalis et al., 2016; Saito et al., 2017)	88.66 88.7 (Saito et al., 2017) 92.9 (Bousmalis et al., 2016)
DAN(Long et al., 2015)	81.1 (Bousmalis et al., 2017)		71.1 (Bousmalis et al., 2016)		76.9 (Bousmalis et al., 2016)	88.0 (Bousmalis et al., 2016)	91.1 (Bousmalis et al., 2016)
Source only (i.e., no adaptation)	78.9 (Bousmalis et al., 2017) 82.2 $\pm$ 0.8 (Hoffman et al., 2018b)	69.6 $\pm$ 3.8 (Hoffman et al., 2018b)	59.19 (Ganin and Lempitsky, 2015) 59.2 (Bousmalis et al., 2016) 67.1 $\pm$ 0.6 (Hoffman et al., 2018b)		56.6 (Bousmalis et al., 2016) 57.49 (Ganin and Lempitsky, 2015) 63.6 (Bousmalis et al., 2017)	86.65 (Ganin and Lempitsky, 2015) 86.7 (Bousmalis et al., 2016)	74.00 (Ganin and Lempitsky, 2015) 85.1 (Bousmalis et al., 2016)

Table 3. Table 3 . Classification accuracy (source → → \rightarrow target, mean ± plus-or-minus \pm std %) of different neural network based domain adaptation methods on the Office computer vision dataset. Adversarial approaches denoted by ∗ .

Name	Office (Amazon, DSLR, Webcam)
Name	A $\to$ W	D $\to$ W	W $\to$ D	A $\to$ D	D $\to$ A	W $\to$ A
CAN(Kang et al., 2019)⁸⁸8with ResNet-50 network	94.5 $\pm$ 0.3	99.1 $\pm$ 0.2	99.8 $\pm$ 0.2	95.0 $\pm$ 0.3	78.0 $\pm$ 0.3	77.0 $\pm$ 0.3
Gen. to Adapt(Sankaranarayanan et al., 2018a)8^∗	89.5 $\pm$ 0.5	97.9 $\pm$ 0.3	99.8 $\pm$ 0.4	87.7 $\pm$ 0.5	72.8 $\pm$ 0.3	71.4 $\pm$ 0.4
SimNet(Pinheiro, 2018)8^∗	88.6 $\pm$ 0.5	98.2 $\pm$ 0.2	99.7 $\pm$ 0.2	85.3 $\pm$ 0.3	73.4 $\pm$ 0.8	71.8 $\pm$ 0.6
MADA(Pei et al., 2018)8^∗	90.0 $\pm$ 0.1	97.4 $\pm$ 0.1	99.6 $\pm$ 0.1	87.8 $\pm$ 0.2	70.3 $\pm$ 0.3	66.4 $\pm$ 0.3
AutoDIAL(Carlucci et al., 2017)⁹⁹9with Inception-based network¹⁰¹⁰10hyperparameter tuned on one $W$ labeled example per class on $A \to$ W task (see (Long et al., 2016))	84.2	97.9	99.9	82.3	64.6	64.2
${CCN}^{+ +}$ (Hsu et al., 2018)¹¹¹¹11with ResNet-18 network^∗	78.2	97.4	98.6	73.5	62.8	60.6
Rozantsev et al.(Rozantsev et al., 2019)	76.0	96.7	99.6
AdaBN(Liu et al., 2018)9	74.2	95.7	99.8	73.1	59.8	57.4
JAN-A(Long et al., 2017)8^∗	86.0 $\pm$ 0.4	96.7 $\pm$ 0.3	99.7 $\pm$ 0.1	85.1 $\pm$ 0.4	69.2 $\pm$ 0.4	70.7 $\pm$ 0.5
LogCORAL(Wang et al., 2017a)	70.2 $\pm$ 0.6	95.5 $\pm$ 0.1	99.5 $\pm$ 0.3	69.4 $\pm$ 0.5	51.2 $\pm$ 0.3	51.6 $\pm$ 0.5
Log D-CORAL(Morerio and Murino, 2017)	68.5	95.3	98.7	62.0	40.6	40.6
ADDA(Tzeng et al., 2017)8^∗	75.1	97.0	99.6
Sener et al.(Sener et al., 2016)	81.1	96.4	99.2	84.1	58.3	63.8
DRCN(Ghifary et al., 2016)	68.7 $\pm$ 0.3	96.4 $\pm$ 0.3	99.0 $\pm$ 0.2	66.8 $\pm$ 0.5	56.0 $\pm$ 0.5	54.9 $\pm$ 0.5
Deep CORAL(Sun and Saenko, 2016)	66.4 $\pm$ 0.4	95.7 $\pm$ 0.3	99.2 $\pm$ 0.1	66.8 $\pm$ 0.6	52.8 $\pm$ 0.2	51.5 $\pm$ 0.3
DANN(Ganin and Lempitsky, 2015; Ganin et al., 2016)^∗	67.3 $\pm$ 1.7 72.6 $\pm$ 0.3 (Ghifary et al., 2016) 73.0 (Rozantsev et al., 2019; Tzeng et al., 2017)	94.0 $\pm$ 0.8 96.4 $\pm$ 0.1 (Ghifary et al., 2016) 96.4 (Rozantsev et al., 2019; Tzeng et al., 2017)	93.7 $\pm$ 1.0 99.2 $\pm$ 0.3 (Ghifary et al., 2016) 99.2 (Rozantsev et al., 2019; Tzeng et al., 2017)	67.1 $\pm$ 0.3 (Ghifary et al., 2016)	54.5 $\pm$ 0.4 (Ghifary et al., 2016)	52.7 $\pm$ 0.2 (Ghifary et al., 2016)
DAN(Long et al., 2015)	68.5 $\pm$ 0.4 63.8 $\pm$ 0.4 (Sun and Saenko, 2016) 64.5 (Rozantsev et al., 2019) 68.5 (Tzeng et al., 2017)	96.0 $\pm$ 0.3 94.6 $\pm$ 0.5 (Sun and Saenko, 2016) 95.2 (Rozantsev et al., 2019) 96.0 (Tzeng et al., 2017)	99.0 $\pm$ 0.2 98.6 (Rozantsev et al., 2019) 98.8 $\pm$ 0.6 (Sun and Saenko, 2016) 99.0 (Tzeng et al., 2017)	67.0 $\pm$ 0.4 65.8 $\pm$ 0.4 (Sun and Saenko, 2016)	54.0 $\pm$ 0.4 52.8 $\pm$ 0.4 (Sun and Saenko, 2016)	53.1 $\pm$ 0.3 51.9 $\pm$ 0.5 (Sun and Saenko, 2016)
Tzeng et al.(Tzeng et al., 2015) ¹²¹²12semi-supervised for some classes, but evaluated on 16 hold-out categories for which the labels were not seen during training^∗	59.3 $\pm$ 0.6	90.0 $\pm$ 0.2	97.5 $\pm$ 0.1	68.0 $\pm$ 0.5	43.1 $\pm$ 0.2	40.5 $\pm$ 0.2
Source only (i.e., no adaptation)	62.6 (Tzeng et al., 2017)8	96.1 (Tzeng et al., 2017)8	98.6 (Tzeng et al., 2017)8

Table 4. Table 4 . Classification accuracy comparison for domain adaptation methods for sentiment analysis (positive or negative review) on the Amazon review dataset (Blitzer et al . , 2007 ) 13 13 13 http://www.cs.jhu.edu/~mdredze/datasets/sentiment/ with domains books (B), DVD (D), electronics (E), and kitchen (K). Adversarial approaches denoted by ∗ .

Source $\to$ Target	DANN(Ganin et al., 2016)¹⁴¹⁴14using 30,000-dimensional feature vectors from marginalized stacked denoising autoencoders (mSDA) by Chen et al. (Chen et al., 2012), which is an unsupervised method of learning a feature representation from the training data^∗	DANN(Ganin et al., 2016)¹⁵¹⁵15using 5000-dimensional unigram and bigram feature vectors^∗	CORAL(Sun et al., 2016)¹⁶¹⁶16using bag-of-words feature vectors including only the top 400 words, but suggest using deep text features in future work	ATT(Saito et al., 2017)15	WDGRL(Shen et al., 2018)15¹⁷¹⁷17the best results on target data for various hyperparameters^∗	No Adapt.(Sun et al., 2016)¹⁸¹⁸18using bag-of-words feature vectors
B $\to$ D	82.9	78.4		80.7	83.1
B $\to$ E	80.4	73.3	76.3	79.8	83.3	74.7
B $\to$ K	84.3	77.9		82.5	85.5
D $\to$ B	82.5	72.3	78.3	73.2	80.7	76.9
D $\to$ E	80.9	75.4		77.0	83.6
D $\to$ K	84.9	78.3		82.5	86.2
E $\to$ B	77.4	71.3		73.2	77.2
E $\to$ D	78.1	73.8		72.9	78.3
E $\to$ K	88.1	85.4	83.6	86.9	88.2	82.8
K $\to$ B	71.8	70.9		72.5	77.2
K $\to$ D	78.9	74.0	73.9	74.9	79.9	72.2
K $\to$ E	85.6	84.3		84.6	86.3

Table 5. Table 5 . List and description of computer vision datasets from Tables 2 and 3

Computer Vision Datasets used for Domain Adaptation
MNIST(LeCun et al., 1998)¹⁹¹⁹19http://yann.lecun.com/exdb/mnist/	This is a binary (mostly black and white, but actually grayscale due to anti-aliasing) handwritten digit dataset (digits 0-9), which stands for “modified NIST.” It is based on the National Institute of Standards and Technology’s (NIST) Special Database 1 and 3, one of which was easier than the other, so MNIST is a combination of the two that are size normalized to fit in a 20x20 box preserving the aspect ratio and centered in a 28x28 pixel image.
MNIST-M(Ganin et al., 2016)²⁰²⁰20See Ganin’s website http://yaroslav.ganin.net/ for links to download.	This is a modification of MNIST where the digits are blended with random patches from BSDS500 dataset color photos.
USPS(LeCun et al., 1990)²¹²¹21This can be found on various sites and some Github repositories. One such place: https://web.stanford.edu/~hastie/ElemStatLearn/data.html	This is another handwritten digit dataset (digits 0-9). It consists of handwritten zipcodes scanned and segmented by the U.S. Postal Service (USPS). They were size normalized to 16x16 pixels preserving the aspect ratio. The values are normalized to be between -1 and 1.
SVHN(Netzer et al., 2011)²²²²22http://ufldl.stanford.edu/housenumbers	The Streetview House Numbers (SVHN) consists of single digits extracted from images of urban house numbers in Google Street View. The digits have been size normalized to 32x32 pixels.
${SYN}_{N}$ (Ganin et al., 2016)20	Ganin et al. (Ganin et al., 2016) used Microsoft Windows fonts to create a synthetic digit dataset (“Syn Numbers”) consisting of 1-3 digit numbers with various positions, orientation, background color, stroke color, and amount of blur.
${SYN}_{S}$ (Moiseev et al., 2013)²³²³23The synthetic dataset linked to on: http://graphics.cs.msu.ru/en/research/projects/imagerecognition/trafficsign	This is a synthetic sign dataset created from modifications to Wikipedia pictograms of traffic signs. It consists of 100,000 images and 43 classes of signs.
GTSRB(Stallkamp et al., 2011)²⁴²⁴24http://benchmark.ini.rub.de/?section=gtsrb&subsection=dataset	The German Traffic Signs Recognition Benchmark (GTSRB) is a dataset created from video taken driving around Germany. It consists of about 50,000 images and 43 classes of signs.
Office(Saenko et al., 2010)²⁵²⁵25http://ai.bu.edu/adaptation.html	This dataset consists of 31 classes of objects in three different domains: Amazon (taken from its online website; medium resolution and studio lighting), DSLR (taken with a digital SLR camera; high resolution and in a real-world environment), and Webcam (taken with a 640x480 computer webcam; have noise, artifacts, and white balance issues). Note: due to Office’s small size, some networks (Ganin et al., 2016; Sun and Saenko, 2016; Rozantsev et al., 2019) were pre-trained on ImageNet.

Equations21

G min D max V (D, G) = E_{x \sim p_{data} (x)} [lo g D (x)] + E_{z \sim p_{z} (z)} [lo g (1 - D (G (z)))]

G min D max V (D, G) = E_{x \sim p_{data} (x)} [lo g D (x)] + E_{z \sim p_{z} (z)} [lo g (1 - D (G (z)))]

ϵ_{T} (h) \leq \overset{ϵ}{^}_{S} (h) + \frac{1}{2} d_{H Δ H} (\hat{D}_{S}, \hat{D}_{T}) + λ^{*} + O \frac{d lo g n + lo g ( \frac{1}{δ} )}{n}

ϵ_{T} (h) \leq \overset{ϵ}{^}_{S} (h) + \frac{1}{2} d_{H Δ H} (\hat{D}_{S}, \hat{D}_{T}) + λ^{*} + O \frac{d lo g n + lo g ( \frac{1}{δ} )}{n}

ϵ_{T} (h) \leq ϵ_{S} (h) + d_{\tilde{H}} (D_{S}, D_{T}) + min {E_{D_{S}} [∣ f_{S} - f_{T} ∣], E_{D_{T}} [∣ f_{S} - f_{T} ∣]}

ϵ_{T} (h) \leq ϵ_{S} (h) + d_{\tilde{H}} (D_{S}, D_{T}) + min {E_{D_{S}} [∣ f_{S} - f_{T} ∣], E_{D_{T}} [∣ f_{S} - f_{T} ∣]}

ϵ_{S} (h \circ g) + ϵ_{T} (h \circ g) \geq \frac{1}{2} (d_{J S} (D_{S}^{Y}, D_{T}^{Y}) - d_{J S} (D_{S}^{Z}, D_{T}^{Z}))^{2}

ϵ_{S} (h \circ g) + ϵ_{T} (h \circ g) \geq \frac{1}{2} (d_{J S} (D_{S}^{Y}, D_{T}^{Y}) - d_{J S} (D_{S}^{Z}, D_{T}^{Z}))^{2}

Δ R (h^{s}, h^{t}) \leq M (W S_{c} (P^{s}, P^{#}) + min {E_{P^{#}} [∥Δ p (y ∣ x) ∥_{1}], E_{P^{s}} [∥Δ p (y ∣ x) ∥_{1}]})

Δ R (h^{s}, h^{t}) \leq M (W S_{c} (P^{s}, P^{#}) + min {E_{P^{#}} [∥Δ p (y ∣ x) ∥_{1}], E_{P^{s}} [∥Δ p (y ∣ x) ∥_{1}]})

ϵ_{T} (\hat{h}) \leq

ϵ_{T} (\hat{h}) \leq

2 (1 - α) \frac{1}{2} \hat{d}_{H Δ H} (U_{S}, U_{T}) + 4 \frac{2 d lo g ( 2 m ^{'} ) + lo g ( \frac{8}{δ} )}{m ^{'}} + λ

α^{*} (m_{T}, m_{S}; D) = {1 min {1, ν} m_{T} \geq D^{2} m_{T} \leq D^{2}

α^{*} (m_{T}, m_{S}; D) = {1 min {1, ν} m_{T} \geq D^{2} m_{T} \leq D^{2}

ν = \frac{m _{T}}{m _{T} + m _{S}} (1 + \frac{m _{S}}{D ^{2} ( m _{S} + m _{T} ) - m _{S} m _{T}})

ν = \frac{m _{T}}{m _{T} + m _{S}} (1 + \frac{m _{S}}{D ^{2} ( m _{S} + m _{T} ) - m _{S} m _{T}})

A = \frac{1}{2} \hat{d}_{H Δ H} (U_{S}, U_{T}) + 4 \frac{2 d lo g ( 2 m ^{'} ) + lo g ( \frac{4}{δ} )}{m ^{'}} + λ

A = \frac{1}{2} \hat{d}_{H Δ H} (U_{S}, U_{T}) + 4 \frac{2 d lo g ( 2 m ^{'} ) + lo g ( \frac{4}{δ} )}{m ^{'}} + λ

B = 4 \frac{2 d lo g ( 2 ( m + 1 )) + 2 lo g ( \frac{8}{δ} )}{m}

B = 4 \frac{2 d lo g ( 2 ( m + 1 )) + 2 lo g ( \frac{8}{δ} )}{m}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhaoxin94/awsome-domain-adaptation
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

A Survey of Unsupervised Deep Domain Adaptation

Garrett Wilson

0000-0002-6760-754X

[email protected]

and

Diane J. Cook

0000-0002-4441-7508

[email protected]

Washington State UniversitySchool of Electrical Engineering and Computer SciencePullmanWA99164USA

Abstract.

Deep learning has produced state-of-the-art results for a variety of tasks. While such approaches for supervised learning have performed well, they assume that training and testing data are drawn from the same distribution, which may not always be the case. As a complement to this challenge, single-source unsupervised domain adaptation can handle situations where a network is trained on labeled data from a source domain and unlabeled data from a related but different target domain with the goal of performing well at test-time on the target domain. Many single-source and typically homogeneous unsupervised deep domain adaptation approaches have thus been developed, combining the powerful, hierarchical representations from deep learning with domain adaptation to reduce reliance on potentially-costly target data labels. This survey will compare these approaches by examining alternative methods, the unique and common elements, results, and theoretical insights. We follow this with a look at application areas and open research directions.

1. Introduction

Supervised learning is arguably the most prevalent type of machine learning and has enjoyed much success across diverse application areas. However, many supervised learning methods make a common assumption: the training and testing data are drawn from the same distribution. When this constraint is violated, a classifier trained on the source domain will likely experience a drop in performance when tested on the target domain due to the differences between domains (Patel et al., 2015). Single-source domain adaptation refers to the goal of learning a concept from labeled data in a source domain that performs well on a different but related target domain (Pan and Yang, 2010; Goodfellow et al., 2016; Ganin et al., 2016). Unsupervised domain adaptation specifically addresses the situation where there are labeled source data and only unlabeled target data available for use during training (Long et al., 2015; Ganin et al., 2016).

Because of its ability to adapt labeled data for use in a new application, domain adaptation can reduce the need for costly labeled data in the target domain. As an example, consider the problem of semantically segmenting images. Each real image in the Cityscapes dataset required approximately 1.5 hours to annotate for semantic segmentation (Cordts et al., 2016). In this case, human annotation time could be spared by training an image semantic segmentation model on synthetic street view images (the source domain) since these can be cheaply generated, then adapting and testing for real street view images (the target domain, here the Cityscapes dataset).

An undeniable trend in machine learning is the increased usage of deep neural networks. Deep networks have produced many state-of-the-art results for a variety of machine learning tasks (Ganin et al., 2016; Goodfellow et al., 2016) such as image classification, speech recognition, machine translation, and image generation (Goodfellow et al., 2016; Goodfellow, 2016). When trained on large amounts of data, these many-layer neural networks can learn powerful, hierarchical representations (Sun and Saenko, 2016; Long et al., 2015; Goodfellow et al., 2016; Patel et al., 2015) and can be highly scalable (Ghifary et al., 2016). At the same time, these networks can also experience performance drops due to domain shifts (Sun and Saenko, 2016; Ganin and Lempitsky, 2015). Thus, much research has gone into adapting such networks from large labeled datasets to domains where little (or possibly no) labeled training data are available (for a list, see (Xin, 2019)). These single-source and typically homogeneous unsupervised deep domain adaptation approaches, which combine the benefit of deep learning with the very practical use of domain adaptation to remove the reliance on potentially costly target data labels, will be the focus of this survey.

A number of surveys have been created on the topic of domain adaptation (Margolis, 2011; Beijbom, 2012; Patel et al., 2015; Csurka, 2017b, a; Wang and Deng, 2018; Zhao et al., 2018a; Kouw, 2018; Bungum and Gambäck, 2011; Chu and Wang, 2018; Sun et al., 2015; Kouw and Loog, 2019) and more generally transfer learning (Pan and Yang, 2010; Lu et al., 2015; Shao et al., 2015; Weiss et al., 2016; Zhang et al., 2017c; Tan et al., 2018; Cook et al., 2013; Taylor and Stone, 2009; Lazaric, 2012), of which domain adaptation can be viewed as a special case (Patel et al., 2015). Previous domain adaptation surveys lack depth of coverage and comparison of unsupervised deep domain adaptation approaches. In some cases, prior surveys do not discuss domain mapping (Kouw, 2018; Csurka, 2017b, a), normalization statistic-based (Kouw, 2018; Zhao et al., 2018a; Csurka, 2017b, a), or ensemble-based (Kouw, 2018; Zhao et al., 2018a; Csurka, 2017b, a; Wang and Deng, 2018) methods. In other cases, they do not survey deep learning approaches (Margolis, 2011; Beijbom, 2012; Patel et al., 2015; Kouw and Loog, 2019). Still others are application-centric, focusing on a single use case such as machine translation (Bungum and Gambäck, 2011; Chu and Wang, 2018). One earlier survey focuses on the multi-source scenario (Sun et al., 2015), while we focus on the more prevalent single-source scenario. Transfer learning is a broader topic to cover, thus surveys provide minimal coverage and comparison of the deep learning methods that have been designed for unsupervised domain adaptation (Pan and Yang, 2010; Lu et al., 2015; Shao et al., 2015; Weiss et al., 2016; Zhang et al., 2017c; Tan et al., 2018), or they focus on tasks such as activity recognition (Cook et al., 2013) or reinforcement learning (Taylor and Stone, 2009; Lazaric, 2012). The goal of this survey is to discuss, highlight unique components, and compare approaches to single-source homogeneous unsupervised deep domain adaptation.

We first provide background on where domain adaptation fits into the more general problem of transfer learning. We follow this with an overview of generative adversarial networks (GANs) to provide background for the increasingly widespread use of adversarial techniques in domain adaptation. Next, we investigate the various domain adaptation methods, the components of those methods, and the results. Then, we overview domain adaptation theory and discuss what we can learn from the theoretical results. Finally, we look at application areas and identify future research directions for domain adaptation.

2. Background

2.1. Transfer Learning

The focus of this survey is domain adaptation. Because domain adaptation can be viewed as a special case of transfer learning (Patel et al., 2015), we first review transfer learning to highlight the role of domain adaptation within this topic. Transfer learning is defined as the learning scenario where a model is trained on a source domain or task and evaluated on a different but related target domain or task, where either the tasks or domains (or both) differ (Pan and Yang, 2010; Dredze et al., 2010; Weiss et al., 2016; Goodfellow et al., 2016). For instance, we may wish to learn a model on a handwritten digit dataset (e.g., MNIST (LeCun et al., 1998)) with the goal of using it to recognize house numbers (e.g., SVHN (Netzer et al., 2011)). Or, we may wish to learn a model on a synthetic, cheap-to-generate traffic sign dataset (Moiseev et al., 2013) with the goal of using it to classify real traffic signs (e.g., GTSRB (Stallkamp et al., 2011)). In these examples, the source dataset used to train the model is related but different from the target dataset used to test the model – both are digits and signs respectively, but each dataset looks significantly different. When the source and target differ but are related, then transfer learning can be applied to obtain higher accuracy on the target data.

2.1.1. Categorizing Methods

In a transfer learning survey paper, Pan et al. (Pan and Yang, 2010) defined two terms to help classify various transfer learning techniques: “domain” and “task.” A domain consists of a feature space and a marginal probability distribution (i.e., the features of the data and the distribution of those features in the dataset). A task consists of a label space and an objective predictive function (i.e., the set of labels and a predictive function that is learned from the training data). Thus, a transfer learning problem might be either transferring knowledge from a source domain to a different target domain or transferring knowledge from a source task to a different target task (or a combination of the two) (Pan and Yang, 2010; Dredze et al., 2010; Weiss et al., 2016).

By this definition, a change in domain may result from either a change in feature space or a change in the marginal probability distribution. When classifying documents using text mining, a change in the feature space may result from a change in language (e.g., English to Spanish), whereas a change in the marginal probability distribution may result from a change in document topics (e.g., computer science to English literature) (Pan and Yang, 2010). Similarly, a change in task may result from either a change in the label space or a change in the objective predictive function. In the case of document classification, a change in the label space may result from a change in the number of classes (e.g., from a set of 10 topic labels to a set of 100 topic labels). Similarly, a change in the objective predictive function may result from a substantial change in the distribution of the labels (e.g., the source domain has 100 instances of class A and 10,000 of class B, whereas the target has 10,000 instances of A and 100 of B) (Pan and Yang, 2010).

To classify transfer learning algorithms based on whether the task or domain differs between source and target, Pan et al. (Pan and Yang, 2010) introduced three terms: “inductive”, “transductive”, and “unsupervised” transfer learning. In inductive transfer learning, the target and source tasks are different, the domains may or may not differ, and some labeled target data are required. In transductive transfer learning, the tasks remain the same while the domains are different, and both labeled source data and unlabeled target data are required. Finally, in unsupervised transfer learning, the tasks differ as in the inductive case, but there is no requirement of labeled data in either the source domain or the target domain.

2.1.2. Domain Adaptation

One popular type of transfer learning is domain adaptation, which will be the focus of our survey. Domain adaptation is a type of transductive transfer learning. Here, the target task remains the same as the source, but the domain differs (Patel et al., 2015; Pan and Yang, 2010; Daumé III and Marcu, 2006). Homogeneous domain adaptation is the case where the domain feature space also remains the same, and heterogeneous domain adaptation is the case where the feature spaces differ (Patel et al., 2015).

In addition to the previous terminology, machine learning techniques are often categorized based on whether or not labeled training data are available. Supervised learning assumes labeled data are available, semi-supervised learning uses both labeled data and unlabeled data, and unsupervised learning uses only unlabeled data. However, domain adaptation assumes data comes from both a source domain and a target domain. Thus, prepending one of these three terms to “domain adaptation” is ambiguous since it may refer to labeled data being available in the source or target domains.

Authors apply these terms in various ways to domain adaptation (Jiang, 2008; Pan and Yang, 2010; Saito et al., 2017; Daumé III, 2007; Weiss et al., 2016). In this paper, we will refer to “unsupervised” domain adaptation as the case in which both labeled source data and unlabeled target data are available, “semi-supervised” domain adaptation as the case in which labeled source data in addition to some labeled target data are available, and “supervised” domain adaptation as the case in which both labeled source and target data are available (Beijbom, 2012). The distinction between these categories describes the target domain, but only describe situations in which labeled data are available for the source domain. These definitions are commonly used in the methods surveyed in this paper as well as others (Sun and Saenko, 2016; Saito et al., 2017; Long et al., 2015; Ganin et al., 2016; Ghifary et al., 2016; Carlucci et al., 2017).

2.1.3. Related Problems

Multi-domain learning (Dredze et al., 2010; Joshi et al., 2012) and multi-task learning (Caruana, 1997) are related to transfer learning and domain adaptation. In contrast to transfer learning, the goal of these learning approaches is obtaining high performance on all specified domains (or tasks) rather than just on a single target domain (or task) (Pan and Yang, 2010; Yang and Hospedales, 2015). For example, often it is assumed that the training data are drawn in an independent and identically distributed (i.i.d.) fashion, which may not be the case (Joshi et al., 2012). One such example is the task of developing a spam filter for users who disagree on what is considered spam. If all the users’ data are combined, the training data will be drawn from multiple domains. While each individual domain may be i.i.d., the aggregated dataset may not be. If the data are split by user, then there may be too little data to learn a model for each user. Multi-domain learning can take advantage of the entire dataset to learn individual user preferences (Dredze et al., 2010; Joshi et al., 2012). Some researchers have developed adversarial strategies to tackle this multi-domain learning challenge (Sebag et al., 2019; Hassan et al., 2018).

When working with multiple tasks, instead of training models separately for different tasks (e.g., one model for detecting shapes in an image and one model for detecting text in an image), multi-task learning will learn these separate but related tasks simultaneously so that they can mutually benefit from the training data of other tasks through a (partially) shared representation (Caruana, 1997). If there are both multiple tasks and domains, then these approaches can be combined into multi-domain multi-task learning, as is described by Yang et al. (Yang and Hospedales, 2015).

Another related problem is domain generalization, in which a model is trained on multiple source domains with labeled data and then tested on a separate target domain that was not seen during training (Muandet et al., 2013). This contrasts with domain adaptation where target examples (possibly unlabeled) are available during training. Some approaches related to those surveyed in this paper have been designed to address this situation. Examples include an adversarial method introduced by Zhao et al. (Zhao et al., 2017b) and an autoencoder approach by Ghifary et al. (Ghifary et al., 2015) discussed in Section 7.4.

2.2. Generative Adversarial Networks

Many deep domain adaptation methods that we will discuss in the next section incorporate adversarial training. We use the term adversarial training broadly to refer to any method that utilizes an adversary or an adversarial process during training. Before other adversarial methods were developed, the term was narrowly applied to training designed to improve the robustness of a model by utilizing adversarial examples, e.g. image inputs with small worst-case perturbations that lead to misclassification (Szegedy et al., 2013; Goodfellow et al., 2014b). Subsequently, other techniques have arisen that also utilize an adversary during training, including generative-adversarial training of generative adversarial networks (GANs) (Goodfellow et al., 2014a) and domain-adversarial training of domain adversarial neural networks (DANN) (Ganin et al., 2016), both of which have been used for domain adaptation. To provide background for the domain adaptation methods utilizing these techniques, we will first discuss GANs and later when discussing DANN note the differences.

In recent years there has been a large and growing interest in GANs. Pitting two well-matched neural networks against each other (hence “adversarial”), playing the roles of a data discriminator and a data generator, the pair is able to refine each player’s abilities in order to perform functions such as synthetic data generation. Goodfellow et al. (Goodfellow et al., 2014a) proposed this technique in 2014. Since that time, hundreds of papers have been published on the topic (Hindupur, 2018; Zhang, 2019). GANs have traditionally been applied to synthetic image generation, but recently researchers have been exploring other novel use cases such as domain adaptation.

GANs are a type of deep generative model (Goodfellow et al., 2014a). For synthetic image generation, a training dataset of images must be available. Popular datasets include human faces (CelebA (Liu et al., 2015)), handwritten digits (MNIST (LeCun et al., 1998)), bedrooms (LSUN (Yu et al., 2015)), and sets of other objects (CIFAR-10 (Krizhevsky, 2009) and ImageNet (Deng et al., 2009; Russakovsky et al., 2015)). After training, the generative model will be able to generate synthetic images that resemble those in the training data. For example, a generator trained with CelebA will generate images of human faces that look realistic but are not images of real people, as shown in Figure 1. To learn to do this, GANs utilize two neural networks competing against each other (Goodfellow et al., 2014a). One network represents a generator. The generator accepts a noise vector as input, which contains random values drawn from some distribution such as normal or uniform. The goal of the generator network is to output a vector that is indistinguishable from the real training data. The other network represents a discriminator, which accepts as input either a real sample from the training data or a fake sample from the generator. The goal of the discriminator is to determine the probability that the input sample is real. During training, these two networks play a minimax game, where the generator tries to fool the discriminator and the discriminator attempts to not be fooled.

Using the notation from Goodfellow et al. (Goodfellow et al., 2014a), we define a value function $V(G,D)$ employed by the minimax game between the two networks:

[TABLE]

Here, $x\sim p_{data}(x)$ draws a sample from the real data distribution, $z\sim p_{z}(z)$ draws a sample from the input noise, $D(x;\theta_{d})$ is the discriminator, and $G(z;\theta_{g})$ is the generator. As shown in the equation, the goal is to find the parameters $\theta_{d}$ that maximize the log probability of correctly discriminating between real ( $x$ ) and fake ( $G(z)$ ) samples while at the same time finding the parameters $\theta_{g}$ that minimize the log probability of $1-D(G(z))$ . The term $D(G(z))$ represents the probability that generated data $G(z)$ is real. If the discriminator correctly classifies a fake input then $D(G(z))=0$ . Equation 1 minimizes the quantity $1-D(G(z))$ . This occurs when $D(G(z))=1$ , or when the discriminator misclassifies the generator’s output as a real sample. Thus the discriminator’s mission is to learn to correctly classify the input as real or fake while the generator tries to fool the discriminator into thinking that its generated output is real. This process is illustrated in Figure 2.

2.2.1. Training

In recent years there have been impressive results from GANs. At the same time, this research faces some challenges since training a GAN can encounter problems such as difficulty converging (Goodfellow, 2016; Arora et al., 2017), mode collapse where the generator only learns to generate realistic samples for a few specialized modes of the data distribution (Goodfellow, 2016), and vanishing gradients (Goodfellow et al., 2014a). Many methods have been proposed to resolve these training challenges using a variety of tricks (Goodfellow et al., 2014a; Salimans et al., 2016; Szegedy et al., 2016; Odena et al., 2017; Shrivastava et al., 2017; Heusel et al., 2017), network architecture choices (Radford et al., 2015; Salimans et al., 2016; Karras et al., 2018), objective modifications (Zhao et al., 2017a; Berthelot et al., 2017; Metz et al., 2017; Mao et al., 2017; Nowozin et al., 2016; Arjovsky et al., 2017; Gulrajani et al., 2017; Kodali et al., 2017; Fedus et al., 2018; Jolicoeur-Martineau, 2018; Miyato et al., 2018; Odena et al., 2018; Nguyen et al., 2017), mixtures or ensembles (Ghosh et al., 2018; Hoang et al., 2018; Park et al., 2018; Khayatkhoei et al., 2018; Mordido et al., 2018; Durugkar et al., 2017; Arora et al., 2017; Zhang et al., 2018b; Tolstikhin et al., 2017), maximum mean discrepancy (MMD) (Dziugaite et al., 2015; Li et al., 2015; Sutherland et al., 2016; Li et al., 2017; Bińkowski et al., 2017), making a connection to reinforcement learning (Finn et al., 2016; Pfau and Vinyals, 2016), or a combination of these modifications (Miyato and Koyama, 2018; Heusel et al., 2017; Zhang et al., 2018b). For an in-depth discussion of these techniques, there are a number of survey papers directed at GAN variants that include a discussion of training challenges and work (Hong et al., 2019; Manisha and Gujar, 2018; Hitawala, 2018). These techniques can be employed in the domain adaptation methods that utilize GANs (Liu and Tuzel, 2016; Mao and Li, 2018; Shrivastava et al., 2017; Bousmalis et al., 2017; Hoffman et al., 2018b; Bousmalis et al., 2018; Sankaranarayanan et al., 2018a; Wang et al., 2018; Wei and Hsu, 2018; Choi et al., 2018). While these training stability methods could similarly be applied to other adversarial domain adaptation approaches, they are not typically needed for the non-GAN methods surveyed here.

2.2.2. Evaluation

Once successfully trained, a GAN model can be difficult to evaluate and compare with other models. Multiple approaches and measures have been introduced to evaluate GAN performance. Often researchers have evaluated their models through visual inspection (Santurkar et al., 2018a) such as performing user studies where participants mark which images they think look more realistic (Salimans et al., 2016). However, ideally a more automated metric could be found. Past generative models were evaluated by computing log-likelihood (Theis et al., 2016), but this is not necessarily tractable in GANs (Goodfellow, 2016). A proxy for log-likelihood is a Parzen window estimate, which was used for early GAN evaluation (Theis et al., 2016; Goodfellow et al., 2014a; Makhzani et al., 2015; Nowozin et al., 2016), but in high dimensions (such as images), this could be far from the actual log-likelihood and not even rank models correctly (Theis et al., 2016; Grover et al., 2017). Thus, there has been much work proposing various evaluation methods for GANs: methods for detecting memorization (Goodfellow et al., 2014a; Makhzani et al., 2015; Donahue et al., 2017; Theis et al., 2016; Radford et al., 2015; Berthelot et al., 2017), determining diversity (Arora et al., 2018; Santurkar et al., 2018a; Odena et al., 2017; Heusel et al., 2017), measuring realism (Salimans et al., 2016; Heusel et al., 2017; Liu et al., 2018; Bińkowski et al., 2017), and approximating log-likelihood (Wu et al., 2017). Xu et al. (Xu et al., 2018b) and Borji (Borji, 2018) survey and compare many of these GAN evaluation methods.

These techniques can be used for evaluating domain adaptation methods used for image translation (a form of image generation but conditioned on an input image) from one domain to another (Yoo et al., 2016; Zhu et al., 2017; Yi et al., 2017; Choi et al., 2018; Royer et al., 2017; Benaim and Wolf, 2017). However, many domain adaptation methods (even those that are adversarial such as those using GANs) are not used for generation but rather for tasks with more easily-defined loss functions, making these techniques largely not needed for adversarial domain adaptation methods. For example, accuracy (Liu and Tuzel, 2016; Ganin et al., 2016; Tzeng et al., 2017; Bousmalis et al., 2016, 2017; Hoffman et al., 2018b; Choi et al., 2018; Benaim and Wolf, 2017; Fu et al., 2018) or AUC scores (Purushotham et al., 2017) can be used to evaluate classification, intersection over union or pixel accuracy can be used to evaluate image segmentation (Hoffman et al., 2018b; Benaim and Wolf, 2017; Fu et al., 2018; Li et al., 2018a; Perone et al., 2018), and absolute difference can be used to evaluate regression (Shrivastava et al., 2017).

3. Methods

In recent years, numerous new unsupervised domain adaptation methods have been proposed, with a growing emphasis on neural network-based approaches. Distinct lines of research have emerged. These include aligning the source domain and target domain distributions, mapping between domains, separating normalization statistics, designing ensemble-based methods, or focusing on making the model target discriminative by moving the decision boundary into regions of lower data density. In addition, others have explored combinations of these approaches. We will describe each of these categories together with recent methods that fall into these categories.

In this survey, we will focus on homogeneous domain adaptation consisting of one source and one target domain, as is most commonly studied. Another case is multi-source domain adaptation, where there are multiple source domains but still only one target domain. Sun et al. (Sun et al., 2015) survey multi-source domain adaptation, and since then a number of other methods (Guo et al., 2018; Hoffman et al., 2018a; Zhao et al., 2018b; Peng et al., 2018; Carlucci et al., 2018; Mancini et al., 2018; Xie et al., 2017; Xu et al., 2018a; Redko et al., 2018) have been developed for this case. It is also possible to perform multi-target domain adaptation (Gholami et al., 2018), though this case is even more rarely studied. Similarly, we focus on homogeneous domain adaptation due to its prevalence, though some heterogeneous methods have been developed (Yao et al., 2020; Zhou et al., 2019; Hubert Tsai et al., 2016; Duan et al., 2012; Wang and Mahadevan, 2011; Li et al., 2019).

3.1. Domain-Invariant Feature Learning

Most recent domain adaptation methods align source and target domains by creating a domain-invariant feature representation, typically in the form of a feature extractor neural network. A feature representation is domain-invariant if the features follow the same distribution regardless of whether the input data are from the source or target domain (Zhao et al., 2019). If a classifier can be trained to perform well on the source data using domain-invariant features, then the classifier may generalize well to the target domain since the features of the target data match those on which the classifier was trained. However, these methods assume that such a feature representation exists and the marginal label distributions do not differ significantly (Section 6).

The general training and testing setup of these methods is illustrated in Figure 3. Methods differ in how they align the domains (the Alignment Component in the figure). Some minimize divergence, some perform reconstruction, and some employ adversarial training. In addition, they differ in weight sharing choices, which will be discussed in Section 4.3. We discuss the various alignment methods below.

3.1.1. Divergence

One method of aligning distributions is through minimizing a divergence that measures the distance between the distributions. Choices for the divergence measure include maximum mean discrepancy, correlation alignment, contrastive domain discrepancy, the Wasserstein metric, and a graph matching loss.

Maximum mean discrepancy (MMD) (Gretton et al., 2007, 2012) is a two-sample statistical test of the hypothesis that two distributions are equal based on observed samples from the two distributions. The test is computed from the difference between the mean values of a smooth function on the two domains’ samples. If the means are different, then the samples are likely not from the same distribution. The smooth functions chosen for MMD are unit balls in characteristic reproducing kernel Hilbert spaces (RKHS) since it can be proven that the population MMD is zero if and only if the two distributions are equal (Gretton et al., 2012).

To use MMD for domain adaptation, the alignment component can be another classifier similar to the task classifier. MMD can then be computed and minimized between the outputs of these classifiers’ corresponding layers (a slightly different setup than that in Figure 3). Rozantsev et al. (Rozantsev et al., 2019) employ MMD, Long et al. (Long et al., 2015) investigate a multiple kernel variant of MMD (MK-MMD), and later Long et al. (Long et al., 2015) develop a joint MMD (JMMD) method (Long et al., 2017). Bousmalis et al. (Bousmalis et al., 2016) also tried MMD but found using an adversarial objective performed better in their experiments.

Correlation alignment (CORAL) (Sun et al., 2016) is similar to MMD with a polynomial kernel, computed from the distance between second-order statistics (covariances) of the source and target features. For domain adaptation, the alignment component consists of computing the CORAL loss between the two feature extractors’ outputs (in order to minimize the distance). A variety of distances have been used: Sun et al. (Sun and Saenko, 2016) use a squared matrix Frobenius norm in Deep CORAL, Zhang et al. (Zhang et al., 2018) use a Euclidean distance in mapped correlation alignment (MCA), others have used log-Euclidean distances in LogCORAL (Wang et al., 2017a) and Log D-CORAL(Morerio and Murino, 2017), and Morerio et al. (Morerio et al., 2018) use geodesic distances. Zhang et al. (Zhang et al., 2018a) generalize correlation alignment to possibly infinite-dimensional covariance matrices in RKHS. Chen et al. (Chen et al., 2019b) align statistics beyond the first and second orders.

Contrastive domain discrepancy (CCD) (Kang et al., 2019) is based on MMD but looks at the conditional distributions in order to incorporate label information (unlike CORAL or ordinary MMD). When minimizing CCD, intra-class discrepancy is minimized while inter-class margin is maximized. This has the problem of requiring target labels though, so Kang et al. (Kang et al., 2019) propose contrastive adaptation networks (CAN) that minimize cross-entropy loss on the labeled target data while alternating between estimating labels for target samples (via clustering) with adapting the feature extractor with the now-computable CCD (using the clusters). This approach outperforms the other methods on the Office dataset as shown in Table 3.

A problem known as “optimal transport” was originally proposed for studying resource allocation such as finding an optimal way to move material from mines to factories (Monge, 1781; Redko et al., 2017), but it can also be used to measure the distances between distributions. If the cost of moving each point is a norm (e.g., Euclidean), then the solution to a discrete optimal transport problem can be viewed as a distance: the Wasserstein distance (Damodaran et al., 2018) (also known as the earth mover’s distance). To align feature and label distributions with this distance, Courty et al. (Courty et al., 2017) propose joint distribution optimal transport (JDOT). To incorporate this into a neural network, Damodaran et al. (Damodaran et al., 2018) propose DeepJDOT.

Another divergence measure arises from graph matching: the problem of finding an optimal correspondence between graphs (Yan et al., 2016). A feature extractor’s output on a batch of samples can be viewed as an undirected graph (in the form of an adjacency matrix), where similar samples in the batch are connected. Given the graph from a batch of source data fed through the feature extractor and similarly a graph from a batch of target data, then the cost of aligning these graphs can be used as a divergence, as proposed by Das et al. (Das and Lee, 2018b, a; Das and George Lee, 2018).

3.1.2. Reconstruction

Rather than minimizing a divergence, Ghifary et al. (Ghifary et al., 2016) and Bousmalis et al. (Bousmalis et al., 2016) hypothesize that alignment can be accomplished by learning a representation that both classifies the labeled source domain data well and can be used to reconstruct either the target domain data (Ghifary et al.) or both the source and target domain data (Bousmalis et al.). The alignment component in these setups is a reconstruction network – the opposite of the feature extractor network – that takes the feature extractor output and recreates the feature extractor’s input (in this case, an image). Ghifary et al. (Ghifary et al., 2016) propose deep reconstruction-classification networks (DRCN), using a pair-wise squared reconstruction loss. Bousmalis et al. (Bousmalis et al., 2016) propose domain separation networks (DSN), using a scale-invariant mean squared error reconstruction loss.

3.1.3. Adversarial

Several varieties of feature-level adversarial domain adaptation methods have been introduced in the literature. In most the alignment component consists of a domain classifier. In one paper this component is instead represented by a network learning an approximate Wasserstein distance, and in another paper the component is a GAN.

A domain classifier is a classifier that outputs whether the feature representation was generated from source or target data. Recall that GANs include a discriminator that tries to accurately predict whether a sample is from the real data distribution or from the generator. In other words, the discriminator differentiates between two distributions, one real and one fake. A discriminator could similarly be designed to differentiate two distributions which instead represent a source distribution and a target distribution, as is done with a domain classifier. Note though that an adversarial domain classifier is used for adaptation, whereas a GAN is used for data generation. The domain classifier is trained to correctly classify the domain (source or target). In this scenario, the feature extractor is trained such that the domain classifier is unable to classify from which domain the feature representation originated. This is a type of zero-sum two-player game (Zhao et al., 2019) as in a GAN (Section 2.2). Typically, these networks are adversarially trained by alternating between these two steps. The feature extractor can be trained to make the domain classifier perform poorly by negating the gradient from the domain classifier with a gradient reversal layer (Ganin and Lempitsky, 2015) when performing back propagation to update the feature extractor weights (e.g., in DANN (Ajakan et al., 2014; Ganin and Lempitsky, 2015; Ganin et al., 2016) and VRADA (Purushotham et al., 2017)), maximally confusing the domain classifier (when it outputs a uniform distribution over binary labels (Tzeng et al., 2015)), or inverting the labels (in ADDA (Tzeng et al., 2017)). Because data distributions are often multi-modal, results may be improved by conditioning the domain classifier on a multilinear map of the feature representation and the task classifier predictions, which takes into account the multi-modal nature of the distributions (Long et al., 2018).

Shen et al. (Shen et al., 2018) created WDGRL, a modification of DANN, by replacing the domain classifier with a network that learns an approximate Wasserstein distance. This distance is then minimized between source and target domains, which they found to yield an improvement. This method is similar to the divergence methods except here the divergence is learned with a network rather than computed based on statistics (e.g., using mean in MMD or covariance in CORAL). This method outperforms the other methods on the Amazon review dataset as shown in Table 13.

Sankaranarayanan et al. (Sankaranarayanan et al., 2018a) propose Generate to Adapt that uses a GAN as the alignment component. The feature extractor output is both fed to a classifier trained to predict the label (if the input is from the source domain) and also to a GAN trained to generate source-like images (regardless of if the input is source or target). For training stability, they use an AC-GAN (Odena et al., 2017). They note one downside of using a GAN for adaptation is that it requires a large training dataset, but a common strategy is to use a pretrained network on a large dataset such as ImageNet. Using this pretraining, even on small datasets (e.g., Office) where the generated images are poor, the network still learns adaptation satisfactorily. Sankaranarayanan et al. (Sankaranarayanan et al., 2018b) similarly develop a similar approach for semantic segmentation.

3.2. Domain Mapping

An alternative to creating a domain-invariant feature representation is mapping from one domain to another. The mapping is typically created adversarially and at the pixel level (i.e., pixel-level adversarial domain adaptation), but not always, as discussed at the end of this section. This mapping can be accomplished with a conditional GAN. The generator performs adaptation at the pixel level by translating a source input image to an image that closely resembles the target distribution. For example, the GAN could change from a synthetic vehicle driving image to one that looks realistic as shown in Figure 4 (Yoo et al., 2016; Zhu et al., 2017; Royer et al., 2017; Choi et al., 2018; Hoffman et al., 2018b). A classifier can then be trained on the source data mapped to the target domain using the known source labels (Shrivastava et al., 2017) or jointly trained with the GAN (Bousmalis et al., 2017; Hoffman et al., 2018b). We will first discuss how a conditional GAN works followed by the ways it can be employed for domain adaptation.

3.2.1. Conditional GAN for Image-to-Image Translation

The original formulation of a GAN was unconditional, where a GAN only accepted a noise vector as input. Conditional GANs, on the other hand, accept as input other information such as a class label, image, or other data (Goodfellow et al., 2014a; Gauthier, 2014; Mirza and Osindero, 2014; Denton et al., 2015). In the case of image generation, this means that a particular type of image to generate can be specified. One such example is to generate an image of a particular class within an image dataset such as “cat” rather than a random object from the dataset. Another example is conditioning on an input image such as in Figure 4, mapping an input driving image from one domain (synthetic) to an output image in another domain (realistic). Other uses include: transferring style (e.g., make a photo look like a Van Gogh painting) (Zhu et al., 2017; Yi et al., 2017; Kim et al., 2017), colorizing images (Isola et al., 2017), generating satellite images from Google Maps data (or vice versa) (Isola et al., 2017; Zhu et al., 2017; Yi et al., 2017), generating images of clothing from images of people wearing the clothing (Yoo et al., 2016), generating cartoon faces from real faces (Taigman et al., 2016; Royer et al., 2017), converting labels to photos (e.g., semantic segmentation output to a photo) (Isola et al., 2017; Zhu et al., 2017; Yi et al., 2017), learning disentangled representations (Chen et al., 2016), improving GAN training stability (Odena et al., 2017), and domain adaptation, which will be discussed in Section 3.2.2.

GANs conditioned on an input image can be used to perform image-to-image translation. These networks can be trained with varying levels of supervision: the dataset may contain corresponding images in the domains (supervised (Yoo et al., 2016; Isola et al., 2017)), only a few corresponding images (semi-supervised (Gan et al., 2017)), or no corresponding images (unsupervised (Zhu et al., 2017; Yi et al., 2017; Kim et al., 2017)). A popular and general-purpose supervised method is pix2pix, developed by Isola et al. (Isola et al., 2017). A commonly used unsupervised method is CycleGAN (Zhu et al., 2017), which is based on pix2pix, or methods similar to CycleGAN including DualGAN (Yi et al., 2017) and DiscoGAN (Kim et al., 2017).

Numerous modifications to these approaches have been proposed: one that is multi-modal is MUNIT, a multi-modal unsupervised image-to-image translator (Huang et al., 2018b). By assuming a decomposition into style (domain-specific) and content (domain-invariant) codes, MUNIT can generate diverse outputs for a given input image (e.g., multiple possible output images corresponding to the same input image). A modification to CycleGAN explored by Li et al. (Li, 2018) uses separate batch normalization for each domain (an idea similar to AdaBN discussed in Section 3.3). Mejjati et al. (Alami Mejjati et al., 2018) and Chen et al. (Chen et al., 2018d) improve results with attention, learning which areas of the images on which to focus. Shang et al. (Shang et al., 2017) improve results by feeding the mapped images into a denoising autoencoder. While CycleGAN and similar approaches use two generators, one for each mapping direction, Benaim et al. (Benaim and Wolf, 2017) developed a method for one-sided mapping that maintains distances between pairs of samples when mapped from the source to the target domain rather than (or in addition to) using a cycle consistency loss, and Fu et al. (Fu et al., 2018) developed an alternative one-sided mapping using a geometric constraint (e.g., vertical flipping or 90 degree rotation). Royer et al. (Royer et al., 2017) propose XGAN, a dual adversarial autoencoder capable of handling large domain shifts, where possibly an image in the source domain may correspond to multiple images in the target domain or vice versa. They tested mapping human faces to cartoon faces, which was a shift larger than CycleGAN could adequately handle. Choi et al. (Choi et al., 2018) propose StarGAN, a method for handling multiple domains with a single GAN. Approaches like CycleGAN need a separate generator (or two, one for each direction) for each pair of domains, which is not a scalable solution to many domains. StarGAN, on the other hand, only needs a single generator. This has the added benefit of allowing the generator to learn using all the available data rather than only the data in a specific pair of domains. During training they randomly pick a target domain at each iteration so the generator learns to generate images in all the domains. Anoosheh et al. (Anoosheh et al., 2018) propose an approach designed for the same purpose as StarGAN but using one generator per domain.

3.2.2. Image-to-Image Translation for Domain Adaptation

While the above approaches map images from one domain to another without the explicit purpose of performing domain adaptation, they can also be used for domain adaptation. For example, the original CycleGAN paper was application agnostic, but others have experimented with applying CycleGAN to domain adaptation (Hoffman et al., 2018b; Benaim and Wolf, 2017; Fu et al., 2018). It is important to note though that these image-to-image translation approaches assume that the domain differences are primarily low-level (Bousmalis et al., 2017, 2018; Tzeng et al., 2017).

If unsupervised domain adaptation is performed for classification, adaptation can be accomplished by training an image-to-image translation GAN to map data from source to target, training a classifier on the mapped source images with known labels, and then subsequently testing by feeding unlabeled target through this target-domain classifier (Shrivastava et al., 2017; Bousmalis et al., 2018; Li et al., 2018a), as done in SimGAN (Shrivastava et al., 2017) and illustrated in Figure 5(a). Alternatively, rather than learning a mapping from source to target, the opposite could be done: learn a mapping from target to source, train a classifier on the source images with known labels, and test by feeding target images to the image-to-image translation model (to make them look like source images) followed by the source-domain classifier (Chen et al., 2018a), as illustrated in Figure 5(b).

In either of these approaches, if the mapping and the classification models are learned independently, the class assignments may not be preserved. For instance, class 1 may end up being “renamed” to class 2 after the mapping since the mapping was learned ignoring the class labels. This issue can be resolved by incorporating a semantic consistency loss (see Section 4.1) and training the mapping and classification models jointly (Bousmalis et al., 2016; Hoffman et al., 2018b), as done in PixelDA (Bousmalis et al., 2017).

If there is a way to perform hyperparameter tuning, a third option is possible (combination of Figure 5(a) and 5(b)): train a target-domain classifier on the source-to-target GAN (for which the GAN is not used during testing) and a source-domain classifier on the target-to-source GAN (for which the GAN is used during testing). The algorithm may then output a linear combination of the prediction results from the two classifiers (Russo et al., 2018). While this approach does improve results, it requires a method of hyperparameter training (see Section 4.7).

All of the above approaches perform pixel-level mapping. An alternative approach is to perform feature-level mapping. Hong et al. (Hong et al., 2018) use a conditional GAN to learn to make the source features look more like the target features (a distinctly different idea than making the features domain invariant, which was discussed in Section 3.1). They found this particularly helpful for structured domain adaptation (e.g., semantic segmentation, in their case).

Up to this point, these domain mapping methods have used image-to-image translation to map images (or in one case features) from one domain to another and thereby improve domain adaptation performance. Another line of research using pixel-level image generation for domain adaptation is to use a GAN to generate corresponding images in multiple domains and then employ all but the last layer of the discriminator as a feature extractor for a classifier (Liu and Tuzel, 2016; Mao and Li, 2018). Liu et al. (Liu and Tuzel, 2016) train a pair of GANs called CoGAN on two domains of images. Mao et al. (Mao and Li, 2018) propose RegCGAN using only one generator and discriminator but including a domain label prepended to the input noise vector.

3.3. Normalization Statistics

Normalization layers such as batch norm (Ioffe and Szegedy, 2015) are used in most neural networks (Santurkar et al., 2018b). These have benefits including allowing for higher learning rates and thus faster training (Ioffe and Szegedy, 2015), reducing initialization sensitivity (Ioffe and Szegedy, 2015), smoothing the optimization landscape and making the gradients more Lipschitz (Santurkar et al., 2018b), and allowing for deeper networks to converge (Wu and He, 2018; Goodfellow et al., 2016). Each batch norm layer normalizes its input to have zero mean and unit variance. At test time, running averages of the batch norm parameters can be used. Alternatives have been developed including instance norm allowing use in recurrent neural networks (Ba et al., 2016) and group norm removing the dependence on batch size (Wu and He, 2018). However, none of these normalization techniques were developed with domain adaptation in mind. In the case of domain adaptation, the normalization statistics for each domain likely differ. Another line of domain adaptation research involves using per-domain batch normalization statistics.

Li et al. (Li et al., 2018c) assume that the neural net layer weights learn task knowledge and the batch norm statistics learn domain knowledge. If this is the case, then domain adaptation can be performed by modulating all the batch norm layers’ statistics from the source to target domain, a technique they call AdaBN. This has the benefit of being simple, parameter free, and complementary to other adaptation methods.

Carlucci et al. (Carlucci et al., 2017) propose AutoDIAL, a generalization of AdaBN. In AdaBN, the target data are not used to learn the network weights but only for adjusting the batch norm statistics. AutoDIAL can utilize the target data for learning the network weights by coupling network parameters between source and target domains. They do this through adding domain alignment layers (DA-layers) that differ for source and target input data before each of the batch norm layers. Generally, batch norm computes a moving average of the statistics on a batch of the layer’s input data. However, in AutoDIAL, source and target input data to DA-layers are mixed by a learnable amount before feeding this to batch norm (meaning that the batch norm statistics are now computed over some source and some target data rather than just source data or just target data). This allows the network to automatically learn how much alignment is needed at various points in the network.

3.4. Ensemble Methods

Given a base model such as a neural network or decision tree, an ensemble consisting of multiple models can often outperform a single model by averaging together the models’ outputs (e.g., regression) or taking a vote (e.g., classification) (El Habib Daho et al., 2014; Goodfellow et al., 2016). This is because if the models are diverse then each individual model will likely make different mistakes (Goodfellow et al., 2016). However, this performance gain corresponds with an increase in computation cost due to the large number of models to evaluate for each ensemble prediction, making ensembles common for some use cases such as competitions but uncommon when comparing models (Goodfellow et al., 2016). Despite the incurred cost, several ensemble-based methods have been developed for domain adaptation either using the ensemble predictions to guide learning or using the ensemble to measure prediction confidence for pseudo-labeling target data.

3.4.1. Self-Ensembling

An alternative to using multiple instances of a base model as the ensemble is using only a single model but “evaluating” (via a history or average) the models in the ensemble at multiple points in time during training – a technique called self-ensembling. This can be done by averaging over past predictions for each example (by recording previous predictions) (Laine and Aila, 2017) or past network weights (by maintaining a running average) (Tarvainen and Valpola, 2017). Since an ensemble requires diverse models, these self-ensembling approaches require high stochasticity in the networks, which is provided by extensive data augmentation, varying the augmentation parameters, and including dropout. These methods were originally developed for semi-supervised learning.

French et al. (French et al., 2018) modify and extend these prior self-ensembling methods for unsupervised domain adaptation. They use two networks: a student network and a teacher network. Input images are fed first to stochastic data augmentation (Gaussian noise, translations, horizontal flips, affine transforms, etc.) before being input to both networks. Because the method is stochastic, the augmented images fed to the networks will differ. The student network is trained with gradient descent while the teacher network weights are an exponential moving average (EMA) of the student network’s weights. This method outperforms the other methods on the datasets in Table 2. Athiwaratkun et al. (Athiwaratkun et al., 2019) show that in at least one experiment stochastic weight averaging (Izmailov et al., 2018) can further improve these results.

3.4.2. Pseudo-Labeling

Rather than voting or averaging the outputs of the models in an ensemble, the individual model predictions could be compared to determine the ensemble’s confidence in that prediction. The more models in the ensemble that agree, the higher the ensemble’s confidence in that prediction. In addition, if performing classification on a particular example, an individual model’s confidence can be determined by looking at the last layer’s softmax distribution: uniform indicates uncertainty whereas one class’s probability much higher than the rest indicates higher confidence. Applying this to domain adaptation, a diverse ensemble trained on source data may be used to label target data. Then, if the ensemble is highly confident, those now-labeled target examples can be used to train a classifier for target data.

This is the method Saito et al. (Saito et al., 2017) developed called asymmetric tri-training (ATT). Two networks sharing a feature extractor are trained on the labeled source data (i.e., the ensemble in this case is of size two). Those two networks then predict the labels for the unlabeled target data, and if the two agree on the label and have high enough confidence on a particular instance, then the predicted label for that example is assumed to be the true label. After the target data are labeled by the first two networks, the third network (also sharing the same feature extractor) can be trained using the assumed-true labels (pseudo-labels). Diversity in the ensemble is handled with an additional loss (see Section 4.1).

Instead of using an ensemble, Zou et al. (Zou et al., 2018) rely on just the softmax distribution for the confidence measure. When working with semantic segmentation, they found relying on the prediction confidence for pseudo-labeling results in transferring primarily easy classes while ignoring harder classes. Thus, they additionally propose adding a class-wise weighting term when pseudo-labeling to normalize the class-wise confidence levels and thus balance out the class distribution.

3.5. Target Discriminative Methods

One assumption that has led to successes in semi-supervised learning algorithms is the cluster assumption (Chapelle and Zien, 2005): that data points are distributed in separate clusters and the samples in each cluster have a common label (Shu et al., 2018). If this is the case, then decision boundaries should lie in low density regions (i.e., should not pass through regions where there are many data points) (Chapelle and Zien, 2005). A variety of domain adaptation methods have been explored to move decision boundaries into density regions of lower density. These have typically been trained adversarially.

Shu et al. (Shu et al., 2018) in virtual adversarial domain adaptation (VADA) and Kumar et al. (Kumar et al., 2018) in co-regularized alignment (Co-DA) both use a combination of variational adversarial training (VAT) developed by Miyato et al. (Miyato et al., 2018) and conditional entropy loss. They are used in combination because VAT without the entropy loss may result in overfitting to the unlabeled data points (Kumar et al., 2018) and the entropy loss without VAT may result in the network not being locally-Lipschitz and thus not resulting in moving the decision boundary away from the data points (Shu et al., 2018). Shu et al. (Shu et al., 2018) additionally propose a decision-boundary iterative refinement step with a teacher (DIRT-T) for use after training to further refine the decision boundaries on the target data, allowing for a slight improvement over VADA. An entropy loss was also used in AutoDIAL (Carlucci et al., 2017) but without VAT.

In generative adversarial guided learning (GAGL), Wei et al. (Wei and Hsu, 2018) propose to let a GAN move decision boundaries into lower-density regions. Using domain alignment methods that learn domain-invariant features like DANN (Section 3.1), typically the data fed to the feature extractor is either source or target data. However, Wei et al. propose to alternate this with feeding generated (fake) images and appending a “fake” label to the task classifier, thus repurposing the task classifier as a GAN discriminator. They found this to have the effect of moving the decision boundaries in the target domain into areas of lower density with a GAN, promoting target-discriminative features as a result.

Saito et al. (Saito et al., 2018a) propose adversarial dropout regularization. Since dropout is stochastic, when they create two instances of the task classifier containing dropout, the resulting networks may produce different predictions. The difference between these predictions can be viewed as a discriminator. Using this discriminator to adversarially train the feature extractor has the effect of producing target discriminative features. Lee et al. (Lee et al., 2019b) alter adversarial dropout to better handle convolutional layers by dropping channel-wise rather than element-wise.

3.6. Combinations

In recent work, researchers have proposed various combinations of the above methods. Domain mapping has been combined with domain-invariant feature learning methods either trained separately (in GraspGAN (Bousmalis et al., 2018)) or jointly (in CyCADA (Hoffman et al., 2018b)). Following AdaBN, many researchers started employing domain-specific batch normalization (Bousmalis et al., 2018; French et al., 2018; Li, 2018; Kumar et al., 2018; Kang et al., 2019). Kumar et al. (Kumar et al., 2018) propose co-regularized alignment (Co-DA), an approach in which two separate adversarial domain-invariant feature networks are learned with different feature spaces, drawing on ensemble-based methods. Kang et al. (Kang et al., 2018) combine domain mapping with aligning the models’ attention by minimizing an attention-based discrepancy. Deng et al. (Deng et al., 2019) combine target discriminative methods with self-ensembling. Lee et al. (Lee et al., 2019a) combine target discriminative methods and domain-invariant feature learning with a sliced Wasserstein metric.

Multi-adversarial domain adaptation (MADA) (Pei et al., 2018) combines adversarial domain-invariant feature learning with ensemble methods for the purpose of better handling multi-modal data. This is accomplished by incorporating a separate discriminator for each class and using the task classifier’s softmax probability to weight the loss from each discriminator for unlabeled target samples.

Saito et al. (Saito et al., 2018b) combine elements of adversarial domain-invariant feature learning, ensemble methods, and target discriminative features in their maximum classifier discrepancy (MCD) method. They propose using a shared feature extractor followed by an ensemble (of size two) of task-specific classifiers, where the discrepancy between predictions measures how far outside the support of the source domain the target samples lie. The discriminator in this setup is the combination of the two classifiers. The feature extractor is trained to minimize the discrepancy (i.e., fool the classifiers that the samples are from the source domain) while the classifiers are trained to maximize the discrepancy on the target samples.

4. Components

Table 1 summarizes the neural network-based domain adaptation methods we discuss showing components each method uses including what type of adaptation, which loss functions, whether the method uses a generator, and which weights are shared. Below we discuss each of these aspects followed by how the networks are trained, what types of networks can be used, multi-level adaptation techniques, and how to tune the hyperparameters of these methods.

4.1. Losses

4.1.1. Distance

Distance functions play a variety of roles in domain adaptation losses. A distance loss can be used to align two distributions by minimizing a distance function (e.g., MMD) as explained in Section 3.1. If using an ensemble, minimizing a distance function can align the outputs of the ensemble’s models: an L1 loss of the difference in predicted target class probabilities from two networks in Co-DA (Kumar et al., 2018) or a squared difference between the predictions of the student and teacher networks in self-ensembling (French et al., 2018). (Note the squared difference loss is confidence thresholded, i.e., if the max predicted output is below a certain threshold then the squared difference loss is set to zero.)

Some of the described methods have been altered replacing the task loss with one of similarity. Laradji et al. (Laradji and Babanezhad, 2018) propose M-ADDA, a metric-learning modification to ADDA but with the goal of maximizing the margin between clusters of data points’ embeddings. Based on DANN, Pinheiro (Pinheiro, 2018) proposes SimNet, classifying based on how close an embedding is to the embeddings of a random subset of source images for each class. Hsu et al. (Hsu et al., 2018) propose $\text{CCN}^{++}$ incorporating a pairwise similarity network (trained with the same class is similar and different classes are dissimilar).

4.1.2. Promote Differences

Methods that rely on multiple networks learning different features (such as to make an ensemble diverse) do so by promoting differences between the networks. Saito et al. (Saito et al., 2017) train the two classifiers labeling unlabeled data to use different features by adding a norm of the product of the two classifiers’ weights. Bousmalis et al. (Bousmalis et al., 2016) promote different features between two private feature extractors with a soft subspace orthogonality constraint, which is similarly used by Liu et al. (Liu et al., 2017) for text classification. Kumar et al. (Kumar et al., 2018) train the feature extractors to be different by pushing minibatch means apart. Saito et al. (Saito et al., 2018b) maximize the discrepancy between two classifiers using a fixed, shared feature extractor to promote using different features.

4.1.3. Cycle Consistency / Reconstruction

A cycle consistency loss or reconstruction loss is commonly used in domain mapping methods to avoid requiring a dataset of corresponding images to be available in both domains. This is how CycleGAN (Zhu et al., 2017), DualGAN (Yi et al., 2017), and DiscoGAN (Kim et al., 2017) can be unsupervised. This means that after translating an image from one domain (e.g., horses) to another (e.g., zebras), the new image can be translated back to reconstruct the original image, as illustrated in Figure 6(a). Some variants of this have been proposed such as an L1 loss with a transformation function (e.g., identity, image derivatives, mean of color channels) (Shrivastava et al., 2017), a feature-level cycle-consistency loss (mapping from source to embedding to target then back to embedding resulting in the same embeddings) (Royer et al., 2017), or using the loss in one (Choi et al., 2018) or both directions (Royer et al., 2017; Hoffman et al., 2018b). Sener et al. (Sener et al., 2016) enforce cycle consistency in their $k$ -nearest neighbors ( $k$ -NN) approach by requiring the distance between any source and target point labeled the same to be less than the distance between any source and target point labeled differently and derive a rule they can solve with stochastic gradient descent.

4.1.4. Semantic Consistency

A semantic consistency loss can be used to preserve class assignments as illustrated in Figure 6(b) (a segmentation example). The semantic consistency loss requires that a classifier output (or semantic segmentation labeling) from the original source image is the same as the same classifier’s output on the pixel-level mapped target output.

4.1.5. Task

Nearly all of the domain adaptation methods include some form of task loss that helps the network learn to perform the desired task. For example, for classification, the goal is to output the ground truth source label, or for semantic segmentation, to label each pixel with the correct ground truth source label. The task loss used is generally a cross-entropy loss, or more specifically the negative log likelihood of a softmax distribution (Goodfellow et al., 2016) when using a softmax output layer. The exceptions not including a task loss are SimNet (Pinheiro, 2018) that classify based on distance to prototypes of each class, the work by Sener et al. (Sener et al., 2016) that uses $k$ nearest neighbors, and AdaBN (Li et al., 2018c) that only adjusts the batch norm layers to the target domain. In addition, the image-to-image translation methods are application agnostic unless trained jointly for domain adaptation.

4.1.6. Adversarial

A variety of methods use a discriminator (or critic) for learning domain-invariant features, realistic image generation, or promoting target discriminative features by forcing a network (either a feature extractor or generator) to produce outputs indistinguishable between two domains (source and target or real and fake). This loss is different than the other losses discussed in this section because this adversarial loss is learned (Goodfellow, 2016; Isola et al., 2017) (where learning is more than a hyperparameter search) rather than being provided as a predefined function. During training, gradients from the discriminator are used to train the feature extractor or generator (e.g., negated by a gradient reversal layer, Section 3.1.3). This alternates with updating the discriminator itself to make the correct domain classification.

4.1.7. Additions for Specific Problems

Some research focusing on specific problems has resulted in additional losses. For semantic segmentation, Li et al. (Li et al., 2018a) develop a loss making segmentation boundaries sharper to help when the mapped image-to-image translation images will be used for segmentation, Chen et al. (Chen et al., 2018b) develop a distillation loss in addition to performing location-aware alignment (e.g., “road” is usually at the bottom of each image), Hoffman et al. (Hoffman et al., 2016) develop a class-aware constrained multiple instance loss, Zhang et al. (Zhang et al., 2017b) develop a curriculum where after learning some high-level properties on easy tasks the segmentation network is forced to follow those properties (interpretations include student-teacher setup or posterior regularization), and Perone et al. (Perone et al., 2018) apply the self-ensembling method (French et al., 2018) replacing the cross-entropy loss with a consistency loss. For object detection, Chen et al. (Chen et al., 2018c) use two domain classifiers (one on an image-level representation and the other on an instance-level representation) with a consistency regularization between them. For adaptation from synthetic images where it is known which pixels are foreground in the source images, Bousmalis et al. (Bousmalis et al., 2017) and Bak et al. (Bak et al., 2018) mask certain losses to only penalize foreground pixel differences. For person re-identification, Wei et al. (Wei et al., 2018) include a person identity-keeping constraint in their domain mapping GAN.

4.2. Low-Confidence or Low-Relevance Rejection

Given a measure of confidence, performance may increase if we can reject data points for training the target classifier that are not of sufficient confidence. This, of course, assumes our confidence measurement is accurate enough. Saito et al. (Saito et al., 2017) used the label agreement of an ensemble combined with the softmax distribution output (uniform is not confident, one probability much higher than the rest is confident). Sener et al. (Sener et al., 2016) used the label agreement of the $k$ nearest source data points. If the confidence is to low, then the example is rejected and not used in training until if later on when re-evaluated it is determined to be sufficiently confident. Inoue et al. (Inoue et al., 2018) used an object detector’s prediction probability as a measure of confidence, only using high-confidence detections for fine-tuning an object detection network. Similarly, a rejection approach could be used if we have a measure of relevance. For text classification, Zhang et al. (Zhang et al., 2017a) weight examples by their relevance to their target aspect based on a small set of positive and negative keywords (a form of weak supervision).

4.3. Weight Sharing

Methods employ different amounts of sharing network weights between domains or regularizing the weights to be similar. Most methods completely share weights between the feature extractors used on the source and target domains (as shown in Table 1). However, some techniques do not. Since deep networks consist of many layers, allowing them to represent hierarchical features, Long et al. (Long et al., 2015) propose copying the lower layers from a network trained on the source domain and adapting higher layers to the target domain with MK-MMD since higher layers do not transfer well between domains. In CoGAN, Liu et al. (Liu and Tuzel, 2016) share the first few layers of the generators and the last few layers of the discriminators, making the assumption that the domains share high-level representations. In AdaBN, Li et al. (Li et al., 2018c) assume domain knowledge is stored in the batch norm statistics, so they share all weights except for the batch norm statistics. French et al. (French et al., 2018) define the teacher network as an exponential moving average of the student network’s weights (a type of ensemble). Instead of sharing weights, Rozantsev et al. (Rozantsev et al., 2019; Rozantsev et al., 2018) propose two variants: regularizing weights to be similar but not penalizing linear transformations and transforming the weights from the source network to the target network with small residual networks. Bousmalis et al. (Bousmalis et al., 2016) propose domain separation networks (DSN): learning source-specific, target-specific, and shared features where the “shared” source domain encoder and “shared” target domain encoder do share weights, but the “private” source domain encoder and “private” target domain encoders do not. Others have similarly explored this idea of shared vs. specific features (Liu et al., 2017; Ren et al., 2018; Cao et al., 2018a).

4.4. Training Stages

Some have trained networks for domain adaptation in stages. Tzeng et al. (Tzeng et al., 2017) train a source classifier first followed by adaptation. Taigman et al. (Taigman et al., 2016) use a pre-trained encoder during adaptation. Bousmalis et al. (Bousmalis et al., 2018) in GraspGAN first train the domain-mapping network followed by the domain-adversarial network. Hoffman et al. (Hoffman et al., 2018b) in CyCADA train their many components in stages because it would not all fit into GPU memory at once.

Other methods train the domain adaptation networks jointly, which using an adversarial approach is done by alternating between training the discriminator and the rest of the networks (Sections 2.2 and 3.1.3). However, variations exist for some other methods. Saito et al. (Saito et al., 2017) in ATT cycle through generating training the source networks, generating pseudo-labels, and training the target network. Zou et al. (Zou et al., 2018) alternate between pseudo-labeling the target data and re-training the model using the labels (a form of self-training). Wei et al. (Wei and Hsu, 2018) in GAGL alternate between feeding in real source and target data and the fake images generated by a GAN. Sener et al. (Sener et al., 2016) alternate between $k$ -nearest neighbors and performing gradient descent.

4.5. Multi-Level

Some adaptation methods perform adaptation at more than one level. As discussed in Section 3.6, GraspGAN (Bousmalis et al., 2018) and CyCADA (Hoffman et al., 2018b) perform pixel-level adaptation with domain mapping and feature-level adaptation with domain-invariant feature learning. Hoffman et al. (Hoffman et al., 2018b) found that performing both levels of adaptation significantly improves accuracy: using domain mapping to capture low-level image domain shifts and learning domain-invariant features to handle larger domain shifts than what pure domain mapping methods can support. Following this idea, Tsai et al. (Tsai et al., 2018) make semantic segmentation predictions and perform domain-invariant feature learning at multiple levels in their semantic segmentation network, and Zhang et al. (Zhang et al., 2018d) perform domain-invariant feature learning at multiple levels while automatically learning how much to align to each level. Chen et al. (Chen et al., 2018c) perform domain-invariant feature learning at both image and instance levels for object detection but also include a consistency regularization between the two domain classifiers.

4.6. Types of Networks

Nearly all of the surveyed approaches focus on learning from image data and use convolutional neural networks (CNNs) such as ResNet-50 or Inception (Table 3). Wang et al. (Wang et al., 2019) explore the use of attention networks, Kang et al. (Kang et al., 2018) a combination of CNNs and attention, Ma et al. (Ma et al., 2019) graph convolutional networks, and Kurmi et al. (Kurmi et al., 2019) Bayesian neural networks. In the case of time-series data, Purushotham et al. (Purushotham et al., 2017) propose instead using a variational recurrent neural network (RNN) (Chung et al., 2015) or LSTM (a type of RNN) (Hochreiter and Schmidhuber, 1997) rather than a CNN. The RNN learns the temporal relationships while adversarial training is used to achieve domain adaptation. For text classification (a type of natural language processing), Liu et al. (Liu et al., 2017) also use LSTMs while Zhang et al. (Zhang et al., 2017a) found a CNN to work just as well as RNNs or bi-LSTMs in their experiments. For relation extraction (another type of natural language processing), Fu et al. (Fu et al., 2017) also use a CNN. For time-series speech recognition, Zhao et al. (Zhao et al., 2017c) use bi-LSTMs while Hosseini-Asl et al. (Hosseini-Asl et al., 2019) used a combination of CNNs and RNNs. In the related problem of domain generalization, a combination of CNNs and RNNs have been used for handling a radio spectrogram changing through time to identify sleep stages (Zhao et al., 2017b).

4.7. Hyperparameter Tuning

Normal supervised learning-based hyperparamenter tuning methods do not carry over to unsupervised domain adaptation (Long et al., 2013, 2016; Ganin et al., 2016; Bousmalis et al., 2016; Wang et al., 2018; Perone et al., 2018; Morerio et al., 2018). A common supervised learning approach is to split the training data into a smaller training set and a validation set. After repeatedly altering the hyperparameters, retraining the model, and testing on this validation set for each set of hyperparameters, the model yielding the highest validation set accuracy is selected. Another option is cross validation. However, in unsupervised domain adaptation, there are now two domains, and the data for the target domain may not include any labels. When evaluating domain adaptation approaches on common datasets, generally the target data does contain labels, so work by some groups (Bousmalis et al., 2016; Russo et al., 2018; Wang et al., 2018; Wei and Hsu, 2018; Carlucci et al., 2017; Kumar et al., 2018; Shu et al., 2018) do use some labeled target data (or all of it (Long et al., 2013; Shen et al., 2018)) for hyperparameter tuning, which can be interpreted as an upper bound on how well the method could perform (Wang et al., 2018). For example, some (Long et al., 2016; Carlucci et al., 2017) tuned for Office on one $W$ labeled example per class on the $A\rightarrow$ W task, while others (Russo et al., 2018; Wei and Hsu, 2018) tuned with a validation set of 1000 randomly sampled target examples. Using any labeled target data is not ideal because real-world testing will not include labels for tuning (unless it is semi-supervised, in which case semi-supervised learning is recommended in Section 6).

One tuning method not requiring labeled target data is reverse validation (Ganin et al., 2016), which is a variant of reverse cross validation (Zhong et al., 2010). For a set of hyperparameters, the reverse validation risk can be estimated by first splitting source (labeled) and target (unlabeled) data into training and validation sets. Then, the labeled source and unlabeled target data are used to learn a classifier (as is normally done). Next, this forward classifier is used to label the target data and a new reverse classifier is learned (with the same algorithm) using the pseudo-labeled target data (as “source”) and unlabeled source data (as “target”, i.e., ignoring the known labels). This reverse classifier is evaluated on the source validation data to measure the reverse validation risk. Ganin et al. (Ganin et al., 2016) found this method works better if the reverse classifier is initialized with the weights of the forward classifier and if using early stopping on the source validation set and a pseudo-labeled target validation set. Finally, hyperparameters are selected (e.g., grid search, random search, Bayesian optimization, or other gradient-free optimization methods such as those implemented in Nevergrad (Rapin and Teytaud, 2018)) that minimize this reverse validation risk.

Alternatively, given some domain knowledge, one may devise relevant measures of similarity between the domains and tune parameters to increase the similarity. For example, French et al. (French et al., 2018) were able to improve performance on the challenging problem of MNIST $\rightarrow$ SVHN by tuning data augmentation hyperparameters for MNIST to match pixel intensities apparent in the SVHN dataset. By doing this, they were able to improve the state-of-the-art to 97.0% (Table 2).

5. Results

Tables 2 through 5 summarize the results of evaluating many of these methods on datasets used for image classification as well as sentiment analysis. Care must be taken in the extent to which conclusions are drawn from comparing published numbers in different papers since the provided accuracies are for different network architectures, hyperparameters, amount of data augmentation, random initializations (or averages over a number of them), etc. and the methods may perform differently in other application areas. However, interestingly, at least one method in each of the categories of surveyed gives promising results on at least one of the datasets.

With domain-invariant feature learning with the contrastive domain discrepancy, CAN (Kang et al., 2019) has the highest performance on the Office dataset (Table 3). By using adversarial domain-invariant feature learning, WDGRL generally outperforms the other methods on the Amazon review dataset (Table 13) and Generate to Adapt is second highest of the methods evaluated on the Office dataset. By using adversarial pixel-level domain mapping, SBADA-GAN (Russo et al., 2018) obtains the highest accuracy on MNIST $\rightarrow$ MNIST-M (Table 2). AutoDIAL (Carlucci et al., 2017), a normalization statistics method, does on-par with CAN and Generate to Adapt in two of Office adaptation tasks. The self-ensembling method by French et al. (French et al., 2018) outperforms all other methods on the datasets in Table 2, and Co-DA (Kumar et al., 2018) comes close using an ensemble (of size two) of adversarial domain-invariant feature networks. CyCADA increases accuracy from 54% to 82% for a synthetic season adaptation dataset (Hoffman et al., 2018b) by combining both adversarial domain-invariant feature learning and domain mapping.

A number of these promising methods use adversarial techniques, which may be a key ingredient in solving domain adaptation problems. Adversarial approaches may be helpful on certain datasets (e.g., WDGRL on the Amazon review dataset on Office), certain types of data (e.g., VRADA was developed for time series data rather than image data), or may not require as extensive of tuning (e.g., Co-DA on MNIST $\rightarrow$ SVHN). Or adversarial training may be an additional tool to incorporate into existing non-adversarial methods. For instance, promising non-adversarial methods such as AutoDIAL and by French et al. could be combined with adversarial methods (see Section 8.3). In fact, Long et al. (Long et al., 2017) develop both JAN and then the adversarial version JAN-A, and JAN-A on average outperformed JAN on the Office dataset. CAN (Kang et al., 2019), which presently is the highest on the Office dataset, might also be improved by incorporating an adversarial component to it as in Long et al. (Long et al., 2017).

Interestingly, French et al. by far outperform all other methods on MNIST $\rightarrow$ SVHN, though this requires a problem-specific data augmentation and hyperparameter tuning. This may indicate that for some problems, maybe in particular the more challenging domain adaptation problems, hyperparameter tuning for a specific dataset may be of utmost importance. Possibly if other domain adaptation methods similarly were tuned appropriately, they would also experience large improvements. This is an area of research requiring further work (see Section 8.2). However, Co-DA (Kumar et al., 2018) is not far behind on SVHN $\rightarrow$ MNIST and MNIST $\rightarrow$ MNIST-M and is the closest on MNIST $\rightarrow$ SVHN, achieving 81.7% compared with 97.0%. A great advantage of Co-DA is that it does not require highly-problem-specific tuning on MNIST $\rightarrow$ SVHN as required by French et al. (without they only achieved 37.5%). Possibly some components of Co-DA such as the adversarial domain adaptation or virtual adversarial training may be partially responsible for the decrease in hyperparameter sensitivity.

6. Theory

Having surveyed domain adaptation methods, we now address the question of when adaptation may be beneficial. Ben-David et al. (Ben-David et al., 2010) develop a theory answering this in terms of an ideal predictor on both domains, Zhao et al. (Zhao et al., 2019) further this theory by removing the dependence on a joint ideal predictor while focusing on domain-invariant feature learning methods, and Le et al. (Le et al., 2018) develop theory looking beyond domain-invariant methods. These theoretical results can help answer two questions: (1) when will a classifier (or other predictor) trained on the source data perform well on the target data, and (2) given a small number of labeled target examples, how can they best be used during training to minimize target test error?

Answering the first question, labeled source data and unlabeled target data are both required (unsupervised). Answering the second question, additionally some labeled target data are required (semi-supervised). We will first review the theoretical bounds followed by a discussion of what insights these bounds provide into answering the above two questions. Ben-David et al. (Ben-David et al., 2010) also address the case of multiple source domains, as do Mansour et al. (Mansour et al., 2009). In this paper, we have focused on the cases containing only one source and one target (as is common in the methods we survey).

6.1. Unsupervised

6.1.1. Shared Hypothesis Space

Ben-David et al. (Ben-David et al., 2010) propose setting a bound on the target error based on the source error and the divergence between the source and target domains. The empirical source error is easy to obtain by first training and then testing a classifier. However, the divergence between the domains cannot be directly obtained with standard methods like Kullback-Leibler divergence due to only having a finite number of samples from the domains and not assuming any particular distribution. Thus, an alternative is to measure it using a classifier-induced divergence called $\mathcal{H}\Delta\mathcal{H}$ -divergence. Estimates of this divergence with finite samples converges to the real $\mathcal{H}\Delta\mathcal{H}$ -divergence. This divergence can be estimated by measuring the error when getting a classifier to discriminate between the unlabeled source and target examples; though, it is often intractable to find the theoretically-required divergence upper bound. Using the empirical source error $\hat{\epsilon}_{S}(h)$ , the $\mathcal{H}\Delta\mathcal{H}$ -divergence between source and target samples $d_{\mathcal{H}\Delta\mathcal{H}}(\mathcal{\hat{D}}_{S},\mathcal{\hat{D}}_{T})$ , and ideal predictor error $\lambda^{*}$ using the optimal hypothesis for the source and target, the target error $\epsilon_{T}(h)$ can be bounded as shown in Equation 2 (using the form given by Zhao et al. (Zhao et al., 2019)), $\forall h\in\mathcal{H}$ with probability at least $1-\delta$ for $\delta\in(0,1)$ .

[TABLE]

Zhao et al. (Zhao et al., 2019) develop another upper bound that removes the reliance on $\lambda^{*}$ . Let $\mathcal{H}\subseteq[0,1]^{\mathcal{X}}$ , $\mathcal{\tilde{H}}\coloneqq\{\text{sgn}\left(|h(x)-h^{\prime}(x)|-t\right)|h,h^{\prime}\in\mathcal{H},0\leq t\leq 1\}$ , $\langle\mathcal{D}_{S},f_{S}\rangle$ and $\langle\mathcal{D}_{T},f_{T}\rangle$ be the source and target domains (the true distributions, not empirical). The target error can then be bounded by the source error $\epsilon_{S}(h)$ , the discrepancy between marginal distributions $d_{\mathcal{\tilde{H}}}(\mathcal{D}_{S},\mathcal{D}_{T})$ , and the distance between the optimal source and target labeling functions $\forall h\in\mathcal{H}$ , as shown in Equation 3.

[TABLE]

Zhao et al. (Zhao et al., 2019) also develop an information-theoretic lower bound for target error. Let the labeling function $Y=f(X)\in\{0,1\}$ , the prediction function $\hat{Y}=h(g(X))\in\{0,1\}$ , and $Z$ be the intermediate representation output by a shared feature extractor used on source and target domain data. If the Jensen-Shannon distance $d_{JS}(\mathcal{D}_{S}^{Y},\mathcal{D}_{T}^{Y})\geq d_{JS}(\mathcal{D}_{S}^{Z},\mathcal{D}_{T}^{Z})$ and the Markov chain $X\xrightarrow{g}Z\xrightarrow{h}\hat{Y}$ holds, then Equation 4 provides a lower bound on the source and target error.

[TABLE]

6.1.2. Different Hypothesis Spaces

Le et al. (Le et al., 2018) develop an upper bound that allows for different hypothesis spaces for source and target functions, possibly non-deterministic labeling, and any bounded or continuous loss. If $l$ is a bounded or continuous loss, $x\sim\mathbb{P}^{s}$ (source) and $x\sim\mathbb{P}^{t}$ (target), $T:\mathcal{X}^{s}\rightarrow\mathcal{X}^{t}$ and $K\coloneqq T^{-1}$ (bijective mapping), $R(\theta)=\mathbb{E}_{p(x,y)}[l(y,h_{\theta}(x))]$ for $\theta$ parameterizing a hypothesis set $\mathcal{H}=\{h_{\theta}|\theta\in\Theta\}$ , $\Delta R(h^{s},h^{t})\coloneqq|R^{t}(h^{t})-R^{s}(h^{s})|$ , $y\in\{-1,1\}$ , $M$ is the number of labels, $\mathbb{P}^{\#}\coloneqq K_{\#}\mathbb{P}^{t}$ is the pushforward probability distribution transporting $\mathbb{P}^{t}$ via $K$ , $\Delta p(y|x)\coloneqq p^{t}(y|T(x))-p^{s}(y|x)$ for the true source and target labeling functions $p^{s}(y|x)$ and $p^{t}(y|x)$ , where ${WS}_{c}(\mathbb{P}^{s},\mathbb{P}^{\#})$ denotes the Wasserstein-1 distance between the source and target distributions with a cost function $c(x,x^{\prime})=1_{x\neq x^{\prime}}$ (1 if $x\neq x^{\prime}$ , otherwise 0), then Equation 5 provides an upper bound for the variance between a general loss on the source and target predictions.

[TABLE]

6.2. Semi-Supervised

In the semi-supervised case, a linear combination of the source and target errors is computed (Ben-David et al., 2010), called the $\alpha$ -error. A bound can be calculated on the true $\alpha$ -error based on the empirical $\alpha$ -error. Finding the minimum $\alpha$ -error depends on the empirical $\alpha$ -error, the divergence between source and target, and the number of labeled source and target examples. Experimentation can be used to empirically determine the values of $\alpha$ that will perform well. Ben-David et al. (Ben-David et al., 2010) also demonstrate the process on sentiment classification, illustrating that the optimum uses non-trivial values.

The bound is given in Equation 6. If $S$ is a labeled sample of size $m$ with $(1-\beta)m$ points drawn from the source distribution and $\beta m$ from the target distribution, then with at least probability $1-\delta$ for $\delta\in(0,1)$ :

[TABLE]

Here, $\hat{h}\in\mathcal{H}$ is the empirical minimizer of the $\alpha$ -error on $S$ given by $\hat{\epsilon}_{\alpha}(h)=\alpha\hat{\epsilon}_{T}(h)+(1-\alpha)\hat{\epsilon}_{S}(h)$ and $h^{*}_{T}=\min_{h\in\mathcal{H}}\epsilon_{T}(h)$ is the target error minimizer.

The optimum $\alpha$ is then:

[TABLE]

Here, $m_{S}=(1-\beta)m$ is the number of source examples, $m_{T}=\beta m$ is the number of target examples, $D=\sqrt{d}/A$ , and

[TABLE]

6.3. Discussion

6.3.1. Unsupervised

Equation 2 indicates that if the optimal predictor error $\lambda^{*}$ on both source and target data is large, then there is no good hypothesis from training on the source domain that will work well on the target domain (Ben-David et al., 2010; Zhao et al., 2019). However, as is more common in the application of domain adaptation, if $\lambda^{*}$ is small, then the bound depends on the source error and the $\mathcal{H}\Delta\mathcal{H}$ -divergence (Ben-David et al., 2010). The domain-invariant feature learning methods discussed in Section 3.1 try minimizing these two terms (Zhao et al., 2019): the source error via a task loss on labeled source data and divergence via a divergence measure such as MMD, with reconstruction, or adversarially. While Section 5 shows that on many datasets these methods work, there is no guarantee that such adaptation will increase performance (these are upper bounds), as shown by simple counterexamples (Zhao et al., 2019). It may actually decrease performance if the marginal label distributions differ significantly between source and target (Zhao et al., 2019).

Equation 3 shows that the target error upper bound alternatively involves the marginal distributions and Equation 4 shows that the lower bound does too. These indicate the importance of aligning the label distributions. If the marginal label distributions are significantly different, then minimizing the source error and divergence between feature representations will actually increase the error (Zhao et al., 2019). Thus over-training domain-invariant feature learning methods can increase target error, and Zhao et al. (Zhao et al., 2019) experimentally verified this. They found on MNIST, USPS, and SVHN adaptation that during training the target accuracy would initially rise rapidly but would eventually decrease again despite increasing source accuracy, an effect even more apparent with larger differences in the marginal label distributions. It is an open problem as to when the label distributions can be aligned without target labels (Zhao et al., 2019).

6.3.2. Semi-Supervised

Equation 6 indicates that when only source or target data are available, that data should be used (as we might expect). If the source and target are the same, then $\alpha^{*}=\beta$ , which implies a uniform weighting of examples. Given enough target data, source data should not be used at all because it might increase the test-time error. Furthermore, without enough source data using it may also not be worthwhile, i.e., $\alpha^{*}\approx 0$ (Ben-David et al., 2010). In this paper we focus on unsupervised domain adaptation, but these are important considerations if target labels can be obtained. For example, this shows that it may be better to perform semi-supervised adaptation if some labeled target examples are available rather than using the labeled target examples to hyperparameter tune an unsupervised adaptation method.

7. Applications

Domain adaptation has been applied in a variety of areas including computer vision, natural language processing, and for time-series data. Using domain adaptation in these various problems can save the human time that would be spent labeling the target data. In some cases such as image semantic segmentation, providing ground truth is very labor intensive. Each pixel-level annotated image in the Cityscapes dataset took on average 1.5 hours to complete (Cordts et al., 2016). In addition, similar methods as described in this paper have been applied to the related problem of domain generalization and some other problems as well.

7.1. Computer Vision

Most of the methods surveyed in this paper are for computer vision tasks such as adapting a model trained on synthetic images to real photos (e.g., from synthetic numbers or signs, Table 2), stock photos to real photos (e.g., Amazon to DSLR on the Office dataset, Table 3), or simple to complex images (e.g., MNIST to SVHN, Table 2). Others have been used in robotics for robot grasping (Bousmalis et al., 2018), autonomous navigation (Yoo et al., 2017), and lifelong learning (Wulfmeier et al., 2018), for semantic segmentation (Chen et al., 2018b; Luo et al., 2018; Lee et al., 2019c; Vu et al., 2018; Huang et al., 2018a; Zou et al., 2018; Hong et al., 2018; Sankaranarayanan et al., 2018b; Tsai et al., 2018) including when additional information is available from a simulator (Lee et al., 2019c), in a medical context for chest X-ray segmentation (Chen et al., 2018a), 3D CT scans to X-ray segmentation (Zhang et al., 2018c), MRI to CT scan segmentation (Chen et al., 2019a), and MRI segmentation (Perone et al., 2018), in low resource situations (where there are very few target data points) (Hosseini-Asl et al., 2019), in situations with different label sets for each domain (Sohn et al., 2019), for object detection (Inoue et al., 2018; Chen et al., 2018c; Hoffman et al., 2016), for person re-identification (Ganin et al., 2016; Deng et al., 2018; Bak et al., 2018; Wei et al., 2018; Zhong et al., 2018, 2019; Li et al., 2018d), and for depth estimation (Nath Kundu et al., 2018; Atapour-Abarghouei and Breckon, 2018; Mahmood et al., 2018).

7.2. Natural Language Processing

Domain adaptation has been used in natural language processing such as for sentiment analysis (Table 13, (Zhang et al., 2017a; Zhao et al., 2017c)), other text classification (Liu et al., 2017; Zhang et al., 2017a) including weakly-supervised aspect-transfer from one aspect of a dataset to another (Zhang et al., 2017a), relation extraction (Fu et al., 2017), semi-supervised sequence labeling (Daumé III, 2007), semi-supervised question answering (Yang et al., 2017), sentence specificity (Ko et al., 2018), and neural machine translation (Chu and Wang, 2018; Britz et al., 2017; Chen et al., 2017).

7.3. Time Series

For time-series data, domain adaptation has been used for learning temporal latent relationships in health data across different population age groups (Purushotham et al., 2017), to perform speech recognition (Zhao et al., 2017c; Shinohara, 2016; Hosseini-Asl et al., 2019), for predicting driving maneuvers (Tonutti et al., 2019), anomaly detection (Vercruyssen et al., 2017), and inertial tracking (Chen et al., 2019c). In a method addressing the related problem of domain generalization, time-series radio data was used for sleep-stage classification (Zhao et al., 2017b). Finally, a combination of pre-training and fine-tuning was used to solve another transfer learning problem, where the source datasets have a different label space than the target dataset (Ismail Fawaz et al., 2018).

7.4. Domain Generalization

Domain-invariant feature learning approaches similar to those discussed in Section 3.1 have been used for the related problem of domain generalization, where there are multiple source domains and an unseen target domain (Blanchard et al., 2011; Muandet et al., 2013). Zhao et al. (Zhao et al., 2017b) use an adversarial approach with a domain classifier to learn a model on a dataset collected from a number of people sleeping in various environments that will generalize well to new people and/or new environments (e.g., sleeping in a different room). Ghifary et al. (Ghifary et al., 2015) use a reconstruction approach with a denoising autoencoder to improve object recognition generalizability, where the “noise” is different views (domains) of the data (e.g., rotation, change in size, or variation in lighting) and the autoencoder tries to reconstruct corresponding views of the object in other domains. Carlucci et al. (Carlucci et al., 2018) propose an adversarial approach combining domain adaptation and generalization while also doing domain mapping. Akuzawa et al. (Akuzawa et al., 2019) note the domain-invariance objective may compete with the discriminative objective and thus develop a method to find the most domain-invariant representation that does not hurt classification performance. Li et al. (Li et al., 2018b) note that previous domain-invariant methods typically assume balanced classes and develop a method to handle changes in class proportions.

7.5. Other Problems

Adversarial losses like those used in adversarial domain adaptation methods have also been applied in multiple other settings. Wang et al. (Wang et al., 2017b) created an adversarial spacial dropout network to add occlusions to images to improve the accuracy of object detection algorithms. They also created an adversarial spatial transformer network to add deformations such as rotations to objects to again increase object detection accuracy. Pinto et al. (Pinto et al., 2017) used adversarial agents to improve a robot’s ability to grasp an object via self-supervised learning by employing both shaking and snatching adversaries. Giu et al. (Gui et al., 2018) used an adversarial loss to predict and demonstrate (i.e., robot will copy) human motion. Rippel et al. (Rippel and Bourdev, 2017; Rippel et al., 2018) used a reconstruction and adversarial loss with an autoencoder for learning higher quality image compression at low bit rates. Sinclair (Sinclair, 2018) applied adversarial loss to clone a physical model for real-time sound synthesis. Adversarial techniques may also be applied to machine learning security, where the goal is to train a classifier robust to adversarial examples (Huang et al., 2011; Miyato et al., 2018).

8. Research Directions

As we have seen, the rapidly-growing body of research focused on unsupervised deep domain adaptation now encompasses many novel methods and components. Here we look at what could be explored in future research to further enhance this existing work.

8.1. Bi-Directional Adaptation

The more difficult domain adaptation problems are far from being solved. Tables 2 through 5 indicate that some domain adaptation problems are harder than others and point to the challenge that more work needs to be focused on these harder problems. While accuracy for SVHN $\rightarrow$ MNIST ranges from 70.7% to 99.3%, for the reverse case of MNIST $\rightarrow$ SVHN, the highest without highly-problem-specific hyperparameter tuning is 81.7% by Kumar et al. (Kumar et al., 2018) (though tuned on a small amount of labeled target data). This indicates how this reverse problem is much harder (Ganin and Lempitsky, 2015; French et al., 2018). As a result, few papers offer results for this direction. French et al. (French et al., 2018) were able to vastly improve performance up to 97.0%; however, this required developing a problem-specific unsupervised hyperparameter tuning method. Other methods may similarly benefit from such tuning. Continued work is needed to strengthen general-purpose bi-directional adaptation.

8.2. Hyperparameter Tuning

Some methods such as reverse validation and a problem-specific pixel intensity matching have been applied to hyperparameter tuning without requiring target labels (Section 4.7). While the reverse validation method appears promising, it was not used in most of the methods surveyed (only (Ganin et al., 2016; Pinheiro, 2018; Pei et al., 2018)). This may be because of the increase in computation cost (Perone et al., 2018) or problems with the reverse validation accuracy not aligning with test accuracy (Bousmalis et al., 2016). It is also possible researchers may just be unaware of the method since in the surveyed papers few mention the idea (only (Bousmalis et al., 2016; Perone et al., 2018; Ganin et al., 2016; Pinheiro, 2018; Pei et al., 2018)). Problem-specific methods such as matching pixel intensity between domains as done by French et al. (French et al., 2018) are possible given some domain knowledge, but hyperparameter tuning methodologies should be developed that will work across a wider range of problems. This remains an open area of research.

8.3. Combining Promising Methods

French et al. (French et al., 2018), Co-DA (Kumar et al., 2018), CAN (Kang et al., 2019), AutoDIAL (Carlucci et al., 2017), Generate to Adapt (Sankaranarayanan et al., 2018a), and WDGRL (Shen et al., 2018) are promising approaches based on Tables 2 through 13. French et al. uses a student and teacher network for self-ensembling, Co-DA trains multiple (e.g., two) adaptation networks while requiring diversity and agreement in addition to incorporating virtual adversarial training, CAN alternates between clustering and adaptation through minimizing intra-class discrepancy and maximizing inter-class margin, AutoDIAL adjusts batch normalization layer weights, Generate to Adapt uses an embedding-conditional GAN for adversarial domain adaptation, and WDGRL performs adversarial domain adaptation similar to DANN by using a domain classifier. These are largely independent ideas that if combined may result in additional performance gains.

For instance, the student network in French et al. that accepts either a source or target augmented image could be replaced by the AutoDIAL network to learn how much adaptation to perform at each level of the network. Or to combine with adversarial methods, the student and teacher networks’ outputs (or an intermediate layer’s outputs, as is being explored by Wang et al. (Wang et al., 2018)) could be fed to a gradient reversal layer followed by a domain classifier, in effect adding an adversarial loss term to the existing two terms used by French et al. Or since French et al. is based upon data augmentation, one might try replacing the existing stochastic data augmentation with a GAN since a GAN can be used for data augmentation (given enough unlabeled training data).

Alternatively, key aspects of other methods could be incorporated. While domain adaptation methods commonly align feature distributions, a different line of research aligns the joint or conditional distribution of the feature and label spaces instead (Long et al., 2018, 2017; Courty et al., 2017; Damodaran et al., 2018; Tang and Jia, 2019; Yu et al., 2019; Ma et al., 2019; Cicek and Soatto, 2019). Researchers found aligning in this manner improves results when handling multi-modal data distributions (Long et al., 2018) or when label proportions differ between domains (Courty et al., 2017). Other domain adaptation strategies may similarly benefit from aligning the joint or conditional distribution rather than merely the feature distribution.

8.4. Balancing Classes

In order to obtain high accuracy on the challenging problem of MNIST $\rightarrow$ SVHN, French et al. (French et al., 2018) include an additional class-balance term in their loss function, which both improved training stability and helped the network avoid a degenerate local minimum. Though, this term was not required in their other experiments. Clearly, class balancing is an important concern; although, this depends on the dataset being used. Other methods may similarly benefit from balancing classes.

For instance, Hoffman et al. (Hoffman et al., 2018b) note that the frequency-weighted intersection over union results in their paper were very close to the target-only model accuracy (an approximate upper bound). Thus, they conclude that domain mapping followed by domain-invariant feature learning is very effective for the common classes in the SYNTHIA dataset (season adaptation on a synthetic driving dataset). It is possible then that additional balancing of classes could help the not-as-common classes to perform better. In addition, data augmentation through occluding parts of the images may improve class balancing as would the adversarial spatial dropout network by Wang et al. (Wang et al., 2017b) since the two best classes (road and sky) were likely in almost every image.

8.5. Incorporating Improved Image-to-Image Translation Methods

Bousmalis et al. (Bousmalis et al., 2017) with PixelDA had difficulty applying their method with large domain differences. However, other image-to-image translation methods like XGAN (Royer et al., 2017) have been developed that may support larger domain shifts. These methods could be extended to domain adaptation directly or also incorporating a semantic consistency loss (as explained in Section 4.1). This may allow for more substantial differences between domains. Similarly, image-to-image translation methods like StarGAN (Choi et al., 2018) have been developed for multiple domains, which could be extended for multi-domain adaptation.

8.6. Futher Experimental Comparison Between Methods

As shown in Table 2, French et al. (French et al., 2018) outperforms all the other methods and Co-DA (Kumar et al., 2018) is quite close behind (with the advantage that it does not require highly-problem-specific tuning on MNIST $\rightarrow$ SVHN). In Table 3, CAN (Kang et al., 2019) outperforms the others followed by Generate to Adapt (Sankaranarayanan et al., 2018a). Finally, in Table 13, WDGRL (Shen et al., 2018) generally performs the best. However, these methods are not all compared on the same dataset, making a direct comparison difficult. Additional experiments must be performed to see how these methods compare. Similarly, other promising approaches may outperform other methods on some datasets, which could be determined through additional experiments.

These comparisons can be made easier through developing a unified implementation of these various methods. Schneider et al. (Schneider et al., 2018) are developing such an open-source set of implementations of state-of-the-art domain adaptation (and domain generalization) methods. The results provided in individual papers have different hyperparameters, data augmentation, network architectures, etc. that can make direct comparisons challenging. Using a unified implementation of these methods can facilitate more clearly understanding what aspects of a method are responsible for performance gains and also support combining the novel elements from multiple methods.

8.7. Limitations of Datasets

Varying amounts of source and target data are available in different situations. The datasets used for comparisons (the image datasets listed in Table 5 and the Amazon review dataset) are relatively small when compared with the sizes of datasets commonly in use in deep learning, e.g., ImageNet (Deng et al., 2009; Russakovsky et al., 2015) (though ImageNet is often used to pretrain adaptation networks). For example, Sankaranarayanan et al. (Sankaranarayanan et al., 2018a) note how GANs require a lot of training data. This may limit GAN-based methods from being used on too small of source or target datasets. Modifications may need to be developed for such low resource situations, an area explored by Hosseini-Asl et al. (Hosseini-Asl et al., 2019). Additionally, most domain adaptation datasets are for computer vision. To spur research in other application areas, other datasets could be created.

8.8. Other Applications

Other application areas may benefit from performing domain adaptation as have those discussed in Section 7. In particular, only a few methods were applied to time-series data. One time-series application that may benefit from adaptation is activity prediction, e.g., adapting from one type of sensor to another or from one person’s data to another’s. Some added challenges in this context may be the large differences in feature spaces due to the wide variety of sensors used (e.g., an event stream of fixed motion sensors turning on and off in a smart home vs. sampled motion and location data collected from smart phones or watches) or the difference in labels (e.g., one model may learn a “walk” activity while another learns “exercise” or may learn “read” while another model learns “school”). Applying domain adaptation in new areas may yield novel methods or components applicable in other areas as well.

8.9. Other Domain Adaptation Cases

As mentioned in Section 3, we have surveyed single-source homogeneous unsupervised domain adaptation methods due to this being the most commonly-studied case of domain adaptation. However, exploring other cases is warranted. By utilizing data from multiple source domains and/or multiple target domains, additional gains in performance may be achievable. By handling heterogeneous feature spaces or various other levels of supervision (e.g., semi-supervised learning (Saito et al., 2019) or weakly-supervised learning (Shu et al., 2019)), domain adaptation may bring performance gains to other problems as well. Finally, another under-studied case of domain adaptation is partial domain adaptation, where the target domain contains only a subset of the source domain’s labels (Cao et al., 2018b; Zhang et al., 2018a; Tang and Jia, 2019).

9. Conclusions

For supervised learning, deep neural networks are in prevalent use, but these networks require large labeled datasets for training. Unsupervised domain adaptation can be used to adapt deep networks to possibly-smaller datasets that may not even have target labels. Several categories of methods have been developed for this goal: domain-invariant feature learning, domain mapping, normalization statistics-based, and ensemble-based methods. These various methods have some unique and common elements as we have discussed. Additionally, theoretical results provide some insight into empirical observations. Some methods appear very promising, but further research is required for direct comparisons, novel method combinations, improved bi-directional adaptation, and use for novel datasets and applications.

Acknowledgements.

This material is based upon work supported by the Sponsor National Science Foundation https://www.nsf.gov/ under Grant Nos. Grant #1543656 and Grant #1734558.

Bibliography292

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Ajakan et al . (2014) Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, and Mario Marchand. 2014. Domain-adversarial neural networks. ar Xiv preprint ar Xiv:1412.4446 (2014).
3Akuzawa et al . (2019) Kei Akuzawa, Yusuke Iwasawa, and Yutaka Matsuo. 2019. Adversarial Invariant Feature Learning with Accuracy Constraint for Domain Generalization. ar Xiv preprint ar Xiv:1904.12543 (2019).
4Alami Mejjati et al . (2018) Youssef Alami Mejjati, Christian Richardt, James Tompkin, Darren Cosker, and Kwang In Kim. 2018. Unsupervised Attention-guided Image-to-Image Translation. In Advances in Neural Information Processing Systems 31 , S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). Curran Associates, Inc., 3693–3703. http://papers.nips.cc/paper/7627-unsupervised-attention-guided-image-to-image-translation.pdf
5Anoosheh et al . (2018) Asha Anoosheh, Eirikur Agustsson, Radu Timofte, and Luc Van Gool. 2018. Combo GAN: Unrestrained Scalability for Image Domain Translation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops .
6Arjovsky et al . (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein Generative Adversarial Networks. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research) , Doina Precup and Yee Whye Teh (Eds.), Vol. 70. PMLR, 214–223. http://proceedings.mlr.press/v 70/arjovsky 17a.html
7Arora et al . (2017) Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. 2017. Generalization and Equilibrium in Generative Adversarial Nets (GA Ns). In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research) , Doina Precup and Yee Whye Teh (Eds.), Vol. 70. PMLR, 224–232. http://proceedings.mlr.press/v 70/arora 17a.html
8Arora et al . (2018) Sanjeev Arora, Andrej Risteski, and Yi Zhang. 2018. Do GA Ns learn the distribution? Some Theory and Empirics. In International Conference on Learning Representations . https://openreview.net/forum?id=B Jeh Nf W 0-