Domain Generalization via Universal Non-volume Preserving Models

Thanh-Dat Truong; Chi Nhan Duong; Khoa Luu; Minh-Triet Tran; Ngan; Le

arXiv:1905.13040·cs.CV·April 15, 2020

Domain Generalization via Universal Non-volume Preserving Models

Thanh-Dat Truong, Chi Nhan Duong, Khoa Luu, Minh-Triet Tran, Ngan, Le

PDF

Open Access

TL;DR

This paper introduces a novel deep learning approach for domain generalization that enhances recognition accuracy across unseen domains without requiring model updates or fine-tuning.

Contribution

It proposes a universal non-volume preserving model that improves domain generalization in deep learning, applicable to various recognition tasks and datasets.

Findings

01

Consistently improves recognition accuracy across multiple datasets.

02

Easily integrated with existing CNN frameworks.

03

Effective in digit, face, and pedestrian recognition tasks.

Abstract

Recognition across domains has recently become an active topic in the research community. However, it has been largely overlooked in the problem of recognition in new unseen domains. Under this condition, the delivered deep network models are unable to be updated, adapted, or fine-tuned. Therefore, recent deep learning techniques, such as domain adaptation, feature transferring, and fine-tuning, cannot be applied. This paper presents a novel approach to the problem of domain generalization in the context of deep learning. The proposed method is evaluated on different datasets in various problems, i.e. (i) digit recognition on MNIST, SVHN, and MNIST-M, (ii) face recognition on Extended Yale-B, CMU-PIE and CMU-MPIE, and (iii) pedestrian recognition on RGB and Thermal image datasets. The experimental results show that our proposed method consistently improves performance accuracy. It can…

Tables6

Table 1. TABLE I: Comparison in the properties between our proposed approaches (UNVP and E-UNVP) and other recent methods, where ✗ represents not applicable properties. Gaussian Mixture Model (GMM), Probabilistic Graphical Model (PGM), Convolutional Neural Network (CNN), Adversarial Loss ( ℓ a d v subscript ℓ 𝑎 𝑑 𝑣 \ell_{adv} ), Log Likelihood Loss ( ℓ L L subscript ℓ 𝐿 𝐿 \ell_{LL} ), Cycle Consistency Loss ( ℓ c y c subscript ℓ 𝑐 𝑦 𝑐 \ell_{cyc} ), Discrepancy Loss ( ℓ d i s subscript ℓ 𝑑 𝑖 𝑠 \ell_{dis} ) and Cross-Entropy Loss ( ℓ C E subscript ℓ 𝐶 𝐸 \ell_{CE} ).

Domain

Modelity

Architecture

Loss

Function

End-to

-End

Target-domain

sample-free

Target-domain

label-free

Deployable

Domains

FT [10]

Transfer Learning

CNN

ℓ_{2}

✓

✗

Two

UBM [11]

Adaptation

GMM

ℓ_{L ​ L}

✗

✓

Any

DANN [1]

Adaptation

CNN

ℓ_{a ​ d ​ v}

✓

✗

✓

Two

CoGAN [9]

Adaptation

CNN+GAN

ℓ_{a ​ d ​ v}

✓

✗

✓

Two

I2IAdapt [12]

Adaptation

CNN+GAN

ℓ_{a ​ d ​ v} + ℓ_{c ​ y ​ c}

✓

✗

✓

Two

ADDA [13]

Adaptation

CNN+GAN

ℓ_{a ​ d ​ v}

✓

✗

✓

Two

MCD [14]

Adaptation

CNN+GAN

ℓ_{a ​ d ​ v} + ℓ_{d ​ i ​ s}

✓

✗

✓

Two

CrossGrad [15]

Generalization

Bayesian Net

ℓ_{C ​ E}

✓

Any

ADA [16]

Generalization

CNN

ℓ_{C ​ E}

✓

Any

Our UNVP

Generalization

PGM+CNN

ℓ_{𝑳 ​ 𝑳} + ℓ_{𝑪 ​ 𝑬}

✓

Any

Our E-UNVP

Generalization

PGM+CNN

ℓ_{𝑳 ​ 𝑳} + ℓ_{𝑪 ​ 𝑬}

✓

Any

Table 2. TABLE II: Ablative experiment results (%) on the effectiveness of the parameters λ 𝜆 \lambda , α 𝛼 \alpha and β 𝛽 \beta that control the distribution separation and shitting range. MNIST is used as the only training set, MNIST-M is used as the unseen testing set.

Dataset	Methods	$λ$			$α$			$β (%)$
Dataset	Methods	0.01	0.1	1.0	0.01	0.1	1.0	0%	1%	10%	20%	30%
MNIST	Pure-CNN	99.28
	UNVP	$-$	$-$	$-$	99.33	99.18	99.30	99.28	99.28	99.35	99.30	99.36
	E-UNVP	99.22	99.42	99.40	99.13	99.31	99.42	99.28	99.36	99.34	99.42	99.43
MNIST-M	Pure-CNN	55.90
	UNVP	$-$	$-$	$-$	58.18	60.76	59.44	55.90	59.99	57.24	59.44	55.11
	E-UNVP	59.83	60.49	59.47	56.92	61.70	60.49	55.90	57.10	60.49	61.70	60.49

Table 3. TABLE III: Experimental results ( % ) (\%) when using UNVP and E-UNVP in various common CNNs.

Networks	Methods	MNIST	MNIST-M
LeNet	Pure-CNN	99.06	55.90
	UNVP	99.30	59.44
	E-UNVP	99.42	61.70
AlexNet	Pure CNN	99.17	40.12
	UNVP	98.81	39.94
	E-UNVP	98.89	40.60
VGG	Pure CNN	99.43	50.67
	UNVP	99.42	54.41
	E-UNVP	99.40	51.37
ResNet	Pure CNN	98.01	35.35
	UNVP	98.82	37.15
	E-UNVP	98.97	40.60
DenseNet	Pure CNN	99.23	41.16
	UNVP	99.42	41.98
	E-UNVP	99.14	43.72

Table 4. TABLE IV: Results ( % ) (\%) on three digit datasets. ADA and ours do not require target data in training. ADDA, DANN require training data from target domains in training.

Methods	MNIST	SVHN	MNIST-M
ADDA	99.29	32.20	63.39
DANN	$-$	$-$	76.66
Pure-CNN	99.06	31.96	55.90
ADA	99.17	37.87	60.02
UNVP	99.30	41.23	59.45
E-UNVP	99.42	42.87	61.70

Table 5. TABLE V: Results ( % ) (\%) on Extended Yale-B [ 27 ] , CMU-PIE [ 28 ] and CMU-MPIE [ 29 ] databases. ADA and ours do not require target domain data during training while ADDA does .

Method	E-Yale-B		CMU-PIE		CMU-MPIE
Method	N	D	N	D	N	D
ADDA	99.17	75.28	96.09	70.33	99.93	97.71
Pure-CNN	98.50	51.39	95.59	62.18	99.93	94.74
ADA	99.00	53.08	96.49	62.69	99.92	96.08
UNVP	99.17	58.24	96.32	64.88	99.83	98.25
E-UNVP	99.54	62.95	97.55	66.89	99.93	98.03

Table 6. TABLE VI: Results ( % ) (\%) on RGB and Thermal pedestrian databases with various common deep network structures.

Networks	Methods	RGB	Thermal
LeNet	Pure-CNN	95.45	79.72
LeNet	E-UNVP	97.25	90.29
AlexNet	Pure CNN	96.64	81.38
AlexNet	E-UNVP	97.04	82.98
VGG	Pure CNN	97.54	95.60
VGG	E-UNVP	98.64	98.38
ResNet	Pure CNN	98.52	96.07
ResNet	E-UNVP	98.56	98.35
DenseNet	Pure CNN	98.39	95.87
DenseNet	E-UNVP	98.60	96.14

Equations26

p_{X} (x, y; θ) = p_{Z} (z, y; θ) \frac{\partial F ( z , y ; θ )}{\partial x}

p_{X} (x, y; θ) = p_{Z} (z, y; θ) \frac{\partial F ( z , y ; θ )}{\partial x}

lo g p_{X} (x, y; θ) = lo g p_{Z} (z, y; θ) + lo g \frac{\partial F ( z , y ; θ )}{\partial x}

lo g p_{X} (x, y; θ) = lo g p_{Z} (z, y; θ) + lo g \frac{\partial F ( z , y ; θ )}{\partial x}

F = f_{1} \circ f_{2} \circ ... \circ f_{N}

F = f_{1} \circ f_{2} \circ ... \circ f_{N}

f (x) = b ⊙ x + (1 - b) ⊙ [x ⊙ exp (S (b ⊙ x) + T (b ⊙ x)]

f (x) = b ⊙ x + (1 - b) ⊙ [x ⊙ exp (S (b ⊙ x) + T (b ⊙ x)]

θ^{*} = ar g θ max c \sum i \sum lo g p_{X} (x^{i}, c; θ)

θ^{*} = ar g θ max c \sum i \sum lo g p_{X} (x^{i}, c; θ)

ar g θ_{1} min P : d (P_{X}, P_{X}^{sr c}) \leq ρ sup E [ℓ (X, Y; M, F, θ, θ_{1})]

ar g θ_{1} min P : d (P_{X}, P_{X}^{sr c}) \leq ρ sup E [ℓ (X, Y; M, F, θ, θ_{1})]

d (P_{X}, P_{X}^{sr c}) = d (P_{Z}, P_{Z}^{sr c}) = c \sum x_{c}, x_{c}^{sr c} \sum in f E [cos t (F (x_{c}), F (x_{c}^{sr c}))] = c \sum z_{c}, z_{c}^{sr c} \sum in f E [cos t (z_{c}, z_{c}^{sr c})]

d (P_{X}, P_{X}^{sr c}) = d (P_{Z}, P_{Z}^{sr c}) = c \sum x_{c}, x_{c}^{sr c} \sum in f E [cos t (F (x_{c}), F (x_{c}^{sr c}))] = c \sum z_{c}, z_{c}^{sr c} \sum in f E [cos t (z_{c}, z_{c}^{sr c})]

cos t^{2} (z_{c}, z_{c}^{sr c}) = + Tr (Σ_{c}^{sr c} c \sum ∣∣ μ_{c}^{sr c} - μ_{c} ∣ ∣_{2}^{2} + Σ_{c} - 2 ((Σ_{c}^{sr c})^{1/2} Σ_{c} (Σ_{c}^{sr c})^{1/2})^{1/2})

cos t^{2} (z_{c}, z_{c}^{sr c}) = + Tr (Σ_{c}^{sr c} c \sum ∣∣ μ_{c}^{sr c} - μ_{c} ∣ ∣_{2}^{2} + Σ_{c} - 2 ((Σ_{c}^{sr c})^{1/2} Σ_{c} (Σ_{c}^{sr c})^{1/2})^{1/2})

= ar g θ_{1} min P sup E [ℓ (X, Y; M, F, θ, θ_{1})] - α \cdot d (P_{X}, P_{X}^{sr c}) ar g θ_{1} min c \sum x sup {ℓ (x, c; M, F, θ, θ_{1}) - α \cdot cos t (F (x), F (x_{c}^{sr c}))}

= ar g θ_{1} min P sup E [ℓ (X, Y; M, F, θ, θ_{1})] - α \cdot d (P_{X}, P_{X}^{sr c}) ar g θ_{1} min c \sum x sup {ℓ (x, c; M, F, θ, θ_{1}) - α \cdot cos t (F (x), F (x_{c}^{sr c}))}

x = ar g x max {ℓ (x, c; M, F, θ, θ_{1}) - α \cdot cos t (F (x), F (x_{c}^{sr c}))}

x = ar g x max {ℓ (x, c; M, F, θ, θ_{1}) - α \cdot cos t (F (x), F (x_{c}^{sr c}))}

ℓ (X, Y; M, F, θ, θ_{1}) = ℓ_{CE} (M (X; θ_{1}), Y - lo g p_{X} (X, Y; θ)

ℓ (X, Y; M, F, θ, θ_{1}) = ℓ_{CE} (M (X; θ_{1}), Y - lo g p_{X} (X, Y; θ)

cos t^{2} (z_{c}, z_{c}^{sr c}) = c \sum ∣∣ μ_{c}^{sr c} - μ_{c} ∣ ∣_{2}^{2} + Tr (Σ_{c}^{sr c} + Σ_{c} - 2 ((Σ_{c}^{sr c})^{1/2} Σ_{c} (Σ_{c}^{sr c})^{1/2})^{1/2}) + ∣∣ M (X_{c}) - M (X_{c}^{sr c}) ∣ ∣_{2}^{2}

cos t^{2} (z_{c}, z_{c}^{sr c}) = c \sum ∣∣ μ_{c}^{sr c} - μ_{c} ∣ ∣_{2}^{2} + Tr (Σ_{c}^{sr c} + Σ_{c} - 2 ((Σ_{c}^{sr c})^{1/2} Σ_{c} (Σ_{c}^{sr c})^{1/2})^{1/2}) + ∣∣ M (X_{c}) - M (X_{c}^{sr c}) ∣ ∣_{2}^{2}

μ_{c} Σ_{c} = γ G_{m} (c) + λ H_{m} (n) = G_{s t d} (c)

μ_{c} Σ_{c} = γ G_{m} (c) + λ H_{m} (n) = G_{s t d} (c)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques

Full text

Domain Generalization via Universal Non-volume Preserving Approach

Dat T. Truong1,3,4, Chi Nhan Duong2, Khoa Luu1, Minh-Triet Tran3,4, Ngan Le1

1 University of Arkansas, USA

2 Concordia University, Canada

3 University of Science, Ho Chi Minh city, Vietnam

4 Vietnam National University, Ho Chi Minh city, Vietnam

{tt032, khoaluu, thile}@uark.edu, [email protected], [email protected]

Abstract

Recognition across domains has recently become an active topic in the research community. However, it has been largely overlooked in the problem of recognition in new unseen domains. Under this condition, the delivered deep network models are unable to be updated, adapted, or fine-tuned. Therefore, recent deep learning techniques, such as domain adaptation, feature transferring, and fine-tuning, cannot be applied. This paper presents a novel approach to the problem of domain generalization in the context of deep learning. The proposed method111Source code will be publicly available. is evaluated on different datasets in various problems, i.e. (i) digit recognition on MNIST, SVHN, and MNIST-M, (ii) face recognition on Extended Yale-B, CMU-PIE and CMU-MPIE, and (iii) pedestrian recognition on RGB and Thermal image datasets. The experimental results show that our proposed method consistently improves performance accuracy. It can also be easily incorporated with any other CNN frameworks within an end-to-end deep network design for object detection and recognition problems to improve their performance.

I Introduction

Deep learning-based detection and recognition studies have been recently achieving very accurate performance in visual applications. However, many such methods assume the testing images come from the same distribution as the training ones and often fail when performing in new unseen domains. Indeed, detection and classification crossing domains have recently become active topics in the research communities. In particular, domain adaptation [1] [2] has received significant attention in computer vision. In the domain adaptation (Fig. 1(A)), we usually have a large-scale training set with labels, i.e., the source domain A, and a small training set with or without labels, i.e., the target domain B. The knowledge from the source domain A is learned and adapted to the target domain B. During the testing time, the trained model is deployed only in the target domain B. Recent results in domain adaptation have shown significant improvement in the many computer vision applications. However, the trained models are potentially deployed not only in the target domain B but also in many other new unseen domains, e.g., C, D, etc. (Fig. 1(B)) in real-world applications. In these scenarios, the released deep network models are usually unable to be retrained or fine-tuned with the inputs in new unseen domains or environments, as illustrated in Fig. 2. Thus, domain adaptation cannot be applied in these problems since the new unseen target domains are unavailable.

Besides, there are some prior works to perform recognition problems with high accuracy by presenting new loss functions [3] [4] or increasing deep network structures [5] via mining hard samples in training sets. These loss functions are deployed to deal with hard samples considered as unseen domains. However, these methods are limited to be generalized in new unseen domains in real-world applications. Some real-world problems are unable to observe training samples from new unseen domains in the training process. Therefore, in the scope of this work, there is no assumption about the new unseen domains. Our proposed method can be supportively incorporated with Convolutional Neural Networks (CNNs)-based detection and classification methods to train within an end-to-end deep learning framework to improve the performance potential.

I-A Contributions of this Work

This work presents a novel domain generalization approach to learn to better generalize new unseen domains. The restrictive setting is considered in this work where there is only single source domain available for training. Table I summarizes the differences between our approach and the prior works. Our contributions can be summarized as follows.

A novel approach named Universal Non-volume Preserving (UNVP) and its extension named Extended Universal Non-volume Preserving (E-UNVP) frameworks are firstly introduced to generalize environments of new unseen domains from a given single-source training domain. Secondly, the environmental features extracted from the environment modeling via Deep Generative Flows (DGF) and the discriminative features extracted from the deep network classifiers are then unified together to provide final generalized deep features that are robustly discriminative in new unseen domains. Our approach is designed within an end-to-end deep learning framework and inherits the power of the CNNs. It can be quickly end-to-end integrated with a CNN-based deep network design for object detection or recognition to improve the performance. Finally, the proposed method has experimented in various visual modalities and applications with consistently improving performances.

II Related Work

Domain Adaptation has recently become one of the most popular research topics in the field [1] [6] [7] [8] [2]. Ganin et al. [1] proposed to incorporate both classification and domain adaptation to a unified network so that both tasks can be learned together. Similarly, Tzeng et al. [2] later introduced a unified framework for Unsupervised Domain Adaptation based on adversarial learning objectives (ADDA). It uses a loss function in a discriminator to be solely dependent on its target distribution. Liu et al. [9] presented Coupled Generative Adversarial Network (CoGAN) for learning a joint distribution of multi-domain images. It is then applied to domain adaptation.

Domain Generalization aims to learn a classification model from a single-source domain and generalize that knowledge to achieve high performance in unseen target domains robustly. To learn a domain-invariant feature representation, M. Ghifary et al. [17] used multi-view autoencoders to perform cross-domain reconstructions. Later, [18] introduced MMD-AAE to learn a feature representation by jointly optimizing a multi-domain autoencoder regularized via the Maximum Mean Discrepancy (MMD) distance. Recently, K. Muandet et al. [19] presented a kernel-based algorithm for minimizing the differences in the marginal distributions of multiple domains, whereas Y. Li [20] proposed an end-to-end conditional invariant deep domain generalization approach by leveraging deep neural networks for domain-invariant representation learning. To address the problem of unseen domains, R. Volpi et al. presented Adversarial Data Augmentation (ADA) [16] to generalize to unseen domains.

III The Proposed Method

Far apart from previous augmentation methods that tried to generate new samples in image space using prior knowledge with the hope that these samples can cover unseen domains, our approach, on the other hand, focuses on modeling the environment density as multiple Gaussian distributions in a deep feature space and uses this knowledge for the generalization process. In this way, the new samples are automatically synthesized with more semantic meaning while consistently maintaining the feature structures (see Fig. 3). Thus, without the need to see the samples in target domains, the method is still able to handle the domain shifting effectively and robustly achieves high performance in these unseen domains.

In particular, the proposed UNVP and E-UNVP approaches present a new tractable CNN deep network to extract the deep features of samples in the source environment and formulate their probability densities to multiple Gaussian distributions (Fig. 3). From these learned distributions, a density-based augmentation approach is employed to expand data distributions of the source environment for generalizing to different unseen domains. This architecture design allows unifying deep feature modeling and distribution modeling within an end-to-end framework.

The proposed framework consists of two main streams: (1) Discriminative feature modeling with a deep network classifier; and (2) Deep Generative Flows to model the domain variations in the form of distributions. They are together going through an end-to-end learning process that alternatively minimizes the within-class distributions and synthesizing new useful samples to generalize to new unseen domains. Notice that our proposed framework does not require the presence of samples in the target domains during the training process.

III-A Domain Variation Modeling as Distributions

This section aims at learning a Deep Generative Flow model, i.e. function $\mathcal{F}$ , that maps an image $\mathbf{x}$ in image space $\mathcal{I}$ to its latent representation $\mathbf{z}$ in latent domain $\mathcal{Z}$ such that the density function $p_{X}(\mathbf{x})$ can be estimated via the probability density function $p_{Z}(\mathbf{z})$ . Then via $\mathcal{F}$ , rather than representing the environment variation, i.e. $p_{X}(\mathbf{x})$ , directly in the image space, it can be easily modeled via variables in latent space, i.e. $p_{Z}(\mathbf{z})$ , with more semantic manner. When $p_{Z}(\mathbf{z})$ follows prior distributions, all samples in the given domain can be effectively modeled in the forms of latent distributions.

Structure and Variable Relationship. Let $\mathbf{x}\in\mathcal{I}$ be a data sample in image domain $\mathcal{I}$ , $y$ be its corresponding class label, and $\mathbf{z}=\mathcal{F}(\mathbf{x},y,\theta)$ where $\theta$ denotes the parameters of $\mathcal{F}$ , the probability density function of $\mathbf{x}$ can be formulated via the change of variable formula as follows:

[TABLE]

where $p_{X}(\mathbf{x},y)$ and $p_{Z}(\mathbf{z},y;\theta)$ define the distributions of samples of class $y$ in image and latent domains, respectively. $\frac{\partial\mathcal{F}(\mathbf{z},y;\theta)}{\partial\mathbf{x}}$ denotes the Jacobian matrix with respect to $\mathbf{x}$ . Then the log-likelihood is computed by.

[TABLE]

Eqns. (1) and (2) provide two facts: (1) learning the density function of samples in class $y$ is equivalent to estimate the density of its latent representation $\mathbf{z}$ and determinant of the associated Jacobian matrix $\frac{\partial\mathcal{F}}{\partial\mathbf{x}}$ ; and (2) if the latent distribution $p_{Z}$ is defined as a Gaussian distribution, the learned function $\mathcal{F}$ explicitly becomes the mapping function from a real data distribution to a Gaussian distribution in latent space. Then, we can model the environment variation via deviations from the Gaussian distributions of all classes in a latent domain. When $\mathcal{F}$ is well-defined with tractable computation of its Jacobian determinant, the two-way connection, i.e., inference and generation, is existed for $\mathbf{x}$ and $\mathbf{z}$ .

The prior class distributions. Motivated from these properties, given $C$ classes, we choose $C$ Gaussian distributions with different means $\{\bm{\mu}_{1},\bm{\mu}_{2},..,\bm{\mu}_{C}\}$ and covariances $\{\Sigma_{1},\Sigma_{2},...,\Sigma_{C}\}$ as prior distributions for these classes, i.e. $\mathbf{z}_{c}\sim\mathcal{N}(\bm{\mu}_{c},\Sigma_{c})$ . It is worth noting that even when we choose Gaussian Distributions, our framework is not limited to other distribution types.

Mapping function structure. To enforce the information flow from an image domain to a latent space with different abstraction levels, we formulate the mapping function $\mathcal{F}$ as a composition of several sub-functions $f_{i}$ as follows.

[TABLE]

where $N$ is the number of sub-functions. The Jacobian $\frac{\partial\mathcal{F}}{\partial\mathbf{x}}$ can be derived by $\frac{\partial\mathcal{F}}{\partial\mathbf{x}}=\frac{\partial f_{1}}{\partial\mathbf{x}}\cdot\frac{\partial f_{2}}{\partial f_{1}}\cdots\frac{\partial f_{N}}{\partial f_{N-1}}$ . With this structure, the properties of each $f_{i}$ will define the properties for the whole mapping function $\mathcal{F}$ . For example, if the Jacobian of $\frac{\partial f_{1}}{\partial\mathbf{x}}$ is tractable, then $\mathcal{F}$ is also tractable. Furthermore, if $f_{i}$ is a non-linear function built from a composition of CNN layers then $\mathcal{F}$ becomes a deep convolution neural network. There are several ways to construct the sub-functions, i.e. different CNN structures for non-linearity property.

[TABLE]

where $\mathbf{b}=[1,...,1,0,...,0]$ is a binary mask, and $\odot$ is the Hadamard product. $\mathcal{S}$ and $\mathcal{T}$ define the scale and translation functions during mapping process.

Learning the mapping function and Environment Modeling. To learn the parameter $\theta$ for mapping function $\mathcal{F}$ , the log-likelihood in Eqn. (2) is maximized as follows.

[TABLE]

Notice that after learning the mapping function, all images of all classes are mapped into the corresponding distributions of their classes. Then the environment density can be considered as the composition of these distributions. Figure 4(A) illustrated an example of the learned environment distributions of MNIST with 10 digit classes. When only samples in MNIST are used for training, the density distributions of MNIST-M, i.e., unseen during training, using Pure-CNN, in our UNVP and E-UNVP are shown in Fig. 4 (B, C, D), respectively. In the next section, a generalization approach is proposed so that using only samples in a source environment, the learned model can expand the density distributions of the source environment so that they can cover as much as possible the distributions of unseen environments.

III-B Unseen Domain Generalization

After modeling the source environment variation as the compositions of its class distributions, this section introduces the generalization process of these distributions with respect to a classification model $\mathcal{M}$ such that the expansion of these distributions can help $\mathcal{M}$ generalize to unseen environments with high accuracy. Notice that $\mathcal{M}$ can be any type of Deep CNN such as LeNet [21], AlexNet [22], VGG [23], ResNet [5], DenseNet [24].

Let $\ell(\mathbf{X,Y};\mathcal{M},\mathcal{F},\theta,\theta_{1})$ be the training loss function of $\mathcal{M}$ , and $\theta_{1}$ be the parameters of $\mathcal{M}$ . The generalization process of $\mathcal{M}$ can be formulated as updating the parameters $\theta_{1}$ such that it can robustly classify the samples having latent distributions that are distance $\rho$ away from the samples in the source environment. Then, the objective function of $\mathcal{M}$ is formulated as.

[TABLE]

where $\{\mathbf{X,Y}\}$ denotes the images and their labels; $d(\cdot,\cdot)$ is the distance between probability distributions; $P^{src}_{X}(\mathbf{X,Y})$ and $P_{X}(\mathbf{X,Y})$ are the density distributions of the source and current expanded environments, respectively.

Since both $P_{X}^{src}$ and $P_{X}$ are density distributions, the Wasserstein distance with respect to $P_{X}^{src}$ and $P_{X}$ can be adopted. Notice that from previous section, we have leaned a mapping function $\mathcal{F}$ that maps the density functions from image space, i.e. $P_{X}$ , to prior distributions in latent space, i.e. $P_{Z}$ . Moreover, since $\mathcal{F}$ is invertible with the specific formula of its sub-functions, computing $d(P_{X},P_{X}^{src})$ is equivalent to $d(P_{Z},P_{Z}^{src})$ . From this, we can efficiently estimate $cost$ as the transformation cost between Gaussian distributions. Then $d(P_{X},P_{X}^{src})$ is reformulated by.

[TABLE]

where $cost(\cdot,\cdot)$ denotes the transformation cost between Gaussian distributions:

[TABLE]

$\{\mu_{c},\Sigma_{c}\}$ and $\{\mu^{\prime}_{c},\Sigma^{\prime}_{c}\}$ are the means and covariances of the distributions of class $c$ in the source and the expanded environment, respectively. Plugging this distance and applying the Lagrangian relaxation to Eqn. (6), we have

[TABLE]

To solve this objective function, the optimization process can be divided into two alternative steps: (1) generate the sample $\mathbf{x}$ for each class such that

[TABLE]

and consider $\mathbf{x}$ as a new “hard” example for class $c$ ; and (2) add $\mathbf{x}$ to the training data and optimize the model $\mathcal{M}$ . In other words, this two-step optimization process aims at finding new samples belonging to distributions that are $\rho$ distance far away from the distributions of the source environment, and making $\mathcal{M}$ became more robust when classifying these examples. In this way, after a certain of iteration, the distributions learned from $\mathcal{M}$ can be generalized so that they can cover as much as possible the distributions of new unseen environments.

III-C Universal Non-volume Preserving (UNVP) Models

The proposed UNVP consists of two main branches: (1) Discriminative Feature Modeling and (2) Generative Distribution Modeling. While the discriminative part focuses on constructing a classifier that minimizes within-class distributions, the generative one aims at embedding samples of all classes into their corresponding latent distributions and then expanding these distributions for generalization. Fig. 5 illustrates the whole end-to-end joint training process for UNVP where the generative part, i.e., Deep Generative Flow $\mathcal{F}$ , is firstly employed to learn the mapping from image space to Gaussian distributions in latent space. Then a two-stage training process is adopted to learn the Deep Classifier $\mathcal{M}$ and adjust the Deep Generative Flow $\mathcal{F}$ for generalization.

In the first stage of this process, given a training dataset, both parameters $\{\theta,\theta_{1}\}$ of the mapping function $\mathcal{F}$ and the classifier $\mathcal{M}$ are updated according to the loss function as.

[TABLE]

where the first term is the cross-entropy loss for $\mathcal{M}$ and the second term is the log-likelihood of $\mathcal{F}$ .

In the second stage, we adapt the generalization process as presented in Sec. III-B and Eqn. (9) to synthesize new “hard” samples. Notice that, to further constraint the perturbation in latent space, we incorporate another regularization term to Eqn. (7) as.

[TABLE]

New generated samples are then added to the training set and used for updating both branches of UNVP.

Notice that in the structure of $\mathcal{F}$ , the choice of Gaussian distributions for all classes play an important role and directly affects the performance of the generative model. By varying the choices for these distributions, different variants of UNVP can be introduced.

Universal Non-volume Preserving Models (UNVP):

The means and covariances of Gaussian distributions are pre-defined for all $C$ classes where $\bm{\mu}_{c}=\text{{1}}c;\bm{\Sigma}=\mathbf{I}$ ; $\mathbf{z}_{c}\sim\mathcal{N}(\bm{\mu}_{c},\mathbf{I})$ where 1 is the all-one vector.

Extended Universal Non-volume Preserving Models (E-UNVP):

Rather than fixing the means and covariances of the Gaussian distributions of $C$ classes, we consider them as variables and flexibly learned during the environment modeling as well as adjusted during domain generalization. Particularly, given the class label $c$ , $\mathcal{F}$ maps each sample $\mathbf{x_{c}}$ to a Gaussian distribution with the mean and covariance as.

[TABLE]

where $\mathcal{G}_{m}(c)$ and $\mathcal{G}_{std}(c)$ denote the learnable function that map label $c$ to the mean and covariance values of its Gaussian distribution. $\mathbf{n}$ is a noise signal that is generated following the normal distribution. $\mathcal{H}_{m}(\mathbf{n})$ defines the allowable shifting range of the Gaussian given the noise signal $\mathbf{n}$ . $\gamma$ and $\lambda$ are user-defined parameters that control the separation of the Gaussian Distributions between different classes and the contribution of $\mathcal{H}_{m}(\mathbf{n})$ to $\bm{\mu}_{c}$ . We choose the Fully Connected structure for $\mathcal{G}_{m}(c)$ and $\mathcal{G}_{std}(c)$ that take the input $c$ in the form of one-hot vector while Convolutional Layer is adopted for $\mathcal{H}_{m}(\mathbf{n})$ .

IV Discussion

As shown in Fig. 3, by exploiting the Generative Flows that model samples of each class as a Gaussian in semantic feature space, the proposed UNVP can robustly maintain the feature structure of each class while expanding and shifting the domain distributions. In this way, we can generate more useful “hard” samples for the generalization process.

By introducing the noise signal $\mathbf{n}$ , we allow the Gaussian distribution of each class shifting around within a limited range, i.e., $\mathcal{H}_{m}(\mathbf{n})$ . This further enhances the robustness of E-UNVP against noise during the environment modeling.

To further enhance the capability of modeling the input signal with high-resolution, we incorporate the activation normalization and invertible $1\times 1$ convolution operators [25] to the structure of each sub-function $f_{i}$ in Eqn. (3). Particularly, the input to each $f_{i}$ is passed through an actnorm layer followed by an invertible $1\times 1$ convolution before being transformed by $\mathcal{S}$ and $\mathcal{T}$ as in Eqn. (4). The two transformations $\mathcal{S}$ and $\mathcal{T}$ are defined by two Residual Networks with rectifier non-linearity and skip connections. Each of them contains three residual blocks. For input image with the resolution higher than $128\times 128$ , six residual blocks are set for $\mathcal{S}$ and $\mathcal{T}$ .

V Experiments

This section first shows the effectiveness of our proposed methods with comprehensive ablative experiments. In these experiments, we use MNIST as the only the training set and MNIST-M as the unseen testing set. The proposed approaches are also benchmarked on various deep network structures, i.e. LeNet [21], AlexNet [22], VGG [23], ResNet [5] and DenseNet [24]. Using the final optimal model, we show in the next subsection that our approaches consistently achieve the state-of-the-art results in digit recognition on three-digit datasets, i.e., MNIST, SVHN [26], and MNIST-M. Then, we show the results of our proposed approaches in face recognition in three databases, i.e. Extended Yale-B [27], CMU-PIE [28] and CMU-MPIE [29]. We use facial images with normal illumination as the training domain and the ones in dark illumination conditions as the testing set on the new unseen domains. Finally, we show the advantages of UNVP and E-UNVP in the cross-domain pedestrian recognition on the Thermal Database.

V-A Ablation Study

This experiment aims to measure the effectiveness of the domain generalization and perturbation processes This experiment uses MNIST as the only training set and MNIST-M as the testing one. To simplify the experiment, LeNet [21] is used as the classifier, i.e., Pure-CNN. About the network hyper-parameters, we choose the learning rate and the batch size to 0.0001 and 128, respectively.

Hyper-parameter Settings. In the GLOW learning process, the multiple Gaussian distributions are handled via the set of scale parameters, i.e., $\gamma$ and $\lambda$ , to control the distribution separation and shitting range as in Eqn. (10). The contributions of the generalization process are also evaluated with various percentages of “hard” generated samples ( $\beta$ ), i.e., from $0\%$ to $30\%$ . When $\beta=0$ , there are no new samples.

There are two phases alternatively updated in the training process: (1) Minimization phase to optimize the networks and (2) Maximization (perturb) phase to generate new hard examples. We do $K$ times of the maximization phase, for each time, we randomly select $\beta$ percent of the number of training images to generate new hard samples via deep generative models. Specifically, our maximization phase generalizes new images based on both semantic features from the CNN classifier and the semantic space via the estimation of environment density. The experimental results in Table II show that the proposed approaches consistently help to improve the classifiers.

Sample Distributions in Unseen Domains. The sample class distributions with the optimal parameter set are used to visually observed and demonstrated in Fig. 4. While Pure-CNN obviously fails to model unseen domain MNIST-M dataset, our UNVP successfully does domain shift and cover unseen domain dataset. These sample distributions are completely class separated when using our E-UNVP.

**Backbone Deep Networks. ** This section evaluates the robustness and the consistent improvements of UNVP and E-UNVP with common deep networks, including LeNet, AlexNet, VGG, ResNet, and DenseNet, as in Table III. The proposed UNVP and E-UNVP consistently outperform the stand-alone classifier (Pure-CNN) using the same network configuration in all experiments. Particularly, it helps to improve 6%, 0.5%, 4%, 5%, 2% on MNIST-M using LeNet, AlexNet, VGG, ResNet and DenseNet respectively.

The proposed methods can be easily integrated with standard CNN deep networks. Therefore, it potentially can be applied to improve the performance in many existed CNN-based applications, e.g., detection and recognition, that are experimented in the next sections.

V-B Digit Recognition on Unseen Domains

The proposed approaches have experimented in digit recognition on new unseen domains with two other digit databases, i.e., MNIST-M and SVHN (Fig. 6). In this experiment, MNIST is the only database used to train the classifier. Then, two other datasets, i.e., MNIST-M and SVHN, are used as the new unseen domains to benchmark the performance. The classifier is trained using 50,000 images of MNIST. In order to generalize an image phase, we use 10,000 images in this set to perturb and generalize new samples. All digit images are resized to $32\times 32$ . We benchmark the learned classifiers on MNIST and two other unseen digit datasets, i.e., SVHN and MNIST-M. The results using our approach are compared against the LeNet classifier (Pure-CNN), and the Adversarial Data Augmentation (ADA). We also show the recognition results on these datasets using the Domain Adaptation methods, including Adversarial Discriminative Domain Adaptation (ADDA), Domain-Adversarial Training of Neural Networks (DANN) [1]. It is noticed that Pure-CNN, ADA, and our approaches do not require the target domain data during training. Meanwhile, ADDA, DANN require the target domain data in the training steps.

Our generalization phase synthesizes images based on semantic space via the estimation of environment density. It helps our generated images to be more diverse than the synthesized images using the ADA method. The experimental results are shown in Table IV. The proposed methods consistently achieve state-of-the-art performance on these datasets. Notably, it helps to improve approximately 11% and 6% on SVHN and MNIST-M, respectively.

V-C Face Recognition on Unseen Domains

In this experiment, the proposed approaches are applied in unseen environment face recognition and compared against the other baseline methods, i.e., Pure-CNN, ADA, and ADDA, on three face recognition databases, including Extended Yale-B, CMU-PIE, and CMU-MPIE. In each database, we select the face images with normal lighting as the source domain, i.e., Normal illumination (N), and the face images with dark lighting as the target domain, i.e., Dark illumination (D). Each database is randomly split into two sets: a training set (80%) and a testing set (20%). The experimental framework structures are similar to the one in digit recognition. All cropped face images are resized to $64\times 64$ pixels. The experimental results in Table V show that our proposed methods help to improve the recognition performance on new unseen domains where the lighting conditions are unknown. Particularly, it helps to improve approximately 11%, 4% and 3% in dark lighting conditions on Extended Yale-B, CMU-PIE and CMU-MPIE databases respectively.

V-D Pedestrian Recognition on Unseen Domains

This experiment aims to improve RGB-based pedestrian recognition on thermal images on the Thermal Dataset222https://www.flir.com/oem/adas/adas-dataset-form/. There are two datasets organized in this experiment: (1) RGB pedestrian and (2) Thermal pedestrian. The methods are trained only on the RGB pedestrian dataset and tested on the Thermal pedestrian dataset. In the training phase, we use $2,000$ images to generalize new images, and all images of two datasets are resized to $128\times 128$ pixels. The experimental results in Table VI show that our proposed methods consistently help to improve the performance of the Pure-CNN in various common deep network structures, including LeNet, AlexNet, VGG, ResNet, and DenseNet.

VI Conclusions

This paper has introduced the novel deep learning based domain generalization approach that generalizes well to different unseen domains. Only using training data from a source domain, we propose an iterative procedure that augments the dataset with samples from a fictitious target domain that is hard under the current model. It can be easily integrated with any other CNN based framework within an end-to-end network to improve the performance. On digit recognition, the proposed method has been benchmarked on three popular digit recognition datasets and consistently showed the improvement. The method is also experimented in face recognition on three standard databases and outperforms the other state-of-the-art methods. In the problem of pedestrian recognition, we empirically observe that the proposed method learns models that improve performance across a priori unknown data distributions.

VII Acknowledgement

In this project, Dat T. Truong and Minh-Triet Tran are partially supported by Vingroup Innovation Foundation (VINIF) in project code VINIF.2019.DA19.

Bibliography29

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” in ICML , 2015.
2[2] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adaptation,” July 2017.
3[3] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in CVPR , June 2015.
4[4] X. Zhang, Z. Fang, Y. Wen, Z. Li, and Y. Qiao, “Range loss for deep face recognition with long-tailed training data,” in ICCV , 2017.
5[5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR , 2016.
6[6] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko, “Simultaneous deep transfer across domains and tasks,” Co RR , 2015.
7[7] O. Sener, H. O. Song, A. Saxena, and S. Savarese, “Learning transferrable representations for unsupervised domain adaptation,” in NIPS , 2016.
8[8] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell, “Deep domain confusion: Maximizing for domain invariance,” Co RR , 2014.