Local Area Transform for Cross-Modality Correspondence Matching and Deep   Scene Recognition

Seungchul Ryu

arXiv:1901.00927·cs.CV·January 7, 2019

Local Area Transform for Cross-Modality Correspondence Matching and Deep Scene Recognition

Seungchul Ryu

PDF

Open Access

TL;DR

This paper introduces the local area transform (LAT), a robust image transform invariant to nonlinear intensity deformations, improving correspondence matching and scene recognition across different modalities.

Contribution

The paper proposes LAT and its integration into deep neural networks, including LAT-Net, for enhanced cross-modality correspondence and scene recognition.

Findings

01

LAT provides consistent results under nonlinear intensity deformations.

02

LAT reduces mean absolute difference compared to conventional methods.

03

LAT-based descriptors outperform traditional approaches in cross-spectral matching.

Abstract

Establishing correspondences is a fundamental task in variety of image processing and computer vision applications. In particular, finding the correspondences between a non-linearly deformed image pair induced by different modality conditions is a challenging problem. This paper describes a efficient but powerful image transform called local area transform (LAT) for modality-robust correspondence estimation. Specifically, LAT transforms an image from the intensity domain to the local area domain, which is invariant under nonlinear intensity deformations, especially radiometric, photometric, and spectral deformations. In addition, robust feature descriptors are reformulated with LAT for several practical applications. Furthermore, LAT-convolution layer and Aception block are proposed and, with these novel components, deep neural network called LAT-Net is proposed especially for scene…

Tables10

Table 1. Algorithm 3.1 Pseudo code for LAT

Algorithm 1: Local Area Transform

Input: input image

𝐈

Internal: Integral histogram

𝐇_{𝐈}

,

Local histogram

𝐇_{𝐩}

at pixel point

𝐩 = (x, y)

,

the corresponding intensity bin

b

of

𝐩

, half-window size

l

Output: Local area transformed image

𝒜

/* integral histogram computation */

for each pixel (x,y) do

𝐇^{'} ​ (x, y) \leftarrow 𝐇^{'} ​ (x, y - 1) + 𝐈 ​ (x, y)

end

for each pixel (x,y) do

𝐇_{𝐈} ​ (x, y) \leftarrow 𝐇_{𝐈} ​ (x - 1, y) + 𝐇^{'} ​ (x, y)

end

/* local histogram computation */

for each pixel (x,y) do

\begin{matrix} 𝐇_{𝐩} ​ (x, y) \leftarrow 𝐇_{𝐈} ​ (x + l, y + l) + 𝐇_{𝐈} ​ (x - l, y - l) \\ - 𝐇_{𝐈} ​ (x - l, y + l) - 𝐇_{𝐈} ​ (x + l, y - l) \end{matrix}

end

/* local area computation */

for each pixel (x,y) do

𝒜 ​ (x, y) \leftarrow

\sum_{b \in n ​ e ​ i ​ g ​ h ​ b ​ o ​ r ​ b ​ i ​ n ​ s} ω ​ (b) \times 𝐇_{𝐩} ​ (x, y, b)

end

Table 2. Table 3.1 : Similarity comparison results in terms of mean absolute difference ( d m a d subscript 𝑑 𝑚 𝑎 𝑑 d_{mad} ) and different pixel ratio ( d d p r subscript 𝑑 𝑑 𝑝 𝑟 d_{dpr} ) for all piecewise linear mapping (PL), piecewise quadratic mapping (PQ), random mapping with Gaussian distribution (RG), and random mapping with uniform distribution (RU) image pairs.

Deform	ORG	GW NormalizedChromaticity	HM HistogramMatching	LC ANCC	RT Rank	LAT
$d_{m a d}$	0.33	0.13	0.32	0.12	0.20	0.02
$d_{d p r}$	0.79	0.53	0.78	0.46	0.53	0.04

Table 3. Table 3.2 : Similarity comparison results in terms of mean absolute difference ( d m a d subscript 𝑑 𝑚 𝑎 𝑑 d_{mad} ) and different pixel ratio ( d d p r subscript 𝑑 𝑑 𝑝 𝑟 d_{dpr} ) for non-linear deformation as varying the parameters in LAT. In each experiment, all other parameters are fixed as initial values in Section 4.1.

	window size $l$			interval of integ. $r$			deg. of Gaussian $σ$
	7	11	15	1	3	5	0.1	0.3	0.5
$d_{m a d}$	0.11	0.02	0.06	0.03	0.02	0.04	0.03	0.02	0.06
$d_{d p r}$	0.13	0.04	0.08	0.06	0.04	0.05	0.07	0.04	0.08

Table 4. Table 4.1 : Cross-spectral template matching results for RGB-NIR Scene Dataset DB1 in terms of correct detection ratio ( r c d subscript 𝑟 𝑐 𝑑 r_{cd} ) and matching pixel error ( e m p subscript 𝑒 𝑚 𝑝 e_{mp} ).

Size	ORG	HM	LC	RT	MTM	LAT
$r_{c d}$	0.32	0.49	0.29	0.60	0.63	0.70
$e_{m p}$	318	218	359	208	158	139

Table 5. Table 4.2 : Cross-spectral feature matching results for RGB-NIR Scene Dataset DB1 in terms of recognition rate.

	Recognition rate
Original	SIFT SIFT	BRIEF BRIEF	LSS LSS
	0.72	0.68	0.65
LAT	SIFT_LAT	BRIEF_LAT	LSS_LAT
	0.85	0.78	0.73

Table 6. Table 4.3 : Stereo matching results for illumination and exposure deformed stereo image pairs in terms of bad pixel percentage ( B P P 𝐵 𝑃 𝑃 BPP ) and root mean squared errors ( R M S E 𝑅 𝑀 𝑆 𝐸 RMSE ).

	ORG	HM	LC	RT	CT	LAT
BPP	0.86	0.55	0.68	0.53	0.50	0.47
RMSE	80.2	51.6	58.8	56.1	49.8	45.8

Table 7. Table 4.4 : Cross modal dense flow estimation results for multimodal database MultiModal in terms of warping error.

Algorithm	RGB-NIR	Flash-Nonflash	Diff. Exp.	All
SIFT-FlowSIFTFLOW	10.11	8.76	10.03	9.78
Variational VM	12.03	15.19	16.57	14.56
DAISY DAISY	20.42	10.84	12.71	16.16
SIFT-Flow_LAT	6.83	8.83	7.54	7.51

Table 8. Table 4.5 : Cross Spectral Scene Recognition in terms of recognition top-1 accuracy. 1 hand-crafted feature based methods, 2 deep feature based methods, 3 holistic deep network based methods

Method	Accuracy (%)
¹GIST oliva2001modeling	31.2
¹1DiscrimPatches singh2012unsupervised	34.2
¹1ObjectBank li2010object	41.3
²fc7-VLAD gong2014multi	49.4
²NetVLAD arandjelovic2016netvlad	53.4
²MFAFVNet li2017deep	56.5
³AlexNet krizhevsky2012imagenet	45.4
³VGGNet simonyan2014very	51.4
³ResNet he2016deep	54.4
³LAT-AlexNet (Ours)	57.5
³LAT-VGGNet (Ours)	65.5
³LAT-ResNet (Ours)	69.6

Table 9. Table 4.6 : Domain Generalized Scene Recognition in terms of recognition top-1 accuracy: Training on places2 testing on RGB-scene. 2 deep feature based methods, 3 holistic deep network based methods

Method	Accuracy (%)
²fc7-VLAD gong2014multi	54.3
²NetVLAD arandjelovic2016netvlad	58.7
²MFAFVNet li2017deep	59.8
³SemanticCluster george2016semantic	66.3
³AlexNet krizhevsky2012imagenet	46.2
³VGGNet simonyan2014very	48.3
³ResNet he2016deep	51.6
³LAT-AlexNet (Ours)	58.5
³LAT-VGGNet (Ours)	65.4
³LAT-ResNet (Ours)	71.2

Table 10. Table 4.7 : Domain Generalized Scene Recognition in terms of recognition top-1 accuracy: Training on places2 testing on NIR-scene. 2 deep feature based methods, 3 holistic deep network based methods

Method	Accuracy (%)
²fc7-VLAD gong2014multi	43.2
²NetVLAD arandjelovic2016netvlad	46.8
²MFAFVNet li2017deep	49.4
³SemanticCluster george2016semantic	54.7
³AlexNet krizhevsky2012imagenet	36.1
³VGGNet simonyan2014very	39.5
³ResNet he2016deep	41.6
³LAT-AlexNet (Ours)	51.9
³LAT-VGGNet (Ours)	56.5
³LAT-ResNet (Ours)	61.3

Equations62

I^{i} (p) = \int_{ω} E (T, λ) S (p, λ) F_{i} (λ) d λ,

I^{i} (p) = \int_{ω} E (T, λ) S (p, λ) F_{i} (λ) d λ,

I^{i} (p) = E (T, λ_{i}) S (p, λ_{i}) υ_{i} .

I^{i} (p) = E (T, λ_{i}) S (p, λ_{i}) υ_{i} .

I^{i} (p) = ε_{i} E (T, λ_{i}) S (p, λ_{i}) υ_{i} .

I^{i} (p) = ε_{i} E (T, λ_{i}) S (p, λ_{i}) υ_{i} .

\begin{array}[]{l}{{\bf{I}}_{G}}^{i}({\bf{p}})={{\bf{I}}^{i}}({\bf{p}})/\sum\limits_{{\bf{q}}}{{{\bf{I}}^{i}}({\bf{q}})}\\ \quad\quad\;\;={\varepsilon_{i}}E(T,{\lambda_{i}})S({\bf{p}},{\lambda_{i}}){\upsilon_{i}}/\sum\limits_{{\bf{q}}}{{\varepsilon_{i}}E(T,{\lambda_{i}}){\upsilon_{i}}S({\bf{q}},{\lambda_{i}})}.\end{array}

\begin{array}[]{l}{{\bf{I}}_{G}}^{i}({\bf{p}})={{\bf{I}}^{i}}({\bf{p}})/\sum\limits_{{\bf{q}}}{{{\bf{I}}^{i}}({\bf{q}})}\\ \quad\quad\;\;={\varepsilon_{i}}E(T,{\lambda_{i}})S({\bf{p}},{\lambda_{i}}){\upsilon_{i}}/\sum\limits_{{\bf{q}}}{{\varepsilon_{i}}E(T,{\lambda_{i}}){\upsilon_{i}}S({\bf{q}},{\lambda_{i}})}.\end{array}

I_{G}^{i} (p) = S (p, λ_{i}) / q \in N_{p} \sum S (q, λ_{i}) .

I_{G}^{i} (p) = S (p, λ_{i}) / q \in N_{p} \sum S (q, λ_{i}) .

I_{N}^{k} (p) = I^{k} (p) / j \in (1, n) \sum I^{j} (p),

I_{N}^{k} (p) = I^{k} (p) / j \in (1, n) \sum I^{j} (p),

I_{N}^{k} (p) = \frac{ε _{k} E ( T , λ _{k} ) S _{m} ( p , λ _{k} ) υ _{k}}{K ( p )},

I_{N}^{k} (p) = \frac{ε _{k} E ( T , λ _{k} ) S _{m} ( p , λ _{k} ) υ _{k}}{K ( p )},

A (p) = q \in N_{p} \sum τ (I (p), I (q)),

A (p) = q \in N_{p} \sum τ (I (p), I (q)),

\begin{array}[]{l}{\mathcal{A}_{2}}({\bf{p}})=\sum\limits_{{\bf{q}}\in\mathcal{N}_{\bf{p}}}{\tau({{\bf{I}}_{2}}({\bf{p}}),{{\bf{I}}_{2}}({\bf{q}}))}\\ \quad\quad\;\;=\sum\limits_{{\bf{q}}\in\mathcal{N}_{\bf{p}}}{\tau(c({{\bf{I}}_{1}}({\bf{p}})){{\bf{I}}_{1}}({\bf{p}}),c({{\bf{I}}_{1}}({\bf{q}})){{\bf{I}}_{1}}({\bf{q}}))}\\ \quad\quad\;\;=\sum\limits_{{\bf{q}}\in\mathcal{N}_{\bf{p}}}{\tau(m{{\bf{I}}_{1}}({\bf{p}}),n{{\bf{I}}_{1}}({\bf{q}}))},\end{array}

\begin{array}[]{l}{\mathcal{A}_{2}}({\bf{p}})=\sum\limits_{{\bf{q}}\in\mathcal{N}_{\bf{p}}}{\tau({{\bf{I}}_{2}}({\bf{p}}),{{\bf{I}}_{2}}({\bf{q}}))}\\ \quad\quad\;\;=\sum\limits_{{\bf{q}}\in\mathcal{N}_{\bf{p}}}{\tau(c({{\bf{I}}_{1}}({\bf{p}})){{\bf{I}}_{1}}({\bf{p}}),c({{\bf{I}}_{1}}({\bf{q}})){{\bf{I}}_{1}}({\bf{q}}))}\\ \quad\quad\;\;=\sum\limits_{{\bf{q}}\in\mathcal{N}_{\bf{p}}}{\tau(m{{\bf{I}}_{1}}({\bf{p}}),n{{\bf{I}}_{1}}({\bf{q}}))},\end{array}

A (p) = q \in N_{p} \sum τ (εE (T, λ) m (p) S (p, λ) v, εE (T, λ) m (q) S (q, λ) v) .

A (p) = q \in N_{p} \sum τ (εE (T, λ) m (p) S (p, λ) v, εE (T, λ) m (q) S (q, λ) v) .

A (p) = q \in N_{p} \sum τ (m (p) S (p, λ), m (q) S (q, λ)) .

A (p) = q \in N_{p} \sum τ (m (p) S (p, λ), m (q) S (q, λ)) .

A^{λ_{i}} (p) = q \in N_{p} \sum τ (m (p) S (p, λ_{i}), m (q) S (q, λ_{i})) .

A^{λ_{i}} (p) = q \in N_{p} \sum τ (m (p) S (p, λ_{i}), m (q) S (q, λ_{i})) .

A^{λ_{j}} (p) = q \in N_{p} \sum τ (m (p) S (p, λ_{j}), m (q) S (q, λ_{j})) .

A^{λ_{j}} (p) = q \in N_{p} \sum τ (m (p) S (p, λ_{j}), m (q) S (q, λ_{j})) .

H_{p} (b) = q \in N_{p} \sum Q (I (q), b), b \in (1, B),

H_{p} (b) = q \in N_{p} \sum Q (I (q), b), b \in (1, B),

A (p) = K_{h} b \in (R_{k}, R_{l}) \sum ω (b) H_{p} (b),

A (p) = K_{h} b \in (R_{k}, R_{l}) \sum ω (b) H_{p} (b),

ω (b) = e^{- \frac{∣ b - I ( p ) ∣ ^{2}}{σ ^{2}}},

ω (b) = e^{- \frac{∣ b - I ( p ) ∣ ^{2}}{σ ^{2}}},

d_{ma d} = K_{1} p \sum ∣ \overset{ˉ}{I}_{1} (p) - \overset{ˉ}{I}_{2} (p) ∣,

d_{ma d} = K_{1} p \sum ∣ \overset{ˉ}{I}_{1} (p) - \overset{ˉ}{I}_{2} (p) ∣,

d_{d p r} = K_{2} p \sum (∣ \overset{ˉ}{I}_{1} (p) - \overset{ˉ}{I}_{2} (p) ∣ > t),

d_{d p r} = K_{2} p \sum (∣ \overset{ˉ}{I}_{1} (p) - \overset{ˉ}{I}_{2} (p) ∣ > t),

ma d (p, q) = (x, y) \in N \sum ∣ I_{p} (x, y) - I_{q} (x, y) ∣

ma d (p, q) = (x, y) \in N \sum ∣ I_{p} (x, y) - I_{q} (x, y) ∣

ma d_{L A T} (p, q) = (x, y) \in N \sum ∣ A_{p} (x, y) - A_{q} (x, y) ∣

ma d_{L A T} (p, q) = (x, y) \in N \sum ∣ A_{p} (x, y) - A_{q} (x, y) ∣

S (p, q) = exp (\frac{ss d _{pq} ( x , y )}{var _{a u t o}})

S (p, q) = exp (\frac{ss d _{pq} ( x , y )}{var _{a u t o}})

ss d_{pq} (x, y) = (x, y) \sum {I_{p} (x, y) - I_{q} (x, y)}^{2}

ss d_{pq} (x, y) = (x, y) \sum {I_{p} (x, y) - I_{q} (x, y)}^{2}

S_{L A T} (p, q) = exp (\frac{s a d _{pq} ( x , y )}{var _{a u t o}})

S_{L A T} (p, q) = exp (\frac{s a d _{pq} ( x , y )}{var _{a u t o}})

s a d_{pq} (x, y) = (x, y) \sum {A_{p} (x, y) - A_{q} (x, y)}^{2}

s a d_{pq} (x, y) = (x, y) \sum {A_{p} (x, y) - A_{q} (x, y)}^{2}

\nabla I (x, y) = [\frac{\partial I}{\partial x} \frac{\partial I}{\partial y}]

\nabla I (x, y) = [\frac{\partial I}{\partial x} \frac{\partial I}{\partial y}]

\nabla A (x, y) = [\frac{\partial A}{\partial x} \frac{\partial A}{\partial y}]

\nabla A (x, y) = [\frac{\partial A}{\partial x} \frac{\partial A}{\partial y}]

LBP({\bf{p}})=\sum\limits_{q=0}^{Q-1}{\Lambda({I_{q}}-{I_{\bf{p}}}){2^{q}}}\quad\quad\Lambda(x)=\left\{{\begin{array}[]{*{20}{c}}{1,}\\ {0,}\end{array}}\right.\begin{array}[]{*{20}{c}}{x\geq 0}\\ {\quad otherwise}\end{array}

LBP({\bf{p}})=\sum\limits_{q=0}^{Q-1}{\Lambda({I_{q}}-{I_{\bf{p}}}){2^{q}}}\quad\quad\Lambda(x)=\left\{{\begin{array}[]{*{20}{c}}{1,}\\ {0,}\end{array}}\right.\begin{array}[]{*{20}{c}}{x\geq 0}\\ {\quad otherwise}\end{array}

LB{P_{LAT}}({\bf{p}})=\sum\limits_{q=0}^{Q-1}{\Lambda({A_{q}}-{A_{\bf{p}}}){2^{q}}}\quad\quad\Lambda(x)=\left\{{\begin{array}[]{*{20}{c}}{1,}\\ {0,}\end{array}}\right.\begin{array}[]{*{20}{c}}{x\geq 0}\\ {\quad otherwise}\end{array}

LB{P_{LAT}}({\bf{p}})=\sum\limits_{q=0}^{Q-1}{\Lambda({A_{q}}-{A_{\bf{p}}}){2^{q}}}\quad\quad\Lambda(x)=\left\{{\begin{array}[]{*{20}{c}}{1,}\\ {0,}\end{array}}\right.\begin{array}[]{*{20}{c}}{x\geq 0}\\ {\quad otherwise}\end{array}

x^{l} = i \in K \sum ω_{i}^{l} x^{l - 1}_{i} + b^{l}

x^{l} = i \in K \sum ω_{i}^{l} x^{l - 1}_{i} + b^{l}

A^{l} = i \in K \sum ω_{i}^{l} A^{l - 1}_{i} + b^{l}

A^{l} = i \in K \sum ω_{i}^{l} A^{l - 1}_{i} + b^{l}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Robotics and Sensor-Based Localization

Full text

\dept

Department of Electrical and Electronic Engineering \universityYonsei University \degreetitleDoctor of Philosophy \degreedateFebruary 2019 \subjectLaTeX

Local Area Transform for Cross-Modality Correspondence Matching and Deep Scene Recognition

Seungchul Ryu

Abstract

Establishing correspondences is a fundamental task in variety of image processing and computer vision applications. In particular, finding the correspondences between a non-linearly deformed image pair induced by different modality conditions is a challenging problem. This paper describes a efficient but powerful image transform called local area transform (LAT) for modality-robust correspondence estimation. Specifically, LAT transforms an image from the intensity domain to the local area domain, which is invariant under nonlinear intensity deformations, especially radiometric, photometric, and spectral deformations. In addition, robust feature descriptors are reformulated with LAT for several practical applications. Furthermore, LAT-convolution layer and Aception block are proposed and, with these novel components, deep neural network called LAT-Net is proposed especially for scene recognition task. Experimental results show that LATransformed images provide a consistency for nonlinearly deformed images, even under random intensity deformations. LAT reduces the mean absolute difference by approximately 0.20 and the different pixel ratio by approximately 58% on average, as compared to conventional methods. Furthermore, the reformulation of descriptors with LAT shows superiority to conventional methods, which is a promising result for the tasks of cross-spectral and modality correspondence matching. LAT gains an approximately 23% improvement in the correct detection ratio and a 10% improvement in the recognition rate for the tasks of RGB-NIR cross-spectral template matching and cross-spectral feature matching, respectively. LAT reduces the bad pixel percentage by approximately 15% and the root mean squared errors by 13.5 in the task of cross-radiation stereo matching. LAT also improves the cross-modal dense flow estimation task in terms of warping error, providing 50% error reduction. LAT-Net provides 14% and 7% accuracy improvements in cross spectral scene recognition and domain generalized scene recognition tasks, respectively. the local area can be considered as an alternative domain to the intensity domain to achieve robust correspondence matching, image recognition, and a lot of applications: such as feature matching, stereo matching, dense correspondence matching, image recognition, and image retrieval.

keywords:

LaTeX PhD Thesis Engineering Yonsei University

{dedication}

I would like to dedicate this thesis to my loving family …

Acknowledgements.

I would like to express my sincere gratitude to my supervisor Prof. Kwanghoon Sohn for the continuous support of my Ph.D sutdy and related research, for his patience, motivation, and immense knowledge. Hist guidance helped me in all the time of research and writing of this dissertation. Besides my supervisor, I would like to thank my dissertation committe: Prof. Euntae Kim, Prof. Hyeran Byun, Prof. Sangyoun Lee, and Prof. Dongbo Min, for their insightful comments and encouragement, but also for the hard question which incented me to widen my research from various perspectives. My sincere thanks also goes to Dr. Jungdong Seo, Dr. Donghyun Kim, Prof. Bumsub Ham, who provided me an insights about my research. Without their precious support it would not be possible to conduct this research. I thank my fellow lab-mates, Dr. Seungryong Kim, Dr. Changae Oh, Dr. Youngjung Kim, Kihong Park, and Sunok Kim in for the discussions and for all the fun we have had in the last years. Also, I thank Dr. Cho who provided me an opportunity to join their great team. Last but not the least, I would like to thank my family: my parents, my brother, my wife, and my daughters for supporting me spiritually throughout writing this dissertation and my life in general.

1 Introduction
2 Related Works
3 Local Area Transform (LAT)
3.1 Definition of LAT
3.2 Properties of LAT
3.2.1 Invariance to non-linear intensity deformation
3.2.2 Invariance to radiometric & photometric deformations
3.2.3 Invariance to spectral deformation
3.2.4 Limitation
3.3 Implementation of LAT and Extension
3.4 Robustness Evaluation of LAT
3.5 LAT Reformulated Features: Cross-Modality Feature Descriptors
3.6 LAT-Net: Deep Scene Recognition Network
4 Cross-Modality Correspondence Matching and Deep Scene Recognition
4.1 Experimental Settings
4.2 Cross-Modality Correspondence Matching
4.2.1 Non-linear Deformation Correspondence Matching
4.2.2 Cross-Spectral Correspondence Matching
4.2.3 Cross-Radiometry Stereo Matching
4.2.4 Cross-Modality Dense Correspondence Matching
4.3 Cross-Modality Deep Scene Recognition
4.3.1 Cross Spectral Scene Recognition
4.3.2 Domain Generalized Scene Recognition
5 Conclusion

List of Figures

3.1 The original test color images
3.2 Nonlinear intensity deformation robustness of LAT.
3.3 The robustness comparison for nonlinear intensity deformations for Mustang.
ORG
GW
HM
LC
RT
LAT
3.4 The robustness comparison for nonlinear intensity deformations for Airplane.
ORG
GW
HM
LC
RT
LAT
3.5 The robustness comparison for nonlinear intensity deformations for Pepper.
ORG
GW
HM
LC
RT
LAT
3.6 Aception block
3.7 The structure of LAT-AlexNet.
3.8 The structure of LAT-VGGNet.
3.9 The structure of LAT-ResNet.
4.1 Feature matching on simulated database.
PL
PQ
RG
RU
4.2 Recognition rate for simulated database.
4.4 Qualitative results of cross-spectral template matching for Lobby.
ORG
GW
HM
LC
RT
LAT
4.5 Qualitative results of cross-spectral template matching for Buildings.
ORG
GW
HM
LC
RT
LAT
4.6 An example of cross spectral feature matching. Top: LSS and Bottom: LSSLAT
4.7 Stereo matching results with a cost function size of 25 for Baby1 stereo pair with Left(1/1) and right(3/1).
4.8 Qualitative results of robust stereo matching for Cloth2.
ORG
GW
HM
LC
RT
LAT
4.9 Qualitative results of robust stereo matching for Aloe.
ORG
GW
HM
LC
RT
LAT

List of Tables

3.1 Similarity comparison results in terms of mean absolute difference ( $d_{mad}$ ) and different pixel ratio ( $d_{dpr}$ ) for all image pairs.
3.2 Similarity comparison results in terms of mean absolute difference ( $d_{mad}$ ) and different pixel ratio ( $d_{dpr}$ ) for non-linear deformation as varying the parameters in LAT.
4.1 Cross-spectral template matching results for RGB-NIR Scene Dataset.
4.2 Cross-spectral feature matching results for RGB-NIR Scene Dataset in terms of recognition rate.
4.3 Stereo matching results for illumination and exposure deformed stereo image pairs.
4.4 Cross modal dense flow estimation results for multimodal database in terms of warping error.
4.5 Cross Spectral Scene Recognition in terms of recognition top-1 accuracy.
4.6 Domain Generalized Scene Recognition in terms of recognition top-1 accuracy: RGB
4.7 Domain Generalized Scene Recognition in terms of recognition top-1 accuracy: NIR

Chapter 1 Introduction

Correspondence matching is a basic and fundamental task in a vast range of image processing and computer vision applications: image denoising app1; trinh2014novel, image editing app2; bugeau2014variational, object tracking app3, stereo matching app4, optical flow revaud2015epicflow, image retrieval babenko2015aggregating, image recognition deng2009imagenet; uijlings2013selective, and scene recognition kwitt2012scene; su2012improving. Conventional correspondence matching algorithms are commonly based on gradient-based descriptors SIFT; bay2008speeded; HOG. In real world, however, images are acquired in an uncontrolled environment; thus, the image may suffer intensity deformations due to changes in illumination conditions, camera photometric parameters, viewing positions, and so on problem. Furthermore, recently, cross-modality imaging system (e.g., multi-spectral imaging system DB1; sorensen2015multimodal has been attracted many attentions to address challenging problems occurring in the conventional unimodal imaging system. Images acquired from different modalities also have intensity deformations due to changes in sensor responses and spectral distributions.

These deformations between patches or images induce the inaccuracy problem of the correspondence matching. Let ${{\bf{I}}_{1}}$ and ${{\bf{I}}_{2}}$ be two input images, and $\alpha({\bf{p}})\in{\bf{I}}_{2}$ be the corresponding pixel of ${\bf{p}}\in{\bf{I}}_{1}$ . When dealing with a correspondence matching under uncontrolled environments or multi-modalities, three groups of approaches have been considered: tone mapping, color constancy, and robust similarity measure. The first group, called tone mapping, attempts to determine a mapping function $\mathcal{M}$ such that $\mathcal{M}\{{{\bf{I}}_{1}}({\bf{p}})\}={{\bf{I}}_{2}}({\alpha({\bf{p}})})$ . A classic method for extracting $\mathcal{M}$ is a histogram matching HistogramMatching, which computes a mapping function that optimally aligns the histogram of ${\bf{I}}_{1}$ with that of ${\bf{I}}_{2}$ . Several methods compute a mapping function $\mathcal{M}$ based on the statistical distribution of intensity values Statistical1. More sophisticated mapping functions were well reviewed in ToneMappingReview2. Tone mapping approaches commonly assume that ${\bf{I}}_{1}$ and ${\bf{I}}_{2}$ are entirely aligned into same scene regions. This assumption is clearly hold only when the images are taken at the same viewpoint under the same illumination condition, but in other cases the obtained mapping function $\mathcal{M}$ might be erroneous and inconsistent.

The second group, called color constancy, tries to find a model $\mathcal{S}$ to transform images into constant color space removing illumination components such that $\mathcal{S}\{{{\bf{I}}_{1}}({\bf{p}})\}=\mathcal{S}\{{{\bf{I}}_{2}}({\alpha({\bf{p}})})\}$ . One of the most popular methods is grey-world model which removes the illumination spectral distribution factor with an assumption that, under a white light source, the average color in a scene is achromatic (i.e., grey) NormalizedChromaticity. Another well-known method, white patch retinex model, assumes that the maximum response in an image is caused by a perfect reflectance (i.e., white patch). In practice, this assumption is alleviated by considring the color channels separately, resulting in the max-RGB algorithm. The normalized chromaticity model is commonly used for the elimination of the lighting geometry factors under the Lambertian reflectance model NormalizedChromaticity. Gamut mapping and other learning based algorithms have been also investigated LearningColorConstancy2. However, most models cannot remove the dependency of the lighting geometry and the illumination spectral distribution simultaneously as will be discussed in Chapter 2.

The third group, called robust similarity measure, attempts to describe a local signature within a patch invariant to a nonlinear deformation. In some cases, an intensity deformation is nonlinear but still maintains a monotonicity, i.e., the order of intensity-levels is preserved. Similarity measures based on such an ordinal value include local binary pattern (LBP) LBP, binary robust independent elementary features (BRIEF) BRIEF, rank transform (RT) Rank, and census transform (CT) census. Although these ordinal information based approaches account for a monotonic mapping, they fail under a non-monotonic intensity deformation.

A gradient-based similarity measure, such as histogram of gradients (HOG) HOG and scale invariant feature transform (SIFT) SIFT has been considered a photometric invariant similarity measure. Such a method inherently, however, causes the loss of information due to the contraction of data weakening their discrimination power and fails under a non-monotonic mapping. Normalized cross correlation (NCC) measures the cosine of an angle between two vectors, and thus is robust to a linear intensity deformation. To address the inaccuracy at an object boundary of the NCC, adaptive normalized cross correlation (ANCC) is proposed in ANCC. In MTM, a generalized version of NCC is proposed, which is called matching by tone mapping (MTM). Mutual information (MI) MI is widely used similarity measure for images with nonlinear deformations. MI measures the statistical dependence between two vectors $v_{1}$ and $v_{2}$ by computing the loss of entropy in $v_{1}$ given $v_{2}$ .

To summarize, the conventional methods approached to solve the problem of a nonlinear intensity deformation by adjusting intensity values to be similar or utilizing a gradient, ordinal information, and a statistical measure. However, these approaches cannot account for a general nonlinear intensity deformation. This paper proposes to use local area information as a robust index for nonlinear intensity deformations. We define local area transform (LAT) as a robust mapping of an image from an intensity domain to a local area domain. LAT is designed to address the nonlinear deformation problem of images which may be acquired from different photometric parameters, light sources, and modalities. The objective of LAT is similar to a color constancy, i.e., transferring an image from the original intensity (or color) values to constant intensity (or color) domain. However, unlike the color constancy LAT alters an image from intensity domain, which is sensitive to a nonlinear deformation, to robust local area domain. Ordinal transform such as LBP, RT, and CT also aims to transfer an image to ordinal information domain, but fails under a non-monotonic intensity deformation. As our knowledge, this study is the first attempt to address a nonlinear deformation problem with the local area information in the task of a correspondence matching.

This study prove that the LAT is robust image transform for non-linear intensity, radiometric, photometric, and spectral deformations. Also, efficient implementation of LAT is proposed with integral histogram. Besides the use as a transformation, the concept of LAT is extended to reformulate the conventional robust feature descriptors such as SIFT, LSS, CT, RT, and etc. The reformulation embeds great properties of LAT into the conventional feature descriptors. The reformulated descriptors show that superior performance in tasks of non-linear deformation correspondence matching, cross-spectral correspondence matching, cross-radiometry stereo matching, and cross-modality dense correspondence matching. Furthermore, novel deep networks are proposed to address cross-domain scene recognition problem. In the proposed deep scene recognition network, conventional convolutional layers are replaced by LAT-convolution layers and aception block is introduced. The proposed deep scene recognition networks outperform the conventional methods in tasks of cross-spectral scene recognition and domain generalized scene recognition.

The remainder of this dissertation is organized as follows. In Chapter 2, related literatures are presented. In Chapter 3, LAT is described with its properties and implementation details. LAT-reformulated features and LAT-Net are also presented. In Chapter 4, the performances of LAT are evaluated in tasks of nonlinear-deformed image matching, cross spectral correspondence matching, cross radiometry stereo matching, cross modal dense flow estimation, and cross modality scene recognition. Chapter 5 concludes this paper with the discussions.

Chapter 2 Related Works

An image taken by a linear imaging device with $i^{th}$ sensor is modeled as ImageModel:

[TABLE]

where ${{\bf{I}}^{i}}({\bf{p}})$ denotes the sensor response at a point $\bf{p}$ in the spatial coordinate, $E(T,\lambda)$ represents the spectral distribution of the incident illuminant, $S({\bf{p}},\lambda)$ represents the surface reflectance at $\bf{p}$ , and $F_{i}(\lambda)$ represents the spectral response of the sensor. Approximating the sensor spectral response $F_{i}(\lambda)$ as the Dirac delta function such that $F_{i}(\lambda)=\upsilon_{i}\delta(\lambda-\lambda_{i})$ , (2.1) is simplified as follows:

[TABLE]

Under Planck’s law, the spectral distribution of the illuminant $E(T,\lambda_{i})$ is modeled a function of the absolute temperature $T$ and the wavelength $\lambda$ as $E(T,\lambda_{i})={c_{1}}{{\lambda_{i}}^{-5}}{e^{{{{c_{2}}}\mathord{\left/{\vphantom{{{c_{2}}}{\lambda_{i}T}}}\right.\kern-1.2pt}{\lambda T}}}}$ where ${c_{1}}\buildrel\Delta\over{=}2h{c^{2}}$ , ${c_{2}}\buildrel\Delta\over{=}\frac{{hc}}{k}$ , $c$ is the speed of light, $h$ is Planck’s constant, and $k$ is Boltzmann constant. The surface reflectance $S({\bf{p}},\lambda_{i})$ is represented as $S({\bf{p}},\lambda_{i})=m({\bf{p}})S_{m}({\bf{p}},\lambda_{i})$ where $m(\bf{p})$ is a lighting geometry factor and $S_{m}({\bf{p}},\lambda_{i})$ is the matte-surface reflectance with the assumption of a matte surface. Taking the exposure time $\varepsilon_{i}$ into the consideration, the image acquisition model in (2.2) is modified as

[TABLE]

When images are acquired in an uncontrolled environment or in cross-modality system, they suffer from nonlinear deformation problem induced by different modalities. To address the correspondence problem under uncontrolled environments or multi-modalities, three groups of approaches have been explored: tone mapping, color constancy, and robust similarity measure. Color constancy is closely related works to the proposed LAT. Color constancy tries to find a model $\mathcal{S}$ to transform images into constant color space removing illumination components such that $\mathcal{S}\{{{\bf{I}}_{1}}({\bf{p}})\}=\mathcal{S}\{{{\bf{I}}_{2}}({\alpha({\bf{p}})})\}$ . One of the most popular methods is grey-world model which removes the illumination spectral distribution factor with an assumption that, under a white light source, the average color in a scene is achromatic (i.e., grey) NormalizedChromaticity. Another well-known method, white patch retinex model, assumes that the maximum response in an image is caused by a perfect reflectance (i.e., white patch). In practice, this assumption is alleviated by considring the color channels separately, resulting in the max-RGB algorithm. The normalized chromaticity model is commonly used for the elimination of the lighting geometry factors under the Lambertian reflectance model NormalizedChromaticity. Gamut mapping and other learning based algorithms have been also investigated LearningColorConstancy2. However, most models cannot remove the dependency of the lighting geometry and the illumination spectral distribution simultaneously. More recently, deep neural networks based color constancy methods were also explored bianco2015color; barron2015convolutional; oh2017approaching; hu2017fc4

Grey world model estimates the illuminant by averaging channel values under the assumption that the average reflectance in an image is achromatic, and is proven to be an instantiation of Minkowski-norm ( $\rho=1$ ) Minkowski. Then, the gray world model ${{\bf{I}}_{G}}^{i}({\bf{p}})$ is computed as follows:

[TABLE]

In practice, it is computed within local neighbors $\mathcal{N}_{\bf{p}}$ with the assumption of $E(T,\lambda_{i})$ to be locally constant, thus (2.4) is simplified as:

[TABLE]

(2.5) implies that the gray world model is invariant to an illumination deformation under the local-constancy assumption. However, when dealing with images acquired by different modalities (e.g., cross-spectral) $S$ undergoes non-linear deformation, thus the gray world model is no longer guarantee the robustness to spectral deformations.

The $k^{th}$ -channel normalized chromaticity ${{\bf{I}}_{N}}^{k}({\bf{p}})$ NormalizedChromaticity eliminates the effect of the lighting geometry by dividing each channel response by the average of them as follows:

[TABLE]

where $n$ is the number of channels. Substituting (3), (6) is simplified as

[TABLE]

where $K({\bf{p}})=\sum\limits_{j\in(1,n)}{\varepsilon_{k}E(T,{\lambda_{j}}){S_{m}}({\bf{p}},{\lambda_{j}}){\upsilon_{j}}}$ . (7) indicates that the normalized chromaticity only removes the lightning geometry factor $m(\bf{p})$ . Log-chromaticity ANCC defined as ${{\bf{I}}_{l}}^{k}({\bf{p}})=\log({{\bf{I}}^{k}}({\bf{p}})/\sqrt[n]{{\prod\limits_{j\in(1,n)}{{{\bf{I}}^{j}}({\bf{p}})}}})$ transforms a nonlinear deformation into a linear deformation. However, both the normalized chromaticity and the log-chromaticity cannot be applicable to uni-channel image, e.g., infra-red image.

Tone mapping algorithms attempt to construct a mapping function $\mathcal{M}$ such that $\mathcal{M}\{{{\bf{I}}_{1}}({\bf{p}})\}={{\bf{I}}_{2}}({\alpha({\bf{p}})})$ . A classic method for extracting $\mathcal{M}$ is a histogram matching HistogramMatching, which computes a mapping function that optimally aligns the histogram of ${\bf{I}}_{1}$ with that of ${\bf{I}}_{2}$ . Several methods compute a mapping function $\mathcal{M}$ based on the statistical distribution of intensity values Statistical1; eilertsen2015real. More sophisticated mapping functions were well reviewed in ToneMappingReview2. Tone mapping approaches commonly assume that ${\bf{I}}_{1}$ and ${\bf{I}}_{2}$ are entirely aligned into same scene regions. This assumption is clearly hold only when the images are taken at the same viewpoint under the same illumination condition, but in other cases the obtained mapping function $\mathcal{M}$ might be erroneous and inconsistent. Histogram matching, the most common tone mapping scheme, aligns the histogram of ${\bf{I}}_{1}$ to that of ${\bf{I}}_{2}$ when they are acquired from the same scene at the same viewpoint, i.e., $\alpha({\bf{p}})={\bf{p}}$ . However, this assumption is too hard to be applied to practical environments. In addition, the histogram matching is stable only for global deformations, and is no longer guarantees for local deformations.

Robust similarity measure attempts to describe a local signature within a patch invariant to a nonlinear deformation. In some cases, an intensity deformation is nonlinear but still maintains a monotonicity, i.e., the order of intensity-levels is preserved. Similarity measures based on such an ordinal value include local binary pattern (LBP) LBP, binary robust independent elementary features (BRIEF) BRIEF, rank transform (RT) Rank, and census transform (CT) census. Although these ordinal information based approaches account for a monotonic mapping, they fail under a non-monotonic intensity deformation.

A gradient-based similarity measure, such as histogram of gradients (HOG) HOG and scale invariant feature transform (SIFT) SIFT has been considered a photometric invariant similarity measure. Such a method inherently, however, causes the loss of information due to the contraction of data weakening their discrimination power and fails under a non-monotonic mapping. Recently, dense adaptive self-correlation (DASC) descriptor has been proposed to provide robustness for modality variations, but is also has limitations on non-linear deformations kim2015dasc. Normalized cross correlation (NCC) measures the cosine of an angle between two vectors, and thus is robust to a linear intensity deformation. To address the inaccuracy at an object boundary of the NCC, adaptive normalized cross correlation (ANCC) is proposed in ANCC. In MTM, a generalized version of NCC is proposed, which is called matching by tone mapping (MTM). Mahalanobis distance cross-correlation (MDCC) has also been proposed kim2014mahalanobis. Mutual information (MI) MI is widely used similarity measure for images with nonlinear deformations. MI measures the statistical dependence between two vectors $v_{1}$ and $v_{2}$ by computing the loss of entropy in $v_{1}$ given $v_{2}$ . Recently, deep learning based similarity measure is also actively studied chen2015deep; kim2017fcss; han2017scnet; ufer2017deep

Under a linear deformation written as ${{\bf{I}}_{2}}(\alpha({\bf{p}}))=a\,{{\bf{I}}_{1}}({\bf{p}})+a^{\prime}$ where $a$ and $a^{\prime}$ are constants, a gradient is deformed with a scaling factor $a$ : $\Delta{{\bf{I}}_{2}}(\alpha({\bf{p}}))=a\,\Delta{{\bf{I}}_{2}}({\bf{p}})$ , thus gradient information can be a robust feature when $a>0$ . However, when $a<0$ the gradient inversion occurs, which leads the inaccuracy of gradient based similarity measures such as HOG and SIFT. When the deformation is non-linear, the gradients fail to be preserved across the deformation. In some cases, the intensity deformation is nonlinear but still maintains monotonicity, i.e., the order of intensity-levels is preserved as $\forall{\bf{p}},{\bf{q}}\;\;{\rm{if}}\;{{\bf{I}}_{1}}({\bf{p}})\leq{{\bf{I}}_{1}}({\bf{q}}),\;\;{{\bf{I}}_{2}}(\alpha({\bf{p}}))\leq{{\bf{I}}_{2}}(\alpha({\bf{q}}))$ . An intensity ordinal similarity measure, such as LBP, RT, and CT, provides the robustness under the assumption of the monotonicity, but the assumption is violated in a general non-linear deformation. The local intensity order is not preserved across non-linear deformation, thus which leads the inaccuracy of an intensity ordinal similarity measure under the non-linear deformation.

One of the most important application in computer vision is image recognition. Especially, scene image recognition is an important problems for applications of computer vision such as robotics, image search, geo-localization, etc. However, scene recognition is challenging problem because scenes commonly include both a holistic component and object-based components. Conventional methods for scene recognition can be categorized into holistic gist descriptors oliva2001modeling and local feature based descriptors nowak2006sampling. Local feature based approaches were mainly based on bag-of-features (BoF) representation, using local features such as SIFT or HOG kwitt2012scene; li2010object; su2012improving, combined through a pooling operator. Sophisticated pooling strategies such as the vector of locally aggregated descriptors (VLAD) su2012improving or the Fisher vector (FV) sanchez2013image emerged as the dominant mechanism for scene recognition.

In recent years, convolutional neural networks (CNNs) have become the feature extractors of choice for scene recognition. The previous success of sophisticated pooling leads many studies utilizing CNNs as local features. Early methods adopted a BoF-like approaches, based on the extraction of features from intermediate CNN layers, which were then fed to dictionary learning methods such as clustering gong2014multi or sparse coding dixit2015scene and pooled by VLAD gong2014multi or Fisher vector liu2014encoding. In liu2014encoding, semantic Fisher vector was proposed, converting features from probability space to the natural parameter space. In li2017deep, mixture of factor analyzers Fisher vector was proposed. However, these methods suffer from two drawbacks: 1) the Fisher vector structure is not easy to integrate in CNN, and 2) they are too high-dimensional. These drawbacks prevent end-to-end training and thus leads sub-optimal problem. Recently, VLAD and Fisher vectors are embedded into CNN architecture, by deriving a neural network implementation of its equations. arandjelovic2016netvlad proposed NetVLAD, an embedded implementation of VLAD descriptor, and tang2016deep proposed Deep FisherNet, an embedded implementation of GMM Fisher vector.

CNNs trained with the ImageNet donahue2014decaf for scene recognition was difficult to yield a better result than hand-designed features incorporating with sophisticated classifer sanchez2013image. This can be ascribed to the fact that scehe has very distinct characteristics from object classification data. To overcome this problem, zhou2014learning; zhou2017places trained a scene-centric CNN by constructing large scale scene dataset, called Places, resulting a significant performance improvement.

In real-world applications, scene images are frequently taken under very different imaging conditions, sensor specifications, and weathers. In such a cross-domain setting, common scene recognition algorithms frequently fail to achieve superior performance. To address the dataset bias problem, many domain adaptation approaches bruzzone2010domain; duan2012domain; baktashmotlagh2013unsupervised have been proposed to reduce the mismatch between the data distributions of the training samples and target samples. In george2016semantic, semantic clustering (SC), as domain generalization method111Unlike domain adoptation, in domain generalization, the knowledge learnt from one or multiple source domains in transferred to an unseen target domain., for fine-grained scene recognition was proposed.

Chapter 3 Local Area Transform (LAT)

3.1 Definition of LAT

In this paper, we propose to use local area information as a robust index for a nonlinear intensity deformation. Let $\bf{I}$ be input image, $\bf{p}$ be the current pixel, $\bf{q}$ be a neighboring pixel, and $\mathcal{N}_{\bf{p}}$ be a set of neighboring pixels. When denoting a set of pixels $\Psi$ whose intensity value is similar as that of $\bf{p}$ such that $\Psi=\{{\bf{\hat{q}}}|{\bf{I}}({\bf{\hat{q}}})\approx{\bf{I}}({\bf{p}}),{\bf{\hat{q}}}\in\mathcal{N}_{\bf{p}}\}$ where $\approx$ means that they have similar values, the local area is defined as the area of $\Psi$ . We define a mapping of an image from the intensity domain to the local area domain as local area transform (LAT). LAT is designed to address the matching-problem of a non-linearly deformed image-pair which might be acquired from different radiometric parameters, different photometric parameters, and different modalities (including different spectrums). The LAT at a pixel $\bf{p}$ , $\mathcal{A}(\bf{p})$ , is computed as follows:

[TABLE]

where $\tau(x,y)=\left\{{\begin{array}[]{*{20}{c}}s(x,y)\\ 0\end{array}}\right.\begin{array}[]{*{20}{c}}{\;\;\;\;if\;s(x,y)<thr}\\ {\;else}\end{array}$ is a logistic function with definition of similarity function $s$ . $s$ is modeled according to the usages and applications. For example, $s$ can be measured as equality check, similarity in spatial domain, similarity in intensity domain, or similarity in gradient domain. When $s$ is modeled as equality check function, $\tau(x,y)$ is defined as a logistic function $\tau(x,y)=\left\{{\begin{array}[]{*{20}{c}}1\\ 0\end{array}}\right.\begin{array}[]{*{20}{c}}{\;\;\;\;if\;x=y}\\ {\;else}\end{array}$ with $\textstyle{property1}$ : $\tau(kx,ky)=\tau(x,y)$ where $k\in\mathbb{N},\;k\neq 0$ and $\textstyle{property2}$ : $\tau(x_{1},y_{1})=1$ and $\tau(x_{2},y_{2})=1$ $\Rightarrow$ $\tau(x_{1}x_{2},y_{1}y_{2})=1$ .

3.2 Properties of LAT

Variety of real world computer vision applications require invariance properties, especially in uncontrolled environments. This section derives the invariance of LAT to non-linear intensity deformations, especially radiometric, photometric, and spectral deformations.

3.2.1 Invariance to non-linear intensity deformation

For a registered input image pair ${\bf{I}}_{1}$ and ${\bf{I}}_{2}$ , a non-linear intensity deformation between ${\bf{I}}_{1}$ and ${\bf{I}}_{2}$ can be represented as $\mathcal{D}\{{{\bf{I}}_{2}}({\bf{p}})\}=c({{\bf{I}}_{1}}({\bf{p}})){{\bf{I}}_{1}}({\bf{p}})$ where $c(\cdot)$ is a intensity mapping operator. Then, $\mathcal{A}_{2}(\bf{p})$ is written as follows:

[TABLE]

where $m$ and $n$ are constant values varied according to ${\bf{I}}_{1}(\bf{p})$ and ${\bf{I}}_{1}(\bf{q})$ . For the case of ${\bf{I}}_{1}(\bf{p})={\bf{I}}_{1}(\bf{q})$ and consequently $m=n$ , with $\textstyle{property1}$ of the function $\tau$ , ${\tau(m{{\bf{I}}_{1}}({\bf{p}}),n{{\bf{I}}_{1}}({\bf{q}}))}={\tau({{\bf{I}}_{1}}({\bf{p}}),{{\bf{I}}_{1}}({\bf{q}}))}$ . For the case of ${\bf{I}}_{1}(\bf{p})\neq{\bf{I}}_{1}(\bf{q})$ and consequently $m\neq n$ , under the assumption that the deformation function $\mathcal{D}$ is an one-to-one mapping, $\tau(m{\bf{I}}_{1}({\bf{p}}),n{\bf{I}}_{1}({\bf{q}}))$ is also equal to $\tau({\bf{I}}_{1}(\bf{p}),{\bf{I}}_{1}(\bf{q}))$ . From these equalities, ${\mathcal{A}_{2}}({\bf{p}})=\sum\limits_{{\bf{q}}\in N_{\bf{p}}}{\tau({{\bf{I}}_{1}}({\bf{p}}),{{\bf{I}}_{1}}({\bf{q}}))}={\mathcal{A}_{1}}({\bf{p}})$ . In other words, LAT is invariant to non-linear intensity deformations.

3.2.2 Invariance to radiometric & photometric deformations

Substituting (3) into (8), $\mathcal{A}({\bf{p}})$ is rewritten as follows:

[TABLE]

Under the assumption of local-constancy of $E(T,\lambda)$ and the fact that $\varepsilon$ and $v$ are constant values, (10) is simplified with $\textstyle{property1}$ of the function $\tau$ as:

[TABLE]

(11) indicates that LAT is independent of the illumination spectral distribution $E(T,\lambda)$ and the exposure time $\varepsilon$ , i.e., it is invariant to illumination and exposure deformations (corresponding to radiometric and photometric deformations, respectively).

3.2.3 Invariance to spectral deformation

When we let $\mathcal{A}^{\lambda_{i}}(\bf{p})$ be a LATransformed value of an image captured by $i^{th}$ -sensor with ${\lambda_{i}}$ (e.g., visible spectrum) and $\mathcal{A}^{\lambda_{j}}(\bf{p})$ be a LATransformed value of an image captured $j^{th}$ -sensor with ${\lambda_{j}}$ (e.g., infra-red spectrum), we show that $\mathcal{A}^{\lambda_{i}}({\bf{p}})=\mathcal{A}^{\lambda_{j}}({\bf{p}})$ , i.e., the invariance of LAT to a spectral deformation as follows. From (11) $\mathcal{A}^{\lambda_{i}}(\bf{p})$ and $\mathcal{A}^{\lambda_{j}}(\bf{p})$ are written as (12) and (13), respectively.

[TABLE]

We assume that pixels having same spectral reflectance values for a specific wavelength have same spectral reflectance values for another wavelength, i.e., $\forall{\bf{p}}\neq{\bf{q}}\;\;{\rm{if}}\;S({\bf{p}},{\lambda_{i}})=S({\bf{q}},{\lambda_{i}}),\;\;S({\bf{p}},{\lambda_{j}})=S({\bf{q}},{\lambda_{j}})$ . Under this assumption and the $\textstyle{property2}$ of the function $\tau$ , ${\tau\left({{{m(\bf{p})}S({\bf{p}},{\lambda_{i}}),{m(\bf{q})}S({\bf{q}},{\lambda_{i}})}}\right)}={\tau\left({{{m(\bf{p})}S({\bf{p}},{\lambda_{j}}),{m(\bf{q})}S({\bf{q}},{\lambda_{j}})}}\right)}$ when ${m(\bf{p})}={m(\bf{q})}$ . For the case of ${m(\bf{p})}\neq{m(\bf{q})}$ , ${\tau\left({{{m(\bf{p})}S({\bf{p}},{\lambda}),{m(\bf{q})}S({\bf{q}},{\lambda})}}\right)}$ is commonly [math] except for $\forall m({\bf{p}})\neq m({\bf{q}})$ and $S({\bf{p}},\lambda)\neq S({\bf{q}},\lambda),{{m(\bf{p})}S({\bf{p}},{\lambda})={m(\bf{q})}S({\bf{q}},{\lambda})}$ . Note that this exceptional case is out-of consideration since it hardly occurs. Accordingly, ${\mathcal{A}^{\lambda_{i}}}({\bf{p}})={\mathcal{A}^{\lambda_{j}}}({\bf{p}})$ for any wavelength pair $\lambda_{i}$ and $\lambda_{j}$ , i.e., a LAT value is invariant to a spectral deformation.

3.2.4 Limitation

In the above, we show the invariance of LAT to non-linear intensity deformations. However, when the deformation function $\mathcal{D}$ is not a one-to-one mapping, there is possibly a duplicated mapping, i.e., $\forall{\bf{I}}_{1}(\bf{p})\neq{\bf{I}}_{1}(\bf{p})$ and consequently $m\neq n$ , ${m\bf{I}}_{1}({\bf{p}})={n{\bf{I}}}_{1}({\bf{p}})$ in (9). For such a mapping, $\mathcal{A}_{1}(\bf{p})\neq\mathcal{A}_{2}(\bf{p})$ . In other words, the LAT is not invariant to a duplicated intensity deformation. Nevertheless, LAT is still a robust transform to non-linear deformations since non-duplicated deformation assumption is commonly insured.

3.3 Implementation of LAT and Extension

The LAT is efficiently computed from a local histogram as $\mathcal{A}({\bf{p}})=\bf{H_{p}}({\bf{I}}({\bf{p}}))$ . $\bf{H}$ is a $B$ -dimensional vector defined as:

[TABLE]

where ${\bf{H}}_{\bf{p}}(b)$ represents the histogram value corresponding to a bin $b$ , $B$ is the number of bins, and ${{\mathop{\rm Q}\nolimits}({\bf{I}}({\bf{q}}),b)}$ is zero except when intensity value ${\bf{I}}({\bf{q}})$ belongs to to bin $b$ . The computational complexity of the brute-force implementation of the local histograms is linear in the neighboring size. This dependency can be removed using integral histogram IntegralHistogram in a way similar to integral image, which reduces the computational complexity from $O(\left|\mathcal{N}_{\bf{p}}\right|B)$ to $O(B)$ at each pixel location.

For practical usefulness and noise robustness, we employ Gaussian integrated similarity function in intensity domain $s$ instead of the naive definition (with equality check similarity function) $\mathcal{A}({\bf{p}})=\bf{H_{p}}({\bf{I}}({\bf{p}}))$ for computing the local area value. Specifically, the local area value is computed by a weighed integration of adjacent bins as (3.8).

[TABLE]

where $K_{h}=1/\sum\limits_{b\in({R_{k}}{\mkern 1.0mu},{\mkern 1.0mu}{R_{l}})}{\omega(b)}$ is a normalization factor, $\omega(b)$ is Gaussian similarity weights of adjacent bins, ${R_{k}}={\bf{I}}({\bf{p}})-r\,$ , and ${R_{l}}={\bf{I}}({\bf{p}})+r\,$ . Parameters $r$ and $\sigma$ control the interval of integration and the degree of Gaussian smoothing of histogram, respectively. Pseudo code is given in Algorithm 3.1. First, integral Histogram $\bf{H_{I}}$ is computed through the image, and then local histogram $\bf{H_{p}}$ at pixel $\bf{p}$ is computed. Lastly, Local area value is computed with Gaussian similarity weights $\omega(b)$ . For multiple channel of sensors, e.g., RGB sensor, local area values are computed for each channel, respectively.

3.4 Robustness Evaluation of LAT

A non-linear intensity deformation is commonly induced by different modality of imaging system. In order to evaluate the robustness of LAT to non-linear intensity deformation, a challenging simulated database is constructed. Eight color images (Airplane, Baboon, Bikes, Lena, Mustang, PaintedFace, Peppers, TwoMacaws, shown in Fig. 3.1) were employed as original images. Each image is deformed using 40 intensity deformation functions constructed by four categories of random probability distribution: piecewise linear mapping (PL), piecewise quadratic mapping (PQ), random mapping with Gaussian distribution (RG), and random mapping with uniform distribution (RU). For each R, G, B channel different deformation functions were applied. In total, 320 non-linear deformed pairs of color images were generated.

The robustness of LAT was evaluated with comparisons to four methods: grey-world model (GW) NormalizedChromaticity, histogram matching (HM) HistogramMatching, log-chromaticity (LC) ANCC, and rank transform (RT) Rank. As a base, the original image pair before transform (ORG) was also compared.

The similarity between a registered image pair is measured by the mean absolute difference $d_{mad}$ and different pixel ratio $d_{dpr}$ . $d_{mad}$ and $d_{dpr}$ are defined as (21) and (22), respectively.

[TABLE]

where ${{{\bf{\bar{I}}}}_{1}}$ is a transformed version of the original image, ${{{\bf{\bar{I}}}}_{2}}$ is a transformed version of the deformed image, $K_{1}=1/(NL)$ is a normalization factor, $N$ is the number of pixels, $L$ is the maximum value of the label.

[TABLE]

where $K_{2}=1/N$ is a normalization factor, $t=0.1L$ is threshold value.

The qualitative evaluations for LAT are summarized in Table 3.1 and Table 3.2, showing that LAT is superior to the other methods in terms of both $d_{mad}$ and $d_{dpr}$ . It should be noted that lower $d_{mad}$ and $d_{dpr}$ are, the more similar sample image pairs are. In the results, the total 10 image pairs are used for an average, and sample images are represented in Fig. 3.2 - Fig. 3.5. More specifically an input image is non-linearly transformed with different transformations, and the reconstruction results are represented as varying image transformation methods, including the state of-the-art method and proposed LAT. The LAT transformed non-linearly deformed images into a common domain, where the discrepancy between non-linear deformations are highly reduced. For most image pairs, the LATransformed images are very similar to each other; in other words, LAT shows higher robustness for randomly intensity-deformed image pairs.

Especially, Table 3.2 intensively analyzed the performance of the LAT as varying associated parameters, including support window size $l$ , the interval of integration $r$ , and degree of Gaussian smoothing $\sigma$ . The performance of LAT was the highest when the parameter $l$ was 11. Note that other parameters $r$ and $\sigma$ in LAT, which control the interval of integration and degree of Gaussian smoothing of the histogram, were not seriously effecting on the performances, thus they were set as $r$ = 3 and $\sigma$ = 0.3 for considering the trade-off between efficiency and robustness.

3.5 LAT Reformulated Features: Cross-Modality Feature Descriptors

Besides the use as a transformation, the concept of Local Area Transform can be used to reformulate conventional cost functions and descriptors. If we replace an ‘intensity value’ by a ‘local area value’, it endows cost functions and descriptors with robustness to a modality deformation with maintaining inherent properties of them. For example, the most widely used cost function, a mean absolute difference (mad), can be reformulated as follows:

[TABLE]

where $mad({\bf{p}},{\bf{q}})$ and $ma{d_{LAT}}({\bf{p}},{\bf{q}})$ are original and the reformulated mad between pixel points p and q. ${\rm{N}}$ is the neighbor pixels around p or q.

Similarly, the original local self-similarity descriptor (LSS) LSS can be reformulated by measuring sum of squared local area difference instead of sum of squared intensity difference as follows:

[TABLE]

where $S({\bf{p}},{\bf{q}})$ and $S_{LAT}({\bf{p}},{\bf{q}})$ are the original and the reformulated correlation surface functions in LSS (please refer LSS for full description of LSS). ${\mathop{\rm var}}_{auto}$ is a constant for stability.

SIFT also can be reformulated by using gradients of local area value (3.19) instead of gradients of intensity value (3.18).

[TABLE]

Binary pattern based robust descriptors, e.g., CT BRISK, RT Rank, BRIEF BRIEF, and BRISK BRISK, are formulated with following local binary pattern (LBP) equation.

[TABLE]

where $LBP({\bf{p}})$ is LBP at pixel ${\bf{p}}$ . $q(=0,1,...,Q-1)$ is the index of neighboring pixels of ${\bf{p}}$ . (3.20) can be reformulated to $LB{P_{LAT}}({\bf{p}})$ with local area value instead of intensity value as follows:

[TABLE]

With the reformulated LBP, robust descriptors: CT BRISK, RT Rank, BRIEF BRIEF, and BRISK BRISK can be reformulated with LAT. We use the subscription LAT as the meaning of the reformation with LAT in the remaining parts of this paper. Note that any cost functions or features computed from intensity values can be reformulated with LAT.

3.6 LAT-Net: Deep Scene Recognition Network

Scene recognition is one of the fundamental task in various applications of computer vision such as robotics, image search, geo-localization, etc. However, scene recognition is challenging problem since scenes contain variety of components from objects to scene-like features. Furthermore, in practical applications, scene images are frequently taken under cross-domain settings, such as different imaging conditions, sensor specifications, and even weathers. Conventional scene recognition algorithms failed to achieve reliable results. To address this problem, domain adaptation duan2012domain; baktashmotlagh2013unsupervised or domain generalization george2016semantic approaches have bee proposed. This section proposes to embed LAT concept into deep convolutional neural network (CNNs) in order to tackle cross-domain scene recognition problem.

The conventional convolutional (Conv) layer in common CNNs is defined as:

[TABLE]

where $\bf{x}^{l}$ and $\bf{x}^{l-1}$ are feature maps of current $l^{th}$ and $l-1^{th}$ layers, respectively. ${\bf{\omega}}_{i}^{l}$ and $b^{l}$ are weights and bias terms. $\bf{K}$ is convolutional kernel. With the concept of LAT, reformulated convolutional (A-Conv) layer is defined as follows:

[TABLE]

where $\mathcal{A}^{l}$ and $\mathcal{A}^{l-1}$ are LAT-reformulated feature maps of current $l^{th}$ and $l-1^{th}$ layers, respectively. $\mathcal{A}^{l-1}$ could be replaced by the output feature maps of regular layers in a CNN such as Conv layer or a pooling layer. It could be also be a previous A-Conv layer, and thus can be stacked together to form a highly nonlinear transformation operator.

Given the impressive performance on the ImageNet benchmark krizhevsky2012imagenet; russakovsky2015imagenet, three popular CNN architectures AlexNet krizhevsky2012imagenet, VGG-16 simonyan2014very, ResNet-34 he2016deep are employed as basis networks. In order to apply non-linear feature transformation into networks, the former Conv layers are replaced as A-Conv layers in the proposed network structures: two Conv layers, four conv layers, six conv layers for AlexNet, VGGNet-16, ResNet-34, respectively. In addition, inception-like stem block, named as Aception block (Fig. 3.6) is placed at the top of each networks. The re-designed CNNs are named as LAT-CNN, i.e., LAT-AlexNet, LAT-VGGNet, and LAT-ResNet, respectively. The structures of re-designed networks are depicted in Figs. 3.7, 3.8, and 3.9, respectively. All the CNNs presented here were implemented and trained using Caffe package jia2014caffe on Nvidia GPUs Tesla P40.

Chapter 4 Cross-Modality Correspondence Matching and Deep Scene Recognition

4.1 Experimental Settings

In experiments, the LAT was implemented with the following parameter settings for all datasets: i, l, $\sigma$ =11, 3, 0.3. LAT was implemented in C++ on Intel Core i7-3770 CPU at 3.40 GHz. In experiments, the performances of LAT were evaluated for the tasks of nonlinearly-deformed image matching in Section 4.2, cross-spectral correspondence matching in Section 4.3, cross-radiometry stereo matching in Section 4.4, and cross-modal dense flow estimation in Section 4.5. For color images, LAT is computed for each channel, and then those values are used for minimum distance/cost selection. LAT was implemented as C++ layer in deep learning library Caffe jia2014caffe for deep scene recognition in Section 4.6.

4.2 Cross-Modality Correspondence Matching

4.2.1 Non-linear Deformation Correspondence Matching

The performance of reformulated feature descriptors with LAT is evaluated in terms of the feature recognition rate. The feature recognition rate is defined as the ratio of corrected matching to the total keypoints similar in BRIEF. The keypoints were detected using SIFT detector. SIFTSIFT, BRIEFBRIEF, and LSSLSS were selected as compared feature descriptors since they are the most successful feature descriptors respectively based on gradient, binary pattern, and self-similarity. They are reformulated with LAT to SIFTLAT, BRIEFLAT, and LSSLAT, respectively. For the evaluation, a simulated database described in Section 3.4 were used.

Fig. 4.1 shows an example of comparison on a simulated image pair. In the results, the correspondence estimations with conventional SIFT descriptor are represented on the upper part, while that with proposed SIFT descriptor on LAT are represented on the below part. For establishing correspondence, same fixed parameters are used (e.g., same threshold for matching). In other words, the number of correspondence depends on the robustness of the descriptors. In these results, the LAT-based SIFT descriptor provides consistently outperformed correspondences compared to original one. Fig. 4.2 summarizes the overall results representing that the reformulated descriptors remarkably outperforms the original descriptors. Especially, reformulated descriptors shows extremely high recognition rate even for image pairs generated with random mapping function (RG, RU). The results give an insight that the nonlinear intensity deformation problem generally induced by different imaging modalities can be addressed by reformulating the conventional descriptors with LAT. In the remaining parts of this section, we show the superiority and applicability of LAT for several multi-modality applications.

4.2.2 Cross-Spectral Correspondence Matching

In this section, we show that LAT is superior in terms of detecting the sought template in different spectral images, i.e., cross-spectral template matching. The cross-spectral template matching was applied on 100 RGB-NIR image pairs randomly selected from RGB-NIR Scene Dataset DB1. For each input NIR image, a template of a give size was selected at 100 random locations. In total, 10,000 (RGB) image-(NIR) template pairs were used in this experiment. To avoid a homogeneous template, the locations of the template were selected from among the structured regions of the image (i.e., locations where the features response of BRISK BRISK is above a threshold). Given an RGB image and a NIR-template, matching distances111The minimum distance among NIR/R-channel, NIR/G-channel, NIR/B-channel distances is set to the distance of the location. were computed for all possible locations in the corresponding RGB image, and the region associated with the minimal distance was considered the matched region. Four different methods, HM HistogramMatching, LC ANCC, RT Rank, and MTM MTM were employed as compared methods and the original images were also compared as a base method. Euclidean distance is employed for ORG, HM, and LC and sum of different rank is employed for RT.

In Fig. 4.3, 4.4, 4.5, in order to evaluate the performance of the LAT, the template matching performances across cross-spectral images are measured compared to the state-of-the-art methods. We show examples of similarity maps (for better visualization, a similarity map, which is the inverse of distance map, is illustrated instead of distance map where higher value (red) means similar region and lower vale (blue) means dissimilar region). The template matching in the LATransformed images clearly shows a sharp peak at the correct location, while it is not well localized in other methods. Table 4.1 summarizes the average correct detection ratio $r_{cd}$ and matching pixel error $e_{mp}$ . $r_{cd}$ measures the percentage of correct detection (if matched and true windows are overlapped with $>$ 70%, the match is considered a correct detection), and $e_{mp}$ measures the absolute difference between matched and true windows. As shown in Table 4.2, quantitative evaluation of LAT are represented as an average for 10,000 RGB-NIR template pairs LAT provides robust results in cross-spectral template matching in terms of both $r_{cd}$ and $e_{mp}$ ; in this study, $r_{cd}$ showed improvement of 23%, and $e_{mp}$ showed a reduction of 113 pixels.

The performance of reformulated feature descriptors for cross spectral feature matching is evaluated in this subsection. 100 RGB-NIR image pairs same as previous section were employed for this evaluation. The feature recognition rate is measured for the evaluation and keypoints were detected using SIFT detector. SIFTSIFT, BRIEFBRIEF, and LSSLSS were selected as compared feature descriptors.

Fig. 4.6 shows an example for comparison of LSS and LSSLAT. Specifically, in the results, the performance of cross-spectral feature matching are represented with conventional LSS descriptor and LAT-based LSS descriptor, respectively. Note that all the parameters are preserved in all experiments. Since this dataset are structually aligned, reliable correspondence should be also aligned. As shown in the results, LAT-based LSS consistently outperformed the original LSS. Table 4.2 summarizes the recognition rate, showing that the reformulated descriptors show superior performance to the original descriptors. The results show that reformulation with LAT provides promising results for cross spectral feature matching, with an improvement of 10% recognition rate.

4.2.3 Cross-Radiometry Stereo Matching

This section provides the superiority of LAT in the task of robust stereo matching in radiometric and photometric deformed stereo images. Stereo matching is commonly formulated as minimization problem of the energy in the MAP-MRF framework ANCC as:

[TABLE]

where $\mathcal{N}_{\bf{p}}$ is the neighboring pixels of $\bf{p}$ , $f$ is a disparity. In the first term, $D_{\bf{p}}(f_{\bf{p}})$ is the data cost which measures the dissimilarity between $\bf{p}$ in the left image and ${\bf{p}}+f_{\bf{p}}$ in the right image. In the second term, ${{V_{{\bf{pq}}}}({f_{\bf{p}}},{f_{\bf{q}}})}$ is the smoothness cost which penalties non-smooth disparities.

In this experiment, we fixed all of the parameters, cost function, aggregation method, optimization method except for the transformation methods. The absolute difference (AD) for a pixelwise data cost, the adaptive support weight AdaptiveSupportWieght with a size of $25\times 25$ for the cost aggregation, a truncated quadratic cost for a smoothness cost, and the loopy belief propagation for the global optimization were employed. Although postprocessing like a occlusion-handling and a noise removal can improve the quality of estimated disparities, we did not employ such a postprocessing to more focus on the influence of transform. For the evaluation and comparison of the performance of LAT with others, middlebury stereo data sets hirschmuller2007evaluation including Aloe, Baby1, Baby3, Bowling2, Cloth2, Cloth3, Lampshade1, and Monopoly were used. There are three different illumination sources (1,2,3) and three different exposures (indexed as 0,1,2), totally nine different image pairs in each data set. In this experiment, the left image is fixed to illumination source 1 and exposure 1, while the right image is varied in both an illumination and an exposure. In other words, the nine combinations of stereo pairs were used for the evaluation.

Four different methods, HM HistogramMatching, LC ANCC, RT Rank, and CT census were employed as compared methods and the original images were also compared as a base method. The qualitative and quantitative comparisons are given in Fig. 4.7 and Table 4.3, respectively. As shown in Table 4.3, the LAT is superior to the other methods in most data sets in terms of bad pixel percentages ( $BPP$ ) and root mean squared errors ( $RMSE$ ). Results presented in Fig. 4.8 and 4.9 show that the qualitative performance of LAT also outperforms the other methods.

4.2.4 Cross-Modality Dense Correspondence Matching

Estimating visual dense flow from different images but sharing similar scene characteristics is very challenging problem but promising function for a high-level computer vision task SIFTFLOW. Especially, cross modality dense flow estimation is more challenging due to their disparate properties MultiModal. This section analyzes the performance of SIFT-FlowLAT with a comparison to state-of-the-art methods: SIFT-Flow SIFTFLOW, and DAISY DAISY222Since RSNCC MultiModal is based on a global matching approach, it is not compared here for fair comparison. SIFT-Flow and DAISY are both based on a local matching approach. For this purpose, multimodal image database MultiModal is employed including RGB-NIR, different exposure, and flash-nonflash image pairs.

Fig. 4.10 shows an qualitative comparison of cross modality dense flow estimated by SIFT-Flow, DAISY, and SIFT-FlowLAT. As shown in the figure, compared to the state-of-the-arts methods SIFT-FlowLAT provides a reliable dense flow. Table 4.4 summarizes quantitative comparisons in terms of warping error. The warping error is computed from ground truth displacement for 100 corner points provided in MultiModal. The results indicate that SIFT-FlowLAT can be a promising approach for cross modality dense flow estimation.

To address the correspondence-matching problem for different modalities of images, deformation-robust local area transform is proposed. LAT is a nonlinear deformation-invariant transformation of the intensity information into local area information. The experimental results show that LAT and descriptors reformulated by LAT are superior to the conventional methods for matching the correspondence in the context of cross-modality correspondence matching. Specifically, LAT gains approximately a 23% improvement in correct detection ratio and a 10% recognition rate increase for the tasks of cross-spectral template matching and feature matching, respectively. LAT also increases the performance of cross-radiation stereo matching and crossmodality dense flow estimation with a 15% reduction in bad pixel percentage and a 50% reduction in the warping error, respectively. In conclusion, the local area can be considered as an alternative domain to the intensity domain to achieve robust correspondence matching. Future works should include the development of a cross-modal object recognition based on the properties of LAT

4.3 Cross-Modality Deep Scene Recognition

4.3.1 Cross Spectral Scene Recognition

In order to study the performance of LAT-Net for cross-spectral scene recognition, we have constructed cross spectral scene database. This database consists of 477 images distributed in 9 categories: Country (52), Field (51), Forest (53), Mountain (55), Old Buildings (51), Street (50), Urban (58), Water (51), where each image is RGB or NIR randomly selected from original RGB-NIR images pairs DB1. Randomly selected 99 images were used for testing (11 per category) and remaining 378 images were used for training. To avoid over-fitting, training images were augmented with resizing (resize ratio is randomly varied from 0.5 - 1.5 with center shift ranged -0.1 - 1.0), rotating (rotating degree is randomly varied from -70

+70 degrees), color-shifted, and flipped. In total, 3,024 training images were employed for training. We trained all networks with the ADAM optimizer kinga2015method, learning rate $\eta$ =0.001, and batch size $b$ =16 for 40 epochs. All networks are pre-trained with places2 database zhou2017places for 10 epochs. The places2 is extended version of places dataset zhou2014learning and probably the largest scene recognition dataset. In total, the Places2 contains more than 10 million images comprising more than 400 unique scene categories. The dataset includes 5,000 to 30,000 training images per class.

We performed a comparison to state-of-the-art scene recognition methods from hand-crafted methods: GIST oliva2001modeling, DiscrimPatches singh2012unsupervised, ObjectBank li2010object to deep learned feature based methods: fc7-VLAD gong2014multi, NetVLAD arandjelovic2016netvlad, MFAFVNet li2017deep. Table 4.5 presents quantitative comparisons of cross spectral scene recognition in terms of top-1 accuracy. As shown in results, LAT redesigned networks provides highest accuracy even with simple network structure AlexNet krizhevsky2012imagenet. LAT-ResNet improved the recognition accuracy by 14.8% as compared to the state-of-the-arts methods. The results indicate that LAT-redesigned networks is a promising approach for cross spectral scene recognition.

4.3.2 Domain Generalized Scene Recognition

Domain generalization transfers the knowledge learnt from other source domain to an unseen target domain. In order to study the performance of LAT-Net for domain generalized scene recognition, we have conducted the following experiments. All networks are trained on places2 zhou2014learning with the ADAM optimizer kinga2015method, learning rate $\eta$ =0.001, and batch size $b$ =16 for 20 epochs. Then, the recognition accuracy is measure on unseen RGB-NIR scene databases. For evaluation, we have constructed three scene databases: RGB, NIR, RGB-NIR combined, which are generated from DB1. We divide DB1 into two separate databases consisting of RGB or NIR, respectively. RGB-NIR combined scene database is same as database employed in Section 4.6.2. Unlike Section 4.6.2, all 477 images are employed as testing images since they are not used for training.

We performed a comparison to state-of-the-art scene recognition methods fc7-VLAD gong2014multi, NetVLAD arandjelovic2016netvlad, MFAFVNet li2017deep, and SemanticCluster george2016semantic. Table 4.6 and 4.7 present quantitative comparisons of domain generalized scene recognition for RGB and NIR scene databases, respectively, in terms of top-1 accuracy. As shown in results, LAT redesigned networks provides highest accuracy. LAT-ResNet improved the recognition accuracy by 15.9% and 10.9% as compared to the state-of-the-arts methods for RGB and NIR scene databases, respectively. The results indicate that LAT-redesigned networks is a promising approach for domain generalized scene recognition.

Chapter 5 Conclusion

This dissertation proposes deformation-robust image transform, called local area transform (LAT), and mathematically and experimentally prove its invariance properties to nonlinear deformations. LAT is also extended into robust cost functions, feature descriptors, and deep scene recognition networks.

The experimental results have shown that LAT and descriptors reformulated by LAT were superior to the conventional methods for matching the correspondence in the context of cross-modality correspondence matching. Specifically, LAT gains approximately a 23% improvement in correct detection ratio and a 10% recognition rate increase for the tasks of cross-spectral template matching and feature matching, respectively. LAT also increases the performance of cross-radiation stereo matching and cross-modality dense flow estimation with a 15% reduction in bad pixel percentage and a 50% reduction in the warping error, respectively. Furthermore, the proposed LAT-Net outperforms existing state-of-the-arts methods in tasks of scene recognition. Specifically, LAT-Net gains up to 14% accuracy improvement in cross spectral scene recognition task. Also, LAT-Net achieves 6% and 7% accuracy improvements for database invariant scene recognition and domain generalized scene recognitions, respectively.

In conclusion, the local area can be considered as an alternative domain to the intensity domain to achieve robust correspondence matching and lots of applications: such as feature matching, stereo matching, dense correspondence matching, and image recognition. we believe the concept of LAT can be extended various potential tasks. Future works include the development of a cross-modal image retrieval and people re-identification based on the properties of local area transformation.

\printthesisindex

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Local Area Transform for Cross-Modality Correspondence Matching and Deep Scene Recognition

Abstract

keywords:

Acknowledgements.

Contents

List of Figures

List of Tables

Chapter 1 Introduction

Chapter 2 Related Works

Chapter 3 Local Area Transform (LAT)

3.1 Definition of LAT

3.2 Properties of LAT

3.2.1 Invariance to non-linear intensity deformation

3.2.2 Invariance to radiometric & photometric deformations

3.2.3 Invariance to spectral deformation

3.2.4 Limitation

3.3 Implementation of LAT and Extension

3.4 Robustness Evaluation of LAT

3.5 LAT Reformulated Features: Cross-Modality Feature Descriptors

3.6 LAT-Net: Deep Scene Recognition Network

Chapter 4 Cross-Modality Correspondence Matching and Deep Scene Recognition

4.1 Experimental Settings

4.2 Cross-Modality Correspondence Matching

4.2.1 Non-linear Deformation Correspondence Matching

4.2.2 Cross-Spectral Correspondence Matching

4.2.3 Cross-Radiometry Stereo Matching

4.2.4 Cross-Modality Dense Correspondence Matching

4.3 Cross-Modality Deep Scene Recognition

4.3.1 Cross Spectral Scene Recognition

4.3.2 Domain Generalized Scene Recognition

Chapter 5 Conclusion