Local Area Transform for Cross-Modality Correspondence Matching and Deep Scene Recognition
Seungchul Ryu

TL;DR
This paper introduces the local area transform (LAT), a robust image transform invariant to nonlinear intensity deformations, improving correspondence matching and scene recognition across different modalities.
Contribution
The paper proposes LAT and its integration into deep neural networks, including LAT-Net, for enhanced cross-modality correspondence and scene recognition.
Findings
LAT provides consistent results under nonlinear intensity deformations.
LAT reduces mean absolute difference compared to conventional methods.
LAT-based descriptors outperform traditional approaches in cross-spectral matching.
Abstract
Establishing correspondences is a fundamental task in variety of image processing and computer vision applications. In particular, finding the correspondences between a non-linearly deformed image pair induced by different modality conditions is a challenging problem. This paper describes a efficient but powerful image transform called local area transform (LAT) for modality-robust correspondence estimation. Specifically, LAT transforms an image from the intensity domain to the local area domain, which is invariant under nonlinear intensity deformations, especially radiometric, photometric, and spectral deformations. In addition, robust feature descriptors are reformulated with LAT for several practical applications. Furthermore, LAT-convolution layer and Aception block are proposed and, with these novel components, deep neural network called LAT-Net is proposed especially for scene…
| Algorithm 1: Local Area Transform |
| Input: input image |
| Internal: Integral histogram , |
| Local histogram at pixel point , |
| the corresponding intensity bin of , half-window size |
| Output: Local area transformed image |
| /* integral histogram computation */ |
| for each pixel (x,y) do |
| end |
| for each pixel (x,y) do |
| end |
| /* local histogram computation */ |
| for each pixel (x,y) do |
| end |
| /* local area computation */ |
| for each pixel (x,y) do |
| end |
| Deform | ORG | GW NormalizedChromaticity | HM HistogramMatching | LC ANCC | RT Rank | LAT |
|---|---|---|---|---|---|---|
| 0.33 | 0.13 | 0.32 | 0.12 | 0.20 | 0.02 | |
| 0.79 | 0.53 | 0.78 | 0.46 | 0.53 | 0.04 |
| window size | interval of integ. | deg. of Gaussian | |||||||
|---|---|---|---|---|---|---|---|---|---|
| 7 | 11 | 15 | 1 | 3 | 5 | 0.1 | 0.3 | 0.5 | |
| 0.11 | 0.02 | 0.06 | 0.03 | 0.02 | 0.04 | 0.03 | 0.02 | 0.06 | |
| 0.13 | 0.04 | 0.08 | 0.06 | 0.04 | 0.05 | 0.07 | 0.04 | 0.08 | |
| Size | ORG | HM | LC | RT | MTM | LAT |
|---|---|---|---|---|---|---|
| 0.32 | 0.49 | 0.29 | 0.60 | 0.63 | 0.70 | |
| 318 | 218 | 359 | 208 | 158 | 139 |
| Recognition rate | |||
|---|---|---|---|
| Original | SIFT SIFT | BRIEF BRIEF | LSS LSS |
| 0.72 | 0.68 | 0.65 | |
| LAT | SIFTLAT | BRIEFLAT | LSSLAT |
| 0.85 | 0.78 | 0.73 | |
| ORG | HM | LC | RT | CT | LAT | |
|---|---|---|---|---|---|---|
| BPP | 0.86 | 0.55 | 0.68 | 0.53 | 0.50 | 0.47 |
| RMSE | 80.2 | 51.6 | 58.8 | 56.1 | 49.8 | 45.8 |
| Algorithm | RGB-NIR | Flash-Nonflash | Diff. Exp. | All |
|---|---|---|---|---|
| SIFT-FlowSIFTFLOW | 10.11 | 8.76 | 10.03 | 9.78 |
| Variational VM | 12.03 | 15.19 | 16.57 | 14.56 |
| DAISY DAISY | 20.42 | 10.84 | 12.71 | 16.16 |
| SIFT-FlowLAT | 6.83 | 8.83 | 7.54 | 7.51 |
| Method | Accuracy (%) |
|---|---|
| 1GIST oliva2001modeling | 31.2 |
| 11DiscrimPatches singh2012unsupervised | 34.2 |
| 11ObjectBank li2010object | 41.3 |
| 2fc7-VLAD gong2014multi | 49.4 |
| 2NetVLAD arandjelovic2016netvlad | 53.4 |
| 2MFAFVNet li2017deep | 56.5 |
| 3AlexNet krizhevsky2012imagenet | 45.4 |
| 3VGGNet simonyan2014very | 51.4 |
| 3ResNet he2016deep | 54.4 |
| 3LAT-AlexNet (Ours) | 57.5 |
| 3LAT-VGGNet (Ours) | 65.5 |
| 3LAT-ResNet (Ours) | 69.6 |
| Method | Accuracy (%) |
|---|---|
| 2fc7-VLAD gong2014multi | 54.3 |
| 2NetVLAD arandjelovic2016netvlad | 58.7 |
| 2MFAFVNet li2017deep | 59.8 |
| 3SemanticCluster george2016semantic | 66.3 |
| 3AlexNet krizhevsky2012imagenet | 46.2 |
| 3VGGNet simonyan2014very | 48.3 |
| 3ResNet he2016deep | 51.6 |
| 3LAT-AlexNet (Ours) | 58.5 |
| 3LAT-VGGNet (Ours) | 65.4 |
| 3LAT-ResNet (Ours) | 71.2 |
| Method | Accuracy (%) |
|---|---|
| 2fc7-VLAD gong2014multi | 43.2 |
| 2NetVLAD arandjelovic2016netvlad | 46.8 |
| 2MFAFVNet li2017deep | 49.4 |
| 3SemanticCluster george2016semantic | 54.7 |
| 3AlexNet krizhevsky2012imagenet | 36.1 |
| 3VGGNet simonyan2014very | 39.5 |
| 3ResNet he2016deep | 41.6 |
| 3LAT-AlexNet (Ours) | 51.9 |
| 3LAT-VGGNet (Ours) | 56.5 |
| 3LAT-ResNet (Ours) | 61.3 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Robotics and Sensor-Based Localization
\dept
Department of Electrical and Electronic Engineering \universityYonsei University \degreetitleDoctor of Philosophy \degreedateFebruary 2019 \subjectLaTeX
Local Area Transform for Cross-Modality Correspondence Matching and Deep Scene Recognition
Seungchul Ryu
Abstract
Establishing correspondences is a fundamental task in variety of image processing and computer vision applications. In particular, finding the correspondences between a non-linearly deformed image pair induced by different modality conditions is a challenging problem. This paper describes a efficient but powerful image transform called local area transform (LAT) for modality-robust correspondence estimation. Specifically, LAT transforms an image from the intensity domain to the local area domain, which is invariant under nonlinear intensity deformations, especially radiometric, photometric, and spectral deformations. In addition, robust feature descriptors are reformulated with LAT for several practical applications. Furthermore, LAT-convolution layer and Aception block are proposed and, with these novel components, deep neural network called LAT-Net is proposed especially for scene recognition task. Experimental results show that LATransformed images provide a consistency for nonlinearly deformed images, even under random intensity deformations. LAT reduces the mean absolute difference by approximately 0.20 and the different pixel ratio by approximately 58% on average, as compared to conventional methods. Furthermore, the reformulation of descriptors with LAT shows superiority to conventional methods, which is a promising result for the tasks of cross-spectral and modality correspondence matching. LAT gains an approximately 23% improvement in the correct detection ratio and a 10% improvement in the recognition rate for the tasks of RGB-NIR cross-spectral template matching and cross-spectral feature matching, respectively. LAT reduces the bad pixel percentage by approximately 15% and the root mean squared errors by 13.5 in the task of cross-radiation stereo matching. LAT also improves the cross-modal dense flow estimation task in terms of warping error, providing 50% error reduction. LAT-Net provides 14% and 7% accuracy improvements in cross spectral scene recognition and domain generalized scene recognition tasks, respectively. the local area can be considered as an alternative domain to the intensity domain to achieve robust correspondence matching, image recognition, and a lot of applications: such as feature matching, stereo matching, dense correspondence matching, image recognition, and image retrieval.
keywords:
LaTeX PhD Thesis Engineering Yonsei University
{dedication}
I would like to dedicate this thesis to my loving family …
Acknowledgements.
I would like to express my sincere gratitude to my supervisor Prof. Kwanghoon Sohn for the continuous support of my Ph.D sutdy and related research, for his patience, motivation, and immense knowledge. Hist guidance helped me in all the time of research and writing of this dissertation. Besides my supervisor, I would like to thank my dissertation committe: Prof. Euntae Kim, Prof. Hyeran Byun, Prof. Sangyoun Lee, and Prof. Dongbo Min, for their insightful comments and encouragement, but also for the hard question which incented me to widen my research from various perspectives. My sincere thanks also goes to Dr. Jungdong Seo, Dr. Donghyun Kim, Prof. Bumsub Ham, who provided me an insights about my research. Without their precious support it would not be possible to conduct this research. I thank my fellow lab-mates, Dr. Seungryong Kim, Dr. Changae Oh, Dr. Youngjung Kim, Kihong Park, and Sunok Kim in for the discussions and for all the fun we have had in the last years. Also, I thank Dr. Cho who provided me an opportunity to join their great team. Last but not the least, I would like to thank my family: my parents, my brother, my wife, and my daughters for supporting me spiritually throughout writing this dissertation and my life in general.
Contents
-
3.5 LAT Reformulated Features: Cross-Modality Feature Descriptors
-
4 Cross-Modality Correspondence Matching and Deep Scene Recognition
List of Figures
- 3.1 The original test color images
- 3.2 Nonlinear intensity deformation robustness of LAT.
- 3.3 The robustness comparison for nonlinear intensity deformations for Mustang.
- ORG
- GW
- HM
- LC
- RT
- LAT
- 3.4 The robustness comparison for nonlinear intensity deformations for Airplane.
- ORG
- GW
- HM
- LC
- RT
- LAT
- 3.5 The robustness comparison for nonlinear intensity deformations for Pepper.
- ORG
- GW
- HM
- LC
- RT
- LAT
- 3.6 Aception block
- 3.7 The structure of LAT-AlexNet.
- 3.8 The structure of LAT-VGGNet.
- 3.9 The structure of LAT-ResNet.
- 4.1 Feature matching on simulated database.
- PL
- PQ
- RG
- RU
- 4.2 Recognition rate for simulated database.
- 4.4 Qualitative results of cross-spectral template matching for Lobby.
- ORG
- GW
- HM
- LC
- RT
- LAT
- 4.5 Qualitative results of cross-spectral template matching for Buildings.
- ORG
- GW
- HM
- LC
- RT
- LAT
- 4.6 An example of cross spectral feature matching. Top: LSS and Bottom: LSSLAT
- 4.7 Stereo matching results with a cost function size of 25 for Baby1 stereo pair with Left(1/1) and right(3/1).
- 4.8 Qualitative results of robust stereo matching for Cloth2.
- ORG
- GW
- HM
- LC
- RT
- LAT
- 4.9 Qualitative results of robust stereo matching for Aloe.
- ORG
- GW
- HM
- LC
- RT
- LAT
List of Tables
- 3.1 Similarity comparison results in terms of mean absolute difference () and different pixel ratio () for all image pairs.
- 3.2 Similarity comparison results in terms of mean absolute difference () and different pixel ratio () for non-linear deformation as varying the parameters in LAT.
- 4.1 Cross-spectral template matching results for RGB-NIR Scene Dataset.
- 4.2 Cross-spectral feature matching results for RGB-NIR Scene Dataset in terms of recognition rate.
- 4.3 Stereo matching results for illumination and exposure deformed stereo image pairs.
- 4.4 Cross modal dense flow estimation results for multimodal database in terms of warping error.
- 4.5 Cross Spectral Scene Recognition in terms of recognition top-1 accuracy.
- 4.6 Domain Generalized Scene Recognition in terms of recognition top-1 accuracy: RGB
- 4.7 Domain Generalized Scene Recognition in terms of recognition top-1 accuracy: NIR
Chapter 1 Introduction
Correspondence matching is a basic and fundamental task in a vast range of image processing and computer vision applications: image denoising app1; trinh2014novel, image editing app2; bugeau2014variational, object tracking app3, stereo matching app4, optical flow revaud2015epicflow, image retrieval babenko2015aggregating, image recognition deng2009imagenet; uijlings2013selective, and scene recognition kwitt2012scene; su2012improving. Conventional correspondence matching algorithms are commonly based on gradient-based descriptors SIFT; bay2008speeded; HOG. In real world, however, images are acquired in an uncontrolled environment; thus, the image may suffer intensity deformations due to changes in illumination conditions, camera photometric parameters, viewing positions, and so on problem. Furthermore, recently, cross-modality imaging system (e.g., multi-spectral imaging system DB1; sorensen2015multimodal has been attracted many attentions to address challenging problems occurring in the conventional unimodal imaging system. Images acquired from different modalities also have intensity deformations due to changes in sensor responses and spectral distributions.
These deformations between patches or images induce the inaccuracy problem of the correspondence matching. Let and be two input images, and be the corresponding pixel of . When dealing with a correspondence matching under uncontrolled environments or multi-modalities, three groups of approaches have been considered: tone mapping, color constancy, and robust similarity measure. The first group, called tone mapping, attempts to determine a mapping function such that . A classic method for extracting is a histogram matching HistogramMatching, which computes a mapping function that optimally aligns the histogram of with that of . Several methods compute a mapping function based on the statistical distribution of intensity values Statistical1. More sophisticated mapping functions were well reviewed in ToneMappingReview2. Tone mapping approaches commonly assume that and are entirely aligned into same scene regions. This assumption is clearly hold only when the images are taken at the same viewpoint under the same illumination condition, but in other cases the obtained mapping function might be erroneous and inconsistent.
The second group, called color constancy, tries to find a model to transform images into constant color space removing illumination components such that . One of the most popular methods is grey-world model which removes the illumination spectral distribution factor with an assumption that, under a white light source, the average color in a scene is achromatic (i.e., grey) NormalizedChromaticity. Another well-known method, white patch retinex model, assumes that the maximum response in an image is caused by a perfect reflectance (i.e., white patch). In practice, this assumption is alleviated by considring the color channels separately, resulting in the max-RGB algorithm. The normalized chromaticity model is commonly used for the elimination of the lighting geometry factors under the Lambertian reflectance model NormalizedChromaticity. Gamut mapping and other learning based algorithms have been also investigated LearningColorConstancy2. However, most models cannot remove the dependency of the lighting geometry and the illumination spectral distribution simultaneously as will be discussed in Chapter 2.
The third group, called robust similarity measure, attempts to describe a local signature within a patch invariant to a nonlinear deformation. In some cases, an intensity deformation is nonlinear but still maintains a monotonicity, i.e., the order of intensity-levels is preserved. Similarity measures based on such an ordinal value include local binary pattern (LBP) LBP, binary robust independent elementary features (BRIEF) BRIEF, rank transform (RT) Rank, and census transform (CT) census. Although these ordinal information based approaches account for a monotonic mapping, they fail under a non-monotonic intensity deformation.
A gradient-based similarity measure, such as histogram of gradients (HOG) HOG and scale invariant feature transform (SIFT) SIFT has been considered a photometric invariant similarity measure. Such a method inherently, however, causes the loss of information due to the contraction of data weakening their discrimination power and fails under a non-monotonic mapping. Normalized cross correlation (NCC) measures the cosine of an angle between two vectors, and thus is robust to a linear intensity deformation. To address the inaccuracy at an object boundary of the NCC, adaptive normalized cross correlation (ANCC) is proposed in ANCC. In MTM, a generalized version of NCC is proposed, which is called matching by tone mapping (MTM). Mutual information (MI) MI is widely used similarity measure for images with nonlinear deformations. MI measures the statistical dependence between two vectors and by computing the loss of entropy in given .
To summarize, the conventional methods approached to solve the problem of a nonlinear intensity deformation by adjusting intensity values to be similar or utilizing a gradient, ordinal information, and a statistical measure. However, these approaches cannot account for a general nonlinear intensity deformation. This paper proposes to use local area information as a robust index for nonlinear intensity deformations. We define local area transform (LAT) as a robust mapping of an image from an intensity domain to a local area domain. LAT is designed to address the nonlinear deformation problem of images which may be acquired from different photometric parameters, light sources, and modalities. The objective of LAT is similar to a color constancy, i.e., transferring an image from the original intensity (or color) values to constant intensity (or color) domain. However, unlike the color constancy LAT alters an image from intensity domain, which is sensitive to a nonlinear deformation, to robust local area domain. Ordinal transform such as LBP, RT, and CT also aims to transfer an image to ordinal information domain, but fails under a non-monotonic intensity deformation. As our knowledge, this study is the first attempt to address a nonlinear deformation problem with the local area information in the task of a correspondence matching.
This study prove that the LAT is robust image transform for non-linear intensity, radiometric, photometric, and spectral deformations. Also, efficient implementation of LAT is proposed with integral histogram. Besides the use as a transformation, the concept of LAT is extended to reformulate the conventional robust feature descriptors such as SIFT, LSS, CT, RT, and etc. The reformulation embeds great properties of LAT into the conventional feature descriptors. The reformulated descriptors show that superior performance in tasks of non-linear deformation correspondence matching, cross-spectral correspondence matching, cross-radiometry stereo matching, and cross-modality dense correspondence matching. Furthermore, novel deep networks are proposed to address cross-domain scene recognition problem. In the proposed deep scene recognition network, conventional convolutional layers are replaced by LAT-convolution layers and aception block is introduced. The proposed deep scene recognition networks outperform the conventional methods in tasks of cross-spectral scene recognition and domain generalized scene recognition.
The remainder of this dissertation is organized as follows. In Chapter 2, related literatures are presented. In Chapter 3, LAT is described with its properties and implementation details. LAT-reformulated features and LAT-Net are also presented. In Chapter 4, the performances of LAT are evaluated in tasks of nonlinear-deformed image matching, cross spectral correspondence matching, cross radiometry stereo matching, cross modal dense flow estimation, and cross modality scene recognition. Chapter 5 concludes this paper with the discussions.
Chapter 2 Related Works
An image taken by a linear imaging device with sensor is modeled as ImageModel:
[TABLE]
where denotes the sensor response at a point in the spatial coordinate, represents the spectral distribution of the incident illuminant, represents the surface reflectance at , and represents the spectral response of the sensor. Approximating the sensor spectral response as the Dirac delta function such that , (2.1) is simplified as follows:
[TABLE]
Under Planck’s law, the spectral distribution of the illuminant is modeled a function of the absolute temperature and the wavelength as where , , is the speed of light, is Planck’s constant, and is Boltzmann constant. The surface reflectance is represented as where is a lighting geometry factor and is the matte-surface reflectance with the assumption of a matte surface. Taking the exposure time into the consideration, the image acquisition model in (2.2) is modified as
[TABLE]
When images are acquired in an uncontrolled environment or in cross-modality system, they suffer from nonlinear deformation problem induced by different modalities. To address the correspondence problem under uncontrolled environments or multi-modalities, three groups of approaches have been explored: tone mapping, color constancy, and robust similarity measure. Color constancy is closely related works to the proposed LAT. Color constancy tries to find a model to transform images into constant color space removing illumination components such that . One of the most popular methods is grey-world model which removes the illumination spectral distribution factor with an assumption that, under a white light source, the average color in a scene is achromatic (i.e., grey) NormalizedChromaticity. Another well-known method, white patch retinex model, assumes that the maximum response in an image is caused by a perfect reflectance (i.e., white patch). In practice, this assumption is alleviated by considring the color channels separately, resulting in the max-RGB algorithm. The normalized chromaticity model is commonly used for the elimination of the lighting geometry factors under the Lambertian reflectance model NormalizedChromaticity. Gamut mapping and other learning based algorithms have been also investigated LearningColorConstancy2. However, most models cannot remove the dependency of the lighting geometry and the illumination spectral distribution simultaneously. More recently, deep neural networks based color constancy methods were also explored bianco2015color; barron2015convolutional; oh2017approaching; hu2017fc4
Grey world model estimates the illuminant by averaging channel values under the assumption that the average reflectance in an image is achromatic, and is proven to be an instantiation of Minkowski-norm () Minkowski. Then, the gray world model is computed as follows:
[TABLE]
In practice, it is computed within local neighbors with the assumption of to be locally constant, thus (2.4) is simplified as:
[TABLE]
(2.5) implies that the gray world model is invariant to an illumination deformation under the local-constancy assumption. However, when dealing with images acquired by different modalities (e.g., cross-spectral) undergoes non-linear deformation, thus the gray world model is no longer guarantee the robustness to spectral deformations.
The -channel normalized chromaticity NormalizedChromaticity eliminates the effect of the lighting geometry by dividing each channel response by the average of them as follows:
[TABLE]
where is the number of channels. Substituting (3), (6) is simplified as
[TABLE]
where . (7) indicates that the normalized chromaticity only removes the lightning geometry factor . Log-chromaticity ANCC defined as transforms a nonlinear deformation into a linear deformation. However, both the normalized chromaticity and the log-chromaticity cannot be applicable to uni-channel image, e.g., infra-red image.
Tone mapping algorithms attempt to construct a mapping function such that . A classic method for extracting is a histogram matching HistogramMatching, which computes a mapping function that optimally aligns the histogram of with that of . Several methods compute a mapping function based on the statistical distribution of intensity values Statistical1; eilertsen2015real. More sophisticated mapping functions were well reviewed in ToneMappingReview2. Tone mapping approaches commonly assume that and are entirely aligned into same scene regions. This assumption is clearly hold only when the images are taken at the same viewpoint under the same illumination condition, but in other cases the obtained mapping function might be erroneous and inconsistent. Histogram matching, the most common tone mapping scheme, aligns the histogram of to that of when they are acquired from the same scene at the same viewpoint, i.e., . However, this assumption is too hard to be applied to practical environments. In addition, the histogram matching is stable only for global deformations, and is no longer guarantees for local deformations.
Robust similarity measure attempts to describe a local signature within a patch invariant to a nonlinear deformation. In some cases, an intensity deformation is nonlinear but still maintains a monotonicity, i.e., the order of intensity-levels is preserved. Similarity measures based on such an ordinal value include local binary pattern (LBP) LBP, binary robust independent elementary features (BRIEF) BRIEF, rank transform (RT) Rank, and census transform (CT) census. Although these ordinal information based approaches account for a monotonic mapping, they fail under a non-monotonic intensity deformation.
A gradient-based similarity measure, such as histogram of gradients (HOG) HOG and scale invariant feature transform (SIFT) SIFT has been considered a photometric invariant similarity measure. Such a method inherently, however, causes the loss of information due to the contraction of data weakening their discrimination power and fails under a non-monotonic mapping. Recently, dense adaptive self-correlation (DASC) descriptor has been proposed to provide robustness for modality variations, but is also has limitations on non-linear deformations kim2015dasc. Normalized cross correlation (NCC) measures the cosine of an angle between two vectors, and thus is robust to a linear intensity deformation. To address the inaccuracy at an object boundary of the NCC, adaptive normalized cross correlation (ANCC) is proposed in ANCC. In MTM, a generalized version of NCC is proposed, which is called matching by tone mapping (MTM). Mahalanobis distance cross-correlation (MDCC) has also been proposed kim2014mahalanobis. Mutual information (MI) MI is widely used similarity measure for images with nonlinear deformations. MI measures the statistical dependence between two vectors and by computing the loss of entropy in given . Recently, deep learning based similarity measure is also actively studied chen2015deep; kim2017fcss; han2017scnet; ufer2017deep
Under a linear deformation written as where and are constants, a gradient is deformed with a scaling factor : , thus gradient information can be a robust feature when . However, when the gradient inversion occurs, which leads the inaccuracy of gradient based similarity measures such as HOG and SIFT. When the deformation is non-linear, the gradients fail to be preserved across the deformation. In some cases, the intensity deformation is nonlinear but still maintains monotonicity, i.e., the order of intensity-levels is preserved as . An intensity ordinal similarity measure, such as LBP, RT, and CT, provides the robustness under the assumption of the monotonicity, but the assumption is violated in a general non-linear deformation. The local intensity order is not preserved across non-linear deformation, thus which leads the inaccuracy of an intensity ordinal similarity measure under the non-linear deformation.
One of the most important application in computer vision is image recognition. Especially, scene image recognition is an important problems for applications of computer vision such as robotics, image search, geo-localization, etc. However, scene recognition is challenging problem because scenes commonly include both a holistic component and object-based components. Conventional methods for scene recognition can be categorized into holistic gist descriptors oliva2001modeling and local feature based descriptors nowak2006sampling. Local feature based approaches were mainly based on bag-of-features (BoF) representation, using local features such as SIFT or HOG kwitt2012scene; li2010object; su2012improving, combined through a pooling operator. Sophisticated pooling strategies such as the vector of locally aggregated descriptors (VLAD) su2012improving or the Fisher vector (FV) sanchez2013image emerged as the dominant mechanism for scene recognition.
In recent years, convolutional neural networks (CNNs) have become the feature extractors of choice for scene recognition. The previous success of sophisticated pooling leads many studies utilizing CNNs as local features. Early methods adopted a BoF-like approaches, based on the extraction of features from intermediate CNN layers, which were then fed to dictionary learning methods such as clustering gong2014multi or sparse coding dixit2015scene and pooled by VLAD gong2014multi or Fisher vector liu2014encoding. In liu2014encoding, semantic Fisher vector was proposed, converting features from probability space to the natural parameter space. In li2017deep, mixture of factor analyzers Fisher vector was proposed. However, these methods suffer from two drawbacks: 1) the Fisher vector structure is not easy to integrate in CNN, and 2) they are too high-dimensional. These drawbacks prevent end-to-end training and thus leads sub-optimal problem. Recently, VLAD and Fisher vectors are embedded into CNN architecture, by deriving a neural network implementation of its equations. arandjelovic2016netvlad proposed NetVLAD, an embedded implementation of VLAD descriptor, and tang2016deep proposed Deep FisherNet, an embedded implementation of GMM Fisher vector.
CNNs trained with the ImageNet donahue2014decaf for scene recognition was difficult to yield a better result than hand-designed features incorporating with sophisticated classifer sanchez2013image. This can be ascribed to the fact that scehe has very distinct characteristics from object classification data. To overcome this problem, zhou2014learning; zhou2017places trained a scene-centric CNN by constructing large scale scene dataset, called Places, resulting a significant performance improvement.
In real-world applications, scene images are frequently taken under very different imaging conditions, sensor specifications, and weathers. In such a cross-domain setting, common scene recognition algorithms frequently fail to achieve superior performance. To address the dataset bias problem, many domain adaptation approaches bruzzone2010domain; duan2012domain; baktashmotlagh2013unsupervised have been proposed to reduce the mismatch between the data distributions of the training samples and target samples. In george2016semantic, semantic clustering (SC), as domain generalization method111Unlike domain adoptation, in domain generalization, the knowledge learnt from one or multiple source domains in transferred to an unseen target domain., for fine-grained scene recognition was proposed.
Chapter 3 Local Area Transform (LAT)
3.1 Definition of LAT
In this paper, we propose to use local area information as a robust index for a nonlinear intensity deformation. Let be input image, be the current pixel, be a neighboring pixel, and be a set of neighboring pixels. When denoting a set of pixels whose intensity value is similar as that of such that where means that they have similar values, the local area is defined as the area of . We define a mapping of an image from the intensity domain to the local area domain as local area transform (LAT). LAT is designed to address the matching-problem of a non-linearly deformed image-pair which might be acquired from different radiometric parameters, different photometric parameters, and different modalities (including different spectrums). The LAT at a pixel , , is computed as follows:
[TABLE]
where \tau(x,y)=\left\{{\begin{array}[]{*{20}{c}}s(x,y)\\ 0\end{array}}\right.\begin{array}[]{*{20}{c}}{\;\;\;\;if\;s(x,y)<thr}\\ {\;else}\end{array} is a logistic function with definition of similarity function . is modeled according to the usages and applications. For example, can be measured as equality check, similarity in spatial domain, similarity in intensity domain, or similarity in gradient domain. When is modeled as equality check function, is defined as a logistic function \tau(x,y)=\left\{{\begin{array}[]{*{20}{c}}1\\ 0\end{array}}\right.\begin{array}[]{*{20}{c}}{\;\;\;\;if\;x=y}\\ {\;else}\end{array} with : where and : and .
3.2 Properties of LAT
Variety of real world computer vision applications require invariance properties, especially in uncontrolled environments. This section derives the invariance of LAT to non-linear intensity deformations, especially radiometric, photometric, and spectral deformations.
3.2.1 Invariance to non-linear intensity deformation
For a registered input image pair and , a non-linear intensity deformation between and can be represented as where is a intensity mapping operator. Then, is written as follows:
[TABLE]
where and are constant values varied according to and . For the case of and consequently , with of the function , . For the case of and consequently , under the assumption that the deformation function is an one-to-one mapping, is also equal to . From these equalities, . In other words, LAT is invariant to non-linear intensity deformations.
3.2.2 Invariance to radiometric & photometric deformations
Substituting (3) into (8), is rewritten as follows:
[TABLE]
Under the assumption of local-constancy of and the fact that and are constant values, (10) is simplified with of the function as:
[TABLE]
(11) indicates that LAT is independent of the illumination spectral distribution and the exposure time , i.e., it is invariant to illumination and exposure deformations (corresponding to radiometric and photometric deformations, respectively).
3.2.3 Invariance to spectral deformation
When we let be a LATransformed value of an image captured by -sensor with (e.g., visible spectrum) and be a LATransformed value of an image captured -sensor with (e.g., infra-red spectrum), we show that , i.e., the invariance of LAT to a spectral deformation as follows. From (11) and are written as (12) and (13), respectively.
[TABLE]
[TABLE]
We assume that pixels having same spectral reflectance values for a specific wavelength have same spectral reflectance values for another wavelength, i.e., . Under this assumption and the of the function , when . For the case of , is commonly [math] except for and . Note that this exceptional case is out-of consideration since it hardly occurs. Accordingly, for any wavelength pair and , i.e., a LAT value is invariant to a spectral deformation.
3.2.4 Limitation
In the above, we show the invariance of LAT to non-linear intensity deformations. However, when the deformation function is not a one-to-one mapping, there is possibly a duplicated mapping, i.e., and consequently , in (9). For such a mapping, . In other words, the LAT is not invariant to a duplicated intensity deformation. Nevertheless, LAT is still a robust transform to non-linear deformations since non-duplicated deformation assumption is commonly insured.
3.3 Implementation of LAT and Extension
The LAT is efficiently computed from a local histogram as . is a -dimensional vector defined as:
[TABLE]
where represents the histogram value corresponding to a bin , is the number of bins, and is zero except when intensity value belongs to to bin . The computational complexity of the brute-force implementation of the local histograms is linear in the neighboring size. This dependency can be removed using integral histogram IntegralHistogram in a way similar to integral image, which reduces the computational complexity from to at each pixel location.
For practical usefulness and noise robustness, we employ Gaussian integrated similarity function in intensity domain instead of the naive definition (with equality check similarity function) for computing the local area value. Specifically, the local area value is computed by a weighed integration of adjacent bins as (3.8).
[TABLE]
[TABLE]
where is a normalization factor, is Gaussian similarity weights of adjacent bins, , and . Parameters and control the interval of integration and the degree of Gaussian smoothing of histogram, respectively. Pseudo code is given in Algorithm 3.1. First, integral Histogram is computed through the image, and then local histogram at pixel is computed. Lastly, Local area value is computed with Gaussian similarity weights . For multiple channel of sensors, e.g., RGB sensor, local area values are computed for each channel, respectively.
3.4 Robustness Evaluation of LAT
A non-linear intensity deformation is commonly induced by different modality of imaging system. In order to evaluate the robustness of LAT to non-linear intensity deformation, a challenging simulated database is constructed. Eight color images (Airplane, Baboon, Bikes, Lena, Mustang, PaintedFace, Peppers, TwoMacaws, shown in Fig. 3.1) were employed as original images. Each image is deformed using 40 intensity deformation functions constructed by four categories of random probability distribution: piecewise linear mapping (PL), piecewise quadratic mapping (PQ), random mapping with Gaussian distribution (RG), and random mapping with uniform distribution (RU). For each R, G, B channel different deformation functions were applied. In total, 320 non-linear deformed pairs of color images were generated.
The robustness of LAT was evaluated with comparisons to four methods: grey-world model (GW) NormalizedChromaticity, histogram matching (HM) HistogramMatching, log-chromaticity (LC) ANCC, and rank transform (RT) Rank. As a base, the original image pair before transform (ORG) was also compared.
The similarity between a registered image pair is measured by the mean absolute difference and different pixel ratio . and are defined as (21) and (22), respectively.
[TABLE]
where is a transformed version of the original image, is a transformed version of the deformed image, is a normalization factor, is the number of pixels, is the maximum value of the label.
[TABLE]
where is a normalization factor, is threshold value.
The qualitative evaluations for LAT are summarized in Table 3.1 and Table 3.2, showing that LAT is superior to the other methods in terms of both and . It should be noted that lower and are, the more similar sample image pairs are. In the results, the total 10 image pairs are used for an average, and sample images are represented in Fig. 3.2 - Fig. 3.5. More specifically an input image is non-linearly transformed with different transformations, and the reconstruction results are represented as varying image transformation methods, including the state of-the-art method and proposed LAT. The LAT transformed non-linearly deformed images into a common domain, where the discrepancy between non-linear deformations are highly reduced. For most image pairs, the LATransformed images are very similar to each other; in other words, LAT shows higher robustness for randomly intensity-deformed image pairs.
Especially, Table 3.2 intensively analyzed the performance of the LAT as varying associated parameters, including support window size , the interval of integration , and degree of Gaussian smoothing . The performance of LAT was the highest when the parameter was 11. Note that other parameters and in LAT, which control the interval of integration and degree of Gaussian smoothing of the histogram, were not seriously effecting on the performances, thus they were set as = 3 and = 0.3 for considering the trade-off between efficiency and robustness.
3.5 LAT Reformulated Features: Cross-Modality Feature Descriptors
Besides the use as a transformation, the concept of Local Area Transform can be used to reformulate conventional cost functions and descriptors. If we replace an ‘intensity value’ by a ‘local area value’, it endows cost functions and descriptors with robustness to a modality deformation with maintaining inherent properties of them. For example, the most widely used cost function, a mean absolute difference (mad), can be reformulated as follows:
[TABLE]
[TABLE]
where and are original and the reformulated mad between pixel points p and q. is the neighbor pixels around p or q.
Similarly, the original local self-similarity descriptor (LSS) LSS can be reformulated by measuring sum of squared local area difference instead of sum of squared intensity difference as follows:
[TABLE]
[TABLE]
[TABLE]
[TABLE]
where and are the original and the reformulated correlation surface functions in LSS (please refer LSS for full description of LSS). is a constant for stability.
SIFT also can be reformulated by using gradients of local area value (3.19) instead of gradients of intensity value (3.18).
[TABLE]
[TABLE]
Binary pattern based robust descriptors, e.g., CT BRISK, RT Rank, BRIEF BRIEF, and BRISK BRISK, are formulated with following local binary pattern (LBP) equation.
[TABLE]
where is LBP at pixel . is the index of neighboring pixels of . (3.20) can be reformulated to with local area value instead of intensity value as follows:
[TABLE]
With the reformulated LBP, robust descriptors: CT BRISK, RT Rank, BRIEF BRIEF, and BRISK BRISK can be reformulated with LAT. We use the subscription LAT as the meaning of the reformation with LAT in the remaining parts of this paper. Note that any cost functions or features computed from intensity values can be reformulated with LAT.
3.6 LAT-Net: Deep Scene Recognition Network
Scene recognition is one of the fundamental task in various applications of computer vision such as robotics, image search, geo-localization, etc. However, scene recognition is challenging problem since scenes contain variety of components from objects to scene-like features. Furthermore, in practical applications, scene images are frequently taken under cross-domain settings, such as different imaging conditions, sensor specifications, and even weathers. Conventional scene recognition algorithms failed to achieve reliable results. To address this problem, domain adaptation duan2012domain; baktashmotlagh2013unsupervised or domain generalization george2016semantic approaches have bee proposed. This section proposes to embed LAT concept into deep convolutional neural network (CNNs) in order to tackle cross-domain scene recognition problem.
The conventional convolutional (Conv) layer in common CNNs is defined as:
[TABLE]
where and are feature maps of current and layers, respectively. and are weights and bias terms. is convolutional kernel. With the concept of LAT, reformulated convolutional (A-Conv) layer is defined as follows:
[TABLE]
where and are LAT-reformulated feature maps of current and layers, respectively. could be replaced by the output feature maps of regular layers in a CNN such as Conv layer or a pooling layer. It could be also be a previous A-Conv layer, and thus can be stacked together to form a highly nonlinear transformation operator.
Given the impressive performance on the ImageNet benchmark krizhevsky2012imagenet; russakovsky2015imagenet, three popular CNN architectures AlexNet krizhevsky2012imagenet, VGG-16 simonyan2014very, ResNet-34 he2016deep are employed as basis networks. In order to apply non-linear feature transformation into networks, the former Conv layers are replaced as A-Conv layers in the proposed network structures: two Conv layers, four conv layers, six conv layers for AlexNet, VGGNet-16, ResNet-34, respectively. In addition, inception-like stem block, named as Aception block (Fig. 3.6) is placed at the top of each networks. The re-designed CNNs are named as LAT-CNN, i.e., LAT-AlexNet, LAT-VGGNet, and LAT-ResNet, respectively. The structures of re-designed networks are depicted in Figs. 3.7, 3.8, and 3.9, respectively. All the CNNs presented here were implemented and trained using Caffe package jia2014caffe on Nvidia GPUs Tesla P40.
Chapter 4 Cross-Modality Correspondence Matching and Deep Scene Recognition
4.1 Experimental Settings
In experiments, the LAT was implemented with the following parameter settings for all datasets: i, l, =11, 3, 0.3. LAT was implemented in C++ on Intel Core i7-3770 CPU at 3.40 GHz. In experiments, the performances of LAT were evaluated for the tasks of nonlinearly-deformed image matching in Section 4.2, cross-spectral correspondence matching in Section 4.3, cross-radiometry stereo matching in Section 4.4, and cross-modal dense flow estimation in Section 4.5. For color images, LAT is computed for each channel, and then those values are used for minimum distance/cost selection. LAT was implemented as C++ layer in deep learning library Caffe jia2014caffe for deep scene recognition in Section 4.6.
4.2 Cross-Modality Correspondence Matching
4.2.1 Non-linear Deformation Correspondence Matching
The performance of reformulated feature descriptors with LAT is evaluated in terms of the feature recognition rate. The feature recognition rate is defined as the ratio of corrected matching to the total keypoints similar in BRIEF. The keypoints were detected using SIFT detector. SIFTSIFT, BRIEFBRIEF, and LSSLSS were selected as compared feature descriptors since they are the most successful feature descriptors respectively based on gradient, binary pattern, and self-similarity. They are reformulated with LAT to SIFTLAT, BRIEFLAT, and LSSLAT, respectively. For the evaluation, a simulated database described in Section 3.4 were used.
Fig. 4.1 shows an example of comparison on a simulated image pair. In the results, the correspondence estimations with conventional SIFT descriptor are represented on the upper part, while that with proposed SIFT descriptor on LAT are represented on the below part. For establishing correspondence, same fixed parameters are used (e.g., same threshold for matching). In other words, the number of correspondence depends on the robustness of the descriptors. In these results, the LAT-based SIFT descriptor provides consistently outperformed correspondences compared to original one. Fig. 4.2 summarizes the overall results representing that the reformulated descriptors remarkably outperforms the original descriptors. Especially, reformulated descriptors shows extremely high recognition rate even for image pairs generated with random mapping function (RG, RU). The results give an insight that the nonlinear intensity deformation problem generally induced by different imaging modalities can be addressed by reformulating the conventional descriptors with LAT. In the remaining parts of this section, we show the superiority and applicability of LAT for several multi-modality applications.
4.2.2 Cross-Spectral Correspondence Matching
In this section, we show that LAT is superior in terms of detecting the sought template in different spectral images, i.e., cross-spectral template matching. The cross-spectral template matching was applied on 100 RGB-NIR image pairs randomly selected from RGB-NIR Scene Dataset DB1. For each input NIR image, a template of a give size was selected at 100 random locations. In total, 10,000 (RGB) image-(NIR) template pairs were used in this experiment. To avoid a homogeneous template, the locations of the template were selected from among the structured regions of the image (i.e., locations where the features response of BRISK BRISK is above a threshold). Given an RGB image and a NIR-template, matching distances111The minimum distance among NIR/R-channel, NIR/G-channel, NIR/B-channel distances is set to the distance of the location. were computed for all possible locations in the corresponding RGB image, and the region associated with the minimal distance was considered the matched region. Four different methods, HM HistogramMatching, LC ANCC, RT Rank, and MTM MTM were employed as compared methods and the original images were also compared as a base method. Euclidean distance is employed for ORG, HM, and LC and sum of different rank is employed for RT.
In Fig. 4.3, 4.4, 4.5, in order to evaluate the performance of the LAT, the template matching performances across cross-spectral images are measured compared to the state-of-the-art methods. We show examples of similarity maps (for better visualization, a similarity map, which is the inverse of distance map, is illustrated instead of distance map where higher value (red) means similar region and lower vale (blue) means dissimilar region). The template matching in the LATransformed images clearly shows a sharp peak at the correct location, while it is not well localized in other methods. Table 4.1 summarizes the average correct detection ratio and matching pixel error . measures the percentage of correct detection (if matched and true windows are overlapped with 70%, the match is considered a correct detection), and measures the absolute difference between matched and true windows. As shown in Table 4.2, quantitative evaluation of LAT are represented as an average for 10,000 RGB-NIR template pairs LAT provides robust results in cross-spectral template matching in terms of both and ; in this study, showed improvement of 23%, and showed a reduction of 113 pixels.
The performance of reformulated feature descriptors for cross spectral feature matching is evaluated in this subsection. 100 RGB-NIR image pairs same as previous section were employed for this evaluation. The feature recognition rate is measured for the evaluation and keypoints were detected using SIFT detector. SIFTSIFT, BRIEFBRIEF, and LSSLSS were selected as compared feature descriptors.
Fig. 4.6 shows an example for comparison of LSS and LSSLAT. Specifically, in the results, the performance of cross-spectral feature matching are represented with conventional LSS descriptor and LAT-based LSS descriptor, respectively. Note that all the parameters are preserved in all experiments. Since this dataset are structually aligned, reliable correspondence should be also aligned. As shown in the results, LAT-based LSS consistently outperformed the original LSS. Table 4.2 summarizes the recognition rate, showing that the reformulated descriptors show superior performance to the original descriptors. The results show that reformulation with LAT provides promising results for cross spectral feature matching, with an improvement of 10% recognition rate.
4.2.3 Cross-Radiometry Stereo Matching
This section provides the superiority of LAT in the task of robust stereo matching in radiometric and photometric deformed stereo images. Stereo matching is commonly formulated as minimization problem of the energy in the MAP-MRF framework ANCC as:
[TABLE]
where is the neighboring pixels of , is a disparity. In the first term, is the data cost which measures the dissimilarity between in the left image and in the right image. In the second term, is the smoothness cost which penalties non-smooth disparities.
In this experiment, we fixed all of the parameters, cost function, aggregation method, optimization method except for the transformation methods. The absolute difference (AD) for a pixelwise data cost, the adaptive support weight AdaptiveSupportWieght with a size of for the cost aggregation, a truncated quadratic cost for a smoothness cost, and the loopy belief propagation for the global optimization were employed. Although postprocessing like a occlusion-handling and a noise removal can improve the quality of estimated disparities, we did not employ such a postprocessing to more focus on the influence of transform. For the evaluation and comparison of the performance of LAT with others, middlebury stereo data sets hirschmuller2007evaluation including Aloe, Baby1, Baby3, Bowling2, Cloth2, Cloth3, Lampshade1, and Monopoly were used. There are three different illumination sources (1,2,3) and three different exposures (indexed as 0,1,2), totally nine different image pairs in each data set. In this experiment, the left image is fixed to illumination source 1 and exposure 1, while the right image is varied in both an illumination and an exposure. In other words, the nine combinations of stereo pairs were used for the evaluation.
Four different methods, HM HistogramMatching, LC ANCC, RT Rank, and CT census were employed as compared methods and the original images were also compared as a base method. The qualitative and quantitative comparisons are given in Fig. 4.7 and Table 4.3, respectively. As shown in Table 4.3, the LAT is superior to the other methods in most data sets in terms of bad pixel percentages () and root mean squared errors (). Results presented in Fig. 4.8 and 4.9 show that the qualitative performance of LAT also outperforms the other methods.
4.2.4 Cross-Modality Dense Correspondence Matching
Estimating visual dense flow from different images but sharing similar scene characteristics is very challenging problem but promising function for a high-level computer vision task SIFTFLOW. Especially, cross modality dense flow estimation is more challenging due to their disparate properties MultiModal. This section analyzes the performance of SIFT-FlowLAT with a comparison to state-of-the-art methods: SIFT-Flow SIFTFLOW, and DAISY DAISY222Since RSNCC MultiModal is based on a global matching approach, it is not compared here for fair comparison. SIFT-Flow and DAISY are both based on a local matching approach. For this purpose, multimodal image database MultiModal is employed including RGB-NIR, different exposure, and flash-nonflash image pairs.
Fig. 4.10 shows an qualitative comparison of cross modality dense flow estimated by SIFT-Flow, DAISY, and SIFT-FlowLAT. As shown in the figure, compared to the state-of-the-arts methods SIFT-FlowLAT provides a reliable dense flow. Table 4.4 summarizes quantitative comparisons in terms of warping error. The warping error is computed from ground truth displacement for 100 corner points provided in MultiModal. The results indicate that SIFT-FlowLAT can be a promising approach for cross modality dense flow estimation.
To address the correspondence-matching problem for different modalities of images, deformation-robust local area transform is proposed. LAT is a nonlinear deformation-invariant transformation of the intensity information into local area information. The experimental results show that LAT and descriptors reformulated by LAT are superior to the conventional methods for matching the correspondence in the context of cross-modality correspondence matching. Specifically, LAT gains approximately a 23% improvement in correct detection ratio and a 10% recognition rate increase for the tasks of cross-spectral template matching and feature matching, respectively. LAT also increases the performance of cross-radiation stereo matching and crossmodality dense flow estimation with a 15% reduction in bad pixel percentage and a 50% reduction in the warping error, respectively. In conclusion, the local area can be considered as an alternative domain to the intensity domain to achieve robust correspondence matching. Future works should include the development of a cross-modal object recognition based on the properties of LAT
4.3 Cross-Modality Deep Scene Recognition
4.3.1 Cross Spectral Scene Recognition
In order to study the performance of LAT-Net for cross-spectral scene recognition, we have constructed cross spectral scene database. This database consists of 477 images distributed in 9 categories: Country (52), Field (51), Forest (53), Mountain (55), Old Buildings (51), Street (50), Urban (58), Water (51), where each image is RGB or NIR randomly selected from original RGB-NIR images pairs DB1. Randomly selected 99 images were used for testing (11 per category) and remaining 378 images were used for training. To avoid over-fitting, training images were augmented with resizing (resize ratio is randomly varied from 0.5 - 1.5 with center shift ranged -0.1 - 1.0), rotating (rotating degree is randomly varied from -70
- +70 degrees), color-shifted, and flipped. In total, 3,024 training images were employed for training. We trained all networks with the ADAM optimizer kinga2015method, learning rate =0.001, and batch size =16 for 40 epochs. All networks are pre-trained with places2 database zhou2017places for 10 epochs. The places2 is extended version of places dataset zhou2014learning and probably the largest scene recognition dataset. In total, the Places2 contains more than 10 million images comprising more than 400 unique scene categories. The dataset includes 5,000 to 30,000 training images per class.
We performed a comparison to state-of-the-art scene recognition methods from hand-crafted methods: GIST oliva2001modeling, DiscrimPatches singh2012unsupervised, ObjectBank li2010object to deep learned feature based methods: fc7-VLAD gong2014multi, NetVLAD arandjelovic2016netvlad, MFAFVNet li2017deep. Table 4.5 presents quantitative comparisons of cross spectral scene recognition in terms of top-1 accuracy. As shown in results, LAT redesigned networks provides highest accuracy even with simple network structure AlexNet krizhevsky2012imagenet. LAT-ResNet improved the recognition accuracy by 14.8% as compared to the state-of-the-arts methods. The results indicate that LAT-redesigned networks is a promising approach for cross spectral scene recognition.
4.3.2 Domain Generalized Scene Recognition
Domain generalization transfers the knowledge learnt from other source domain to an unseen target domain. In order to study the performance of LAT-Net for domain generalized scene recognition, we have conducted the following experiments. All networks are trained on places2 zhou2014learning with the ADAM optimizer kinga2015method, learning rate =0.001, and batch size =16 for 20 epochs. Then, the recognition accuracy is measure on unseen RGB-NIR scene databases. For evaluation, we have constructed three scene databases: RGB, NIR, RGB-NIR combined, which are generated from DB1. We divide DB1 into two separate databases consisting of RGB or NIR, respectively. RGB-NIR combined scene database is same as database employed in Section 4.6.2. Unlike Section 4.6.2, all 477 images are employed as testing images since they are not used for training.
We performed a comparison to state-of-the-art scene recognition methods fc7-VLAD gong2014multi, NetVLAD arandjelovic2016netvlad, MFAFVNet li2017deep, and SemanticCluster george2016semantic. Table 4.6 and 4.7 present quantitative comparisons of domain generalized scene recognition for RGB and NIR scene databases, respectively, in terms of top-1 accuracy. As shown in results, LAT redesigned networks provides highest accuracy. LAT-ResNet improved the recognition accuracy by 15.9% and 10.9% as compared to the state-of-the-arts methods for RGB and NIR scene databases, respectively. The results indicate that LAT-redesigned networks is a promising approach for domain generalized scene recognition.
Chapter 5 Conclusion
This dissertation proposes deformation-robust image transform, called local area transform (LAT), and mathematically and experimentally prove its invariance properties to nonlinear deformations. LAT is also extended into robust cost functions, feature descriptors, and deep scene recognition networks.
The experimental results have shown that LAT and descriptors reformulated by LAT were superior to the conventional methods for matching the correspondence in the context of cross-modality correspondence matching. Specifically, LAT gains approximately a 23% improvement in correct detection ratio and a 10% recognition rate increase for the tasks of cross-spectral template matching and feature matching, respectively. LAT also increases the performance of cross-radiation stereo matching and cross-modality dense flow estimation with a 15% reduction in bad pixel percentage and a 50% reduction in the warping error, respectively. Furthermore, the proposed LAT-Net outperforms existing state-of-the-arts methods in tasks of scene recognition. Specifically, LAT-Net gains up to 14% accuracy improvement in cross spectral scene recognition task. Also, LAT-Net achieves 6% and 7% accuracy improvements for database invariant scene recognition and domain generalized scene recognitions, respectively.
In conclusion, the local area can be considered as an alternative domain to the intensity domain to achieve robust correspondence matching and lots of applications: such as feature matching, stereo matching, dense correspondence matching, and image recognition. we believe the concept of LAT can be extended various potential tasks. Future works include the development of a cross-modal image retrieval and people re-identification based on the properties of local area transformation.
\printthesisindex
