Unsupervised Cross-spectral Stereo Matching by Learning to Synthesize

Mingyang Liang; Xiaoyang Guo; Hongsheng Li; Xiaogang Wang; You Song

arXiv:1903.01078·cs.CV·March 5, 2019

Unsupervised Cross-spectral Stereo Matching by Learning to Synthesize

Mingyang Liang, Xiaoyang Guo, Hongsheng Li, Xiaogang Wang, You Song

PDF

Open Access 1 Repo

TL;DR

This paper introduces an unsupervised framework for cross-spectral stereo matching that uses style adaptation via cycle consistency and adversarial learning to handle appearance variations, improving disparity estimation without ground truth supervision.

Contribution

It proposes a novel style adaptation network, F-cycleGAN, for spectral translation, enabling effective unsupervised stereo matching across different spectral images.

Findings

01

Achieves accurate disparity estimation without ground truth data.

02

Effectively minimizes appearance variations between spectral images.

03

Enhances robustness of cross-spectral stereo matching.

Abstract

Unsupervised cross-spectral stereo matching aims at recovering disparity given cross-spectral image pairs without any supervision in the form of ground truth disparity or depth. The estimated depth provides additional information complementary to individual semantic features, which can be helpful for other vision tasks such as tracking, recognition and detection. However, there are large appearance variations between images from different spectral bands, which is a challenge for cross-spectral stereo matching. Existing deep unsupervised stereo matching methods are sensitive to the appearance variations and do not perform well on cross-spectral data. We propose a novel unsupervised cross-spectral stereo matching framework based on image-to-image translation. First, a style adaptation network transforms images across different spectral bands by cycle consistency and adversarial learning,…

Tables1

Table 1. Table 1: Quantitative results. The RMSE of disparity for each material is evaluated. The RMSE results and execute times of CMA, ANCC, DASC, DMC(w.o. seg.), DMC(w. seg.) are extracted from ( ? ), where the DMC(w. seg.) means the method of ( ? ) with material-aware confidence. The proposed methods are tested on a single NVIDIA TITAN Xp GPU, which is the same as ( ? ). The network structure changes (row 7-10) lead to the improvement of performance.

Method	Common	Light	Glass	Glossy	Veg.	Skin	Clothing	Bag	Mean	Time(s)
CMA	1.60	5.17	2.55	3.86	4.42	3.39	6.42	4.63	4.00	227
ANCC	1.36	2.43	2.27	2.41	4.82	2.32	2.85	2.57	2.63	119
DASC	0.82	1.24	1.50	1.82	1.09	1.59	0.80	1.33	1.28	44.7
DMC(w.o. seg.)	0.51	1.08	1.05	1.57	0.69	1.01	1.22	0.90	1.00	0.02
DMC(w.seg.)	0.53	0.69	0.65	0.70	0.72	1.15	1.15	0.80	0.80	0.02
Only SMN	1.25	1.37	1.13	1.65	1.07	1.50	1.18	0.96	1.27	0.02
STN + SMN	1.13	1.55	1.05	1.52	0.89	1.23	1.14	0.98	1.18	0.04
STN(F) + SMN	1.24	1.02	0.92	1.32	0.79	1.10	1.03	0.92	1.04	0.04
STN(F) + SMN(aux)(ori)	0.75	0.86	0.63	1.05	0.81	1.16	0.99	0.74	0.87	0.02
STN(F) + SMN(aux)	0.68	0.80	0.67	1.05	0.68	1.04	0.98	0.80	0.84	0.04
Full Method	0.68	0.80	0.67	1.05	0.68	1.04	0.98	0.80	0.84	0.04

Equations33

F : I_{A, B} \to X_{A, B},

F : I_{A, B} \to X_{A, B},

G_{A} : X_{A, B} \to I_{A},

G_{B} : X_{A, B} \to I_{B},

L_{D}^{a d v} = L_{D}^{a d v, A} + L_{D}^{a d v, B} .

L_{D}^{a d v} = L_{D}^{a d v, A} + L_{D}^{a d v, B} .

L_{G}^{a d v} = L_{G}^{a d v, A} + L_{G}^{a d v, B},

L_{G}^{a d v} = L_{G}^{a d v, A} + L_{G}^{a d v, B},

L_{G}^{cy c} = \frac{1}{N} p \in Ω \sum + ∥ I_{A}^{cy c} (p) - I_{A} (p) ∥ ∥ I_{B}^{cy c} (p) - I_{B} (p) ∥,

L_{G}^{cy c} = \frac{1}{N} p \in Ω \sum + ∥ I_{A}^{cy c} (p) - I_{A} (p) ∥ ∥ I_{B}^{cy c} (p) - I_{B} (p) ∥,

L_{G}^{r ec} = \frac{1}{N} p \in Ω \sum + ∥ I_{A}^{r ec} (p) - I_{A} (p) ∥ ∥ I_{B}^{r ec} (p) - I_{B} (p) ∥ .

L_{G}^{r ec} = \frac{1}{N} p \in Ω \sum + ∥ I_{A}^{r ec} (p) - I_{A} (p) ∥ ∥ I_{B}^{r ec} (p) - I_{B} (p) ∥ .

L_{G}

L_{G}

L_{D}

I_{A} F X_{A} G_{B} I_{B}^{f ak e} F X_{B}^{f ak e} G_{A} I_{A}^{cy c}

I_{A} F X_{A} G_{B} I_{B}^{f ak e} F X_{B}^{f ak e} G_{A} I_{A}^{cy c}

I_{B} F X_{B} G_{A} I_{A}^{f ak e} F X_{A}^{f ak e} G_{B} I_{B}^{cy c}

I_{B} F X_{B} G_{A} I_{A}^{f ak e} F X_{A}^{f ak e} G_{B} I_{B}^{cy c}

X_{A} G_{A} I_{A}^{r ec}, X_{B} G_{B} I_{B}^{r ec}

X_{A} G_{A} I_{A}^{r ec}, X_{B} G_{B} I_{B}^{r ec}

\tilde{I^{l}} = ω (I^{r}, d^{l}) ⟺ \tilde{I^{l}}_{x, y} = I_{x + d_{x, y}^{l}, y}^{r} .

\tilde{I^{l}} = ω (I^{r}, d^{l}) ⟺ \tilde{I^{l}}_{x, y} = I_{x + d_{x, y}^{l}, y}^{r} .

L_{S}^{a p, l} = \frac{1}{N} p \in Ω \sum α \frac{1 - δ ( I ^{l} , I ~ ^{l} ) ( p )}{2} + (1 - α) I^{l} (p) - \tilde{I}^{l} (p),

L_{S}^{a p, l} = \frac{1}{N} p \in Ω \sum α \frac{1 - δ ( I ^{l} , I ~ ^{l} ) ( p )}{2} + (1 - α) I^{l} (p) - \tilde{I}^{l} (p),

L_{S}^{d s, l} = \frac{1}{N} p \in Ω \sum (\partial_{x} d^{l} e^{- ∥ \partial_{x} I^{l} ∥} + \partial_{y} d^{l} e^{- ∥ \partial_{y} I^{l} ∥}) (p),

L_{S}^{d s, l} = \frac{1}{N} p \in Ω \sum (\partial_{x} d^{l} e^{- ∥ \partial_{x} I^{l} ∥} + \partial_{y} d^{l} e^{- ∥ \partial_{y} I^{l} ∥}) (p),

L_{S}^{l r, l} = \frac{1}{N} p \in Ω \sum ∣ d^{l} (p) - ω (d^{r}, d^{l}) (p) ∣.

L_{S}^{l r, l} = \frac{1}{N} p \in Ω \sum ∣ d^{l} (p) - ω (d^{r}, d^{l}) (p) ∣.

L_{S M N} = α_{a p} (L_{S}^{a p, l} + L_{S}^{a p, r}) + α_{d s} (L_{S}^{d s, l} + L_{S}^{d s, r}) + α_{l r} (L_{S}^{l r, l} + L_{S}^{l r, r}) .

L_{S M N} = α_{a p} (L_{S}^{a p, l} + L_{S}^{a p, r}) + α_{d s} (L_{S}^{d s, l} + L_{S}^{d s, r}) + α_{l r} (L_{S}^{l r, l} + L_{S}^{l r, r}) .

L_{G}^{a ux} = α_{a ux} \frac{1}{N} p \in Ω \sum + G_{B} (F (I_{or i}^{l})) (p) - \tilde{I_{or i}^{l}} (p) G_{A} (F (I_{or i}^{r})) (p) - \tilde{I_{or i}^{r}} (p),

L_{G}^{a ux} = α_{a ux} \frac{1}{N} p \in Ω \sum + G_{B} (F (I_{or i}^{l})) (p) - \tilde{I_{or i}^{l}} (p) G_{A} (F (I_{or i}^{r})) (p) - \tilde{I_{or i}^{r}} (p),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rish-av/gan_spectral_matching
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Image Enhancement Techniques · Advanced Image Processing Techniques

Full text

Unsupervised Cross-spectral Stereo Matching by Learning to Synthesize

Mingyang Liang1,211footnotemark: 1, Xiaoyang Guo3, Hongsheng Li3, Xiaogang Wang3, You Song1

1Beihang University, Beijing, China

2SenseTime Research

3The Chinese University of Hong Kong, Hong Kong, China

{liangmingyang,songyou}@buaa.edu.cn, {xyguo, hsli, xgwang}@ee.cuhk.edu.hk These authors contributed equally to this work.Corresponding author.

Abstract

Unsupervised cross-spectral stereo matching aims at recovering disparity given cross-spectral image pairs without any depth or disparity supervision. The estimated depth provides additional information complementary to original images, which can be helpful for other vision tasks such as tracking, recognition and detection. However, there are large appearance variations between images from different spectral bands, which is a challenge for cross-spectral stereo matching. Existing deep unsupervised stereo matching methods are sensitive to the appearance variations and do not perform well on cross-spectral data. We propose a novel unsupervised cross-spectral stereo matching framework based on image-to-image translation. First, a style adaptation network transforms images across different spectral bands by cycle consistency and adversarial learning, during which appearance variations are minimized. Then, a stereo matching network is trained with image pairs from the same spectra using view reconstruction loss. At last, the estimated disparity is utilized to supervise the spectral translation network in an end-to-end way. Moreover, a novel style adaptation network F-cycleGAN is proposed to improve the robustness of spectral translation. Our method can tackle appearance variations and enhance the robustness of unsupervised cross-spectral stereo matching. Experimental results show that our method achieves good performance without using depth supervision or explicit semantic information.

Introduction

Multi-camera multi-spectral systems have become very common in many modern devices like Realsense, Kinect, and iPhoneX. Moreover, it has been proven that infrared images are very helpful in face recognition (?), detection (?), and scene parsing (?).

Stereo matching is one of the most heavily investigated topics in computer vision (?). Given a rectified image pair ( $I_{l}$ for the left image, $I_{r}$ for the right image), stereo matching focuses on finding correspondence of each pixel between two images. If the right pixel $I_{r}(x{-}d,y)$ corresponds to the left pixel $I_{l}(x,y)$ , then we can define $d$ as the disparity of the pixel $I_{l}(x,y)$ . Moreover, if we know the camera’s focal length $f$ and the distance $B$ between the two camera centers, the disparity can be converted into depth by $fB/d$ .

Cross-spectral stereo matching is stereo matching for images from different spectra, for example the left image is a visible image and the right image is a near-infrared image in Fig. 1. The recovered depth provides additional information which is complementary to semantic features of individual spectrum. In addition, the estimated depth can help improve missing areas of depth images captured by depth sensors (e.g. reflection or transparent surfaces) (?).

However, the cross-spectral stereo matching is still a challenging task especially without depth supervision (?), because there are great illumination differences between images of different spectra. The translation between different spectra is quite complex and hard to accurately describe with a simple linear transformation. Figure 1 shows an example of cross-spectral (visible and near-infrared) stereo matching.

The key of traditional cross-spectral stereo matching is to design robust descriptors or features between the two modalities, such as ANCC (?) and DASC (?). However, these traditional methods are still not robust enough for transparent objects and large illumination variations. Zhi et al. (?) proposed deep material-aware cross-spectral stereo matching, which tried to tackle the problem with deep neural networks and unsupervised learning. However, this method suffers from severe limitations: (i) The method requires additional semantic annotations to obtain auxiliary material information. (ii) The loss function is manually designed for different materials, which limits its applications to other scenarios.

To tackle the above problems, in this paper we employ image-to-image translation to assist cross-spectral stereo matching, and our full framework is shown in Figure 2. By regarding the difference between different spectral images as the different distributions, we explore the possibility of applying image-to-image translation methods to assist unsupervised cross-spectral stereo matching. We use two networks to transform images across different spectral bands and estimate disparity respectively. The first network is a spectral translation network (STN), which transforms images by cycle consistency and adversarial learning. The second network is a stereo matching network (SMN), which is trained with the image pairs transformed to the same spectrum by the spectral translation network. Then, we use the disparity predicted by the SMN to supervise the spectral translation network again. A novel share-encoder spectral translation network F-cycleGAN is employed to make the whole framework more robust.

Our contributions are as follows:

•

We proposed a novel framework for cross-spectral stereo matching, which iteratively optimizes spectral translation network and stereo matching network.

•

The proposed F-cycleGAN based on image-to-image translation and adversarial learning improves the robustness of image transformation.

•

Our method surpasses state-of-the-art methods on cross-spectral stereo matching without depth supervision and extra human intervention.

Related Work

Unsupervised Depth Estimation

Garg et al. (?) first proposed to use warping-based view synthesis to learn disparity in an unsupervised way. The right image is first warped to the left view using disparity. Then, the absolute difference between the warped image and the left image, also called reconstruction error or photometric loss, is minimized to supervise disparity predictions. Godard et al. (?) extended this idea by incorporating left-right consistency into the unsupervised loss. Zhou et al. (?) proposed a framework which simultaneously predicted depth and frame-to-frame relative camera pose, which was trained with photometric loss using consecutive frames from videos. Zhou et al. (?) iteratively train a stereo network by filtering reliable predictions with left-right consistency check. However, unsupervised methods based on photometric loss often fail to predict accurate disparity for cross-spectral images due to the appearance differences.

Cross-spectral Stereo Matching

A series of robust matching costs were designed for radiometric variations. Mutual information (MI) measure (?) was extended by incorporating prior probabilities and 2D match surface (?). Heo et al. (?) used color formation model explicitly and proposed Adaptive Normalized Cross-Correlation (ANCC) to tackle illumination changes and camera parameter differences. Local self-similarity (LSS) (?) used window-based self similarity descriptor to do dense correspondence measure for thermal-visible videos. Pinggera1 et al. (?) showed that dense gradient features based on HOG achieved better performance than MI and LSS descriptors. Aguilera et al. (?) proposed a feature descriptor for matching features points with nonlinear intensity variations. Kim et al. (?) proposed Dense adaptive self-correlation descriptor (DASC) by improving LSS descriptor with random receptive field pooling.

Another track of works tried to improve the quality of depth captured by RGBD cameras (?; ?; ?). Chiu et al. (?) proposed a cross-modal adaptation method for cross-spectral stereo matching and fused predictions with depth captured with Kinect.

For deep learning methods, Aguilera et al. (?) learned a similarity measurement of cross-spectral image patches, which is a potential way to learn matching cost for multi-spectrum images. Zhi et al. (?) utilized deep segmentation maps to improve robustness of cross-spectral stereo matching, while the method required extra semantic annotations and manually designed losses for different materials, which made it hard to apply to other scenes.

Image-to-image Translation

Image-to-image translation converts images from one modality to another, such as style transfer (?; ?), colorization (?), sketch to image (?; ?). For image translation, training with only L1 loss results in predictions lack of local semantic details. Johnson et al. (?) combined per-pixel loss with perceptual loss to train a fast feed-forward network for image transformation. Isola et al. (?) proposed a general framework for image-to-image translation using conditional adversarial networks. High-resolution synthetic images can be generated by applying multi-scale structure and novel adversarial loss (?). Later, Zhu et al. (?) proposed CycleGAN to translate image styles with unaligned images from different domains.

We utilize the method of (?) to convert images across different spectra, which is a basis of our proposed framework.

Method

In this section, we provide a detailed description of each part of the proposed method. Our network can be divided into two parts, the spectral translation network (STN) and stereo matching network (SMN). STN is responsible for minimizing the differences between domains, and SMN is responsible for predicting the disparity.

Spectral Translation Network

The goal of the STN is to minimize the appearance variations between different spectra and provide the supervision information to the SMN. To achieve the goal, we proposed a novel style adaptation network F-cycleGAN as STN.

Given any image ${I_{A}}$ of spectral $A$ and image ${I_{B}}$ of spectral $B$ , we regard ${I_{A}}$ and ${I_{B}}$ sampled from two distributions $A$ and $B$ . We can define three mapping functions,

[TABLE]

where $F$ encodes image $I_{A,B}$ to a unified feature space $X$ . $G_{A}$ and $G_{B}$ are generators which convert features back into images in spectrum $A$ and $B$ respectively. In our implementation, we take the encoder and the decoder of the generator network in CycleGAN (?) as the structure of our $F$ network and $G_{A/B}$ networks.

The network $F$ , $G_{A}$ and $G_{B}$ are supervised by adversarial losses (?) and cycle-reconstruction loss. The adversarial loss is given by two discriminator networks $D_{A}$ and $D_{B}$ , which try to differentiate real and fake A or B images. We define $G_{B}(X_{A})$ as $I^{fake}_{B}$ , $G_{A}(X_{B})$ as $I^{fake}_{A}$ , $G_{A}(X_{A})$ as $I_{A}^{rec}$ , and $G_{B}(X_{B})$ as $I_{B}^{rec}$ .

The discriminator $D_{A}$ aims to distinguish between $I^{fake}_{A}$ and $I_{A}$ . To train $D_{A}$ , a classification loss $L^{adv,A}_{D}$ is used to classify $I^{fake}_{A}$ and $I_{A}$ . The loss for training discriminators is thus defined by

[TABLE]

For generator networks, the loss can be mainly divided into two parts, adversarial loss and cycle consistency loss. The adversarial loss aims at fooling the discriminator networks and is given by

[TABLE]

where $L^{adv,A}_{G}$ and $L^{adv,B}_{G}$ are achieved by maximizing the classification errors of discriminators $D_{A}$ and $D_{B}$ (details in (?)). The cycle consistency loss is,

[TABLE]

where N is the number of pixels, $I_{A}^{cyc}$ means $G_{A}(F(I^{fake}_{B}))$ , $I_{B}^{cyc}$ means $G_{B}(F(I^{fake}_{A}))$ , and $\Omega$ is the pixel coordinate space. To guarantee the network $F$ maps the images to the same hidden semantic feature space, and prevent the STN from learning disparity, a auxiliary reconstruction loss is introduced to supervise the network:

[TABLE]

Then the final loss for the image transformation network $F,G_{A},G_{B}$ and the adversarial discriminator are given by

[TABLE]

To make the expressions clearer, all the intermediate outputs are summarized as follows,

[TABLE]

Cross-spectral Stereo Matching Network

Dispnet (?), which takes concatenated images as input to directly regress disparities, is adopted as the SMN to predict disparity maps $d^{l}$ , $d^{r}$ for left and right images. Given rectified cross-spectral image pair $I_{ori}^{l}$ , $I_{ori}^{r}$ , without loss of generality, we assume the spectrum of $I_{ori}^{l}$ as spectrum $A$ and $I_{ori}^{r}$ as spectrum $B$ . STN is applied to transform the cross-spectral images to the same modality. After that, we concatenate $\{I^{l}_{ori},G_{B}(F(I^{l}_{ori}))\}$ as $I^{l}$ and $\{G_{A}(F(I^{r}_{ori})),I^{r}_{ori}\}$ as $I^{r}$ to get the image pair in the same modality, which can be used as the input to the stereo matching network and for the cross-spectral unsupervised loss which will be discussed in the following section.

It should be emphasized that we block the gradients from network inputs back into STN for training stability. It should be noted that the forms of the image used for input and supervision are not required to be identical which will be discussed in benchmark results section. We apply the training loss from (?) which includes appearance matching loss $L_{S}^{ap}$ , disparity smoothness loss ${L_{S}^{ds}}$ , and left-right disparity consistency loss ${L_{S}^{lr}}$ . We only show the left terms, since the right can be derived similarly.

Based on the left disparity $d^{l}$ , we can get reconstructed left image $\tilde{I^{l}}$ from $I^{r}$ with the warping operator $\omega$ , which can be described as

[TABLE]

Since the disparity value might be a float number, $I^{r}_{x+d^{l}_{x,y},y}$ is bilinearly sampled at the pixel $(x+d^{l}_{x,y},y)$ . For simplicity, we use a mask to stop calculating the gradients for the pixels which are unable to be warped (e.g. pixels out of bound).

The appearance matching loss $L_{S}^{ap}$ encourages the reconstructed image to appear similar to the original image by comparing structure and intensity. We let $\delta(I_{1},I_{2})$ be the structural similarity function (?) and $\tilde{I^{l}}$ be the reconstruction of $I^{l}$ from $I^{r}$ . The appearance matching loss can be described as

[TABLE]

where $\alpha$ denotes the weight coefficient for the structural dissimilarity function and L1 reconstruction loss. The loss $L_{S}^{ds}$ enforces the disparity smoothness,

[TABLE]

where $\partial d^{l}$ and $\partial I^{l}$ means the gradients of $d^{l}$ and $I^{l}$ . The loss $L_{S}^{lr,l}$ regularizes the consistency of the left disparity and the right disparity,

[TABLE]

Then the final loss for the SMN network is given by

[TABLE]

To further improve the performance, we introduce an auxiliary loss for the STN. First we can get the warped original images $\tilde{I^{l}_{ori}}=\omega(I^{r}_{ori},d^{l})$ and $\tilde{I^{r}_{ori}}=\omega(I^{l}_{ori},d^{r})$ with disparity prediction, then the auxiliary loss $L_{G}^{aux}$ is defined by

[TABLE]

which attempts to tackle appearance variations and enhance the robustness of STN. There is a possibility that the reconstruction may encode both the disparity and spectral differences. We hold that by the cycle loss and reconstruction loss in Equ. 3 and Equ. 4, we can prevent the STN from learning disparity.

Iterative Optimization

We will introduce our iterative optimization approach in this section. All the losses required are presented in the Equ. 5, Equ. 6, Equ. 14, and Equ. 15. For simplicity, we omit subscript of spectrum for $G_{A},G_{B},D_{A},D_{B}$ because the optimization for the two modalities is identical.

Figure 4 shows the gradient flow across different network blocks. A randomly sampled cross-spectral image pair is provided to the entire system in each iteration. For the step (1), we train the $D$ network by loss $L_{D}$ from Equ. 6, which encourage the discriminator to distinguish between real and fake images. Then for the step (2), we train the $F$ network and $G$ network by loss $L_{G}$ from Equ. 5. The stereo network $S$ is trained in step (3) with the loss $L_{SMN}$ from Equ. 14 by taking the translation results from $G$ network as supervision. Finally, we use loss $L_{aux}$ from Equ. 15 to train the $F$ network and $G$ network again for global optimization. The whole framework is first trained with several warmup epochs, using only step (1) and step (2), during which the stereo matching network is not trained. After the warmup stage, all four steps are used for further training.

Experiments

In this section, an evaluation of our method is performed on the benchmark dataset, and detailed analysis is given.

The network is trained on rectified cross-spectral stereo image pairs without any supervision in the form of ground truth disparity or depth. We evaluate on the PittsStereo-RGBNIR dataset proposed by (?) which covers many material categories including lights, glass, glossy surfaces, vegetation, skin, clothing and bags. This dataset was captured by a visible (VIS) and near infrared (NIR) camera pairs. We define the left VIS as spectrum $A$ and right NIR as spectrum $B$ . The Left VIS consists of three spectral bands while the right NIR consists of only one band. For the simplicity of implementation, we convert NIR images into three channels.

Implementation Details

Architecture

The $G$ network and $D$ network followed (?) which has shown impressive results for image-to-image translation. The $F$ network contains 4 residual blocks (?) and two stride-2 convolutions for down-sampling which is similar to the $G$ network.

We used the DispNet (?) as our stereo matching network SMN, and for the training stability, multi-scale predictions of SMN are applied following (?). The weights of the STN were initialized from a Gaussian distribution with zero mean and 0.02 standard deviation, and the weights of the SMN were initialized with Kaiming initialization (?).

Parameters

The SMN predicts the disparity directly instead of the ratio between disparity and image width. The disparity predictions are clamped to the range of zero to the image width. A scaling factor $\eta=0.008$ is multiplied to the predictions for stable optimization.

The weights of the losses in STN are set to $\lambda_{c}=10$ , $\lambda_{r}=5$ , $\lambda_{a}=1$ , $\lambda_{d}=1$ , and the weights of losses in SMN are $\alpha_{ap}=1$ , $\alpha_{ds}$ = 0.2, $\alpha_{lr}$ = 0.1, $\alpha_{aux}=20$ . We use $5\times 5$ window for calculating the structural similarity $\delta$ , and the $\alpha$ in Equ. 11 is set to $0.9$ .

Training and Testing

The entire network contains about 54 million trainable parameters, of which 33 million parameters are in SMN. The dataset is split into two sets for training (40000 pairs) and testing (2000 pairs), which is the same as (?). The STN and SMN are trained on 40000 cross-spectral image pairs with Adam optimizer (?) (batch size = 16 and learning rate = 0.0002).

For data augmentation, we flip the input images of STN horizontally with a 50% chance. Input images are resized into $512{\times}384$ for the entire network. We perform an instance normalization on the images provided to the SMN as input. The training process takes about 34 hours using 8 Nvidia TITAN Xp GPUs. The network is first trained with 15 warm-up epochs (with only step 1, 2, the SMN is not trained during this stage), and then trained with all 4 steps for 10 epochs.

For testing, the predicted disparity maps are bilinearly upsampled to the original size with disparity values multiplied with the horizontal scaling factor. 5030 sparse points on 2000 testing images are evaluated to compute the root mean square error.

Benchmark Results

For the sake of comparison, we choose the root mean square error (RMSE) as an indicator for our comparison, and we calculate the RMSE of each material category and obtain the average value as the final result Mean, following (?). We have tested five network structure choices: only SMN, STN+SMN, STN(F)+SMN, STN(F)+SMN(aux)(ori), and STN(F)+SMN(aux).

For only SMN, STN is not employed and the cross-spectral image pairs are directly used as the unsupervised supervision. STN+SMN employs the original Cycle-GAN as the spectral translation network. For the STN(F) series, we use our proposed F-cycleGAN as the STN. (aux) represents using the auxiliary loss during training. The STN(F) + SMN(aux)(ori) means the original image pairs instead of concatenated image pairs are used as inputs. All the methods with STN except STN(F) + SMN(aux)(ori) take concatenated original images and translated fake image pairs from STN as the inputs of SMN. We found that using only the NIR image and the fake NIR image in the unsupervised loss of SMN achieved better results, thus in all of our experiments, we employ only NIR images for the unsupervised supervision of SMN.

We compare the performance of the proposed method with other cross-spectral stereo matching methods like CMA(?), ANCC(?), DASC(?), and DMC(?). Table 1 presents the comparison with disparity RMSE and execution time. For DMC(?), they incorporate material-aware confidence into the disparity prediction network, which requires semantic segmentation labels and manually defined loss for each kind of material. For fair comparison, we also list their results without the material-aware confidence.

On average, our approach outperforms other methods without extra human intervention. On lights, glass, glossy surface, and bag, our approach performs better than others. Table 1 also presents the changes in the results of our three comparative experiments, STN+SMN, STN(F)+SMN, and STN(F)+SMN(aux). The results show that the F-cycleGAN and the framework for jointly training are able to improve the performance of unsupervised stereo matching. We also find that it is still hard to translate the appearance of clothing between VIS and NIR by the STN, possibly because the material of clothing is more variable than others, which leads to an unstable correspondence.

Visualization Results

Figure 5 presents the visualized results of the proposed method which suggests that the proposed approach is able to handle the illumination variations between different spectra. Comparing to other unsupervised methods in Figure 3, our method provides cleaner and more reasonable disparity predictions.

Conclusion

We have presented an unsupervised cross-spectral stereo matching method which can be trained in an end-to-end way without extra data or excessive human intervention. We propose F-cycleGAN based on the work of the (?) as STN, which is able to minimize the appearance variations between different spectra without the loss of geometric information and improve the robustness of the stereo matching network SMN. Our experimental results show that our method outperforms other state-of-the-art methods. Our approach can be directly applied to other spectra, such as short-wave infrared or medium-wave infrared images.

In the future, we expect to further enhance the capabilities of the STN network for subtle visual differences. The structural similarity loss in the unsupervised loss of SMN, which is illumination sensitive, could also be improved to better supervise the stereo matching network.

Bibliography34

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[Aguilera et al . 2016] Aguilera, C. A.; Aguilera, F. J.; Sappa, A. D.; Aguilera, C.; and Toledo, R. 2016. Learning cross-spectral similarity measures with deep convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops , 1–9.
2[Aguilera, Sappa, and Toledo 2015] Aguilera, C. A.; Sappa, A. D.; and Toledo, R. 2015. Lghd: A feature descriptor for matching across non-linear intensity variations. In Image Processing (ICIP), 2015 IEEE International Conference on , 178–181. IEEE.
3[Cheng, Yang, and Sheng 2015] Cheng, Z.; Yang, Q.; and Sheng, B. 2015. Deep colorization. In Proceedings of the IEEE International Conference on Computer Vision , 415–423.
4[Chiu, Blanke, and Fritz 2011] Chiu, W.-C.; Blanke, U.; and Fritz, M. 2011. Improving the kinect by cross-modal stereo. In BMVC , volume 1, 3. Citeseer.
5[de La Garanderie and Breckon 2014] de La Garanderie, G. P., and Breckon, T. P. 2014. Improved depth recovery in consumer depth cameras via disparity space fusion within cross-spectral stereo. In BMVC .
6[Egnal 2000] Egnal, G. 2000. Mutual information as a stereo correspondence measure.
7[Fookes et al . 2004] Fookes, C.; Maeder, A.; Sridharan, S.; and Cook, J. 2004. Multi-spectral stereo image matching using mutual information. In 3D Data Processing, Visualization and Transmission, 2004. 3DPVT 2004. Proceedings. 2nd International Symposium on , 961–968. IEEE.
8[Garg et al . 2016] Garg, R.; BG, V. K.; Carneiro, G.; and Reid, I. 2016. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In European Conference on Computer Vision , 740–756. Springer.