Image Matching via Loopy RNN

Donghao Luo; Bingbing Ni; Yichao Yan; Xiaokang Yang

arXiv:1706.03190·cs.LG·June 20, 2017

Image Matching via Loopy RNN

Donghao Luo, Bingbing Ni, Yichao Yan, Xiaokang Yang

PDF

Open Access

TL;DR

This paper introduces Loopy RNN, a novel recursive neural network inspired by human vision, which iteratively refines image matching scores by aggregating relationship information, outperforming traditional one-off algorithms.

Contribution

The paper proposes a new Loopy RNN architecture with symmetry and monotonous loss, enabling iterative and progressive image matching, which is a significant advancement over existing methods.

Findings

01

Demonstrates superior performance on multiple image matching benchmarks.

02

Shows the effectiveness of recursive, iterative matching process.

03

Validates the symmetry property and monotonous loss in improving matching confidence.

Abstract

Most existing matching algorithms are one-off algorithms, i.e., they usually measure the distance between the two image feature representation vectors for only one time. In contrast, human's vision system achieves this task, i.e., image matching, by recursively looking at specific/related parts of both images and then making the final judgement. Towards this end, we propose a novel loopy recurrent neural network (Loopy RNN), which is capable of aggregating relationship information of two input images in a progressive/iterative manner and outputting the consolidated matching score in the final iteration. A Loopy RNN features two uniqueness. First, built on conventional long short-term memory (LSTM) nodes, it links the output gate of the tail node to the input gate of the head node, thus it brings up symmetry property required for matching. Second, a monotonous loss designed for the…

Tables2

Table 1. Table 1: Details of FeatureNet architecture. C: convolutional layer. MP: max pooling layer. KS: Kernel size. S: stride. OD: output dimension of feature map. OD is present as (width × \times height × \times depth).

Name	Type	KS	S	OD
Conv $0$	C	7x7	1	$64 \times 64 \times 24$
Pool $0$	MP	3x3	2	$32 \times 32 \times 24$
Conv $1$	C	5x5	1	$32 \times 32 \times 64$
Pool $1$	MP	3x3	2	$16 \times 16 \times 64$
Conv $2$	C	3x3	1	$16 \times 16 \times 96$
Conv $3$	C	3x3	1	$16 \times 16 \times 96$
Conv $4$	C	3x3	1	$16 \times 16 \times 64$
Pool $4$	MP	3x3	2	$8 \times 8 \times 64$

Table 2. Table 2: Matching result of UBC. We set our Loopy RNN model with N = 10 𝑁 10 N=10 , D = 1024 𝐷 1024 D=1024 and λ 𝜆 \lambda is set to 0.4 in the model with monotonous loss.

Train	Liberty		Notredame		Yosemite
Test	Notredame	Yosemite	Liberty	Yosemite	Liberty	Notredame	mean
nSIFT concat.+NNet	14.35	21.41	20.44	20.65	22.23	14.84	18.99
MatchNet	3.87	10.88	6.9	8.39	10.77	5.67	7.75
Siamese	4.33	14.89	8.77	13.23	13.48	5.75	10.07
Pseudo-Siamese	3.93	12.5	12.87	12.64	10.35	5.44	9.62
Siamese-2stream	3.05	9.02	6.45	10.44	11.51	5.29	7.63
Siamese-2stream- $l_{2}$	4.54	13.24	8.79	13.02	12.84	5.58	9.67
2ch-2stream	1.9	5	4.85	4.1	7.2	2.11	4.56
Loopy RNN(without monotonous loss)	3.02	8.92	6.64	8.13	9.56	3.96	6.70
Loopy RNN(with monotonous loss)	2.79	8.29	6.22	7.71	9.19	3.72	6.32

Equations24

i_{A (B)} =

i_{A (B)} =

f_{A (B)} =

o_{A (B)} =

c_{A (B)} =

U_{c} h_{B (A)} + b_{c}),

h_{A (B)} =

F (x_{A}, x_{B}) = (h_{n_{1}, A} + h_{n_{1}^{^{'}}, B}) /2,

F (x_{A}, x_{B}) = (h_{n_{1}, A} + h_{n_{1}^{^{'}}, B}) /2,

F (x_{B}, x_{A}) = (h_{n_{1}, B} + h_{n_{1}^{^{'}}, A}) /2,

h_{n_{1}, A} = h_{n_{1}^{^{'}}, A},

h_{n_{1}, A} = h_{n_{1}^{^{'}}, A},

h_{n_{1}, B} = h_{n_{1}^{^{'}}, B},

F (x_{A}, x_{B}) = F (x_{B}, x_{A}) .

F (x_{A}, x_{B}) = F (x_{B}, x_{A}) .

s_{n} = \frac{1}{1 + e ^{- θ^{T} h_{n}}}, n \neq = 0,

s_{n} = \frac{1}{1 + e ^{- θ^{T} h_{n}}}, n \neq = 0,

L_{n}^{m} = max (0, (- 1)^{y} (s_{n} - s_{n}^{p r e})), n \neq = 0,

L_{n}^{m} = max (0, (- 1)^{y} (s_{n} - s_{n}^{p r e})), n \neq = 0,

s_{n}^{p r e} = {max (s_{1}, s_{2} ..., s_{n - 1}), min (s_{1}, s_{2} ..., s_{n - 1}), y = 1, y = 0 .

s_{n}^{p r e} = {max (s_{1}, s_{2} ..., s_{n - 1}), min (s_{1}, s_{2} ..., s_{n - 1}), y = 1, y = 0 .

L_{n}^{c} = - (y lo g (s_{n}) + (1 - y) lo g (1 - s_{n})), n \neq = 0,

L_{n}^{c} = - (y lo g (s_{n}) + (1 - y) lo g (1 - s_{n})), n \neq = 0,

L_{n} = L_{n}^{c} + λ L_{n}^{m}, n \neq = 0,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Video Surveillance and Tracking Methods · Advanced Neural Network Applications

Full text

Image Matching via Loopy RNN

Donghao Luo, Bingbing Ni, Yichao Yan, Xiaokang Yang

Shanghai Jiao Tong University

{luo-donghao, nibingbing, yanyichao, xkyang}@sjtu.edu.cn

Abstract

Most existing matching algorithms are one-off algorithms, i.e., they usually measure the distance between the two image feature representation vectors for only one time. In contrast, human’s vision system achieves this task, i.e., image matching, by recursively looking at specific/related parts of both images and then making the final judgement. Towards this end, we propose a novel loopy recurrent neural network (Loopy RNN), which is capable of aggregating relationship information of two input images in a progressive/iterative manner and outputting the consolidated matching score in the final iteration. A Loopy RNN features two uniqueness. First, built on conventional long short-term memory (LSTM) nodes, it links the output gate of the tail node to the input gate of the head node, thus it brings up symmetry property required for matching. Second, a monotonous loss designed for the proposed network guarantees increasing confidence during the recursive matching process. Extensive experiments on several image matching benchmarks demonstrate the great potential of the proposed method.

1 Introduction

Image matching is a very important research topic in computer vision, due to its great potential in a wide range of real-world tasks including object/place retrieval Arandjelovic et al. (2016), person re-identification Yan et al. (2016b), $3D$ reconstruction Cheng et al. (2014), etc. Mathematically, a matching algorithm takes two images as inputs and outputs a score measuring the similarity of the two inputs, i.e., higher score indicates higher similarity between the two inputs. Previous research work is mainly focused on two aspects. On one hand, various image patch descriptors such as SIFT Lowe (2004), SURF Bay et al. (2006), ORB Rublee et al. (2011), etc., have been proposed to well represent the two patches, based on which the computed distance (e.g., Euclidean distance) can accurately reflect the true relationship between them. On the other hand, metric learning based methods Jia and Darrell (2011); Jain et al. (2012) have been developed to achieve more discriminative distance measure, which is superior to conventional Euclidean distance.

Recently, deep learning has further made significant progress in image matching on both aspects. For feature representation, SIFT based patch descriptors have been replaced with convolutional neural networks (CNN) based ones Fischer et al. (2014); Paulin et al. (2015). The results show a significant performance gain. For distance metric learning, end-to-end learning infrastructure has been utilized to enhance image matching. One remarkable example is the Siamese network Bromley et al. (1993), in which two image patches are first input to a two-stream convolutional sub-network (with identical parameters) to extract features, and then combined with a second sub-network (based on fully connected layers) to infer the similarity of two image patches. Siamese network has been widely used in many aspects of computer vision including people re-identification Yi et al. (2014) and tracking Bertinetto et al. (2016). Based on the end-to-end learnable capability provided by Siamese, MatchNet Han et al. (2015) Zagoruyko and Komodakis (2015) has recently boosted patch-based image matching performance.

Despite their remarkable improvements, previous work on matching can all be regarded as a one-off solution, i.e., most algorithms perform the image patch feature extraction and distance calculation for only one time and output the final matching score. However, human’s vision system performs matching process in a rather recursive/iterative manner. To judge whether the two images refer to the same species, human’s attention is constantly switched between two images and moved to different patches/parts on the images. In other words, one will take turns observing different regions of both images to progressively aggregate information on the matched/un-matched portions of both images and get more and more confident. This process repeats until one is confident enough to make the final judgement. As shown in Figure 1, it is difficult to distinguish if the two birds are from same subspecies by observing the two images once. In this situation, human alternately observes two images and each observation is focused on some parts of the birds, such as the head and leg, etc., and finally makes a confident decision based on integral and local information. From a computational point of view, the above recursive/iterative matching mechanism also has advantages over the conventional one-off approach, as it can progressively attend to more and more discriminative regions of the images and get rid of the issue of cluttered background or irrelevant and noisy image features.

It is thus demanding to develop a computational model or network structure to simulate the recursive mechanism to enhance image matching. Towards this end, it is natural to consider recurrent neural networks (e.g., RNN Dorffner (1996), LSTM Hochreiter and Schmidhuber (1997)). Intuitively, human’s attentive regions/patches on both images could be considered as a sequence of observations. This observation sequence could naturally serves as the input sequence to a RNN/LSTM structure and the aggregated similarity measure could be output from the last (temporal) node of the recurrent network. However, such a sequential model cannot be directly applied for image matching as it violates the symmetric property which is required for a valid matching algorithm. Namely, the output similarity measurement should be unchanged if we switch the order of the two input image patches. To satisfy this symmetric requirement, we propose a loopy recurrent neural network (Loopy RNN). A Loopy RNN inherits basic components and structure of conventional RNN. The major differences between a Loopy RNN structure with a conventional one are that: 1) instead of having an arbitrary number of temporal nodes, it only has two, which correspond to the two input image patches, and 2) these two nodes are cyclicly linked, thus it brings up symmetry property. When applied to image patch matching, Loopy RNN can simulate the iterative process of examining image features from both images alternatively and progressively gather more and more matched information to consolidate the final matching score. To facilitate model training and testing, an approximation from the Loopy RNN structure toward a normal RNN/LSTM structure is developed via duplicating the head and tail nodes for a number of times. To simulate human’s perception as well as to guarantee robust matching, it also requires that the confidence of similarity measurement increases when we goes deeper in our recursive matching network. For such a purpose, we utilize a monotonous objective function Ma et al. (2016), which enforces more penalty to the output associated with deeper node in the network. The proposed Loopy RNN has been experimented on several image matching benchmark including UBS patch dataset Winder et al. (2009) and Mikolajczyk dataset, and results demonstrate performance gain over Siamese-like networks.

2 Related Work

Two key components are included in image matching, one is extracting proper features from the original image and the other is measuring the distance of the features to describe the similarity of images. At first, hand-craft features such as SIFT Lowe (2004) and DAISY Tola et al. (2008) are cooperated with fixed metric method like Graph Model Yan et al. (2016a, 2015a, 2015b) to match images. It means that the two parts of matching (extracting feature and measuring similarity) are independent when using above methods and the isolation hinders the improvement of performance. To break the isolation, researchers propose learning descriptors or similarity metric in condition of fixed the other part. For example, Brown et al. (2011) learns the descriptors by minimizing the classification error and Jain et al. (2012) learns the metric by treating it as a linear transformation. Learning descriptors and metric jointly is proposed to make the cooperation of the two parts more powerful. In Trzcinski et al. (2012), boosting trick is adopted to learn descriptors and metrics and achieved great performance. The performance of these methods are limited by the hand-crafted features, while the proposed method employs deep features and achieves better performance.

The advent of CNNs has tremendously promoted the development of many branches of computer vision including image matching. In Krizhevsky et al. (2012), the performance of convolutional descriptor from AlexNet (trained on ImageNet) has been proved more effective than SIFT in most cases. In Ren et al. (2017), CNN also was used to compute dense correspondence. Combining CNN and Siamese structure Bromley et al. (1993), it is natural to train the network in an end-to-end manner, i.e., learn descriptor and metric jointly. MatchNet of Han et al. (2015) employs a Siamese network in which some convolutional layers are adopted as a feature extractor and fully connected layers as a comparator to measure similarity. Zagoruyko and Komodakis (2015) explores different architectures to do patch-based image matching including Siamese (share parameter of CNN), Pseudo-Siamese (unshare parameter of CNN), and 2-channel (treat two patches as two channels of an image). In virtue of CNN’s advantage, these methods obtain great promotion compared with previous traditional methods. These methods mentioned above can all be regarded as one-off, i.e., descriptors are compared just once. The Loopy RNN proposed in this paper learns the descriptors and metric jointly like Siamese, however, draws the conclusion by repeating comparing the descriptors. Note that Shyam et al. published a paper based on similar idea, which they call Attentive Recurrent Comparators Shyam et al. (2017). We recommend the readers to also refer to this contemporary work.

3 Methodology

The goal of this work is to develop a recusive/iterative matching framework to imitate the matching mechanism of human perception. To this end, we propose a loopy recurrent neural network (Loopy RNN) which not only aggregates individual matching attempts and progressively yields more and more confident matching result but also preserves the symmetric property.

3.1 Loopy Recursive Neural Network

Architecture. The basic structure of a loopy recurrent neural network (Loopy RNN) is illustrated in Figure 2. Our Loopy RNN consists of two sub-networks sharing parameter. Two recurrent nodes compose a sub-network. In this work, we adopt a standard long short-term memory (LSTM) Hochreiter and Schmidhuber (1997) node with input/output/forget/hidden cells as an atomic node of a Loopy RNN. Therefore we denote two sub-networks as $LSTM$ and $LSTM^{{}^{\prime}}$ respectively. For $LSTM$ , two nodes are $\mathbf{n_{1}}$ , $\mathbf{n_{2}}$ . For $LSTM^{{}^{\prime}}$ , two nodes are $\mathbf{n_{1}^{{}^{\prime}}}$ , $\mathbf{n_{2}^{{}^{\prime}}}$ . As illustrated in Figure 2, the difference between a sub-network RNN and a normal RNN architecture lies in the connection between nodes, i.e., a normal RNN is a linear structure, a Loopy RNN is a circular structure with the output of the tail node linked to its head node.

We denote $\mathbf{x_{A},x_{B}}$ as a pair of inputs, i.e., features of image patches. $\mathbf{h_{A(B)}}$ , $\mathbf{o_{A(B)}}$ are the hidden state and output node corresponding to input image A (or B). For each node of LSTM, three gates input gate $\mathbf{i}$ , output gate $\mathbf{o}$ , and forget gate $\mathbf{f}$ as well as a memory cell $\mathbf{c}$ are included. The LSTM nodes in our loopy network are updated as follows:

[TABLE]

where $\sigma$ is the sigmoid function and $\odot$ denotes the element-wise multiplication operator. $\mathbf{W_{*}}$ , $\mathbf{U_{*}}$ and $\mathbf{V_{*}}$ are the weight matrices, and $\mathbf{b_{*}}$ are the bias vectors. The memory cell $\mathbf{c_{A(B)}}$ is a weighted sum of the previous memory cell $\mathbf{c_{B(A)}}$ and a function of the current input.

Proof of Symmetry: Because of the dual structure, it’s straightforward to prove the symmetry of the proposed Loopy network. If we use $\mathbf{F(x_{A},x_{B})}$ and $\mathbf{F(x_{B},x_{A})}$ denoting the final output hidden states of different input orders $(\mathbf{x_{A},x_{B}})$ and $(\mathbf{x_{B},x_{A}})$ respectively, i.e., average of the first node’s output. The symmetric property of matching requires that $\mathbf{F(x_{A},x_{B})=F(x_{B},x_{A})}$ . As shown in Figure 2, the hidden state of node $\mathbf{n_{1}}$ with input $\mathbf{x_{A}}$ is denoted as $\mathbf{h_{n_{1},A}}$ . And $\mathbf{h_{n_{1}^{{}^{\prime}},B}}$ is the hidden state of node $\mathbf{n_{1}^{{}^{\prime}}}$ with input $\mathbf{x_{B}}$ . $\mathbf{F(x_{A},x_{B})}$ is the final hidden state which is used to determine the similarity with the input order $\mathbf{(x_{A},x_{B})}$ . $\mathbf{F(x_{A},x_{B})}$ and $\mathbf{F(x_{B},x_{A})}$ are determined as follows:

[TABLE]

$LSTM$ and $LSTM^{{}^{\prime}}$ share parameters, therefore

[TABLE]

from Equation 2 and 3,

[TABLE]

Thus it’s proven that the proposed Loopy RNN structure possess symmetry property.

3.2 Loopy RNN for Image Matching

To facilitate image (patch) matching, additional network components shall be augmented/modified to the basic structure of Loopy RNN. First, a feature extraction sub-network, i.e., a CNN network, is utilized to map the original image patch to a learned feature space, to input to the core structure of loopy RNN (denoted as FeatureNet in the rest of this paper). Second, as it is generally not feasible to train/test a loopy structure, we develop a simple yet effective structural approximation to convert a Loopy RNN into a conventional LSTM network to measure the similarity of a pair of features (denoted as MetricNet). Details are given as follows.

FeatureNet. We adopt the feature network of MatchNet Han et al. (2015) as our network prototype for extracting deep features, named as FeatureNet. The structure of FeatureNet is modulated from AlexNet Krizhevsky et al. (2012). The detailed network structure of FeatureNet is illustrated in Figure 3. Note that the input patch size of FeatureNet is $64\times 64$ and the output feature dimension (which is connected to the core structure of Loopy RNN) is 4096. For the purpose of simplicity, in our work, dropout and local response normalization layers are omitted, since our input is image patch rather than the entire image (which is more complicated). Note that the feature extraction sub-networks for different input image patches share the same parameters.

MetricNet. Although the two-node Loopy RNN has very simple structure, both forward and backward (training) computation for such a loopy structure is infeasible. Therefore, it is required to develop a simpler network structure which should not only mimic the recursive nature of Loopy RNN but also possess the advantage of easy training/testing. We observe that for a two-node Loopy RNN, the sequence of inputs could be regarded as an infinite repeat of both inputs. For example, for two image patches $I_{A}$ and $I_{B}$ , their descriptors (the output feature vector from FeatureNet) are denoted by $\mathbf{x_{A}}$ and $\mathbf{x_{B}}$ respectively, then the input to the two-node Loopy RNN is the alternative sequence $\mathbf{x_{A}}\rightarrow\mathbf{x_{B}}\rightarrow\mathbf{x_{A}}\rightarrow...$ . Motivated by this observation, we therefore utilize a conventional RNN/LSTM structure with a sequence of finite $K$ repeated $\mathbf{(x_{A},x_{B})}$ patterns as input to approximate the Loopy RNN matching network, as shown in Figure 3. Thus it breaks the loopy structure to a standard RNN/LSTM structure, and off-the-shelf training algorithm could be directly applied. The proposed approximation has two advantages: 1) as information aggregates quickly through recursion, the output of network becomes stable after just a few repeats, i.e., $K$ could be small, which makes model training and inference very efficient; 2) if we regard each time step in the approximated LSTM network as an attention switch to one of the images, the proposed approximation naturally simulates the recursive matching process of human, i.e., iteratively switches attention between two input images and makes deeper and deeper comparisons.

Corresponding to the input sequence, our approximate LSTM network outputs a sequence of pair-wise matching features as $\mathbf{h}_{0}\rightarrow\mathbf{h}_{1}\rightarrow\mathbf{h}_{2}\rightarrow...\rightarrow\mathbf{h}_{n-3}\rightarrow\mathbf{h}_{n-2}\rightarrow\mathbf{h}_{n-1}$ . Each pair-wise feature vector $\mathbf{h}_{n}$ encodes the similarity/comparison information aggregated from repeated input information from both images up to time step $n$ . In contrast to previous matching algorithms which output only a scalar value to indicate similarity, our model outputs a feature vector to encode the similarity relationship between images, which conveys richer information and is more flexible for postprocessing (in cases where later fusion is preferred). These output similarly feature vectors are sent to a softmax layer to determine the final similarity score, i.e., the final output of our network is a sequence of scores which describing the similarity of two image patches, denoted by $s_{0}\rightarrow s_{1}\rightarrow s_{2}\rightarrow...\rightarrow s_{n-3}\rightarrow s_{n-2}\rightarrow s_{n-1}$ . $s_{n}$ is computed as follows:

[TABLE]

where $\mathbf{\theta}$ is the set of softmax layer parameters. Each $s_{n}$ ranges in the interval $[0,1]$ and greater value indicates higher similarity. In fact, the first pair-wise feature $\mathbf{h_{0}}$ and first score $s_{0}$ are abandoned as Figure 3 showed, because only single patch information instead of pair-wise information is included in first node.

3.3 Monotonous Loss

As the network’s attention iteratively switches between two input image patches and more and more comparisons are made, the measurement for similarity should be more and more confident. In other words, it is required that the similarity score of the correct category should be monotonically non-decreasing as the information propagates deeper along the proposed matching network, as illustrated in Figure 3. However, a plain cross-entropy loss does not enforce such a monotonous non-decreasing property. We therefore use a monotonous loss Ma et al. (2016), which extends the conventional cross-entropy loss to enforce the accuracy of prediction increase when the matching process goes deeper. Mathematically, we can express this loss as:

[TABLE]

Here, $L_{n}^{m}$ denotes the monotonous loss at time-step $n$ , which penalizes the corresponding node if the output similarity score violates the monotonous rule. $y$ is the ground truth label, i.e., $1$ for matched and [math] for un-matched. $s_{n}$ is the predicted similarity score at time step $n$ and $s_{n}^{pre}$ is the maximum ( $y=1$ ) or minimum ( $y=0$ ) prediction score until time step $n-1$ . The max operation of Equation 6 picks out the nodes that violate the monotonous rule. Denoted by $L_{n}^{c}$ as the standard cross-entropy loss, the overall loss could be expressed as:

[TABLE]

where $\lambda$ is a weighting factor for both types of losses.

3.4 Implementation Details

Data Preparation. Data imbalance is a common problem in patch-based image matching, because the number of positive pairs is far less than the number of negative pairs. A sampler which generates equal number of positive and negative pairs in a mini-batch is employed to prevent excessive bias to negative pairs. To improve the generalization capability, we augment the dataset by vertically and horizontally flipping the original patch and rotating to $90$ , $180$ , $270$ degrees. Following the previous work Han et al. (2015), we map each pixel value $x$ (in [0,255]) to $(x-128)/160$ .

Network Parameter and Training. The details of FeatureNet are listed in Table 1. For MetricNet, there are 3 key factors which influence the performance of Loopy RNN model: 1) the weighting factor of monotonous loss $\lambda$ ; 2) the number of RNN nodes $N$ ( $N\in\{6,8,10,12\}$ ); 3) the output dimension of LSTM node $D$ ( $D\in\{512,1024,1536,2048\}$ ).

Our models are trained on Caffe Jia et al. (2014) and optimized by Stochastic Gradient Descent (SGD) with the batch-size 32. Learning rate is set to 0.01 at the beginning and decreased once every 1000 iterations. Our model converges to the steady state after about 70 epoches.

4 Experiments

We evaluate our Loopy RNN network on two datasets: UBC dataset Winder et al. (2009) and Mikolajczyk Dataset Brown et al. (2011). Extensive experimental evaluations and in-depth analysis of the proposed method are presented in this section.

4.1 Dataset and Evaluation Metric

UBC Dataset. UBC includes three subsets: Liberty, Notredame and Yosemite. The number of image patches in the three subsets are 450k, 468k, and 634k respectively. For the three subsets, 100k, 200k, and 500k pre-generated pairs are provided and the number of positive pairs equals to negative pairs. Each patch in the dataset has a fixed size of $64\times 64$ , which corresponds to the input dimension of our FeatureNet.

We follow the standard evaluation protocol Brown et al. (2011), i.e., the model is iteratively trained on one subset and tested on the other two subsets, FPR95 (false positive rate at $95\%$ recall) is adopted as the evaluation metric, the lower the better.

Mikolajczyk Dataset. Mikolajczyk dataset is composed of 48 images in 8 sequences and each sequence corresponds to one of 5 transformations: viewpoint change, compression, blur, lighting change and zoom with gradually increasing amount of transformation. Among 6 images of a sequence, one is reference image and the rest are transformed from reference image with different transformation magnitude, i.e., the degree of transformation. The ground truth homography between reference image and transformation images are provided for evaluation. We follow the method of Mikolajczyk and Schmid (2005) to test our model. MAP (Mean Average Precision), which measures the area under the precision-recall curve is adopted as the evaluation metric.

4.2 Evaluation on UBC

In testing phase, we adopt the mean score of all nodes (except first node) as the final similarity score if monotonous loss is not used, i.e., $\lambda=0$ . Otherwise, the final score is the mean of the last two nodes’ score, i.e., for a sequence of 8 nodes, $s_{1},s_{2},...,s_{7}$ are scores of last 7 nodes respectively, the final score $s=\frac{1}{7}(s_{1}+s_{2}+...+s_{7})$ if $\lambda=0$ , otherwise $s=\frac{1}{2}(s_{6}+s_{7})$ .

To verify the effectiveness of our Loopy RNN network, we compare our model with two recent works which apply CNN on patch-based image matching: MatchNet Han et al. (2015) and Zagoruyko and Komodakis (2015). Table 2 lists the comparison results. Our best model achieves $6.32\%$ average error rate with monotonous loss and parameter $N=10,D=1024$ . nSIFT concat.+NNet is the concatenation of SIFT feature and neural network. Our network outperforms this model by a large margin mainly because our feature extracted by FeatureNet is more effective. Before metric network, MatchNet has the same architecture with our network and MatchNet achieves $7.75\%$ average FPR95. Therefore the 1.43% promotion compared with MatchNet completely comes from the combination of our Loopy RNN architecture and monotonous loss. The rest networks are all Siamese-like. Compared with Siamese and Pseudo-Siamese network, the FPR95 of our network decreases 3.75% and 3.3% respectively. The gain comes from the FeatureNet and our Loopy RNN network. Siamese-2stream and Siamese-2stream- $l_{2}$ utilize information of different resolutions and achieves $7.63\%$ and $9.67\%$ average FPR95 respectively. Even only using information of one resolution, our model still obtains 1.31%, 3.34% improvement on FPR95. To verify the effect of monotonous loss, we list the results of our model without monotonous loss. It is obvious that the performance decreases compared with the model with monotonous loss. The experiment results illustrate that our Loopy RNN architecture has superiority over Siamese network and monotonous loss assists Loopy RNN to obtain further performance gain.

2ch-2stream network Zagoruyko and Komodakis (2015) obtains better performance than our network. It achieves the performance by treating two grayscale patches as two channels of a new image and classifying the new image into two category. The 2ch-2stream network performs better because it disposes two images jointly at the very beginning. Then each feature map includes the pair’s feature. Because feature map can’t input to the LSTM node directly, in our network, it is necessary to extract feature vector from image. Thus our FeatureNet disposes two images respectively. Only in MetricNet, pair’s feature is disposed. From image to feature vector, our network misses some pair information compared with the 2-channel network.

Parameter Analysis. On one hand, we find that large $\lambda$ makes the proposed network hard to converge. Thus we can not set $\lambda$ to a large value. On the other hand, $\lambda$ is used for balancing the weights of monotonous loss and cross-entropy loss. Too small value of $\lambda$ weakens the function of monotonous loss. As a result, we set $\lambda=0.4$ empirically. Figure 4 shows the experiment results with different LSTM output dimension $D$ and loopy time $N$ . We fix $N$ as 10 and test the model with different $D$ (Figure 4(a)), then we fix $D$ as 1024 and test our model with different $N$ (Figure 4(b)). We observe that higher dimension results to better performance as well as computational complexity. Here, we set $D$ as 1024. For $N$ , larger loopy time promotes the performance, however, when $N$ exceeds 10, the performance saturates. Thus we set $N$ to 10. This is because that the network already has enough observations for making the judgement. Based on the above analysis, we choose the model with parameter $N=10$ , $D=1024$ as our best models even though the $N=10$ , $D=2048$ model outperforms the former a little.

4.3 Evaluation On Local Descriptors

We compare our network with five networks of Zagoruyko and Komodakis (2015) that are tested on the same dataset. MSER SIFT uses the Euclidean distance of SIFT features to measure the similarity of two patches. The other three models, MSER Imagenet, MSER Siam-2stream- $l_{2}$ and MSER Siam-SPP- $l_{2}$ substitute SIFT feature with CNN feature. MSER 2ch-2stream dispose 2 images as mentioned above. All the models are trained on the Liberty dataset. We test our model under different transformation with increasing magnitudes. Figure 5 illustrates the overall results. Our network outperforms most networks, especially in the extreme case. When transformation magnitude equals to 5, our network and 2ch-2stream greatly outperforms other networks by approximately 10%, which demonstrates the robustness of our model. MSER Siam-SPP- $l_{2}$ is a network with SPP layer which makes the network able to deal with patches of different scale. Although our model only deals with single scale patches, it still outperforms MSER Siam-SPP- $l_{2}$ in most cases. When magnitude equals to 1, 2, 5, MAP of our model achieves better performances by about 2%, 4% and 9% respectively compared with MSER Siam-SPP- $l_{2}$ . The performance of our network is comparable with the 2ch-2stream network. This indicates that our loopy network has good generalization ability as it performs well on different datasets.

5 Conclusion

In this paper, we propose a novel Loopy RNN which matches a pair of patches in a recurrent manner. Based on widely used cross-entropy loss, we add the monotonous loss aiming at restricting the output of a sequence. Combined with monotonous cross-entropy loss, our network imitates human to observe the two patches back and forth and the judgement become more and more confident in this process. Our experimental results show the effectiveness of the proposed method.

Acknowledgements

The work was supported by State Key Research and Development Program (2016YFB1001003). This work was partly supported by NSFC (61502301), China’s Thousand Youth Talents Plan, National Natural Science Foundation of China (61521062), the 111 Project (B07022) and the Opening Project of Shanghai Key Laboratory of Digital Media Processing and Transmissions.

Bibliography30

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Arandjelovic et al. [2016] Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. Netvlad: Cnn architecture for weakly supervised place recognition. In CVPR , 2016.
2Bay et al. [2006] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. In ECCV , 2006.
3Bertinetto et al. [2016] Luca Bertinetto, Jack Valmadre, João F Henriques, Andrea Vedaldi, and Philip HS Torr. Fully-convolutional siamese networks for object tracking. In ECCV , 2016.
4Bromley et al. [1993] Jane Bromley, James W Bentz, Bottou Léon, Isabelle Guyon, Yann Le Cun, Cliff Moore, Eduard Säckinger, and Roopak Shah. Signature verification using a siamese time delay neural network. IJPRAI , pages 669–688, 1993.
5Brown et al. [2011] Matthew Brown, Gang Hua, and Simon Winder. Discriminative learning of local image descriptors. IEEE TPAMI , pages 43–57, 2011.
6Cheng et al. [2014] Jian Cheng, Cong Leng, Jiaxiang Wu, Hainan Cui, and Hanqing Lu. Fast and accurate image matching with cascade hashing for 3d reconstruction. In CVPR , 2014.
7Dorffner [1996] Georg Dorffner. Neural networks for time series processing. In Neural network world , 1996.
8Fischer et al. [2014] Philipp Fischer, Alexey Dosovitskiy, and Thomas Brox. Descriptor matching with convolutional neural networks: a comparison to sift. ar Xiv preprint ar Xiv:1405.5769 , 2014.