SFNet: Learning Object-aware Semantic Correspondence

Junghyup Lee; Dohyung Kim; Jean Ponce; Bumsub Ham

arXiv:1904.01810·cs.CV·April 8, 2019

SFNet: Learning Object-aware Semantic Correspondence

Junghyup Lee, Dohyung Kim, Jean Ponce, Bumsub Ham

PDF

TL;DR

SFNet is a novel CNN architecture trained with synthetic deformations and binary masks to establish dense semantic correspondences between images of the same object category, outperforming previous methods.

Contribution

The paper introduces SFNet, a new CNN with a differentiable argmax and combined loss, for improved semantic correspondence using mask-based supervision.

Findings

01

Outperforms state-of-the-art on standard benchmarks.

02

Effectively handles background clutter and image-specific details.

03

Utilizes synthetic geometric deformations for training data augmentation.

Abstract

We address the problem of semantic correspondence, that is, establishing a dense flow field between images depicting different instances of the same object or scene category. We propose to use images annotated with binary foreground masks and subjected to synthetic geometric deformations to train a convolutional neural network (CNN) for this task. Using these masks as part of the supervisory signal offers a good compromise between semantic flow methods, where the amount of training data is limited by the cost of manually selecting point correspondences, and semantic alignment ones, where the regression of a single global geometric transformation between images may be sensitive to image-specific details such as background clutter. We propose a new CNN architecture, dubbed SFNet, which implements this idea. It leverages a new and differentiable version of the argmax function for…

Tables4

Table 1. Table 1: Quantitative comparison with the state of the art on the PF-WILLOW [ 14 ] and the test split of the PF-PASCAL [ 15 , 16 ] in terms of the average PCK. We measure the PCK scores with height and width of the bounding box size. All numbers except for the methods of [ 40 , 42 , 41 ] are taken from [ 15 , 42 ] . Numbers in bold indicate the best performance and underscored ones are the second best. We denote by “F” and “A”, respectively, semantic flow and semantic alignment methods. The characters in parentheses are types of a supervisory signal for training; T: Transformation parameters; P: Image pairs depicting different instances of the same object category; B: Bounding boxes; M: Foreground masks.

Type		Methods	PCK ( $α = 0.1$ )
Type		Methods	WILLOW	PASCAL
Hand-crafted	F	DeepFlow [39]	0.20	0.21
	F	GMK [10]	0.27	0.27
	F	SIFTFlow [30]	0.38	0.33
	F	DSP [26]	0.29	0.30
	F	HOG+PF-LOM [15]	0.56	0.45
CNN-based	A	(T) ResNet-101+CNNGeo [40]	0.68	0.68
	A	(T) ResNet-101+A2Net [42]	0.69	0.67
	A	(T+P) ResNet-101+WS-SA [41]	0.71	0.72
	F	(B+P) FCSS+PF-LOM [27]	0.58	0.46
	F	(M) ResNet-101+Ours	0.74	0.79

Table 2. Table 2: Quantitative comparison on the Caltech-101 dataset [ 12 ] . All numbers are taken from [ 15 , 41 , 42 ] . Numbers in bold indicate the best performance and underscored ones are the second best. C: Ground-truth correspondences.

Type		Methods	LT-ACC	IoU
Hand-crafted	F	DeepFlow [39]	0.74	0.40
	F	GMK [10]	0.77	0.42
	F	SIFTFlow [30]	0.75	0.48
	F	DSP [26]	0.77	0.47
	F	HOG+PF-LOM [15]	0.78	0.50
	F	OADSC [46]	0.81	0.55
CNN-based	A	(T) VGG-16+A2Net [42]	0.80	0.57
	A	(T) ResNet-101+CNNGeo [40]	0.83	0.61
	A	(T+P) ResNet-101+WS-SA [41]	0.85	0.63
	F	(C+P) VGG-16+SCNet-AG+ [16]	0.79	0.51
	F	(B+P) FCSS+PF-LOM [27]	0.83	0.52
	F	(M) ResNet-101+Ours	0.88	0.67

Table 3. Table 3: Average PCK comparison of different loss functions.

Mask	Flow	Smoothness	PCK
consistency	consistency		( $α = 0.1$ )

✓	✗	✗	0.675
✗	✓	✗	0.718
✓	✓	✗	0.782
✓	✓	✓	0.787

Table 4. Table 4: Average PCK comparison of different components. We denote by “H”, “S”, and “KS” hard, soft, and kernel soft argmax operators, respectively.

Adaptation	Multi-level	Argmax		PCK
layer	feature	Train	Test	( $α = 0.1$ )

✗	✓	-	H	0.458
✗	✓	-	S	0.088
✗	✓	-	KS	0.284
✓	✗	S	H	0.725
✓	✗	S	S	0.717
✓	✗	KS	KS	0.750
✓	✓	S	H	0.768
✓	✓	S	S	0.762
✓	✓	KS	KS	0.787

Equations20

c (p, q) = f^{s} (p)^{⊤} f^{t} (q), \vspace - 0.1 c m

c (p, q) = f^{s} (p)^{⊤} f^{t} (q), \vspace - 0.1 c m

ϕ (p) = q \sum m_{p} (q) q . \vspace - 0.1 c m

ϕ (p) = q \sum m_{p} (q) q . \vspace - 0.1 c m

m_{p} (q) = \frac{exp ( β k _{p} ( q ) n _{p} ( q ))}{\sum _{q^{'} \in n_{p}} exp ( β k _{p} ( q ^{'} ) n _{p} ( q ^{'} ))}, \vspace - 0.1 c m

m_{p} (q) = \frac{exp ( β k _{p} ( q ) n _{p} ( q ))}{\sum _{q^{'} \in n_{p}} exp ( β k _{p} ( q ^{'} ) n _{p} ( q ^{'} ))}, \vspace - 0.1 c m

L = λ_{mask} L_{mask} + λ_{flow} L_{flow} + λ_{smooth} L_{smooth}, \vspace - 0.1 c m

L = λ_{mask} L_{mask} + λ_{flow} L_{flow} + λ_{smooth} L_{smooth}, \vspace - 0.1 c m

F^{s} (p) = ϕ (p) - p . \vspace - 0.1 c m

F^{s} (p) = ϕ (p) - p . \vspace - 0.1 c m

\hat{M}^{s} = W (M^{t}; F^{s}) . \vspace - 0.1 c m

\hat{M}^{s} = W (M^{t}; F^{s}) . \vspace - 0.1 c m

\mathcal{L}_{\mathrm{mask}}=\sum_{i\in\{s,t\}}\Biggl{(}\frac{1}{|N^{i}|}\sum_{{\bf{p}}}({M}^{i}({\bf{p}})-\hat{M}^{i}({\bf{p}}))^{2}\Biggr{)},\vspace{-0.1cm}

\mathcal{L}_{\mathrm{mask}}=\sum_{i\in\{s,t\}}\Biggl{(}\frac{1}{|N^{i}|}\sum_{{\bf{p}}}({M}^{i}({\bf{p}})-\hat{M}^{i}({\bf{p}}))^{2}\Biggr{)},\vspace{-0.1cm}

\mathcal{L}_{\mathrm{flow}}=\sum_{i\in\{s,t\}}\Biggl{(}\frac{1}{|N^{i}_{F}|}\sum_{{\bf{p}}}||(\mathcal{F}^{i}({\bf{p}})+\hat{\mathcal{F}}^{i}({\bf{p}}))\odot M^{i}({\bf{p}})||^{2}_{2}\Biggr{)},\vspace{-0.1cm}

\mathcal{L}_{\mathrm{flow}}=\sum_{i\in\{s,t\}}\Biggl{(}\frac{1}{|N^{i}_{F}|}\sum_{{\bf{p}}}||(\mathcal{F}^{i}({\bf{p}})+\hat{\mathcal{F}}^{i}({\bf{p}}))\odot M^{i}({\bf{p}})||^{2}_{2}\Biggr{)},\vspace{-0.1cm}

\hat{F}^{s} = W (F^{t}; F^{s}), \vspace - 0.1 c m

\hat{F}^{s} = W (F^{t}; F^{s}), \vspace - 0.1 c m

\mathcal{L}_{\mathrm{smooth}}=\sum_{i\in\{s,t\}}\Biggl{(}\frac{1}{|N^{i}_{F}|}\sum_{{\bf{p}}}||\nabla\mathcal{F}^{i}({\bf{p}})\odot M^{i}({\bf{p}})||_{1}\Biggr{)},

\mathcal{L}_{\mathrm{smooth}}=\sum_{i\in\{s,t\}}\Biggl{(}\frac{1}{|N^{i}_{F}|}\sum_{{\bf{p}}}||\nabla\mathcal{F}^{i}({\bf{p}})\odot M^{i}({\bf{p}})||_{1}\Biggr{)},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

SFNet: Learning Object-aware Semantic Correspondence

Junghyup Lee1, Dohyung Kim1,11footnotemark: 1 Jean Ponce2,3 Bumsub Ham1, $\dagger$

1Yonsei University 2DI ENS 3INRIA Equal contribution. †Corresponding author.

1School of Electrical and Electronic Engineering, Yonsei University, Seoul, Korea.

2Département d’Informatique de l’ENS, ENS, CNRS, PSL Research University, Paris, France.

Abstract

We address the problem of semantic correspondence, that is, establishing a dense flow field between images depicting different instances of the same object or scene category. We propose to use images annotated with binary foreground masks and subjected to synthetic geometric deformations to train a convolutional neural network (CNN) for this task. Using these masks as part of the supervisory signal offers a good compromise between semantic flow methods, where the amount of training data is limited by the cost of manually selecting point correspondences, and semantic alignment ones, where the regression of a single global geometric transformation between images may be sensitive to image-specific details such as background clutter. We propose a new CNN architecture, dubbed SFNet, which implements this idea. It leverages a new and differentiable version of the argmax function for end-to-end training, with a loss that combines mask and flow consistency with smoothness terms. Experimental results demonstrate the effectiveness of our approach, which significantly outperforms the state of the art on standard benchmarks.

1 Introduction

Establishing dense correspondences across images is one of the fundamental tasks in computer vision [36, 5, 30]. Early works have focussed on handling different views of the same scene (stereo matching [36, 19]) or adjacent frames (optical flow [5, 4]) in a video sequence. Semantic correspondence algorithms (e.g., SIFT Flow [30]) go one step further, finding a dense flow field between images depicting different instances of the same object or scene category. This is very challenging especially in the presence of large changes in appearance/scene layout and background clutter. Classical approaches to semantic correspondence [30, 26, 20, 3, 47] typically use an objective function involving fidelity and regularization terms. The fidelity term encourages hand-crafted features (e.g., SIFT [32], HOG [7], DAISY [45]) to be matched along a dense flow field between images, and the regularization term makes it smooth while aligning discontinuities to object boundaries. Although they have proven useful in various computer vision tasks including object recognition [30, 10], semantic segmentation [26], co-segmentation [44], image editing [8], and scene parsing [26, 50], hand-crafted features do not capture high-level semantics (e.g., appearance and shape variations), and are not robust to image-specific details (e.g., texture, background clutter, occlusion).

Convolutional neural networks (CNNs) have allowed remarkable advances in semantic correspondence in the past few years. Recent methods using CNNs [16, 6, 35, 24, 27, 51, 40, 41, 42, 23] benefit from rich semantic features invariant to intra-class variations, achieving state-of-the-art results. Semantic flow approaches [16, 6, 35, 27, 51] attempt to find correspondences for individual pixels or patches. They are not seriously affected by non-rigid deformations, but are easily distracted by background clutter. They also require a large amount of data with ground-truth correspondences for training. Although pixel-level semantic correspondences impose very strong constraints, manually annotating them is extremely labor-intensive and somewhat subjective, which limits the amount of training data available [14]. An alternative is to learn feature descriptor only [6, 35, 27] or to exploit 3D CAD models provided by rendering engines [51]. Semantic alignment methods [24, 40, 41, 42, 23] on the other hand formulate semantic correspondence as a geometric alignment problem and directly regress parameters of a global transformation model (e.g., affine and thin plate spline) between images. This leverages self-supervised learning where ground-truth parameters are generated synthetically using random transformations with, however, a higher sensitivity to non-rigid deformations. Moreover, background clutter prevents focussing on individual objects and distracts estimating the transformation parameters. To overcome this problem, recent methods alleviate the influence of distractors by inlier counting [41] or an attention process [42].

In this paper, we present a new approach to establishing an object-aware semantic flow and propose to exploit binary foreground masks as a supervisory signal (Fig. 1). Our approach builds upon the insight that correspondences of high quality between images allow to segment common objects from background. To implement this idea, we introduce a new CNN architecture, dubbed SFNet, that outputs a semantic flow field at a sub-pixel level. We leverage a new and differentiable version of the argmax function, a kernel soft argmax, together with mask/flow consistency and smoothness terms to train SFNet end-to-end, establishing object-aware correspondences while filtering out distracting details. Our approach has the following advantages: First, it is a good compromise between current semantic flow and alignment methods, since masks are available for large dataset, and they give a good set of constraints. Exploiting binary foreground masks explicitly for training makes it possible to focus on learning correspondences between prominent objects and scene elements. Note that no masks are required at test time. Second, our method establishes a dense non-parametric flow field (i.e., semantic flow), which is more robust to non-rigid deformations than a parametric regression (i.e., semantic alignment). Finally, the kernel soft argmax enables training the whole network end-to-end, and hence our approach further benefits from high-level semantics specific to the task of semantic correspondence. The main contributions of this paper can be summarized as follows:

$\bullet$

We propose to exploit binary foreground masks directly, that are widely available and can be annotated more easily than the pixel-level ground truth, to learn semantic flow by incorporating them into loss functions.

$\bullet$

We introduce a kernel soft argmax, making it less susceptible to multi-modal distributions while providing a differentiable flow field at a sub-pixel level.

$\bullet$

We set a new state of the art on standard benchmarks for semantic correspondence, clearly demonstrating the effectiveness of our approach to exploiting foreground masks. We additionally provide an extensive experimental analysis with ablation studies.

To encourage comparison and future work, our code and models are available online: https://cvlab-yonsei.github.io/projects/SFNet.

2 Related work

Correspondence problems cover a broad range of topics in computer vision including stereo, motion analysis, object recognition and shape matching. Giving a comprehensive review on these topics is beyond the scope of this paper. We briefly review representative works related to ours.

Classical approaches have focussed on finding sparse correspondences, e.g., for instance matching [32] or establishing dense matches between nearby views of the same scene/object, e.g., for stereo matching [19, 36] and optical flow estimation [4, 5]. Unlike these, semantic correspondence methods estimate dense matches across pictures containing different instances of the same object or scene category. Early works on semantic correspondence focus on matching local features from hand-crafted descriptors, such as SIFT [30, 26, 20, 3], DAISY [47] and HOG [14, 44, 46], together with spatial regularization using graphical models [30, 26, 44, 20] or random sampling [1, 47]. However, designing hand-crafted features while considering high-level semantics is extremely hard, and computing similarities between them is easily distracted e.g., by clutter, texture, occlusion and appearance variations. There are many attempts to estimate correspondences robust against background clutter or scale changes between objects/object parts, by using object proposals as candidate regions for matching [14, 46] or performing matching in scale space [38].

Recently, image features from CNNs have shown the powerful capacity of representing high-level semantics and the robustness to appearance and shape variations [29, 43, 17]. Long et al. [31] first apply CNNs to establish semantic correspondences between images. They follow the same procedure as the SIFT Flow [30] method, but exploit off-the-shelf CNN features trained for ImageNet classification tasks due to a lack of training datasets with pixel-level annotations. This problem can be alleviated by synthesizing ground-truth correspondences from 3D models [51] or augmenting the number of match pairs in a sparse keypoint dataset using interpolation [44]. More recently, the PF dataset [15] has been released providing 1300+ image pairs of 20 image categories with ground-truth annotations from the PASCAL 2011 keypoint dataset [2]. This enables learning local features [16, 27, 35] specific to the task of semantic correspondence. Although these approaches using CNN features outperform early methods by large margins, loss functions for training do not involve a spatial regularizer mainly due to a lack of differentiability of the flow field. In contrast, our flow field is differentiable, allowing to train the whole network with a spatial regularizer end-to-end.

Several recent methods [24, 40, 41, 42, 23] formulate semantic correspondence as a geometric alignment problem using parametric models. In particular, these methods first compute feature correlations between images, and they are fed into a regression layer to estimate parameters of a global transformation model (e.g., affine, homography, and thin plate spline) to align images. This makes it possible to leverage self-supervised learning [24, 40, 41, 42] using synthetically generated data and to train the entire CNNs end-to-end. These approaches apply the same transformation to all pixels, which has the effect of an implicit spatial regularization, providing smooth matches and often outperforming semantic flow methods [6, 14, 16, 27, 51]. However, they are easily distracted by background clutter and occlusion [24, 40], since correlations between pairs of features are noisy and include outliers (e.g., between different backgrounds). Although this can be alleviated by using attention models [42] or suppressing outlier metches [41], global transformation models are highly sensitive to non-rigid deformations or local geometric variations. In this context, our method avoids this problem by establishing semantic correspondences directly from feature correlations.

Similar to ours, many methods [23, 27, 50, 51] leverage object bounding boxes or foreground masks to learn semantic correspondence. They, however, do not incorporate the object location prior explicitly into loss functions. They instead use the prior for pre-processing training samples, e.g., generating positive/negative training pairs [23, 27] or limiting the candidate regions for matching [50, 51]. In contrast, we incorporate the prior directly into loss functions to train the network, outperforming the state of the art by a significant margin.

3 Approach

In this section, we describe our approach to establishing object-aware semantic correspondences including the network architecture (Sec. 3.1) and loss functions (Sec. 3.2). An overview of our method is shown in Fig. 2.

3.1 Network architecture

Our model is fully convolutional and mainly consists of three parts (Fig. 2): We first extract features from source and target images, $I^{s}$ and $I^{t}$ , using a siamese network where each sub-network has the same structure with shared parameters. We then compute matching scores between all pairs of local features in the two images, and assign the best match for each feature by the kernel soft argmax. All components are differentiable, allowing us to train the whole network end-to-end. In the following, we describe the network architecture for source to target matching in detail. A target to source matching is similarly computed.

Feature extraction and matching. We exploit a ResNet-101 [17] trained for ImageNet classification [9] for feature extraction. Although such CNN features give rich semantics, they typically fire on highly discriminative parts for classification. This may be less adequate for feature matching that requires capturing a spatial deformation for fine-grained localization. We thus use additional adaptation layers to extract features specific to the task of semantic correspondence, transforming them to be highly discriminative w.r.t both appearance and spatial context. This gives a feature map of size $h\times w\times d$ for each image that corresponds to $h\times w$ grids of $d$ -dimensional local features. We then apply L2 normalization to the individual $d$ -dimensional features. As will be seen in our experiments, the adaptation layers boost the matching performance drastically.

Matching scores are computed using the dot product between local features, resulting in a 4-dimensional correlation map of size $h\times w\times h\times w$ as follows:

[TABLE]

where we denote by $f^{s}({\bf{p}})$ and $f^{t}({\bf{q}})$ $d$ -dimensional features at positions ${\bf{p}}=(p_{x},p_{y})$ and ${\bf{q}}=(q_{x},q_{y})$ in the source and target images, respectively.

Kernel soft argmax layer. We can assign the best matches by applying the argmax function over a 2-dimensional correlation map $c_{\bf{p}}({\bf{q}})=c({\bf{p}},{\bf{q}})$ , w.r.t all features $f^{t}({\bf{q}})$ at each spatial location ${\bf{p}}$ . However, the argmax is discrete and not differentiable. The soft argmax [18, 25] computes an output by a weighted average of all spatial positions with corresponding matching probabilities. Although it is differentiable and enables fine-grained localization at a sub-pixel level, the output is influenced by all spatial positions, which is problematic especially in the case of multi-modal distributions.

We introduce a hybrid version, the kernel soft argmax, that takes advantage of both the soft and discrete argmax. Concretely, it computes correspondences $\phi({\bf{p}})$ for individual locations ${\bf{p}}$ as an average of all coordinate pairs ${\bf{q}}=(q_{x},q_{y})$ weighted by a matching probability $m_{\bf{p}}({\bf{q}})$ as follows.

[TABLE]

The matching probability $m_{\bf{p}}$ is computed by applying a spatial softmax function to a L2-normalized version $n_{\bf{p}}$ of the correlation map $c_{\bf{p}}$ :

[TABLE]

where $k_{\bf{p}}$ is a 2-dimensional Gaussian kernel centered on the position, computed by applying the discrete argmax to $n_{\bf{p}}$ 111At training time, we compute the kernel $k_{\bf{p}}$ every iterations and no gradients are propagated through the discrete argmax, making the matching probability $m_{\bf{p}}$ differentiable.. That is, we perform element-wise multiplication between the score map $n_{\bf{p}}$ and kernel $k_{\bf{p}}$ , and then apply the softmax function. This retains the scores $n_{\bf{p}}$ near the output of the discrete argmax while suppressing others, having the effect of restricting the range of averaging in (2) and making it less susceptible to multi-modal distributions (e.g., from ambiguous matches in background clutter and repetitive patterns) while maintaining differentiability. $\beta$ is a “temperature" parameter adjusting a distribution of the softmax output. Note that as it becomes larger, the softmax function approaches the discrete one with one clear peak, but this may cause an unstable gradient flow at training time. Different from [18, 25], we perform L2 normalization on the 2-dimensional correlation map $c_{\bf{p}}$ , adjusting the matching scores $f^{s}({\bf{p}})^{\top}f^{t}({\bf{q}})$ to a common scale before applying the softmax function. Note that the normalization is particularly important for semantic alignment methods [40, 41, 42, 23] (see, for example, Table 2 in [40]) but for different reasons. It penalizes features having multiple highly-correlated matches, boosting the scores of discriminative matches.

3.2 Loss

We exploit binary foreground masks as a supervisory signal to train the network, which gives a strong object prior. To this end, we define three losses that guide the network to learn object-aware correspondences without pixel-level ground truth as

[TABLE]

which consists of mask consistency $\mathcal{L}_{\mathrm{mask}}$ , flow consistency $\mathcal{L}_{\mathrm{flow}}$ and smoothness $\mathcal{L}_{\mathrm{smooth}}$ terms, balanced by the weight parameters ( $\lambda_{\mathrm{mask}}$ , $\lambda_{\mathrm{flow}}$ , $\lambda_{\mathrm{smooth}}$ ). In the following, we describe each term in detail.

Mask consistency loss. We define a flow field $\mathcal{F}^{s}$ from source to target images as

[TABLE]

Similarly, a flow field $\mathcal{F}^{t}$ from target to source images are defined as $\phi({\bf{q}})-{\bf{q}}$ . We denote by $M^{s}$ and $M^{t}$ binary masks of source and target images, respectively. The values of 0 and 1 in the masks indicate background and foreground regions, respectively. We assume that reconstructing foreground/background masks by feature matching requires computing reliable similarities between features and dense correspondences of a high quality. To implement this idea, we transfer the target mask $M^{t}$ by warping [22] using the flow field $\mathcal{F}^{s}$ and obtain an estimate of the source mask $\hat{M}^{s}$ as follows.

[TABLE]

Here, we denote by $\mathcal{W}$ a warping operator using a flow field, e.g., $\mathcal{W}(M^{t};\mathcal{F}^{s})({\bf{p}})=M^{t}({\bf{p}}+\mathcal{F}^{s}({\bf{p}}))$ . We then compute the difference between the source mask $M^{s}$ and its estimate $\hat{M}^{s}$ . Similarly, we reconstruct the target mask $\hat{M}^{t}$ from $M^{s}$ using the field $\mathcal{F}^{t}$ and compute its difference from $M^{t}$ . Accordingly, we define the mask consistency loss as

[TABLE]

where $|N^{i}|$ is the number of pixels in the mask $M^{i}$ . Although the mask consistency loss does not enforce not aligning the background with anything, it prevents matches from foreground to background regions and vice versa by penalizing them. This encourages correspondences to be established between features within foreground masks and background masks, guiding our model to learn object-aware correspondences. Note that the mask consistency loss does not restrict a many-to-one matching. That is, it does not penalize a case when many foreground features in an image are matched to a single one in other image, since binary masks do not give a positional certainty of correspondences.

Flow consistency loss. A flow consistency loss measures consistency between flow fields $\mathcal{F}^{s}$ and $\mathcal{F}^{t}$ within foreground masks defined as

[TABLE]

where $|N^{i}_{F}|$ is the number of foreground pixels in the mask $M^{i}$ , and

[TABLE]

which aligns the flow field $\mathcal{F}^{t}$ with respect to $\mathcal{F}^{s}$ by warping. $\hat{\mathcal{F}}^{t}$ is computed similar to (9). We denote by $\left\lVert\cdot\right\rVert_{2}$ and $\odot$ the L2 norm and element-wise multiplication, respectively. The multiplication is applied separately for each $x$ and $y$ component. The flow consistency term favors a one-to-one matching, spreading flow fields over foreground regions and alleviating the many-to-one matching problem in the mask consistency loss. For example, when the flow fields are consistent with each other, $\mathcal{F}^{s}$ and $\hat{\mathcal{F}}^{s}$ have the same magnitude with opposite directions. Similar ideas have been explored in stereo matching [49, 13] and optical flow [33, 52], but without considering appearance and shape variations. It is hard to incorporate this term in current semantic flow methods based on CNNs [6, 16, 27] mainly due to a lack of differentiability of the flow field. Recently, Zhou et al. [51] exploit cycle consistency between flow fields, but they regress correspondences directly from concatenated features from source and target images and do not consider background clutter. In contrast, our method establishes a differentiable flow field by computing feature similarities explicitly while considering background clutter.

Smoothness loss. The differentiable flow field also allows to exploit a smoothness loss, which has been widely used in classical energy-based approaches [30, 26, 20]. We define a smoothness loss using the first-order derivative of the flow fields $\mathcal{F}^{s}$ and $\mathcal{F}^{t}$ as

[TABLE]

where $\left\lVert\cdot\right\rVert_{1}$ and $\nabla$ are the L1 norm and the gradient operator, respectively. This regularizes (or smooths) flow fields within foreground regions while not accounting for correspondences at background.

4 Experiments

In this section we present a detailed analysis and evaluation of our approach including ablation studies on different losses and network architectures.

4.1 Implementation details

Following [41, 42], we use CNN features from ResNet-101 [17] trained for ImageNet classification [9]. Specifically, we use the networks cropped at conv4-23 and conv5-3 layers, respectively. This results in two feature maps of size $20\times 20\times 1024$ and $10\times 10\times 2048$ , respectively, for a pair of input images of size $320\times 320$ , which gives a good compromise between localization accuracy and high-level semantics. Adaptation layers are trained with random initialization, separately for each feature map in a residual fashion [17]. To compute residuals, we add $5\times 5$ and $3\times 3$ convolutional layers with padding on top of conv4-23 and conv5-3, respectively, with batch normalization [21] and the ReLU [29]. The residuals are then added to the corresponding input features. With the resulting two feature maps of size $20\times 20\times 1024$ and $20\times 20\times 2048$ 222We upsample the features adapted from conv5-3 using bilinear interpolation., we compute pairwise match scores and then combine them by element-wise multiplication, resulting in a correlation map of size $20\times 20\times 20\times 20$ . We do not finetune the whole network due to a lack of training data, and train adaptation layers only. We empirically set the temperature parameter $\beta$ to 50 and standard deviation $\sigma$ of Gaussian kernel $k_{\bf{p}}$ to 5. Other parameters for losses are fixed to all experiments ( $\lambda_{\mathrm{mask}}=3$ , $\lambda_{\mathrm{flow}}=16$ , $\lambda_{\mathrm{smooth}}=0.5$ ). We use a grid search to set these parameters, and choose the ones that give the best performance on the validation split of the PF-PASCAL dataset [15, 41]. At test time, we upsample a flow field of size $20\times 20$ using bilinear interpolation.

4.2 Training

Training our network requires pairs of foreground masks for source and target images depicting different instances of the same object category. Although the TSS [44] and Caltech-101 [12] datasets provide such pairs, the number of masks is not enough to train our network [44] or there is a lack of background clutter [12]. Our model trained with these datasets suffers from a overfitting problem or may not generalize well for other images containing clutter. Motivated by [24, 40, 42, 34], we generate pairs of source and target images synthetically from single images by applying random affine transformations and use the synthetically warped pairs as training samples. Corresponding foreground masks are also transformed with the same transformation parameters. Contrary to [24, 40, 42, 34], our model does not perform a parametric regression, and thus it does not require ground-truth transformation parameters for training. We use the Pascal VOC 2012 segmentation dataset [11] that consists of 1,464, 1,449, and 1,456 images for training, validation and test, respectively. We exclude 122 images from train/validation sets that overlap with the test split in the PF-PASCAL [15], and train our model with the corresponding 2,791 images. We augment the training dataset by horizontal flipping and color jittering. Note that we do not use segmentation masks, provided by the Pascal VOC 2012 dataset, that specify the class of the object at each pixel. We instead generate binary foreground masks using all labeled objects, regardless of image categories and the number of object, at training time. We train our model with a batch size of 16 about 7k iterations, giving roughly 40 epochs over the training data. We use the Adam optimizer [28] with $\beta_{1}=0.9$ and $\beta_{2}=0.999$ . A learning rate initially set to 3e-5 is divided by 5 after 30 epochs. All networks are trained end-to-end using PyTorch[37].

4.3 Results

We compare our model to the state of the art on semantic correspondence including hand-crafted and CNN-based methods with the following three benchmark datasets: PF-WILLOW [14], PF-PASCAL [15], and Caltech-101 [12]. The results for all comparisons have been obtained from the source code or models provided by the authors.

PF-WILLOW & PF-PASCAL. The PF-WILLOW [14] and PF-PASCAL [15] datasets provide 900 and 1,351 image pairs of 4 and 20 image categories, respectively, with corresponding ground-truth object bounding boxes and keypoint annotations. These benchmarks are more challenging than other datasets [12, 44] for semantic correspondence evaluation, featuring different instances of the same object class in the presence of large changes in appearance and scene layout, clutter and scale changes between objects. To evaluate our model, we use the PF-WILLOW and the test split of the PF-PASCAL provided by [16, 41] corresponding roughly 900 and 300 image pairs, respectively. We use the probability of correct keypoint (PCK) [48] to measure the precision of overall assignment, particularly at sparse keypoints of semantic relevance. We compute the Euclidean distances between warped keypoints using an estimated dense flow and ground truth, and count the number of keypoints whose distances lie within $\alpha\text{max}(h,w)$ pixels, where $\alpha=0.1$ and $h$ and $w$ are the height and width of the object bounding box, respectively.

We show in Table 1 the average PCK scores for the PF-WILLOW and PF-PASCAL datasets, and compare our method with the state of the art including hand-crafted [39, 10, 30, 26, 15] and CNN-based methods [40, 42, 41, 27]. The PCK scores in [40, 42, 41] are obtained by the provided models (affine + TPS). All other numbers are taken from [15, 42]. From this table, we observe four things: (1) Our model outperforms the state of the art by a significant margin in terms of the PCK especially for the PF-PASCAL datasets. In particular, it shows better performance than other object-aware methods [15, 27] that focus on establishing region correspondences between prominent objects. A plausible explanation is that establishing correspondences between object proposals is susceptible to shape deformations. (2) We can clearly see that our model gives better results than semantic alignment methods [40, 42, 41] on both datasets, but performance gain for the PF-PASCAL dataset, which typically contains pictures depicting a non-rigid deformation and clutter (e.g., in cat and person classes), is more significant. For example, the PCK gain over WS-SA [41] for the PF-PASCAL (0.79 vs. 0.72) is about two times more than that for the PF-WILLOW (0.74 vs. 0.71), indicating that our semantic flow method is more robust to non-rigid deformations and background clutter than semantic alignment approaches. (3) By comparing our model with a CNN-based semantic flow method [27], we can see that involving a spatial regularizer is significant. It focuses on designing fidelity terms (e.g., using a contrastive loss [6]) only to learn a feature space preserving semantic similarities. This is because of a lack of differentiability of the flow field. In contrast, our model gives a differentiable flow field, allowing to exploit a spatial regularizer while further leveraging high-level semantics from CNN features more specific to semantic correspondence. (4) We confirm once more a finding in [31] that CNN features trained for ImageNet classification [9] clearly show the better ability to handle intra-class variations than hand-crafted ones such as SIFT [32] and HOG [7].

Caltech-101. The Caltech-101 [12] dataset, originally introduced for image classification, provides pictures of 101 image categories with ground-truth object masks. Unlike the PF [14, 15] and TSS [44] datasets, it does not provide ground-truth keypoint annotations. For fair comparison, we use 15 image pairs, provided by [16, 41], for each object category, and use the corresponding 1,515 image pairs for evaluation. Following the experimental protocol in [26], we compute matching accuracy with two metrics using the ground-truth masks: Label transfer accuracy (LT-ACC) and the intersection-over-union (IoU) metric. Both metrics count the number of correctly labeled pixels between ground-truth and transformed masks using dense correspondences, where the LT-ACC evaluates the overall matching quality while the IoU metric focusses more on foreground objects. Following [42, 41], we exclude the LOC-ERR metric, since it measures the localization error of correspondences using object bounding boxes due to a lack of keypoint annotations, which does not cover rotations, affine or deformable transformations. The LT-ACC and IoU comparisons on the Caltech-101 dataset are shown in Table 2. Although this dataset provides ground-truth object masks, we do not retrain or fine-tune our model to evaluate its generalization ability on other datasets. From this table, we can see that (1) our model generalizes better than other CNN-based methods for other images outside the training dataset; and (2) it outperforms the state of the art in terms of the LT-ACC and IoU, verifying once more that our model focuses on regions containing objects while filtering out background clutter, even without using object proposals [15, 16, 46, 27] or an inlier counting [41].

Qualitative comparison. Figure 3 shows a visual comparison of alignment results between source and target images with the state of the art on the test split of the PF-PASCAL dataset [15]. We can see that our method is robust to a local non-rigid deformation (e.g., bird’s beaks and horse’s legs in the first two rows), scale changes between objects (e.g., front wheels in the third row), and clutter (e.g., wheels in the last row). In particular, the fourth example clearly shows that our method gives more discriminative correspondences, cutting off matches for non-common objects. For example, it does not establish correspondences between a person and background regions in the source and target images, respectively, while others fail to cut off matches on these regions. We can also see that all methods do not handle occlusion (e.g., a bicycle saddle in the last row).

4.4 Ablation study

We show an ablation analysis on different components and losses in our model. We measure PCK scores with height and width of the bounding box size, and report the results on the test split of PF-PASCAL dataset [15, 16, 41].

Training loss. We show the average PCK for three variants of our model in Table 3. The mask consistency term encourages establishing correspondences between prominent objects. Our model trained with this term only, however, may not yield spatially distinctive correspondences, resulting in the worst performance. A flow consistency term, which spreads flow fields over foreground regions, overcomes this problem, but it does not differentiate correspondences between background and objects. Accordingly, these two terms are complementary each other and exploiting both significantly boosts the performance of our model from 0.675/0.718 to 0.782, already outperforming the state of the art by a large margin (see Table 1). An additional smoothness term further boosts performance to 0.787.

Network architecture. Table 4 compares the performance of networks with different components in terms of the average PCK. The baseline models in the first three rows compute matching scores using both features from conv4-23 and conv5-3, and estimate correspondences with different argmax operators. They do not involve any training similar to [31] that uses off-the-shelf CNN features for semantic correspondence. We can see that applying the soft argmax directly to the baseline model degrades performance severely, since it is highly susceptible to multi-modal distributions. The results in the next three rows are obtained with a single adaptation layer on top of conv4-23. This demonstrates that the adaptation layer extracts features more adequate for pixel-wise semantic correspondences, boosting performance of all baseline models significantly. Particularly, we can see that the kernel soft argmax outperforms others by a large margin, since it enables training our model end-to-end including adaptation layers at a sub-pixel level and is less susceptible to multi-modal distributions. The last three rows suggest that exploiting deeper level of features is important, and using all components with the kernel soft argmax performs best in terms of the average PCK.

5 Conclusion

We have presented a CNN model for learning an object-aware semantic flow end-to-end, and introduced the corresponding CNN architecture, dubbed SFNet, with a novel kernel soft argmax layer that outputs differential matches at a sub-pixel level. We have proposed to use binary foreground masks directly to train a model for learning pixel-to-pixel correspondences that are widely available and can be obtained easily compared to pixel-level annotations. The ablation studies clearly demonstrate the effectiveness of each component and loss in our model. Finally, we have shown that the proposed method is robust to distracting details and focuses on establishing dense correspondences between prominent objects, outperforming the state of the art on standard benchmarks by a significant margin.

Acknowledgments. This work was supported in part by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. 2017R1C1B2005584), the Louis Vuitton/ENS chair on artificial intelligence and the NYU/Inria collaboration agreement.

Bibliography52

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Connelly Barnes, Eli Shechtman, Adam Finkelstein, and Dan B Goldman. Patch Match: A randomized correspondence algorithm for structural image editing. ACM TOG , 28(3), 2009.
2[2] Lubomir Bourdev and Jitendra Malik. Poselets: Body part detectors trained using 3d human pose annotations. In ICCV , 2009.
3[3] Hilton Bristow, Jack Valmadre, and Simon Lucey. Dense semantic correspondence where every pixel is a classifier. In ICCV , 2015.
4[4] Thomas Brox, Christoph Bregler, and Jitendra Malik. Large displacement optical flow. In CVPR , 2009.
5[5] Thomas Brox, Andrés Bruhn, Nils Papenberg, and Joachim Weickert. High accuracy optical flow estimation based on a theory for warping. In ECCV , 2004.
6[6] Christopher B Choy, Jun Young Gwak, Silvio Savarese, and Manmohan Chandraker. Universal correspondence network. In NIPS , 2016.
7[7] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In CVPR , 2005.
8[8] Kevin Dale, Micah K Johnson, Kalyan Sunkavalli, Wojciech Matusik, and Hanspeter Pfister. Image restoration using online photo collections. In ICCV , 2009.