On guiding video object segmentation

Diego Ortego; Kevin McGuinness; Juan C. SanMiguel; Eric Arazo; Jos\'e; M. Mart\'inez; Noel E. O'Connor

arXiv:1904.11256·cs.CV·April 26, 2019

On guiding video object segmentation

Diego Ortego, Kevin McGuinness, Juan C. SanMiguel, Eric Arazo, Jos\'e, M. Mart\'inez, Noel E. O'Connor

PDF

TL;DR

This paper introduces a guided convolutional neural network approach for video object segmentation that leverages foreground masks from existing algorithms to improve segmentation accuracy in challenging environments.

Contribution

It proposes a novel guided CNN framework that uses external foreground masks to enhance segmentation, combining color and motion features for better separation.

Findings

01

Outperforms non-guided segmentation methods on DAVIS 2016 dataset.

02

Effectively integrates color and optical flow features for segmentation.

03

Achieves state-of-the-art results in challenging environments.

Abstract

This paper presents a novel approach for segmenting moving objects in unconstrained environments using guided convolutional neural networks. This guiding process relies on foreground masks from independent algorithms (i.e. state-of-the-art algorithms) to implement an attention mechanism that incorporates the spatial location of foreground and background to compute their separated representations. Our approach initially extracts two kinds of features for each frame using colour and optical flow information. Such features are combined following a multiplicative scheme to benefit from their complementarity. These unified colour and motion features are later processed to obtain the separated foreground and background representations. Then, both independent representations are concatenated and decoded to perform foreground segmentation. Experiments conducted on the challenging DAVIS 2016…

Figures3

Click any figure to enlarge with its caption.

Tables1

Table 1. TABLE I: Comparative results for DAVIS 2016. (*) indicates the proposed approach using the corresponding foreground mask. ↑ ( ↓ ) ↑ absent ↓ \uparrow\left(\downarrow\right) means higher (lower) is better. Bold denotes better performance for our approach.

Algorithms	$𝒥 (↑)$		$ℱ (↑)$		$𝒯 (↓)$
Algorithms	mean	std.	mean	std.	mean	std.
ARP	.7609	.1125	.7051	.1111	.5363	.3446
ARP*	.8069	.0631	.8121	.0799	.3930	.2675
CTN	.7304	.1048	.6886	.1147	.3682	.2176
CTN*	.7933	.0627	.7996	.0775	.3969	2762
FSEG	.7068	.0804	.6524	.1036	.4456	.2809
FSEG*	.7834	.0673	.7950	.0773	.3938	.2566
MSK	.7955	.0817	.7519	.0929	.3226	.1978
MSK*	.8012	.0718	.8097	.0821	.3805	.2660
OFL	.6738	.1231	.6279	.1330	.3608	.2123
OFL*	.7844	.0764	.7926	.0933	.3898	.2805
VPN	.6984	.0902	.6513	.1002	.4923	.2630
VPN*	.7765	.0709	.7920	.0792	.3918	.2551

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

On guiding video object segmentation

Diego Ortego

Insight Centre for Data Analytics

Dublin City University (DCU)

Dublin (Ireland)

[email protected]

Kevin McGuinness

Insight Centre for Data Analytics

Dublin City University (DCU)

Dublin (Ireland)

[email protected]

Juan C. SanMiguel

VPULab

Universidad Autónoma de Madrid (UAM)

Madrid (Spain)

[email protected]

Eric Arazo

Insight Centre for Data Analytics

Dublin City University (DCU)

Dublin (Ireland)

[email protected]

José M. Martínez

VPULab

Universidad Autónoma de Madrid (UAM)

Madrid (Spain)

[email protected]

Noel E. O’Connor

Insight Centre for Data Analytics

Dublin City University (DCU)

Dublin (Ireland)

[email protected]

Abstract

This paper presents a novel approach for segmenting moving objects in unconstrained environments using guided convolutional neural networks. This guiding process relies on foreground masks from independent algorithms (i.e. state-of-the-art algorithms) to implement an attention mechanism that incorporates the spatial location of foreground and background to compute their separated representations. Our approach initially extracts two kinds of features for each frame using colour and optical flow information. Such features are combined following a multiplicative scheme to benefit from their complementarity. These unified colour and motion features are later processed to obtain the separated foreground and background representations. Then, both independent representations are concatenated and decoded to perform foreground segmentation. Experiments conducted on the challenging DAVIS 2016 dataset demonstrate that our guided representations not only outperform non-guided, but also recent and top-performing video object segmentation algorithms.

Index Terms:

Video object segmentation, foreground segmentation, attention, deep learning.

I Introduction

Segmenting an image into regions is key for identifying objects of interest. For example, image segmentation [1] clusters image pixels with common properties (e.g. colour, textures, etc), while semantic segmentation [2] categorises each pixel into a set of predefined classes. Foreground segmentation is a particular case of semantic segmentation with two categories: foreground and background. The former contains the objects of interest in an image, which may correspond to salient objects [3][4], generic objects [5][6], moving objects [7], spatio-temporal relevant patterns [8] or even weak labels [9]. The latter consists on the non-relevant data, being usually the static scene objects. Moreover, foreground segmentation in unconstrained environments is known as video object segmentation (VOS) and faces many challenges related to camera motion, shape deformations, occlusions or motion blur [10].

VOS has become an active research area, as demonstrated by the widespread use of the DAVIS benchmark [11]. Existing algorithms are supervised, semi-supervised and unsupervised. Supervised approaches employ frame-by-frame human intervention [12] whereas semi-supervised ones only require initialization (e.g. annotations of the objects to segment in the first frame) [13][14]. Conversely, unsupervised VOS does not involve human intervention, requiring the automatic detection of relevant moving objects [15].

Currently deep learning is significantly advancing computer vision performance [16] such as for VOS, where state-of-the-art approaches employ convolutional neural networks [7][17]. Performance improvement can be achieved by increasing the complexity of the network [18][19], but also by learning better models without requiring new architectures. For instance, loss function variations [20], transfer learning [21], data augmentation [22] or applying spatial attention [23] are techniques widely explored. In particular, spatial attention can be used to highlight activations in feature maps of the network, thus enabling training of more accurate models. In [23], attention is extracted from convolutional features from a semantic segmentation network to promote those activations that do not respond to several classes, but to only one. Also, [24] generates attention maps responding to visual patterns in different scales and levels from convolutional features, thus capturing different semantic patterns such as body parts, objects, or background that improve pedestrian attribute recognition. Furthermore, [25] uses three attentions maps (general, residual and channel attentions) extracted from convolutional features to weight the cross-correlation of a fully-convolutional siamese tracker that adapts the offline learned model to the online tracked targets. Additionally, content-based image retrieval [26] can be enhanced by using visual attention models to weight the contribution of the activations from different spatial regions.

This paper proposes a novel approach to improving the performance of an unsupervised VOS algorithm by using an independent foreground segmentation (i.e. the mask from an existing algorithm in the literature) to guide the segmentation process by focusing on relevant activations. First, given a video frame and its associated optical flow, two networks compute appearance and motion feature maps. Second, both feature maps are unified to exploit their complementarity. Third, the foreground mask of an independent algorithm is used to encode foreground and background information. Finally, both representations are concatenated and decoded to produce a foreground mask. We validate the proposed approach in the recent DAVIS 2016 dataset [11], demonstrating the improvements achieved by our approach, whose novelty lies in the use of an independent foreground mask to separate foreground and background representations and learn a better foreground segmentation model.

The remainder of this paper is organized as follows: Section II overviews the proposed approach whereas Section III and IV describe the proposed algorithm and the experiments performed. Finally, Section V concludes this paper.

II Algorithm overview

VOS can be formulated as a pixel-wise labelling of each video frame $\mathbb{I}$ as either foreground (1) or background (0), thus generating a foreground segmentation mask $\mathcal{S}$ . We propose an approach based on convolutional neural networks (CNNs) that uses a frame $\mathbb{I}$ together with its corresponding optical flow $\mathbb{O}$ and the foreground mask of an independent algorithm $\mathcal{S}^{e}$ (from now on, the independent foreground mask), to compute a foreground segmentation $\mathcal{S}$ (see Figure 1). Our CNN-VOS approach starts with an appearance network computing a convolutional feature map $\mathbb{A}$ from a video frame $\mathbb{I}$ and a motion network generating a convolutional feature map $\mathbb{M}$ from the optical flow $\mathbb{O}$ associated to $\mathbb{I}$ . Both feature maps are then combined by element-wise multiplication to obtain a unique representation $\mathbb{R}$ comprising both appearance and motion information. Subsequently, the independent foreground mask $\mathcal{S}^{e}$ is used to weight $\mathbb{R}$ and separately encode foreground $\mathbb{F}$ and background $\mathbb{B}$ information based on the previously computed feature maps, which are processed by two independent networks. Finally, the output of the two sub-networks is concatenated and decoded with a convolutional network to produce the foreground mask $\mathcal{S}$ .

III Algorithm description

III-A Appearance network

The appearance network learns a model $\mathbb{A}$ from spatial distribution of the colour information of a video frame $\mathbb{I}$ . In particular, we use PSPNet [2], a fully-convolutional neural network for semantic segmentation which relies on ResNet [19] followed by a hierarchical context module (Pyramid Pooling Module) that harvest information from different scales. This module processes different downsamplings of ResNet features and later upsamples them for concatenating all features to form an improved representation that includes the original ResNet features. In particular, we use ResNet-101 to obtain $\mathbb{A}$ by extracting the 512 feature maps obtained after the convolutional layer that follows the Pyramid Pooling Module, which contains both local and global representations.

III-B Motion network

In contrast to the appearance network, the motion network obtains a model $\mathbb{M}$ again making use of PSPNet, but in this case from the optical flow $\mathbb{O}$ . In particular, we convert the optical flow vectors [27] into a 3-channel colour-coded optical flow image $\mathbb{O}$ [15] and train PSPNet to produce a foreground segmentation from the colour-coded optical flow $\mathbb{O}$ . In order to get the model $\mathbb{M}$ , we select the 512 feature maps at the same layer as done in the appearance network.

III-C Unified representation

The intuition behind using two networks with independent inputs (appearance and motion) is to benefit from the complementary of appearance and motion exhibited by moving objects [28]. Therefore, we exploit such complementarity by combining both feature maps following a multiplicative fusion (see the Combination stage in Figure 1) similarly to [29], where the authors multiply different sources of CNNs (appearance and motion) to enable the amplification or suppression of feature activations based on their agreement. In our approach, this fusion consists of a $1\times 1$ convolution applied to both the appearance features $\mathbb{A}$ and the motion features $\mathbb{M}$ followed by an element-wise multiplication of both sets of feature maps (which have the same dimensionality) to produce the unified encoding $\mathbb{R}$ . Applying the $1\times 1$ convolutions helps to control the dimensionality while learning how to combine the feature maps before their multiplicative combination.

III-D Foreground and background encoding

Unlike many VOS literature which jointly process all features [15], we introduce an attention mechanism that splits the feature maps into foreground and background to better guide the learning process and obtain a better VOS model (see FG-BG encoding stage in Figure 1). In particular, we use an foreground mask $\mathcal{S}^{e}$ (obtained from a state-of-the-art algorithm) which is downsampled to match the size of $\mathbb{R}$ . Then, we split $\mathbb{R}$ into foreground $\mathbb{F}=\mathbb{R}\times\mathbb{S}^{e}$ and background $\mathbb{B}=\mathcal{\mathbb{R}}\times\left(1-\mathbb{S}^{e}\right)$ representations according to the guidance provided by the independent foreground $\mathcal{S}^{e}$ . On the one hand, multiplying the feature maps $\mathbb{R}$ by the independent foreground $\mathcal{S}^{e}$ helps focusing on important spatial areas of the features by zeroing responses associated to background areas in $\mathbb{F}$ , thus implementing an attention mechanism. On the other hand, maintaining information from the background regions through $\mathbb{B}$ helps in the segmentation process, which may be useful as it assures a background representation is maintained, especially when errors in $\mathcal{S}^{e}$ lead to the suppression of important foreground responses. Subsequently, $\mathbb{F}$ and $\mathbb{B}$ are individually processed by four convolutional layers to separately model foreground and background representations before concatenating them as presented in the FG-BG encoding stage in Figure 1. Finally, this set of concatenated foreground and background representations contain a high-level joint encoding of appearance and motion that is fed to four additional convolutional layers for decoding in order to compute the final prediction (see Decoding stage in Figure 1), i.e. the foreground segmentation mask $\mathcal{S}$ .

III-E Architecture details

Our architecture is the fully-convolutional network shown in Figure 1. After the appearance and motion networks (both using the PSPNet architecture), there are two convolutional layers that use 512 $1\times 1$ kernels to process the feature maps $\mathbb{A}$ and $\mathbb{M}$ (both have 1/8 the original image resolution). Then, the multiplication of $\mathcal{S}^{e}$ is preceded by a downsampling performed through average pooling with stride 8. Regarding the processing of $\mathbb{F}$ and $\mathbb{B}$ : before concatenating both we use four convolutional layers, three with depth 256 followed by batch normalization and ReLU and the final one with depth 128 to reduce the number of feature maps. As it is well known that reducing the feature map resolution may result in coarse predictions [30], we use dilated convolutions (1, 2, 4 and 8 depths) to aggregate context [31] while preserving the already reduced resolution (1/8). Finally, we apply four convolutional layers with dilated convolutions (128, 64, 64 and 1 depths). We use $3\times 3$ kernels throughout (unless otherwise stated).

IV Experimental work

IV-A Datasets and metrics

We evaluate the proposed approach on the DAVIS 2016 dataset [11] which covers many challenges of unconstrained VOS. We use all available test videos (20) whose length ranges from 40 to 100 frames with $854\times 480$ resolution. A unique foreground object is annotated in each frame. We use three standard performance measures [11] to account for region similarity (Jaccard index $\mathcal{J}$ , i.e. intersection-over-union ratio between the segmented foreground mask and the ground-truth mask), for contour accuracy (F-score $\mathcal{F}$ between contour pixels of the segmented and the ground-truth masks) and for temporal stability $\mathcal{T}$ of foreground masks (by associating temporal smooth and precise transformations to good foreground segmentations). These measures are computed frame-by-frame and then averaged to compute a sequence-level score, which are also averaged to get dataset-level performance. Note that we do not consider DAVIS 2017 dataset as it is semi-supervised whereas our approach is unsupervised.

IV-B Implementation details

The training procedure consists of three stages. First, we train the appearance network to perform semantic segmentation on the PASCAL VOC 2012 dataset [32] using a total of 10582 training images. Second, we train the motion network to perform foreground segmentation based on optical flow data [27] using the annotations provided by [15] for 84929 frames of ImageNet-Video dataset [33]. Third, we freeze both the appearance and the motion networks and train the rest of the network on 22 of the 30 training video sequences of DAVIS 2016 dataset [11] (we use 8 sequences for validation). For this last step, masks of an independent algorithm are needed, so we use available foreground masks in DAVIS 2016111https://davischallenge.org/davis2016/soa_compare.html (the 13 algorithms evaluated in [11] and the algorithm proposed in [34]). We train the appearance and the motion networks for 30k and 90k iterations, respectively, using the “poly” learning rate policy described in [2]. However, for the third step we train the network for 20 epochs reducing the learning rate by ten each five epochs (starting with the value 0.1) and we select the best model using the performance in the validation set. We use batch size 8, data augmentation (random Gaussian blur, sized crops, rotations and horizontal flips), cross-entropy loss and Kaiming weight initialization [35] in all training steps.

IV-C Evaluation

We compare our approach against 6 recent and top-performing VOS alternatives (both unsupervised and semi-supervised): ARP [7], CTN [14], FSEG [15], MSK [17], OFL [36] and VPN [37]. To provide a fair comparison, we also use the segmentation masks of these alternatives for our proposal (indicated by ). Table I presents the average performance for the 20 test sequences of DAVIS 2016 in terms of Jaccard index $\mathcal{J}$ , contour F-score $\mathcal{F}$ and temporal stability $\mathcal{T}$ . By comparing the independent foreground masks and the proposed approach (), we outperform all of them in terms of $\mathcal{J}$ and $\mathcal{F}$ . Regarding the temporal stability $\mathcal{T}$ , we improve for 3 of the 6 algorithms. The reason $\mathcal{T}$ is not improved for CTN, MSK, and OFL is that these are implicitly focused on transferring the same segmentation frame to frame, thus resulting in an inherent temporal stability that is difficult to improve upon from and unsupervised perspective. The improvements achieved are due to both the potential of our learned representations and the guiding mechanism introduced through the independent masks. As it is not possible to separate both facts from the results in Table I, we perform an additional experiment to highlight the potential of the learned representations. In particular, we have repeated the third step of the training process (see Subsection IV-B) without making use of independent foreground masks. We explore two non-guided alternatives, the first one implies removing the FG-BG encoding (see Figure 1), thus passing the representation $\mathbb{R}$ directly to the decoder (NG1 alternative). The second alternative consists in using a dummy foreground mask where all pixels contain foreground (i.e. $\mathbb{F}$ and $\mathbb{B}$ contain the same information), thus avoiding the use of independent foreground masks to guide the segmentation while using a network (NG2 alternative) with the same number of parameters as the one proposed. Figure 2 presents the performance in terms of $\mathcal{J}$ and $\mathcal{F}$ for this two non-guided alternatives (NG1 and NG2), demonstrating that non-guided representations perform worse than the guided ones, i.e. those learnt with separate foreground-background representation.

Furthermore, directly comparing NG1 and NG2 performance (around $\mathcal{J}=0.76$ ) with the selected alternatives in Table I shows that we obtain competitive results (only MSK with $\mathcal{J}=0.7955$ is better than NG1 and NG2).

Figure 3 gives some examples of foreground segmentations. The first column presents an example where the proposed mask (green) is capable of solving the mistakes of the non-guided representations (red) due to the guiding introduced by an independent algorithm (blue). The second and third columns show examples where the proposed approach (green) outperforms the independent algorithm (blue) due to the guiding scheme over representations that, without guiding (red), were lacking of object parts or introducing false positives.

V Conclusions

This paper proposes a VOS approach which takes advantage of independent (i.e. state-of-the-art) algorithm results to guide the segmentation process. This strategy enables learning an enhanced colour and motion representation for VOS due to specific attention on foreground and background classes. The experimental work validates the utility of our approach to improve foreground segmentation performance. Future work will explore the use of feedback strategies to induce the foreground-background separation from the produced result and not from independent algorithms.

Acknowledgement

This work was partially supported by the Spanish Government (MobiNetVideo TEC2017-88169-R), “Proyectos de cooperación interuniversitaria UAM-BANCO SANTANDER con Europa (Red Yerun)” (2017/YERUN/02 (SOFDL) and Science Foundation Ireland (SFI) under grant numbers SFI/12/RC/2289 and SFI/15/SIRG/3283. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.

Bibliography37

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Sï¿œsstrunk, “SLIC Superpixels Compared to State-of-the-Art Superpixel Methods,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 34, no. 11, pp. 2274–2282, 2012.
2[2] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid Scene Parsing Network,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2017, pp. 6230–6239.
3[3] A. Borji, M.-M. Cheng, H. Jiang, and J. Li, “Salient Object Detection: A Benchmark,” IEEE Transactions on Image Processing , vol. 24, no. 12, pp. 5706–5722, 2015.
4[4] S. Wu, M. Nakao, and T. Matsuda, “Super Cut: Superpixel Based Foreground Extraction With Loose Bounding Boxes in One Cutting,” IEEE Signal Processing Letters , vol. 24, no. 12, pp. 1803–1807, 2017.
5[5] B. Alexe, T. Deselaers, and V. Ferrari, “Measuring the Objectness of Image Windows,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 34, no. 11, pp. 2189–2202, 2012.
6[6] S. Jain, B. Xiong, and K. Grauman, “Pixel Objectness,” Co RR , vol. abs/1701.05349, 2017.
7[7] Y. J. Koh and C. S. Kim, “Primary Object Segmentation in Videos Based on Region Augmentation and Reduction,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2017, pp. 7417–7425.
8[8] Y. J. Lee, J. Kim, and K. Grauman, “Key-segments for video object segmentation,” in Proceedings of IEEE International Conference on Computer Vision (ICCV) , 2011, pp. 1995–2002.