TL;DR
This paper introduces a fast 3D convolutional spectral clustering method for pixel-level object segmentation in videos, leveraging graph partitioning in space-time to improve accuracy and speed over existing techniques.
Contribution
The authors propose a novel 3D filtering technique that efficiently computes spectral clustering for video segmentation without explicit matrix construction, enabling fast GPU implementation.
Findings
Outperforms state-of-the-art methods on DAVIS-2016 in unsupervised and semi-supervised settings.
Achieves top results on SegTrackv2 dataset.
Significantly faster than classical power iteration methods.
Abstract
We formulate object segmentation in video as a graph partitioning problem in space and time, in which nodes are pixels and their relations form local neighborhoods. We claim that the strongest cluster in this pixel-level graph represents the salient object segmentation. We compute the main cluster using a novel and fast 3D filtering technique that finds the spectral clustering solution, namely the principal eigenvector of the graph's adjacency matrix, without building the matrix explicitly - which would be intractable. Our method is based on the power iteration for finding the principal eigenvector of a matrix, which we prove is equivalent to performing a specific set of 3D convolutions in the space-time feature volume. This allows us to avoid creating the matrix and have a fast parallel implementation on GPU. We show that our method is much faster than classical power iteration applied…
| Input | ||||
| Input Score (J) | SFSeg | |||
| over | ||||
| Improved | Input (J) | |||
| Videos | ||||
| (%) | ||||
| Semi | OnAVOS | 86.1 | 86.3 (+0.2) | 65 |
| Supervised | OSVOS-S | 85.6 | 86.0 (+0.4) | 90 |
| PReMVOS | 84.9 | 88.2 (+3.3) | 90 | |
| FAVOS | 82.4 | 83.0 (+0.6) | 95 | |
| OSMN | 73.9 | 75.9 (+2.0) | 95 | |
| Un | COSNet | 80.5 | 80.9 (+0.4) | 65 |
| Supervised | MotAdapt | 77.2 | 77.5 (+0.3) | 65 |
| PDB | 77.2 | 77.4 (+0.2) | 60 | |
| ARP | 76.2 | 77.7 (+1.5) | 90 | |
| LVO | 75.9 | 78.8 (+2.9) | 90 | |
| FSEG | 70.7 | 72.3 (+1.6) | 95 | |
| NLC | 55.1 | 55.6 (+0.5) | 65 | |
| Average Boost | +1.1% | 80% | ||
| Method | Score (J) |
|---|---|
| LVO | 57.3 |
| FSEG | 61.4 |
| OSVOS | 65.4 |
| NLC | 67.2 |
| MaskTrack | 70.3 |
| BB + SFSeg + denseCRF (ours) | 72.7 |
| Method | DAVIS (J) | SegTrackv2 (J) |
|---|---|---|
| BB | 67.2 | 72 |
| BB + denseCRF | 68.1 | 72 |
| BB + SFSeg | 68.7 | 72.1 |
| BB + SFSeg + denseCRF | 69.2 | 72.7 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSpectral Clustering · 3D Convolution
A 3D Convolutional Approach to Spectral Object Segmentation in Space and Time
Elena Burceanu1,2111Contact Author
Marius Leordeanu3,4
1Bitdefender
2University of Bucharest, Romania
3Institute of Mathematics of the Romanian Academy
4University Politehnica of Bucharest, Romania
[email protected], [email protected]
Abstract
We formulate object segmentation in video as a spectral graph clustering problem in space and time, in which nodes are pixels and their relations form local neighbourhoods. We claim that the strongest cluster in this pixel-level graph represents the salient object segmentation. We compute the main cluster using a novel and fast 3D filtering technique that finds the spectral clustering solution, namely the principal eigenvector of the graph’s adjacency matrix, without building the matrix explicitly - which would be intractable. Our method is based on the power iteration which we prove is equivalent to performing a specific set of 3D convolutions in the space-time feature volume. This allows to avoid creating the matrix and have a fast parallel implementation on GPU. We show that our method is much faster than classical power iteration applied directly on the adjacency matrix. Different from other works, ours is dedicated to preserving object consistency in space and time at the level of pixels. In experiments, we obtain consistent improvement over the top state of the art methods on DAVIS-2016 dataset. We also achieve top results on the well-known SegTrackv2 dataset.
1 Introduction
Elements from a video are interconnected in space and time and have an intrinsic graph structure (Fig. 1). Most existing approaches use higher-level components, such as objects, super-pixels or features, at a significantly lower resolution. Considering this graph structure in space-time, explicitly at the dense pixel-level, is an extremely expensive problem. Our proposed solution to video object segmentation, Spectral Filtering Segmentation (SFSeg), is based on transforming an expensive eigenvalue problem inspired from spectral clustering, into 3D convolutions on the space-time volume. This makes it fast, while keeping the properties of spectral clustering. We are the first, to our best knowledge, to propose a practical spectral clustering approach to video object segmentation at the pixel level, in space and time.
Most state of the art algorithms for this task do not use the time constraint, and when they do, they take little advantage of it. Time plays a fundamental factor in how objects move and change in the world, but computer vision does not yet exploit it sufficiently. Consequently, the segmentation outputs of current state of the art algorithms is not always consistent over time. Our work comes to address precisely this aspect and our contribution is demonstrated through solid experiments on DAVIS-2016 and SegTrackv2 datasets on which we improve over state of the art methods.
We demonstrate in experiments that the eigenvector of the graph’s adjacency matrix is a good solution for salient object segmentation. Once our filtering-based optimization converges, the segmentation map is spatio-temporally consistent, with a smooth transition between frames: noise coming from other objects is removed and missing parts of the object are added back. Through multiple iterations, the relevant information is propagated step by step to farther away neighbourhoods in space and time, acting like a diffusion.
Our contribution is two-fold. Besides formulating the segmentation problem in video as an eigenvalue problem on the adjacency matrix of the graph in space-time, we also provide a very fast optimization algorithm that computes the required eigenvector (which represents the desired segmentation) without explicitly creating or using the huge adjacency matrix. We prove theoretically and in practice that our algorithm reaches the same solution as a standard routine for eigenvector computation. We also show in experiments that the values in the final eigenvector, with one element per video pixel, confirm the spectral clustering assumption and provide an improved soft-segmentation of the main object.
2 Related work
Most state of the art methods for video object segmentation are using CNNs architectures, pre-trained for object segmentation on other large image datasets. They have a strong image-based backbone and are not designed from scratch with both space and time dimensions in mind. Many solutions Khoreva et al. (2017) adapt image segmentation methods by adding an additional branch to the architecture for incorporating the time axis: motion branch (previous frames or optical flow as) or previous masks branch (for mask propagation). Other methods are based on one-shot learning strategies and fine tune the model on the first video frame, followed by some post-processing refinement Maninis et al. (2018). Approaches derived from OSVOS Caelles et al. (2017) do not take the time axis into account. Our method comes to better address the natural space-time relationship, which is why it is effective when combined with frame-based segmentation algorithms.
Graph representations.
Graph methods are suitable for segmentation and can have different representations, where the nodes can be pixels, super-pixels, voxels or image/video regions. Graph edges are usually undirected, modeled as symmetric similarity functions. The choice of the representation influences both accuracy and runtime. Specifically, pixel-level representations are computationally extremely expensive, making the problem intractable for high resolution videos. Our fast solution implicitly uses a pixel-level graph representation: we make a first-order Taylor approximation of the Gaussian kernel (usually used for pairwise affinities) and rewrite it as a sequence of 3D convolutions in the video directly. Thus, we get the desired outcome without explicitly working with the graph. We describe it in detail in Sec. 3.
Spectral clustering.
Computing eigenvectors of matrices extracted from data is a classic approach for clustering. There are several choices in the literature for choosing those matrices, the most popular being the Laplacian matrix Ng et al. (2001), normalized Shi and Malik (2000) or unnormalized. Other methods use the random walk matrix or directly the unnormalized adjacency matrix. Most methods are based on finding the eigenvectors corresponding to the smallest eigenvalues, while others, including our approach, require the leading eigenvectors. Graph Cuts are a popular class of spectral clustering algorithms, with many variants: normalized, average, min-max, mean cut and topological cut.
CRFs.
Discriminative graphical models Kumar and Hebert (2003) are often applied over the segmentation of images and videos (denseCRF Krähenbühl and Koltun (2011)). CRFs are effective as they incorporate the observed data both at the level of nodes as well as edges. But they have a strict probabilistic interpretation and use inference algorithms that are significantly more expensive than the simpler eigenvector power iteration that we use for optimizing our non-probabilistic objective score.
Image segmentation.
Graph cuts have been used in image segmentation Shi and Malik (2000). They are expensive in practice, as they require the computation of eigenvectors of smallest eigenvalues for very large Laplacian matrices. Fast graph-based algorithm for image segmentation exist, such as Felzenszwalb and Huttenlocher (2004), which is linear in the number of edges and it is based on an heuristic for building the minimum spanning tree. It is still used as staring point by current methods. Another approach Pourian et al. (2015) is to learn image regions with spectral graph partitioning and formulate segmentation as a convex optimization problem.
Video Segmentation.
Many video segmentation methods adapt existing image segmentation. In Yu et al. (2015) a parametric graph partitioning model over superpixels is proposed. Hierarchical graph-based segmentation over RGBD video sequences Hickson et al. (2014) also groups pixels into regions. The problem is solved using bipartite graph matching and minimizing the spanning tree. In Li et al. (2018), an efficient graph cut method is applied on a subset of pixels. To our best knowledge, all of the efficient methods group pixels into superpixels, regions from a grid or object proposals to handle the computational and memory burden. However, the hard initial grouping of pixels comes with a risk and could carry errors into the final solution, as it misses details available only at the original pixel resolution.
Our formulation is most related to Leordeanu and Hebert (2005); Meila and Shi (2001). Our solution is the leading eigenvector of (the adjacency matrix), computed fast and stably with power iteration as explained in Sec. 3. Note that using the unnormalized adjacency matrix in combination with power iteration is the least expensive spectral approach and the only one that can be factored into simple and fast 3D convolutions. This possibility gives our algorithm efficiency and speed (Sec. 4).
3 Our approach
We formulate salient object segmentation in video as a graph partitioning problem (foreground vs background), where the graph is both spatial and temporal. Each node represents a pixel in the space-time volume, which has pixels. is the number of frames and the frame size. Each edge captures the similarity between two pixels and is defined by the pairwise function . The pairwise connections between pixels and , in space and time are symmetric and always non-negative, defining a adjacency matrix . We take into account only the local connections in space-time, so is sparse.
Let and be feature vectors of size with a feature value for each node. They will be used in defining the similarity function (Eq. 1). For now we consider the simplest case when represent single channel features (e.g. they could be soft masks, grey level values, edge or motion cues, or any pre-trained features). Later on we show how we can easily adapt the formulation to the multi-channel feature case. We define the edge similarity using a Gaussian kernel:
[TABLE]
[TABLE]
In graph methods, it is common to use two types of terms for representing the model over the graph. Unary terms are about individual node properties, while pairwise terms describe relations between pairs of nodes. In our case, , describe individual node properties, whereas , are used to define the pairwise similarity kernel between the two nodes. Note that in Eq. 2 we approximate the Gaussian kernel with its first-order Taylor expansion. The approximation is crucial in making our filtering approach possible, as shown next. Hyperparameters and control the importance of those terms.
To partition the space-time graph of video pixels, we want to find the strongest cluster in this graph. We first represent a segmentation solution (i.e., cluster in the space-time graph) with an indicator vector , that has one element for each node in the 3D space-time volume, such that if node (pixel) is in the video segmentation cluster (foreground) and otherwise (background). We define the clustering score to be the sum over all pairwise similarity terms between the nodes inside the cluster. The higher this score, the stronger the sum of connections and the cluster. The segmentation score can be written compactly in matrix form as . Similar to other spectral approaches in graph matching Leordeanu and Hebert (2005), we find the segmentation solution that maximizes under the relaxed constraints . Fixing the L2 norm of is needed since only relative soft segmentation values matter. Thus, our optimization problem become one of maximizing the Raleigh quotient:
[TABLE]
The global optimum solution is the principal eigenvector of . is symmetric and has non-negative values, so the solution will also have non-negative elements, by Perron-Frobenius theorem Frobenius (1907). The final segmentation could be simply obtained by thresholding. However, matrix , even for a small video has 20 million nodes (50 frames of ), making the problem of finding the leading eigenvector with standard procedures intractable (Sec 4.2).
Next we show how to take advantage of the first-order expansion of the pairwise terms defining and break power iteration into several very fast 3D convolutions in space and time, directly on the feature maps, without explicitly using the very big adjacency matrix. Our method receives as input pixel level feature maps and returns a final segmentation, as the solution to problem 3.
3.1 Power iteration with pixel-wise iterations
We apply power iteration algorithm to compute the eigenvector. At iteration , we have Eq. 4:
[TABLE]
where, after each iteration, the solution is normalized to unit norm and is the set of neighbors pixels with , in space and time. Expanding (Eq. 2), Eq. 4 becomes:
[TABLE]
[TABLE]
3.2 Power iteration using 3D convolutions
In Eq. 6 we observe that the links between the nodes are local (M is sparse) and we can replace the sums over neighbours with local 3D convolutions in space and time. Thus, we rewrite Eq. 6 as a sum of convolutions in 3D:
[TABLE]
[TABLE]
where is a convolution over a 3D space-time volume with a 3D Gaussian filter (), is an element-wise multiplication, 3D matrices have the original video shape () and is a 3D matrix with all values 1. We transformed the standard form of power iteration in Eq. 4 in several very fast matrix operations: 3 convolutions and 13 element-wise matrix operations (multiplications and additions), which are local operations that can be parallelized.
3.3 Multiple feature channels
Our approach in Eq. 7 can easily accommodate multiple feature channels if we rewrite from Eq. 2 and propagate it through Eq. 7, the final multi-channel solution is obtained by summing over the final solution for each channel:
[TABLE]
where is one (3D) channel feature matrix.
4 Algorithm
We present the version of our algorithm (Alg. 1) that has a single channel feature map, but can be easily adapted to the multi-channel version, using Eq. 9. We first initialize the solution with a uniform vector or with a soft-segmentation provided by another method, if it is available. We also initialize feature maps and , which could be of any kind: lower-level (optical flow, edges, gray-level values) or higher-level pre-trained semantic features (deep features or initial soft/hard segmentation maps). At each iteration, we select a time frame around the current one. In Step 2, we multiply the corresponding matrices, apply the convolutions, compose the results and obtain the new segmentation mask for pixels in current frame, using the space-time operations (as in Eq. 7). Since the solution needs to be binary at the end (for evaluation), after each iteration (Step 3, line 14 in Alg. 1), we project the solution in a more discrete space (see Sec. 4.1).
4.1 Binarization - Spectral vs Discrete space
At the end, we need to have a hard segmentation map for the object of interest. Over the iterations, a spectral method makes the solution continuous. It was previously observed that in graph matching optimization, where the solution is relaxed Leordeanu et al. (2009), keeping it close to the initial discrete domain comes with a better final performance, even though the optimum in the spectral space is affected. So we integrated the binarization in the iterative optimization. After a few iterations in the continuous space, we start projecting the solution on an almost discrete space through a sigmoid (which continuously approximates a step function) and initialize the next iteration with this projection. After the last iteration, we apply a hard threshold on a solution much closer to the discrete space than before. This way, the transition is smoother compared with a simple sharp thresholding.
4.2 Numerical Analysis
We compare the standard power iteration eigenvector computation with our filtering formulation, both from qualitative and quantitative (speedup) points of view.
Computational Complexity.
Lanczos Lanczos (1950) method for sparse matrices has complexity for computing the leading eigenvector, where is the number of neighbours for each node, the number of frames in video, the number of pixels per frame and the number of iterations. Our full iteration algorithm has also complexity, but with highly parallelizable operations, comparing to Lanczos. The Gaussian filters are separable, so the 3D convolutions can be broken into a sequence of three vector-wise convolutions, reducing the complexity for filtering to : = vs ++= for a xx kernel.
We compare three solutions: a) Lanczos for the principal eigenvector for Eq. 1 b) Lanczos for the approximate adjacency matrix as in Eq. 2 c) our 3D convolutions approach. For a small graph of 4000 nodes (a video with 10 frames of pixels), a) and b) have 0.15 sec/iter and our 3D filtering formulation has 0.02 sec/iter (Fig. 2). Our approach scales better, having a huge advantage when working with videos with millions of nodes because we do not explicitly build the adjacency matrix and filtering is parallelized on GPU.
Qualitative analysis.
We perform tests on synthetic data, in order to study the differences between the original spectral solution using the exponential pairwise scores (1) and the one obtained after our first-order Taylor approximation trick (2). In Fig. 3 we see qualitative comparisons between the solutions obtained by three implementations: our SFSeg, power iteration with original pairwise scores and numpy eigenvector with original pairwise scores. The output is almost identical. In the synthetic experiments, the input is noisy, but all spectral solutions manage to reconstruct the initial segmentation.
Quantitative analysis.
We analyze the numerical differences between the original eigenvector and our approximation (SFSeg). We plot the angle (in degrees) and the IoU (Jaccard) between SFSeg (first-order approximation of pairwise functions, optimized with 3D convolutions) and the original eigenvector (exponential pairwise functions in the adjacency matrix), over multiple SFSeg iterations in Fig. 4. Note that in these experiments we intentionally start from a far away solution (70 degrees difference between the SFSeg initial segmentation vector and the original eigenvector) to better show that SFSeg indeed converges to practically the same eigenvector. Such comparisons can be performed only on synthetic data with relatively small videos, for which the computation of the adjacency matrix needed for the original eigenvector is tractable. The results clearly show that SFSeg, with first order approximations of the pairwise functions on edges and optimization based on 3D filters, reaches the same theoretical solution, while being orders of magnitude faster.
5 Experimental Analysis
Experiments on DAVIS-2016.
DAVIS-2016 Perazzi et al. (2016) is a densely annotated video object segmentation dataset. It contains 50 high-resolution video sequences (30 train/20 valid), with a total of 3455 annotated frames of real-world scenes. The benchmark comes with two tasks: the unsupervised one, where the solutions do not have access to the first frame of the video and the semi-supervised one, where the methods use the ground-truth from the first frame. In both setups, the methods can train the model on the training set and report their performance on the validation set. Our results are reported on the validation set, but we do not use the training set. For optical flow we used the Pytorch implementation of Flownet2 Reda et al. (2017).
Experimental Setup.
We test SFSeg with input from pre-computed segmentations of the video produced by top methods from DAVIS-2016, on both tasks. For the features maps, we initialized with the pre-computed input segmentation values. For , we used two channels: the magnitude for the direct optical flow and for the reverse optical flow. We set: ; and for unsupervised task and for the semi-supervised one. The algorithm is implemented as in Alg. 1 with the multi-channels as in Eq. 9.
\settasks
label-offset = 0em , item-indent = 0em , item-indent = 1em , column-sep = 1em
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Caelles et al. [2017] S. Caelles, K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool. One-shot video object segmentation. CVPR , 2017.
- 2Cheng et al. [2018] J. Cheng, Y.-H. Tsai, W.-C. Hung, S. Wang, and M.-H. Yang. Fast and accurate online video object segmentation via tracking parts. CVPR , 2018.
- 3Faktor and Irani [2014] A. Faktor and M. Irani. Video segmentation by non-local consensus voting. BMVC , 2014.
- 4Felzenszwalb and Huttenlocher [2004] P. Felzenszwalb and D. Huttenlocher. Efficient graph-based image segmentation. IJCV , 2004.
- 5Frobenius [1907] G. Frobenius. ”About a Fundamental Theorem of Group Theory. II. Session Reports of the Royal Prussian Academy of Sciences , 1907.
- 6Hickson et al. [2014] S. Hickson, S. Birchfield, I. Essa, and H. Christensen. Efficient hierarchical graph-based segmentation of rgbd videos. CVPR , 2014.
- 7Jain et al. [2017] S. Jain, B. Xiong, and K. Grauman. Fusionseg: Learning to combine motion and appearance for fully automatic segmention of generic objects in videos. ar Xiv , 2017.
- 8Khoreva et al. [2017] A. Khoreva, F. Perazzi, R. Benenson, B. Schiele, and A. Hornung. Learning video object segmentation from static images. CVPR , 2017.
