L1-regularized Reconstruction Error as Alpha Matte

Jubin Johnson; Hisham Cholakkal; Deepu Rajan

arXiv:1702.02744·cs.CV·April 5, 2017

L1-regularized Reconstruction Error as Alpha Matte

Jubin Johnson, Hisham Cholakkal, Deepu Rajan

PDF

TL;DR

This paper introduces a novel video matting algorithm that employs L1-regularized reconstruction error to estimate alpha mattes, ensuring temporal coherence through a multi-frame non-local means framework, with demonstrated effectiveness on a dedicated dataset.

Contribution

It proposes using L1-regularized reconstruction error for alpha estimation and incorporates a multi-frame non-local means approach for temporal consistency in video matting.

Findings

01

Effective alpha matte estimation demonstrated on video dataset.

02

Improved temporal coherence in video matting results.

03

Quantitative and qualitative evaluations confirm method's superiority.

Abstract

Sampling-based alpha matting methods have traditionally followed the compositing equation to estimate the alpha value at a pixel from a pair of foreground (F) and background (B) samples. The (F,B) pair that produces the least reconstruction error is selected, followed by alpha estimation. The significance of that residual error has been left unexamined. In this letter, we propose a video matting algorithm that uses L1-regularized reconstruction error of F and B samples as a measure of the alpha matte. A multi-frame non-local means framework using coherency sensitive hashing is utilized to ensure temporal coherency in the video mattes. Qualitative and quantitative evaluations on a dataset exclusively for video matting demonstrate the effectiveness of the proposed matting algorithm.

Tables2

Table 1. TABLE I: Comparison of temporal jitter error rates of different video matting algorithms against the proposed method

Video	SC [11]	BA [9]	EH [8]	JO [10]	Proposed
Face	3.46	2.92	4.26	2.37	1.49
Dancer	4.72	4.13	1.48	2.13	1.46
Arm	4.43	2.91	2.54	3.52	1.58
Woman	4.03	2.72	3.36	2.82	2.05
Smoke	3.17	2.96	1.80	4.85	2.19
Cat	2.54	4.18	2.45	4.41	1.40
Chimp	3.54	4.63	2.90	2.09	1.81
Girl	4.55	4.34	2.31	2.12	1.65
Whitegoat	3.72	3.85	3.47	2.17	1.76
Amira	4.18	4.27	2.72	2.09	1.72
Girl2	4.54	4.40	2.18	2.0	1.86
Office	3.94	3.64	2.94	2.05	2.41
Soccer	4.05	3.24	2.31	2.59	2.79
Unicorn	3.23	3.21	2.91	3.34	2.28
Dog	4.09	3.62	3.5	2.04	1.73

Table 2. TABLE II: Comparison of running time of the proposed method with recent sampling-based approaches

Total Time (secs)

Video

Size

No. of

Frames

EH [7]

JO [10]

Proposed

Smoke

500x500

90

4491

3798

3628

Arm

640x540

49

3260

1618

2188

Dancer

480x360

40

5488

2803

2589

Face

640x540

78

5378

4955

4786

Archaeology

480x405

128

4980

2524

2961

Woman

450x400

154

5541

2912

3178

Equations19

I_{i} = α_{i} F_{i} + (1 - α_{i}) B_{i},

I_{i} = α_{i} F_{i} + (1 - α_{i}) B_{i},

β_{F}^{i} = a r g min v_{i} - D_{F}^{i} β_{F}^{i}_{2}^{2} + λ β_{F}^{i}_{1},

β_{F}^{i} = a r g min v_{i} - D_{F}^{i} β_{F}^{i}_{2}^{2} + λ β_{F}^{i}_{1},

β_{B}^{i} = a r g min v_{i} - D_{B}^{i} β_{B}^{i}_{2}^{2} + λ β_{B}^{i}_{1},

β_{B}^{i} = a r g min v_{i} - D_{B}^{i} β_{B}^{i}_{2}^{2} + λ β_{B}^{i}_{1},

ξ_{F}^{i} = v_{i} - D_{F}^{i} β_{F}^{i}_{2}, ξ_{B}^{i} = v_{i} - D_{B}^{i} β_{B}^{i}_{2} .

ξ_{F}^{i} = v_{i} - D_{F}^{i} β_{F}^{i}_{2}, ξ_{B}^{i} = v_{i} - D_{B}^{i} β_{B}^{i}_{2} .

\overset{α}{^}_{i} = \frac{P ( f ( i ) ∣ D _{B} )}{P ( f ( i ) ∣ D _{B} ) + P ( b ( i ) ∣ D _{F} )} = \frac{ξ _{B}^{i}}{ξ _{B}^{i} + ξ _{F}^{i}} .

\overset{α}{^}_{i} = \frac{P ( f ( i ) ∣ D _{B} )}{P ( f ( i ) ∣ D _{B} ) + P ( b ( i ) ∣ D _{F} )} = \frac{ξ _{B}^{i}}{ξ _{B}^{i} + ξ _{F}^{i}} .

α_{T} (i) = \frac{1}{Ω} t = T - 2 \sum T + 2 γ^{∣ t - T ∣} j = 1 \sum K \overset{α}{^} (i_{t j}) exp {- \frac{D _{w} ( P ( i ) , P ( i _{t j} ))}{2 σ _{t}^{2}}},

α_{T} (i) = \frac{1}{Ω} t = T - 2 \sum T + 2 γ^{∣ t - T ∣} j = 1 \sum K \overset{α}{^} (i_{t j}) exp {- \frac{D _{w} ( P ( i ) , P ( i _{t j} ))}{2 σ _{t}^{2}}},

Ω = t = T - 2 \sum T + 2 γ^{∣ t - T ∣} j = 1 \sum K exp {- \frac{D _{w} ( P ( i ) , P ( i _{t j} ))}{2 σ _{t}^{2}}} .

Ω = t = T - 2 \sum T + 2 γ^{∣ t - T ∣} j = 1 \sum K exp {- \frac{D _{w} ( P ( i ) , P ( i _{t j} ))}{2 σ _{t}^{2}}} .

D_{w} (P (i_{1}), P (i_{2})) =

D_{w} (P (i_{1}), P (i_{2})) =

u \in [- s, s]^{2} \sum (P (i_{1} + u) - P (i_{2} + u))^{2} exp {- \frac{∥ u ∥ ^{2}}{2 σ _{p}^{2}}},

f_{i} (t) = \frac{∣ α _{i} ( t + 1 ) - α _{i} ( t ) ∣}{∣ I _{i} ( t + 1 ) - I _{i} ( t ) ∣},

f_{i} (t) = \frac{∣ α _{i} ( t + 1 ) - α _{i} ( t ) ∣}{∣ I _{i} ( t + 1 ) - I _{i} ( t ) ∣},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

L1-regularized Reconstruction

Error as Alpha Matte

Jubin Johnson, Hisham Cholakkal, and Deepu Rajan J. Johnson, H. Cholakkal, and D. Rajan are with the Multimedia Lab, School of Computer Science and Engineering, Nanyang Technological University, Singapore, 639798 (e-mail: {jubin001, hisham002, asdrajan}@ntu.edu.sg).

Abstract

Sampling-based alpha matting methods have traditionally followed the compositing equation to estimate the alpha value at a pixel from a pair of foreground (F) and background (B) samples. The (F,B) pair that produces the least reconstruction error is selected, followed by alpha estimation. The significance of that residual error has been left unexamined. In this letter, we propose a video matting algorithm that uses L1-regularized reconstruction error of F and B samples as a measure of the alpha matte. A multi-frame non-local means framework using coherency sensitive hashing is utilized to ensure temporal coherency in the video mattes. Qualitative and quantitative evaluations on a dataset exclusively for video matting demonstrate the effectiveness of the proposed matting algorithm.

Index Terms:

Residual error, video matting, non-local means.

I Introduction

Digital matting refers to the problem of accurate foreground extraction and finds its use in image and video editing. Mathematically, any pixel color $I_{i}$ can be modeled as a convex combination of the foreground color ( $F_{i}$ ) and the background color ( $B_{i}$ ) such that

[TABLE]

where $\alpha_{i}$ is the opacity (alpha) value at pixel $i$ . Determining $\alpha$ is an under-constrained problem, made tractable by means of user-input labels in the form of a trimap or scribbles.

Matting methods are generally divided into sampling-based [1, 2, 3, 4] and propagation-based [5, 6] approaches. The former category uses color values from the known foreground and background regions to find the best foreground-background $(F,B)$ pair that represents the true foreground and background colors to estimate $\alpha$ of a given pixel. Different sampling strategies (local/global) and optimization criteria for selecting the best $(F,B)$ pair distinguish these approaches. Similar color distribution among the foreground and background regions is a challenge since the samples cannot discriminate between $F$ and $B$ regions anymore. Propagation-based methods leverage the correlation between neighboring pixels with respect to local image statistics to interpolate the known alpha values to the unknown regions. As with sampling approaches, false correlations between neighboring $F$ and $B$ pixels occurs due to color similarity. Moreover, strong edges and textured regions fail to propagate the alpha accurately. Recently, deep learning based approaches [7] have shown to perform well in natural image matting.

Video matting, apart from extracting spatially accurate mattes on each frame, also has the additional requirement of temporal coherence across the video [8, 9, 10]. The human visual system is highly sensitive to jitter and temporal inconsistencies across frames. Low contrast and fast motion are factors that contribute to inaccurate matte in a frame, thereby leading to temporal jitter across the extracted video matte. Although the quality of the mattes obtained by independently applying image matting algorithms to each frame is high, it does not result in temporally coherent mattes. $\alpha$ -propagation has been extended to the temporal domain as post-processing to alleviate this problem. Snapcut [11] uses the matting Laplacian [5] to bias the alpha to the previous frame. A motion-aware Laplacian is constructed to propagate the matte temporally in [12]. Level-set interpolation is used to temporally smooth the estimated mattes in [9]. Optical flow is used to warp the alpha from the previous frame in the Laplacian formulation in [8].

The proposed approach is based on sampling. As mentioned earlier, sampling methods find the best $(F,B)$ pair that satisfies eq. (1) and use it to estimate the alpha value. The reconstruction error in the selected pair is $\xi_{i}=\left\|I_{i}-(\hat{\alpha}_{i}F_{i}+(1-\hat{\alpha}_{i})B_{i})\right\|$ . The significance of this residual error for matting has largely been left unexamined in literature. Johnson et al. [10] showed sparse coding as an alternative to the compositing equation for estimating the $\alpha$ value at a pixel. Inspired by this, we propose a sampling-based approach that looks at matting from the perspective of sparse reconstruction error of feature samples. Fig. 1 illustrates the motivation behind using reconstruction error as a measure of the matte in a real image. A zoomed region of the input image in Fig. 1(a) and its trimap are shown in Fig. 1(b), representing a hairy region containing mixed pixels. The local smoothness assumption between the alpha values of neighboring pixels is paramount to extracting a good matte. In a real image, alpha would gradually transition between the definite $F$ and $B$ with the true mixed pixel alphas having an intermediate value. The RGB color distribution of pixels in the image patch varies smoothly between the foreground and background with the blending peaking at the middle of the unknown region (Fig. 1(c)). Similarly, the error obtained during reconstruction using $F$ and $B$ samples can be thought of as a probability measure that varies smoothly between the foreground and background regions, gradually rising from the definite regions and peaking at true mixed pixels. As can be seen in Fig. 1(c) and (d), the color distribution of pixels in a real image and the residual error are highly correlated. To the best of our knowledge, we are the first to formulate matting from the perspective of reconstruction error. A patch-based non-local means (NLM) framework using coherency sensitive hashing across multiple frames is integrated into the estimated mattes to ensure temporal coherence in the final mattes. The proposed NLM framework is shown to reduce temporal jitter when compared to the widely used Laplacian methods using qualitative and quantitative comparisons on a video matting dataset [8, 10].

II Proposed Approach

II-A L1-regularized reconstruction error as alpha matte

The aim of the proposed method is to use reconstruction error as a measure of the $\alpha$ value. The objective of using error reconstruction hinges on the assumption that the foreground and background are locally smooth, akin to propagation-based methods. The idea is therefore, to select a local subset of the known regions for the local smoothness assumption to hold. Following [10], at each pixel, $F$ and $B$ dictionaries are formed by sampling the spatially nearest pixels at a radius of 50 pixels from the definite foreground and background regions, respectively. The feature vector used is the 8-D vector $[\>R\;G\;B\;L\;a\;b\;x\;y\>]^{T}$ formed by concatenating the RGB and CIELAB color-spaces along with the $X$ - $Y$ coordinates. In order to reduce the sample space, the definite $F$ and $B$ regions are clustered into superpixels using SLIC segmentation [13]. It is to be noted that [10] uses a single dictionary by concatenating the $F$ and $B$ samples together. However, the proposed method requires separate $F$ and $B$ dictionaries in order to determine the reconstruction error with respect to each as explained below.

Given an unknown pixel $i$ , let $D_{F}^{i}$ and $D_{B}^{i}$ be the foreground and background dictionaries formed by sampling the feature vectors. The sparse codes with respect to each dictionary are determined as

[TABLE]

where $v_{i}$ is the feature vector at pixel $i$ . The residual errors generated by reconstruction using $F$ and $B$ dictionaries are

[TABLE]

$\xi_{F}^{i}$ $(\xi_{B}^{i})$ is the error generated at the unknown pixel $i$ when its feature is reconstructed using foreground (background) dictionary. A high value for $\xi_{F}^{i}$ $(\xi_{B}^{i})$ indicates that the current pixel cannot be reconstructed well enough by the $F$ $(B)$ samples. Fig. 2 (b) and (c) visualizes these error maps for a real image. $\xi_{F}^{i}$ should ideally be 0 for foreground pixels and gradually increase towards the background pixels. Similarly, $\xi_{B}^{i}$ should ideally be 0 for background pixels and gradually increase towards foreground regions. A pixel with a true alpha value of 0.5, i.e. a truly mixed pixel should have comparable reconstruction errors in $\xi_{F}^{i}$ and $\xi_{B}^{i}$ .

The alpha value can be interpreted as the probability of the pixel belonging to the foreground. $\xi_{B}^{i}$ represents the probability of belonging to the foreground, given the known background information, i.e., $\xi_{B}^{i}=P(f(i)|\mathbf{D}_{B})$ . $\xi_{F}^{i}$ represents the probability of belonging to the background, given the known foreground information - $\xi_{F}^{i}=P(b(i)|\mathbf{D}_{F})=1-P(f(i)|\mathbf{D}_{F})$ . Based on the above observation, the alpha value is then estimated as

[TABLE]

As can be seen from eq. (5), if a pixel truly belongs to the foreground, its foreground reconstruction error $\xi_{F}^{i}$ will be a smaller value than the background reconstruction error $\xi_{B}^{i}$ , thereby ensuring $\alpha$ is large. The alpha map is shown in Fig. 2(d) and indicates the effectiveness of this simple formulation using sparse reconstruction error.

II-B Patch-based non-local means for temporal coherence

Since the sampling strategy uses a local spatial subset of samples from within the frame, the alpha estimates obtained above lack temporal coherency as the information present in the nearby frames is ignored. Existing methods follow a Laplacian based post-processing step where the inter-pixel correlation is utilized to propagate the matte. The disadvantage inherently lies in its inability to find distant neighbors in space and time. Also, the use of pixel-based matching leads to noise from outliers that get matched incorrectly. To handle this, we propose a patch-based NLM framework that is prevalent in video denoising [14] to maintain the temporal consistency across neighboring frames. NLM [15] was originally introduced to remove noise by averaging pixels in an image weighted by local patch similarities. The high search complexity in finding non-local neighboring patches restricts its use to a local neighborhood alone. Therefore, we apply an approximate K-nearest neighbor patch-matching using coherency sensitive hashing [16] that extends PatchMatch [17] using a hashing scheme where similar patches in the temporal neighborhood are used to propagate the matches to their neighbors.

The framework is illustrated in Fig. 3. For a given image patch $P_{T}(i)$ (shown in red) centered at pixel $i$ in frame $T$ , approximate K-nearest neighbors (AKNN) in frames $T-1$ and $T+1$ (in blue) are initialized by creating $L$ hash tables based on projection of the patches on Walsh-Hadamard kernels, followed by search for the best candidate patches [16]. The two images are assumed to be coherent, i.e., for every pair of similar patches, their neighbors are also likely to be similar. In the next iteration, the AKNN field for the initial candidates are searched for in their respective temporal neighbors $T-2$ and $T+2$ . Under this setup, the temporal neighbors for $P_{t}(z)$ include a series of AKNN’s $\{\mathcal{N}_{T-2},\mathcal{N}_{T-1},\mathcal{N}_{T},\mathcal{N}_{T+1},\mathcal{N}_{T+2}\}$ . $\mathcal{N}_{i}=\{P(z_{tj})\}_{j=1}^{K}$ denotes the patches in the $t^{th}$ frame. The non-local means estimate of alpha is [14]

[TABLE]

where $\hat{\alpha}(i)$ is the alpha value of a patch centered at $i$ and $\Omega$ is a normalization constant:

[TABLE]

$D_{w}(.,.)$ is a weighted sum of squared difference (SSD) over 2 patches denoted by

[TABLE]

where $s$ is the patch width and $\sigma_{p}$ is set to $s/2$ . $\gamma$ is set to 0.9 to control the influence of temporal neighbors. Eq. (6) estimates the alpha matte for an entire patch centered at pixel $i$ in frame $T$ . For an alpha patch of width $s$ , all patches whose centers are located within a radius of $\frac{s}{2}$ from $i$ , contain the pixel at $i$ . A simple Gaussian weighted averaging is performed to obtain the final NLM estimate at $i$ as $\alpha_{i}=\frac{{}\sum_{j\in\phi}\alpha(i)e^{-\frac{\left\|i-i_{j}\right\|^{2}}{2\sigma_{p}^{2}}}}{\sum_{j\in\phi}e^{-\frac{\left\|i-i_{j}\right\|^{2}}{2\sigma_{p}^{2}}}}$ where $\phi$ denotes the set of patches that contain the pixel $i$ . The process is repeated for each frame of the video to obtain the final video mattes.

III Experimental Results

The effectiveness of the proposed method is evaluated on an exclusive video matting dataset used in [8, 10]. It contains sequences covering a wide range of pixel opacity variations and challenges like occlusion and low-contrast. Trimaps are generated on each frame using the method of [18]. $\lambda$ for sparse coding is set to 0.1. In all experiments, the patch size was set to $8\times 8$ . K is set to 5 in eq. (6). The proposed method is evaluated both quantitatively and qualitatively with recent video matting approaches namely, Snapcut (SC) [11], Bai et al. (BA) [9], Ehsan et al. (EH) [8] and Johnson et al. (JO) [10].

III-A Qualitative comparison

Fig. 4 and Fig. 5 show the visual comparison of the proposed method against recent video matting methods [8, 10, 9, 11] on Amira, Face and Woman sequences from the dataset. Additional comparisons and video are available at the url111https://goo.gl/Ho5xMN. The yellow arrows indicate the regions of discontinuity between consecutive frames. The low contrast between the foreground and the background in Amira is a challenging scenario for most matting algorithms. Laplacian-based smoothing used by existing methods produce ambiguity near the boundaries as pixel neighbors tend to be unreliable for accurate propagation. The use of patch-based neighbors in the proposed method enables us to remove such artifacts in the final matte. The sequences in Fig. 5 represent cluttered background which is challenging for most sampling-based algorithms. Our error based matte is accurately able to distinguish between the hair and the background when compared to the ground truth, showing its effectiveness in highly textured regions.

III-B Quantitative evaluation

We perform quantitative comparison to evaluate the temporal coherence of the extracted mattes by measuring the difference in alpha values between successive frames as in [19]. The temporal flicker at the $i^{th}$ pixel in frame $t$ is measured as

[TABLE]

where $\alpha_{i}(t)$ and $I_{i}(t)$ are the alpha and RGB color values at pixel $i$ in frame $t$ .

Table I compares the mean temporal jitter error across 15 sequences in the dataset with recent video matting approaches. As can be seen, the proposed method is able to produce the least temporal jitter across most of the sequences. For the few exceptions, the reconstruction error cannot be trustworthy when the true $F$ and $B$ samples are not present in the dictionary leading to poor initial estimates. Apart from the smooth reconstruction error formulation, the use of a patch-based coherency sensitive hashing is instrumental in the increased performance of the proposed method. [10] uses pixel neighbors in its graph formulation which can be erroneous due to noise. Moreover, the smoothness of alpha is not maintained in the feature vector for sparse coding in [10].

Runtime performance: Table II compares the running time of the proposed method with other sampling-based video matting approaches. MATLAB implementations were evaluated on a PC running Intel Xeon 3.2 GHz processor. The proposed method perform comparable to the current approaches without compromising on the quality of the matte.

IV Conclusion

We present a novel video matting framework that treats the matting problem from the perspective of reconstruction error of a feature. Foreground and background dictionaries, whose bases are used to reconstruct an unknown feature vector with L1-regularization are used to measure the error towards $F$ and $B$ respectively. A NLM framework is also proposed that is integrated across multiple frames to ensure temporal coherence in the video mattes. Experimental evaluations demonstrate that the proposed method has advantages over current matting methods that use a Laplacian based smoothing.

Bibliography19

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] J. Wang and M. F. Cohen, “Optimized color sampling for robust matting,” in Proc. IEEE CVPR , 2007, pp. 1–8.
2[2] K. He, C. Rhemann, C. Rother, X. Tang, and J. Sun, “A global sampling method for alpha matting,” in Proc. IEEE CVPR , 2011, pp. 2049–2056.
3[3] E. Shahrian, D. Rajan, B. Price, and S. Cohen, “Improving image matting using comprehensive sampling sets,” in Proc. IEEE CVPR , 2013, pp. 636–643.
4[4] J. Johnson, D. Rajan, and H. Cholakkal, “Sparse codes as alpha matte,” in Proc. BMVC , 2014.
5[5] A. Levin, D. Lischinski, and Y. Weiss, “A closed-form solution to natural image matting,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 30, no. 2, pp. 228–242, 2008.
6[6] Q. Chen, D. Li, and C.-K. Tang, “Knn matting,” in Proc. IEEE CVPR , 2012, pp. 869–876.
7[7] D. Cho, Y.-W. Tai, and I. Kweon, “Natural image matting using deep convolutional neural networks,” in Proc. ECCV , 2016, pp. 626–643.
8[8] E. Shahrian, B. Price, S. Cohen, and D. Rajan, “Temporally coherent and spatially accurate video matting,” Computer Graphics Forum , vol. 33, no. 2, pp. 381–390, 2014.