Revisiting Graph Construction for Fast Image Segmentation

Zizhao Zhang; Fuyong Xing; Hanzi Wang; Yan Yan; Ying Huang; Xiaoshuang; Shi; Lin Yang

arXiv:1702.05650·cs.CV·December 5, 2017

Revisiting Graph Construction for Fast Image Segmentation

Zizhao Zhang, Fuyong Xing, Hanzi Wang, Yan Yan, Ying Huang, Xiaoshuang, Shi, Lin Yang

PDF

Open Access

TL;DR

This paper introduces a new graph construction method for fast image segmentation that leverages local and global region relationships, leading to improved efficiency and competitive accuracy on standard benchmarks.

Contribution

The authors propose a novel graph construction approach based on co-occurrence and saliency, along with an energy function for efficient graph partitioning in image segmentation.

Findings

01

Achieves competitive segmentation accuracy on BSDS500, PASCAL VOC, and COCO datasets.

02

Significantly improves computational efficiency over existing methods.

03

Effective multi-class segmentation driven by eigenvector histogram representations.

Abstract

In this paper, we propose a simple but effective method for fast image segmentation. We re-examine the locality-preserving character of spectral clustering by constructing a graph over image regions with both global and local connections. Our novel approach to build graph connections relies on two key observations: 1) local region pairs that co-occur frequently will have a high probability to reside on a common object; 2) spatially distant regions in a common object often exhibit similar visual saliency, which implies their neighborship in a manifold. We present a novel energy function to efficiently conduct graph partitioning. Based on multiple high quality partitions, we show that the generated eigenvector histogram based representation can automatically drive effective unary potentials for a hierarchical random field model to produce multi-class segmentation. Sufficient experiments,…

Tables3

Table 1. Table 1 : The comparison of segmentation results and runtime on the BSDS500 dataset.

Method	Covering		PRI		VoI		Time(s)
Method	ODS	OIS	ODS	OIS	ODS	OIS	Time(s)
NCut [6]	.45	.53	.78	.80	2.23	1.89	-
Felz-Hutt [1]	.52	.57	.80	.82	2.21	1.87	-
Mean Shift [60]	.54	.58	.79	.81	1.85	1.64	-
ISCRA [61]	.59	.66	.82	.85	1.60	1.42	$30$
gPb-owt-ucm [11]	.59	.65	.83	.85	1.69	1.48	$240$
cPb-owt-ucm [4]	.59	.65	.83	.86	1.65	1.45	$> 240$
red-spectral [37]	.56	.62	.83	.85	1.78	1.56	$\sim 12$
DC-Seg [33]	.58	.63	.82	.85	1.75	1.59	$6$
${DC-Seg}_{full}$ [33]	.59	.64	.82	.85	1.68	1.54	$144$
${PMI}_{low}$ [31]	.61	.66	.83	.86	1.58	1.42	30^⋆
MCG [12]	.61	.66	.83	.86	1.57	1.39	$18$
PFE $+$ ucm [7]	.61	.66	.83	.86	1.64	1.46	$> 900 \cdot b^{†}$
PFE $+$ MCG [7]	.62	.68	.84	.87	1.56	1.36	$> 900 \cdot b^{†}$
Ours	.62	.66	.83	.86	1.59	1.43	$9$

Table 2. Table 2 : The running time (in second) of each phase of the proposed method.

Phase	Min	Max	Mean	Var.
1: Region structure generation	2.60	3.73	3.08	0.05
2: Graph construct. and partition	3.30	6.89	4.51	0.56
3: Multi-class segmentation	1.39	2.26	1.71	0.03
Total	7.54	12.6	9.30	1.00

Table 3. Table 3 : The segmentation results under different configurations. Different region generation baselines which are used by our method are indicated in ( ⋅ ) ⋅ (\cdot) . The last row is the result of our method (SE+ucm) when E g l o b a l subscript 𝐸 𝑔 𝑙 𝑜 𝑏 𝑎 𝑙 E_{global} is not applied. Please see text for detailed explanations.

Method	Covering		PRI		VoI
Method	ODS	OIS	ODS	OIS	ODS	OIS
SemiContour+ucm	.56	.63	.82	.85	1.79	1.57
Ours (SemiContour+ucm)	.60	.64	.83	.85	1.68	1.50
MCG	.61	.66	.83	.86	1.57	1.39
Ours (MCG)	.62	.66	.84	.86	1.57	1.40
SE+ucm	.59	.64	.83	.86	1.71	1.51
Ours (SE+ucm)	.62	.66	.83	.86	1.59	1.43

Equations22

C O (S_{i}, S_{j}) = lo g \frac{1}{A} P (S_{i}, S_{j}),

C O (S_{i}, S_{j}) = lo g \frac{1}{A} P (S_{i}, S_{j}),

y min i = 1 \sum N j = 1 \sum N ∣∣ y_{i} - y_{j} ∣ ∣^{2} W_{ij}, s . t . y^{T} D y = 1,

y min i = 1 \sum N j = 1 \sum N ∣∣ y_{i} - y_{j} ∣ ∣^{2} W_{ij}, s . t . y^{T} D y = 1,

W_{ij} = exp (o \sum C O (S_{i}^{f_{o}}, S_{j}^{f_{o}})),

W_{ij} = exp (o \sum C O (S_{i}^{f_{o}}, S_{j}^{f_{o}})),

y min i = 1 \sum N ∣∣ y_{i} - j \neq = i \sum K_{ij} y_{j} ∣ ∣^{2}, s . t . y^{T} y = 1,

y min i = 1 \sum N ∣∣ y_{i} - j \neq = i \sum K_{ij} y_{j} ∣ ∣^{2}, s . t . y^{T} y = 1,

\begin{split}s_{ij}&=\big{(}\min(|w_{S_{i}}-w_{S_{j}}|,I_{w}-|w_{S_{i}}-w_{S_{j}}|)^{2}\\ &+\min(|h_{S_{i}}-h_{S_{j}}|,I_{h}-|h_{S_{i}}-h_{S_{j}}|)^{2}\big{)}^{\frac{1}{2}},\end{split}

\begin{split}s_{ij}&=\big{(}\min(|w_{S_{i}}-w_{S_{j}}|,I_{w}-|w_{S_{i}}-w_{S_{j}}|)^{2}\\ &+\min(|h_{S_{i}}-h_{S_{j}}|,I_{h}-|h_{S_{i}}-h_{S_{j}}|)^{2}\big{)}^{\frac{1}{2}},\end{split}

K_{i} min ∣∣ σ (S_{i}) - j \neq = i \sum K_{ij} σ (S_{j}) ∣ ∣^{2} + α Tr (K_{i}^{T} K_{i}), s . t . j \sum K_{ij} = 1.

K_{i} min ∣∣ σ (S_{i}) - j \neq = i \sum K_{ij} σ (S_{j}) ∣ ∣^{2} + α Tr (K_{i}^{T} K_{i}), s . t . j \sum K_{ij} = 1.

E = E_{l oc a l} + μ E_{g l o ba l} = i = 1 \sum N j = 1 \sum N ∣∣ y_{i} - y_{j} ∣ ∣^{2} W_{ij} + μ i = 1 \sum N ∣∣ y_{i} - i \neq = j \sum K_{ij} y_{j} ∣ ∣^{2} = y^{T} (D - W + μ M) y,

E = E_{l oc a l} + μ E_{g l o ba l} = i = 1 \sum N j = 1 \sum N ∣∣ y_{i} - y_{j} ∣ ∣^{2} W_{ij} + μ i = 1 \sum N ∣∣ y_{i} - i \neq = j \sum K_{ij} y_{j} ∣ ∣^{2} = y^{T} (D - W + μ M) y,

(D - W + μ M) y = λ D y,

(D - W + μ M) y = λ D y,

Θ (p) = i = 1 \sum 2 \cdot N + 1 U (p_{i}) + (i, j) \in N (S) \sum B (S_{i}, S_{j}), p_{i} \in {0, 1, ..., L},

Θ (p) = i = 1 \sum 2 \cdot N + 1 U (p_{i}) + (i, j) \in N (S) \sum B (S_{i}, S_{j}), p_{i} \in {0, 1, ..., L},

\forall i \neq = j, S_{i}, S_{j} \in S^{+}, if S_{i} \cap S_{j} \neq = \emptyset, then p_{i} \cdot p_{j} = 0,

\forall i \neq = j, S_{i}, S_{j} \in S^{+}, if S_{i} \cap S_{j} \neq = \emptyset, then p_{i} \cdot p_{j} = 0,

U_{p_{i} = k} = - β \cdot ⟨ H (S_{i}), H (Z_{k})⟩, S_{i} \in S^{+}, Z_{k} \in Z,

U_{p_{i} = k} = - β \cdot ⟨ H (S_{i}), H (Z_{k})⟩, S_{i} \in S^{+}, Z_{k} \in Z,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques · Advanced Neural Network Applications

Full text

Revisiting Graph Construction for Fast Image Segmentation

Zizhao Zhang1, Fuyong Xing2, Hanzi Wang4,

Yan Yan4, Ying Huang4, Xiaoshuang Shi3, Lin Yang1，2，3,⋆

1Dept. of Computer and Information Science and Engineering, University of Florida, FL 32611, USA

2Dept. of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Denver, Denver, CO 80045 USA

3J. Crayton Pruitt Family Dept. of Biomedical Engineering, University of Florida, FL 32611, USA

4 Fujian Key Laboratory of Sensing and Computing for Smart City, School of Information Science and Engineering, Xiamen University, Fujian 361005, China

[email protected], [email protected], [email protected], [email protected], [email protected]， [email protected], [email protected]

Abstract

In this paper, we propose a simple but effective method for fast image segmentation. We re-examine the locality-preserving character of spectral clustering by constructing a graph over image regions with both global and local connections. Our novel approach to build graph connections relies on two key observations: 1) local region pairs that co-occur frequently will have a high probability to reside on a common object; 2) spatially distant regions in a common object often exhibit similar visual saliency, which implies their neighborship in a manifold. We present a novel energy function to efficiently conduct graph partitioning. Based on multiple high quality partitions, we show that the generated eigenvector histogram based representation can automatically drive effective unary potentials for a hierarchical random field model to produce multi-class segmentation. Sufficient experiments, on the BSDS500 benchmark, large-scale PASCAL VOC and COCO datasets, demonstrate the competitive segmentation accuracy and significantly improved efficiency of our proposed method compared with other state of the arts.

keywords:

Image segmentation, Graph partition, Manifold

††journal: Pattern Recognition

1 Introduction

Image segmentation is a challenging and critical computer vision task. Graph-based algorithms have been shown as an effective approach for image segmentation [1, 2, 3]. Among various graph based approaches, spectral clustering becomes a major trend [4, 5].

Recent methods attempt to solve several primary issues of spectral clustering (referring to normalized cuts (NCut) [6]) based image segmentation to segment image into meaningful partitions. First, NCut based methods tend to segment image into spatially connected components [6, 7]. Multiscaling processing [8, 9] is a common way to address this problem by building the affinity for distant pixel affinities [10, 9]. However, the usage of these methods for real large-scale datasets is not clear. Most current cutting-edge methods do not follow this direction. Instead, recent methods, like gPb [11] and MCG [12, 7] based methods [13] use the boundary-preserving property of NCut to trace boundary orientation information rather than direct segmentation. Building effective affinity matrices [7, 4, 12] usually uses sophisticated low-level features [11]. These features can effectively measure the local changes but are not effective in capturing high-level knowledge for segmentation. They are not good options for fast segmentation either due to high computational cost [11]. Different from previous approaches, our method re-examines spectral clustering from a manifold learning perspective to construct a graph to model the high-level image knowledge (i.e., pixel pair co-occurrence and saliency relationship) for unsupervised image segmentation. More importantly, our method provides the possibility of enabling graph partitioning to directly segment challenging natural images rather than just boundary tracing.

To better illustrate the motivation, we first explain the latent relation of NCut to manifold learning. Both NCut and Laplacian eigenmaps [14] take advantage of the locality-preserving character [15] of graph Laplacian to conduct clustering and dimensionality reduction. In fact, locality-based dimensionality reduction methods are implicitly tied to clustering [14, 16]. Preserving locality is the key factor that drives effective clustering. Let’s assume pixels of an image lie on a certain manifold where pixels belonging to a common object are adjacent (within a small range), but far away in the spatial image plane. These pixels supposed to have strong connections to be grouped together, but these connections are not encoded in the sparse affinity matrix of NCut due to their Euclidean distances. Although multi-scale affinity matrices [8, 10] can alleviate this issue, increasing the range of an affinity matrix and connecting all pixel pairs in the range also introduce unavoidable noises. The method to construct affinities between spatially distant and adjacent pixels should be considered respectively in order to better capture their respective characteristics in image statistics.

In this paper, we propose a novel approach to construct an image region graph to address the aforementioned problems. The overall idea is illustrated in Figure 1. The graph nodes are connected among both spatially adjacent and distant regions through different and independent cues. We build local connections between spatially adjacent regions with an affinity matrix. The estimation of the similarity between two regions is based on an observation that adjacent region pairs co-occurring frequently often reside on a common object. Oppositely, global connections are built among adjacent regions in the manifold which might be spatially distant, with an objective to preserve their relationships and encourage them to be clustered together. We introduce a simple cue to discover the similar saliency of those regions as the global connection measurement. We present a new energy function to partition the constructed graph, which formulates the minimization problem as a single and efficiently solvable eigenvector system. Based on the generated high quality graph partitions, we present a simple eigenvector histogram based representation to represent image regions and automatically drive effective unary potentials for the hierarchical random field of the Pylon model [17], yielding high-quality multi-class segmentation.

In brief, the contributions of this paper are:

We propose a sophisticatedly connected graph to build the connection of image regions yet with very efficient graph partitioning capability.

2.

We exploit various simple and efficient cues to capture the high-level image information in order to segment objects with complex inner-variances and background.

3.

We present a multi-class segmentation strategy by utilizing graph partitions to generate clear and smooth segmentation.

4.

Extensive experiments and comprehensive analysis are conducted, on BSDS500 [11], large-scale PASCAL VOC [18] and COCO [19] datasets, to validate the effectiveness of our proposed method, its generalization ability to different datasets with diverse scenes, and the high efficiency compared with other state of the arts.

The rest of the paper are structured as follow: Section 2 discusses the related work. Section 3 introduces the graph construction and partition of our method. Section 4 introduces the proposed multi-class segmentation by utilizing graph partitions. Finally, Section 5 conducts experiments and detailed analysis. Section 6 concludes the paper.

2 Related Works

Image segmentation has been studied in the computer vision community for decades. Shi et al. [6] propose normalized cuts (NCut), which advanced spectral clustering based image region segmentation. [20] enables its multi-class segmentation. Among the region based segmentation, diffusion based approaches [21, 22], GraphCut [23], GrabCut [23], etc [24, 25, 26, 27], have been explored to partition images. Building successful affinity matrices is critical [28]. Many subsequent approaches have computed more effective affinity matrices using elaborately designed low-level features and metrics [4, 10, 11, 29]. To solve the limitation of NCut to capture affinities of distant pixels, several methods [8, 9, 29, 30] have been proposed base on multi-scaling affinity strategies. However, dense affinity suffers from optimization bottleneck, although approximation algorithms are explored [10, 9, 12]. Our method is able to capture both local and global affinities as well keeps the sparsity of the affinity matrix.

Contour driven image region segmentation is widely studied. Arbelaez et al. [11] propose the globalized probability of boundary (gPb), which utilizes the boundary-preserving characteristic of NCut with sophisticatedly designed features to detect object boundaries and incorporate it into the oriented watershed transform and ultrametric contour map (OWT-UCM) to conduct image segmentation. This approach becomes the main support of many subsequent segmentation approaches [31, 4, 32, 33, 12, 13]. Kim et al. [34] formulate a hypergraph-based model and perform correlation clustering for image segmentation. Recently, Yu et al. [7] minimize an $\ell_{1}$ -normed energy function of NCut to obtain piecewise smooth embeddings for gPb-owt-ucm [11], which obtains state-of-the-art image segmentation performance. However, all these methods suffer from expensive computations for feature extraction or optimization. Speed issues are considered in several following work. Multiscale combinatorial grouping (MCG) [12] segment images with multi-scale UCMs and it uses more advanced edge detection methods [35] to largely reduce the computation bottleneck of NCut used by gPb-owt-ucm. Chen [13] provides a solution to the scale-alignment in MCG. However, all these methods suffer from expensive computations for feature extraction or optimization. Sometimes several minutes are required to process a single $321\times 481$ image, which significantly limits their practical usages. On the other hand, Pont-Tuset et al. [36] propose a downsampled approximating algorithm to accelerate the graph partitioning and use richer information in multiscale UCMs. Tayor et al. [37] and many others [4, 34] reduce the size of the affinity matrix using superpixel techniques. In this paper, we develop a method that is much faster than the aforementioned methods with competitive accuracy.

Edge detection plays an extreme importantly role in region based image segmentation [38, 39, 40, 41, 42]. For example, Convolutional Oriented Boundaries (COB) proposes an accurate boundary detection method using convolutional neural networks (CNNs) and combines with [36] to perform image and object segmentation. Another popular image segmentation direction is semantic segmentation. Current methods use CNNs [43, 44, 45] to predict the semantic label of each pixel. These methods rely on large-scale training data. In contrast, our method aims at partition images into regions that can accurately segment objects from an image by observing its internal statistics in an unsupervised manner.

Designing feature to build the affinity between pixels/regions is important. Several studies have explored different cues, such as sophisticated combination of mixed image features [11], texture information [46], or saliency [47]. Different from these low-level image features, we argue that high-level cues are equally important and sometimes even more effective. For example, co-occurrence statistics have been used to capture the semantic object context knowledge based on training data to help the inference in, for example, condition random field (CRF) [48]. Different from this direction of research, our approach models region-wise co-occurrence probability based on pointwise mutual information [49] to build local connections of our proposed graph learned from the image itself.

Laplacian eigenmaps [50] computes a low-dimensional embedding to preserve the pairwise affinity of data points in the manifold. Local linear embedding (LLE) [51], alternatively, preserves the linear structure among the local neighboring points. The locality-preserving character of these two methods implicitly encourages the clustering of data. However, Isomap [52], which preserves global data geodesic distances, does not possess the nature of clustering. Our method shows a distinct point of view on the side of manifold learning to enhance spectral clustering for image segmentation.

3 Global-local Connected Graph Partitioning

In this section, we present the approach to build the local and global connections of the graph. Then we introduce the proposed energy function to partition the constructed graph.

3.1 Local connection with co-occurrence cues

Our proposed method begins with an over-segmentation with a set of regions, defined as $\mathcal{S}=\{S_{1},...,S_{N}\}$ . The over-segmentation is favorable considering its local spatial consistency and computational efficiency. Denote the graph by $\mathcal{G}_{local}=\{\mathcal{S},W\}$ , where $W\in\mathbb{R}^{N\times N}$ is the affinity matrix with each entry $W_{ij}$ representing the affinity between regions $S_{i}$ and $S_{j}$ . $W$ is sparse such that only spatially adjacent region pairs within a small range have nonzero values. Given a test image with large appearance variations inside the object (see Figure 2), a desirable affinity matrix should be able to discover the strong affinity between two visually different neighboring regions belonging to the common object. However, it is difficult for low-level features to achieve this goal because of their limitations in learning high-level knowledge.

One type of high-level knowledge comes from the fact that a neighboring region pair residing on an object is more likely to co-occur (i.e., have a high joint probability) due to the color patterns inside the object [31], such as the strip patten on the clothes of the images in Figure 2. If we treat regions $\{S_{i}|i=1,...,N\}$ as random variables, we can define the co-occurrence of two regions as

[TABLE]

where $P(S_{i},S_{j})$ is the joint probability over $S_{j}$ and $S_{j}$ . Let $A=P(S_{i})P(S_{j})$ represent a normalization term, which is crucial to penalize the biased-high $P(S_{i},S_{j})$ of background region pairs against foreground object region pairs, because the background area usually has larger proportion than foreground objects. This normalization term will eliminate this unbalance accordingly. In addition, $CO$ also contains information about object boundaries, because a region pair across the object boundary is a small-probability event [31].

We estimate $P(S_{i},S_{j})$ and marginal distribution $P(S_{i})$ by using a nonparametric kernel density estimator [53] following [31]. But differently, we densely sample region pairs of each region and its adjacent regions within a certain (denoted as $e_{1}$ ) distance apart without repetition (which means $P(S_{1},S_{2})=P(S_{2},S_{1})$ ). Basically, we place estimator kernels on all regions $\{S_{i}\}$ , and compute the image feature (gray values) co-occurrence probability over all region pairs. So for each feature value pair, we have a co-occurrence frequency. Then we can simply normalize them and obtain $P(S_{i})$ and the final co-occurrence cue $CO(S_{i},S_{j})$ .

Our approach shares some similarities with [31] (denoted as PMI) for using pointwise mutual information, but is different from PMI in several perspectives. PMI interests in low pixel-wise joint probability to discover the rare boundaries, but we are interested in high region-wise probabilities and simultaneously maintain the boundary detection ability of PMI. PMI relies on raw image pixels, the probability in Eq. (1) is estimated over limited number of randomly sampled pixels. We rely on coherent regions to estimate this probability over most of adjacent region pairs, which yields probability distribution estimation closer to the actual distribution for the regions, and the estimation process is much less computational expensive.

Energy function: The first term $E_{local}$ in our proposed energy function will encourage frequently co-occurring region pairs to be clustered into a group, and vice versa. Minimizing $E_{local}$ is defined as the following:

[TABLE]

where $D$ is a diagonal matrix and its $i$ -th diagonal element is $d_{ii}=\sum_{j}W_{ij}$ . The constraint is the key to normalize the cut of the graph. Minimizing $E_{local}$ enforces $y_{i}$ and $y_{j}$ to take a similar value when $W_{ij}$ is large. $\bm{y}=[y_{1},...,y_{N}]^{T}$ is a real-valued vector, which is interpreted as a binary graph partition in NCut or an one dimensional embedding in Laplacian eigenmaps. $W_{ij}$ is defined as

[TABLE]

where the superscript $f_{o}$ specifies a feature representation of the corresponding region. For each region, we calculate the pixel mean of Lab color space and the diagonal values of the RGB color covariance matrix in a $3\times 3$ window. $W_{ij}$ is computed between $S_{i}$ and $S_{j}$ within a certain distance apart, denoted as $e_{2}$ ( $e_{2}>e_{1}$ ).

The affinity matrix $W$ is designed to measure the similarity between spatially adjacent region pairs based on their latent co-occurrence statistics. In order to preserve the ignored relationships among spatially distant regions in the common object, we propose an additional energy term by building the global connections of the graph in the following section.

3.2 Global connection with saliency cues

The graph associated with global connections is denoted by $\mathcal{G}_{global}=\{\mathcal{S},K\}$ . Our approach strengthens the locality-preserving character by discovering the underlying linear structures among spatially distant regions (i.e., each region can be linearly represented by several neighboring regions so that the global connections are directed) belonging to a common object, while these regions are adjacent on a certain manifold. This goal is achieved by minimizing the second energy term $E_{global}$ :

[TABLE]

where $K\in\mathbb{R}^{N\times N}$ is the coefficient matrix with $R$ non-zero entries in each row to specify the linear combination coefficients of the representing neighbors. The constraint avoids degenerated solutions. ${\bm{y}}$ is interpreted as an embedding in the original locally linear embedding (LLE) [51] method. Note that both $\bm{y}$ and $K$ are unknown; minimizing this energy function consists of three steps: 1) finding $R$ neighbors for each region, 2) computing coefficient matrix $K$ , and 3) computing $\bm{y}$ .

Geodesic distance based neighbors: For each region, we consider its candidate neighbors from all regions within a large range of the defined geodesic distance, such that the distance ( $s_{ij}$ ) between regions $S_{i}$ and $S_{j}$ is defined as follows:

[TABLE]

where $w_{S_{i}}$ and $h_{S_{i}}$ denote the spatial $x$ - and $y$ -coordinates of region $S_{i}$ , respectively. $I_{w}$ and $I_{h}$ denote the width and height of the image, respectively. Intuitively, this metric treats the image as if it was wrapped along its four corners into a sphere and describes the geodesic distance along this resulting surface. The measurement can trace the connections of the regions belonging to foreground objects or background with an arbitrary shape and range.

For a region $S_{i}$ , we select $R$ nearest regions with each region represented as a feature vector calculated by a saliency cue mapping $\sigma$ . Then we find its coefficients $K_{i}$ by

[TABLE]

The regularization term is necessary to prevent ill-conditioned solutions when neighboring regions have similar feature values (i.e., making the Gram matrix singular). The regularization parameter is chosen as $\alpha=1e{-}10$ . The constraint ensures the translation invariance.

Saliency cue complying linearity: Spatially distant regions inside the same object may have large appearance variances, for example, the face, hairs, and clothes of a person exhibit totally different appearances (see Figure 2). Therefore, it is difficult to measure their latent similarity with traditional cues. However, those visually different regions usually exhibit similar saliency degree in the human visual system [54]. This characteristic remedies the “imperfection” of pairwise co-occurrence affinity and satisfies the requirement to build global connections. We take advantage of the empirical knowledge that salient objects in images have distinctive colors from the background under a certain linear combination of mixed color spaces [55]. To this end, we choose RGB, Lab, and hue and saturation channels of HSV (8 channels) and their nonlinear transformations with gamma correction (with three gamma values, $[0.5,1.5,2.0]$ ) to consider the human vision’s nonlinear responses, thereby yielding a 24-dimensional feature vector for each region, $\sigma:\mathcal{S}\mapsto\mathbb{R}^{24}$ .

Our method to incorporate saliency in the graph connection is elaborate. Unlike saliency detector [55], we do not compute the coefficient explicitly based on any supervised information. Since the correlation of each region feature vector $\sigma(S)$ is consistent under arbitrary linear transformation, its saliency characteristic between regions will be implicitly expressed in Eq. (4).

3.3 Proposed energy function to partition graph

Overall, the proposed full graph is defined as $\mathcal{G}=\{\mathcal{S},(W,K)\}$ , where $W$ specifies the local undirected connections and $K$ specifies the global directed connections. Our goal is to pursue global partitioning of the graph $\mathcal{G}$ , i.e., minimizing the two energy terms simultaneously. Therefore, the energy function $E$ can be defined and derived as follows (detailed derivations are skipped):

[TABLE]

where $(D-W)\in\mathcal{R}^{N\times N}$ is the Laplacian matrix and $M=(I-K)^{T}(I-K)\in\mathcal{R}^{N\times N}$ . $\mu$ is a regularization parameter to balance $E_{global}$ and $E_{local}$ . How to select the optimal value of $\mu$ is discussed in the experimental section.

It is straightforward to see that minimizing $E$ is to solve a generalized eigenvector system:

[TABLE]

which produces a set of eigenvectors $\hat{Y}$ , where each column is an eigenvector representing a binary partition of the graph. In practice, the number of segments of an arbitrary test image is unknown and the expected partition is not guaranteed to be the eigenvector associated with the second smallest eigenvalue [6]. In Section 4, we present a novel approach to address this issue for multi-class segmentation.

Leverage $E_{local}$ and $E_{gloal}$ : The two energy terms are designed for different purposes. $E_{local}$ preserves the pairwise similarity of spatially adjacent region pairs, while $E_{global}$ preserves the linear structure of spatially distant regions in the common object. Both emphasize the locality-preserving for the purpose of clustering (or graph partitioning). Compared with the hard constraint of $E_{local}$ , $E_{global}$ encourages soft (i.e., likelihood) clustering of the regions [14]. In Figure 2, we visualize several graph partition results of the proposed approach and compare it to NCut. As can be observed, the significantly improved graph partitioning quality demonstrates the effectiveness of the global connections introduced in $E_{global}$ .

4 Multi-class Segmentation

In this section, we introduce the approach to use graph partitions for multi-class segmentation.

4.1 EigenHistogram

We have computed a set of eigenvectors (i.e., image partitions) $\hat{Y}=[\hat{\bm{y}}_{1},...,\hat{\bm{y}}_{d}]\in\mathbb{R}^{N\times d}$ corresponding to the first $d$ smallest eigenvalues (excluding the zero eigenvalue) using Eq. (8). The $i$ -th region can now be represented as a $d$ -dimensional vector $S_{i}^{\hat{Y}}$ . The $k$ -means algorithm is applied to group regions into $L$ segments, $\mathcal{Z}=\{\mathcal{Z}_{k}\}_{k=1}^{L}$ , to produce a hard partition [4, 6]. To obtain more reliable multi-class segmentation that can be generalized to arbitrary images with different number of classes, we treat it as a prior segmentation to provide the class likelihood for the multi-class segmentation. Note that our method can deal with the number of segments regardless of pre-defined $L$ . We will discuss this in the experiments.

For each dimension of $S_{i}^{\hat{Y}}$ , we compute a histogram with $L$ bins uniformly spaced between $[0,1]$ based on the corresponding normalized eigenvector. As a consequence, a region will be represented as a $(d{\times}L)$ -dimensional concatenated histogram (we set $d{=}6$ empirically and we will discuss the parameter $L$ in the experimental section). For each segment $\mathcal{Z}_{k}$ , we accumulate and normalize the histograms of all regions belonging to this segment. We term this region representation as EigenHistogram (see Figure 3).

4.2 Multi-class segmentation

Considering the image regions as a random field, we are interested in incorporating the unary potential (the class likelihood) of each region based on the prior segmentation $\mathcal{Z}$ and pairwise potentials between neighboring regions into a unified energy function, to achieve a holistic multi-class segmentation. Numerous literatures have investigated to learn effective unary potentials for random field based algorithms via structured support vector machine [56, 17] or convolution neutral network [57, 58] to perform semantic segmentation. In contrast, EigenHistogram can be treated as a high-level representation which possesses spatial consistency, thereby intrinsically scalable to image segments of arbitrary size. Furthermore, it is easy and fast to compute without any supervision as other methods [56, 17, 57] conduct.

Following the Pylon model [17], we can configure the regions into a hierarchical binary segmentation tree. Different from the traditional “flat” random field models [59, 2], each node in our tree structure stands for a region nested from bottom to top, which enables the features to be extracted at different levels of the hierarchy to enrich the feature representation of the segments. In total, the constructed tree has $2N{+}1$ regions (the root node is the whole image), $\mathcal{S^{+}}=\{S_{i}|i=1,...,2N{+}1\}$ , which define a hierarchical random field.

Our goal is to assign labels $\bm{p}=[p_{1},...,p_{(2{\cdot}N{+}1)}]^{T}$ to all regions in $S^{+}$ . Therefore, we minimize the following object function:

[TABLE]

where $U(p_{i})$ is the unary potential of the region $S_{i}\in\mathcal{S^{+}}$ to specify the cost of assigning label $p_{i}$ to $S_{i}$ , and $B(S_{i},S_{j})$ is the pairwise potential to specify the boundary cost (exponentiated boundary strength [56]) between any two neighboring regions $(i,j)\in\mathcal{N}(\mathcal{S})$ in the child nodes, which is used to encourage the spatial smoothness. Note that $p_{i}$ is allowed to take a zero label such that it satisfies the non-overlapping requirement [17] by using the constraint:

[TABLE]

which ensures that any subtree can have only one single non-zero label.

Since we have clustered image regions into $L$ segments, the unary potential of region $S_{i}$ assigned to the $k$ -th segment has the cost:

[TABLE]

where $U_{p_{i}}$ is the unary potential of the region $S_{i}\in\mathcal{S^{+}}$ to specify the cost of assigning label $p_{i}$ to $S_{i}$ . $\beta$ determines the weight of the unary potential against the pairwise potential. $\mathcal{H}$ transforms a region into the EigenHistogram representation, where the class likelihood is calculated for each region in the tree. Following [17], we compute the pairwise potentials as the exponentiated boundary strength.

EigenHistograms of the internal nodes of the binary segmentation tree are accumulated and normalized from that of the corresponding descendant nodes (see Figure 3). Therefore, the partial non-smoothness effects of the eigenvectors (i.e., isolated regions as visualized in the right panel of Figure 2) reflected in the EigenHistograms of top-level nodes will be suppressed. Finally, we can leave the rest computation to the whole inference procedure to produce a holistic multi-class segmentation as the final output, by using the alpha-expansion based graph cut [2].

5 Experimental Results

This section evaluates the segmentation performance of the proposed method. We first analyze the parameter settings. Then we evaluate and demonstrate the segmentation results, and compare to several state-of-the-art methods.

We mainly evaluate the proposed segmentation approach using the challenging Berkeley Segmentation Dataset (BSDS500) [11]. BSDS500 is widely used as the benchmark for image segmentation and boundary detection, which contains 200 training, 100 validation, and 200 test images. We use several standard evaluation criteria [11] to conduct quantitative analysis: Segmentation Covering, Probability Rand Index (PRI), and Variation of Information (VoI), which measure per-pixel segment overlapping, pairwise pixel matching, and segmentation-wise entropy, respectively. For each measurement, we report the values with the optimal dataset scale (ODS) and optimal image scale (OIS). We further evaluate our method on large-scale PASCAL VOC and COCO datasets to show the generalization ability of our method for object segmentation and compare to two state-of-the-art methods.

5.1 Implementation details

We investigate the parameter sensitivity of the proposed method and select the optimal values based on the training set. Then we apply these values to the independent test set.

Figure 4 shows the performance of the proposed method with respect to $L$ for clustering, $R$ for selecting nearest regions in Eq. (6), and $\mu$ for graph partitioning. The selection of the optimal value of the number of segments $L$ is dependent on the test set, but we do not select the best $L$ based on the test set, by which we aim to demonstrate the strong generalization ability of the proposed method. For the parameter $\mu$ , compared with $\mu=0$ , which means that only $E_{local}$ is considered in the energy function, $E_{global}$ with $\mu=8$ improves the accuracy by ${>}.1$ ODS (Covering). Section 5.4.2 further validates the effectiveness of $E_{global}$ . As can be observed, the proposed method is insensitive to the three parameters. As a result, we set $L=6$ , $\mu=8$ , and $R=14$ throughout the following experiments.

We empirically set $e_{1}=20$ for kernel density estimation in Eq. (1) and $e_{2}=40$ for computing $W$ in Eq. (3). Since the test set has approximately equal image sizes, we can assume that these two values can be generalized to all test images. We empirically found that the parameter $\beta$ in Eq. (11) varies from images to images. In practice, we run the inference procedure to obtain multiple segmentations by varying the $\beta$ value between $[200,300,...,800]$ with an interval equal to $100$ , and take the average of all these outputs and the superpixel map as the final segmentation.

We use the toolbox provided by Dollár et al. [35] to generate the superpixel map (i.e., the structured edge (SE) detector followed by UCM) with roughly uniform region sizes. Our implementation is based on Matlab running on a standard Intel i7 desktop.

5.2 Segmentation result comparison

We evaluate the performance and efficiency of the proposed method, and compare it to several state-of-the-art methods.

In Table 1, we compare the proposed approach to several state-of-the-art methods in terms of segmentation accuracy and running time on the BSDS500 test set. As one can see, the proposed method significantly outperforms most of the comparative methods. $\text{PMI}_{\text{low}}$ [31] is a boundary detection method, which embeds the edge map into OWT-UCM [11] to obtain accurate segmentation. We report its the best accuracy, which is achieved on low resolution images. The recently proposed multiscale combinatorial grouping (MCG) [12] and piecewise flat embedding (PFE) [7] obtain significant improvement compared with the early method, such as red-spectral [37] and DC-Seg [33] (see Table 1). MCG uses hierarchical UCMs to boost the segmentation performance. PFE integrates its computed graph partitions into the gPb-owt-ucm [11] and MCG, which achieves good segmentation performance. However, PFE suffers from the computationally expensive optimization. The proposed method outperforms PEF $+$ owt-ucm and it achieves close segmentation performance compared with PFE $+$ MCG. More importantly, the proposed method is hundreds of times faster than the PFE based methods. DC [33] and red-spectral [37] also emphasize on fast image segmentation, but their segmentation accuracy is not as accurate as ours.

Figure 5 shows the qualitative segmentation results. Figure 6 presents the graph partitioning result obtained by the proposed method, which provides good initial segmentation proposals. Compared with other methods, the proposed method is able to resist the object internal variances to avoid small segments, so that the segments are much more spatially consistent. In addition, the proposed method can implicitly figure out the best number of segments regardless of the pre-defined $L$ value. It is because that EigenHistogram can penalize over-segmentation since homogeneous segments have similar EigenHistogram and thus proximate unary potentials, encouraging them to be merged. The first three rows of Figure 5 particularly highlight the above-mentioned capability. To provide more detailed comparison, in Figure 7, we show the pairwise segmentation results obtained by our proposed method compared to the classical gPb-owt-ucm [11] method. As can be observed, the proposed method shows obvious improvement on a large number test images.

Time Efficiency: The comparison of running time is shown in the rightmost column of Table 1. The test image size is $321\times 481$ . The proposed method is much faster than other competing methods because of several important aspects:

The proposed method does not need complex feature computation, which is superior than gPb based methods [7, 11]. 2. 2.

We construct the graph model based on superpixels rather than raw pixels. Although we incorporate multiple cues into a graph with complicated constraints, the graph partitioning is a single eigenvector system. While in the PFE method [7], performing graph partation is particularly computationally expensive. 3. 3.

EigenHistogram is efficient to compute and very scalable to regions with arbitrary size for hierarchical multi-class segmentation.

The proposed method is executed in three phases: 1) generating the superpixel map and constructing the hierarchical segmentation tree, 2) constructing and partitioning the graph, and 3) conducting multi-class segmentation. Given an $H\times$ W resolution image, phase 1 takes low logarithmic time of random forest tree depth to predict edge map with a random forest, $O(HW+N)$ to compute superpixels with $N$ regions, and $(\log N)$ to construct the hierarchical binary tree. Phase 2 takes up to a factor of $O(HW{+}N)$ to compute all image features with respect to pixels and regions and approximately $O(fN^{2})$ to compute the affinity matrix, where $f$ is the feature dimension. Since $(D-W+\mu M)$ in Eq. (8) is sparse, solving the eigen decomposition problem with a $N\times N$ affinity matrix takes $O(N(\tilde{R}{+}R))$ using a Lanczos algorithm according to [6], where $\tilde{R}{+}R{\ll}N$ is the adjacencies from both local ( $\tilde{R}$ ) and global connections ( $R$ ) in the graph. Phase 3 optimizes Eq. (9) with $L$ classes with approximately $O(N^{2}L)$ using graph cut with alpha-expansion [59, 17]. Therefore, the overall method has approximate time complexity $O(HW{+}fN^{2})$ , bounded by phase 2.

We also evaluate the detailed running time of the proposed method on the $200$ BSDS500 test images. Table 2 shows the detailed time cost of each phase. Compared with most comparative methods, the proposed method is more scalable for practical usages.

5.3 Towards large-scale object segmentation

This section further demonstrates our proposed method on the large-scale PASCAL VOC [18] and COCO [19] segmentation datasets111According to the experiment settings of [36], for PASCAL, 1,464 training images and 1,449 validation images are used. COCO totally contains 82,783 training images and 40,504 validation images in total. In our experiments, we randomly select 5,000 and 2,500 from the training and validation set, respectively for evaluation.. Since our method generates region segmentation composed by a set of connected regions (the same as UCMs), we can fully use our method to generate object proposals by training an object proposal grouping classifier following [36]. We closely follow its training procedure and evaluation settings. In brief, Jaccard Index $J$ , i.e. the size of the intersection of the pixel union of two regions, is used to evaluate the accuracy of generated objects compared with groundtruth.

Figure 8 shows the comparing results for PASCAL VOC. We compare with a method proposed by [36], denoted as singlescale combinatorial grouping (SCG). As can be observed on the two evaluation metrics, our method improves the performance of SCG on the recall evaluation metrics consistently. We also compare with a recent deep learning based method COB [39] which aims at detecting accurate object boundaries. It combines with MCG [36] to perform object segmentation and achieved significant improvement. Note that region segmentation highly relies on the quality of boundary detection (it is out of the focus of this paper). As will demonstrated in Section 4.1, our method is flexible to be an extension of arbitrary baseline methods. Hence, we use the edge maps generated by COB (denoted as Ours (COB)). As can be observed, our method improves a substantial margin compared with our original method and achieves competitive results compared with COB. Figure 9 compares the results on COCO. Our method shows better results than SCG (right) at low numbers of proposals and competitive results on the recall with respect to the number of candidates.

SCG is designed for generate image object candidates, so its generated UCMs contain very fine and small region segments, which is an advantage when computing evaluation metrics for images with multiple objects. However, our method does not have designs for this goal. Compared with it, our method is significantly more proficient at segmenting the salient objects in images. We will further analyze this behavior in the next section. The PASCAL dataset is mainly collected for image and object segmentation tasks. According to our observation, PASCAL images usually contain definite and salient objects. Therefore, our method performs better and largely improves SCG. While in COCO, most of images are outdoor scenes that usually contain many small and indefinite objects. That is the reason why the improvement on COCO for our method is not as large as that in PASCAL, compared with SCG. We qualitatively compare with SCG on COCO images with relatively definite objects. As can be seen in Figure 10, our method can significantly reduce over-segmentation and give rises to clearer segmentation results. Nevertheless, the shown results on the two large-scale datasets are sufficient to demonstrate the generalization ability of our proposed method to different datasets with diverse scenes 222Note that we did not select the parameters of our method on the targeting training datasets following Section 5.1 but used the unique one selected using the BSDS500 training set. We believe there is still room for improvement with careful fine-tunning..

5.4 Analysis

5.4.1 Serving as an extension to improve baseline methods

We consider the cases of using different methods to generate superpixel maps as the input of the proposed method, which allow us to conduct more detailed analyses. It is necessary to notice that, although the proposed method is flexible to build upon these methods, it is not an extension of the underling methods. In contract, the proposed method is a new exploration of accurate and fast spectral clustering based image segmentation. In addition, many state-of-the-art methods use accurate supervised edge detectors and other trained classifiers [12, 13]. We are particularly interested in reducing number of training data with an aim to completely unsupervised image segmentation. Either unsupervised [62] or semi-supervised SE detector [41] can be used as the underlying edge detectors. We consider using the latter, namely SemiContour [41] ( $3$ training images are used), as an alternative to the originally used SE. We compare the performance in Table 3. The obtained segmentation results consistently improve the segmentation accuracy of different baseline methods. Particularly, we observe $.4$ ODS (Covering) improvement over SemiContour+ucm and $.3$ ODS improvement over SE+ucm.

5.4.2 Ablation study

We analyze the effectiveness of each component of the proposed method. The proposed global connection (Section 3.2) is very effective at capturing the affinity between spatially distant regions belonging to the same objects. And the proposed multi-class segmentation is critical to generate smooth and clear segmentation map and makes our method robust to arbitrary images. Figure 11 evaluates each components both qualitatively and quantitatively. Comparing with our method without using $E_{global}$ , we observe obvious improvement (comparing the 3rd row against 4th row and the 1st row against the 2nd row), which indicates the effectiveness of the proposed energy term $E_{global}$ . To validate our multi-class segmentation, we conduct an experiment by simpling clustering the generated graph partitions (i.e. eigenvectors) using k-means to $L$ classes and evaluate the performance. Simple hard clustering strategy can not adapt to arbitrary images with different number of classes and does not guarantee local smoothness, these two factors have large penalty on the evaluation metrics as shown in the first and second rows of Table 11. Therefore, we argue that our strategy to use eigenvectors for multi-class segmentation is very effective (as explained in Section 4.2).

5.4.3 Edge information

The improvement using MCG as the baseline is a small margin (i.e., $.1$ ODS) compared with cases of using the other two methods as the baselines. In fact, MCG uses SE to detect edges while it also sharpens edges. Nevertheless, we observe MCG sometimes sharpens irrelevant edges as well, such that the sharpened noisy edges will have a large penalization through pairwise potentials against unary potentials in our multi-class segmentation procedure, leading to undesirable results. Figure 12 illustrates this situation. The above results indicate that the proposed method relies less on strong edge information compared with MCG.

Additionally, since the proposed method relies less on edges, one potential weakness of the graph partitioning procedure could result in the fragmentation of homogeneous regions, which decreases the precision of the boundary detection. We compare the boundary precision-recall curve in Figure 13, from which we can see that the proposed method maintains nearly the same precision as the baseline methods, i.e., MCG and SE+ucm (though negligible 0.03 decrease for SE+ucm).

5.4.4 Strengths and limitations

The proposed method is effective in discovering complex image knowledge among regions from challenging natural images and segmenting objects even when objects have weak boundaries. The proposed method is significantly better than MCG in those samples shown in Figure 14(left). However, we found that the proposed method is not that effective at images without definite objects, because our graph design emphasizes the high-level discriminative image knowledge of objects against the background. MCG outperforms ours in those samples (see Figure 14(right)).

6 Conclusions

In this paper, we present a fast yet accurate image segmentation method, which is a novel re-examination of spectral clustering based image segmentation for unsupervised image segmentation. We construct an image region graph with both local and global connections based on simple but effective high-level cues, and formulate the graph partitioning as a simple generalized eigenvector system. The high quality graph partitions are used to compute effective unary potentials of Pylon model for multi-class image segmentation. Extensive experiments, on the BSDS500 benchmark, large-scale PASCAL VOC and COCO datasets, show that the proposed method achieves significantly faster speed and competitive performance when it is compared to state-of-the-art segmentation methods.

7 Acknowledgement

This work was partially supported by the National Natural Science Foundation of China under Grants U1605252, 61472334, and 61571379.

Bibliography62

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] P. F. Felzenszwalb, D. P. Huttenlocher, Efficient graph-based image segmentation, International Journal of Computer Vision 59 (2) (2004) 167–181.
2[2] Y. Boykov, O. Veksler, R. Zabih, Fast approximate energy minimization via graph cuts, Transactions on Pattern Analysis and Machine Intelligence 23 (11) (2001) 1222–1239.
3[3] B. Peng, L. Zhang, D. Zhang, A survey of graph theoretical approaches to image segmentation, Pattern Recognition 46 (3) (2013) 1020–1038.
4[4] T. H. Kim, K. M. Lee, S. U. Lee, Learning full pairwise affinities for spectral segmentation, Transactions on Pattern Analysis and Machine Intelligence 35 (7) (2013) 1690–1703.
5[5] X. Shi, Z. Guo, Z. Lai, Y. Yang, Z. Bao, D. Zhang, A framework of joint graph embedding and sparse regression for dimensionality reduction, IEEE Transactions on Image Processing 24 (4) (2015) 1341–1355.
6[6] J. Shi, J. Malik, Normalized cuts and image segmentation, Transactions on Pattern Analysis and Machine Intelligence 22 (8) (2000) 888–905.
7[7] Y. Yu, C. Fang, Z. Liao, Piecewise flat embedding for image segmentation, in: Proceedings of the International Conference on Computer Vision, 2015, pp. 1368–1376.
8[8] S. X. Yu, Segmentation induced by scale invariance, in: Proceedings of the Conference on Computer Vision and Pattern Recognition, 2005, pp. 444–451.