Structured Learning of Tree Potentials in CRF for Image Segmentation

Fayao Liu; Guosheng Lin; Ruizhi Qiao; Chunhua Shen

arXiv:1703.08764·cs.CV·March 28, 2017

Structured Learning of Tree Potentials in CRF for Image Segmentation

Fayao Liu, Guosheng Lin, Ruizhi Qiao, Chunhua Shen

PDF

Open Access

TL;DR

This paper introduces a novel image segmentation method that combines CRFs with decision tree ensembles, enabling nonlinear potential functions learned through a unified large-margin framework, improving modeling of complex data.

Contribution

It formulates CRF potentials as decision tree forests and develops an efficient optimization method, advancing nonlinear learning in image segmentation.

Findings

01

Effective on binary and multi-class datasets

02

Outperforms traditional linear potential methods

03

Demonstrates flexible modeling of complex data

Abstract

We propose a new approach to image segmentation, which exploits the advantages of both conditional random fields (CRFs) and decision trees. In the literature, the potential functions of CRFs are mostly defined as a linear combination of some pre-defined parametric models, and then methods like structured support vector machines (SSVMs) are applied to learn those linear coefficients. We instead formulate the unary and pairwise potentials as nonparametric forests---ensembles of decision trees, and learn the ensemble parameters and the trees in a unified optimization problem within the large-margin framework. In this fashion, we easily achieve nonlinear learning of potential functions on both unary and pairwise terms in CRFs. Moreover, we learn class-wise decision trees for each object that appears in the image. Due to the rich structure and flexibility of decision trees, our approach is…

Figures40

Click any figure to enlarge with its caption.

Tables6

Table 1. TABLE I: The average intersection-over-union score and average pixel accuracy comparison on the Graz-02 dataset. We include the foreground and background results in the brackets. Our method CRFTree with nonlinear and class-wise potentials learning performs better than all the baseline methods.

Category	bike	car	people
	intersection/union (foreground, background)( $%$ )
SVMs	67.8 (51.9, 83.8)	69.7 (46.8, 92.6)	65.0 (44.5, 85.5)
AdaBoost	71.2 (57.6, 84.9)	71.0 (49.4, 92.6)	67.7 (48.7, 86.7)
SSVMs	72.2 (58.6, 85.8)	76.9 (60.0, 94.2)	70.9 (53.8, 87.9)
CRFTree	76.4 (65.0, 87.8)	79.5 (64.0, 95.0)	74.2 (58.7, 89.7)
CRFTree (FL)	78.3 (67.7, 88.9)	83.0 (70.1, 95.9)	75.7 (61.0, 90.5)
	pixel accuracy (foreground, background)( $%$ )
SVMs	79.5 (67.4, 91.5)	77.3 (57.2, 97.3)	77.7 (63.8, 91.6)
AdaBoost	83.8 (77.3, 90.3)	80.1 (63.5, 96.6)	80.5 (69.0, 91.9)
SSVMs	83.8 (76.1, 91.6)	85.5 (73.8, 97.2)	83.9 (75.8, 92.1)
CRFTree	87.8 (83.9, 91.8)	87.0 (76.4, 97.7)	85.9 (78.4, 93.4)
CRFTree (FL)	89.1 (85.8, 92.4)	90.0 (82.1, 98.0)	86.9 (80.0, 94.0)

Table 2. TABLE II: Segmentation results on the MSRC dataset. We report the pixel-wise accuracy for each category as well as the average per-category scores and the global pixel-wise accuracy. (1) The upper part presents the comparison with baseline methods, which all use bag-of-words and color histogram features. Our method CRFTree gains impressive improvements over SSVMs while far better than simple linear models. (2) The lower part shows the results of our method using unsupervised feature learning and CNN features (denoted as CRFTree (FL) and CRFTree (CNN) respectively) compared with state-of-the-art methods on this dataset.

	building	grass	tree	cow	sheep	sky	aeroplane	water	face	car	bicycle	flower	sign	bird	book	chair	road	cat	dog	body	boat	Average	Global
SVMs	54	92	73	41	54	80	51	67	51	41	59	41	28	8	64	17	75	41	23	20	7	47.0	63.7
AdaBoost	68	92	83	48	58	87	43	69	58	43	64	41	32	14	70	28	79	47	22	41	6	52.0	68.6
SSVMs	65	92	81	42	76	84	65	70	75	54	87	62	31	14	76	31	78	61	30	25	2	57.2	70.8
CRFTree	53	87	85	59	84	90	77	82	81	54	90	57	62	22	81	59	80	71	26	49	15	64.9	73.9
CRFTree (FL)	66	95	89	83	89	90	90	83	76	74	83	71	69	46	87	73	87	84	53	68	20	75.1	82.2
CRFTree (CNN)	73	96	89	82	92	96	89	86	93	78	86	91	71	75	85	76	86	91	63	83	41	82.0	86.2
Shotton et al.[32]	49	88	79	97	97	78	82	54	87	74	72	74	36	24	93	51	78	75	35	66	18	67	72
Ladicky et al.[18]	80	96	86	74	87	99	74	87	86	87	82	97	95	30	86	31	95	51	69	66	9	75	86
Gonfaus et al.[11]	60	78	77	91	68	88	87	76	73	77	93	97	73	57	95	81	76	81	46	56	46	75	77
Lucchi et al.[24]	59	90	92	82	83	94	91	80	85	88	96	89	73	48	96	62	81	87	33	44	30	76	82
Lucchi et al.[23]	67	89	85	93	79	93	84	75	79	87	89	92	71	46	96	79	86	76	64	77	50	78.9	83.7

Table 3. TABLE III: Performance of different methods on the Weizmann Horse dataset.

Method	Sa	So
Levin & Weiss [21]	95.5	-
Cosegmentation [14]	80.1	-
Bertelli et al.[5]	94.6	80.1
Kuttel et al.[17]	94.7	-
CRFTree (FL)	94.6	80.4

Table 4. TABLE IV: Performance of different methods on the Oxford FLower dataset. Our method CRFTree performs better than the compared methods.

Method	Sa	So
Nilsback et al.[26]	-	94.0
Bertelli et al.[5]	97.7	92.3
CRFTree (FL)	98.0	94.2

Table 5. TABLE V: Comparing with state-of-the-art methods on the Graz-02 dataset. We report the F-score (%) for each class and the average over classes. Our method CRFTree outperforms all the compared methods with a large margin.

Method	bike	car	people	average
Marszalek & Schimid [25]	61.8	53.8	44.1	53.2
Fulkerson et al.[9]	66.4	54.7	51.4	57.5
Aldavert et al.[2]	71.9	62.9	58.6	64.5
Kuettel et al.[17]	63.2	74.8	66.4	68.1
CRFTree (FL)	80.7	82.4	75.8	79.5

Table 6. TABLE VI: The average intersection-over-union scores of different methods on the PACAL VOC 2012 test dataset. Our method CRFTree achieves comparable performance with the state-of-the-arts. Note that [ 38 ] used extra training data.

Method	intersection/union
CFM [7]	61.8
Hypercolumn [12]	62.6
FCN-8s [22]	62.2
Zheng et al.[38]	72.0
CRFTree	65.4

Equations54

P (y ∣ x; w) = \frac{1}{Z} exp (- E (y, x; w)) .

P (y ∣ x; w) = \frac{1}{Z} exp (- E (y, x; w)) .

E (y, x; w)

E (y, x; w)

Φ^{(1)} (y^{(p)}, x) = w_{y^{p}}^{(1) ⊤} H_{y^{p}}^{(1)} (x^{p}) .

Φ^{(1)} (y^{(p)}, x) = w_{y^{p}}^{(1) ⊤} H_{y^{p}}^{(1)} (x^{p}) .

Φ^{(2)} (y^{(p)}, y^{(q)}, x) = w^{(2) ⊤} H^{(2)} (x^{p}, x^{q}) I (y^{p} \neq = y^{q}) .

Φ^{(2)} (y^{(p)}, y^{(q)}, x) = w^{(2) ⊤} H^{(2)} (x^{p}, x^{q}) I (y^{p} \neq = y^{q}) .

E (y, x; w, H)

E (y, x; w, H)

+ (p, q) \in S \sum w^{(2) ⊤} H^{(2)} (x^{p}, x^{q}) I (y^{p} \neq = y^{q}) .

w, ξ \geq 0 min

w, ξ \geq 0 min

s.t. :

\forall i = 1, \dots, m, and \forall y \in Y; .

Ψ^{(1)} (y, x; H^{(1)}) = p \in N \sum H_{y^{p}}^{(1)} (x^{p}) \otimes y^{p} .

Ψ^{(1)} (y, x; H^{(1)}) = p \in N \sum H_{y^{p}}^{(1)} (x^{p}) \otimes y^{p} .

w^{(1) ⊤} Ψ^{(1)} (y, x; H^{(1)}) = p \in N \sum Φ^{(1)} (y^{p}, x) .

w^{(1) ⊤} Ψ^{(1)} (y, x; H^{(1)}) = p \in N \sum Φ^{(1)} (y^{p}, x) .

Ψ^{(2)} (y, x; H^{(2)}) = (p, q) \in S \sum H^{(2)} (x^{p}, x^{q}) I (y^{p} \neq = y^{q}) .

Ψ^{(2)} (y, x; H^{(2)}) = (p, q) \in S \sum H^{(2)} (x^{p}, x^{q}) I (y^{p} \neq = y^{q}) .

w^{(2) ⊤} Ψ^{(2)} (y, x; H^{(2)}) = (p, q) \in S \sum Φ^{(2)} (y^{p}, y^{q}, x) .

w^{(2) ⊤} Ψ^{(2)} (y, x; H^{(2)}) = (p, q) \in S \sum Φ^{(2)} (y^{p}, y^{q}, x) .

Ψ (y, x; H) = Ψ^{(1)} (y, x; H^{(1)}) ⊙ Ψ^{(2)} (y, x; H^{(2)}) .

Ψ (y, x; H) = Ψ^{(1)} (y, x; H^{(1)}) ⊙ Ψ^{(2)} (y, x; H^{(2)}) .

E (y, x; w, H)

E (y, x; w, H)

+ (p, q) \in S \sum Φ^{(2)} (y^{p}, y^{q}, x; w, H^{(2)})

= w^{⊤} Ψ (y, x; H) .

w, ξ min

w, ξ min

s.t. :

\forall i = 1, \dots, m, and \forall y \in Y;

w \geq 0, ξ \geq 0.

\displaystyle\begin{split}\max_{{\boldsymbol{\lambda}},{\boldsymbol{\theta}}}\;\;&\sum_{i,{\bf y}}\lambda_{(i,{\bf y})}{\it\Delta}({\bf y}_{i},{\bf y})\\ &-{\tfrac{1}{2}}\biggl{\{}\sum_{i,{\bf y}}\lambda_{(i,{\bf y})}\left[\Psi({\bf y},{\bf x}_{i};{\bf H})-\Psi({\bf y}_{i},{\bf x}_{i};{\bf H})\right]+{\boldsymbol{\theta}}\biggr{\}}^{2}\\ {{\rm s.t.}\!:}\;\;&0\leq\textstyle\sum_{{\bf y}}\lambda_{(i,{\bf y})}\leq\tfrac{C}{m},\forall i=1,\dots,m;{\boldsymbol{\theta}}\geq 0,{\boldsymbol{\lambda}}\geq 0.\end{split}

\displaystyle\begin{split}\max_{{\boldsymbol{\lambda}},{\boldsymbol{\theta}}}\;\;&\sum_{i,{\bf y}}\lambda_{(i,{\bf y})}{\it\Delta}({\bf y}_{i},{\bf y})\\ &-{\tfrac{1}{2}}\biggl{\{}\sum_{i,{\bf y}}\lambda_{(i,{\bf y})}\left[\Psi({\bf y},{\bf x}_{i};{\bf H})-\Psi({\bf y}_{i},{\bf x}_{i};{\bf H})\right]+{\boldsymbol{\theta}}\biggr{\}}^{2}\\ {{\rm s.t.}\!:}\;\;&0\leq\textstyle\sum_{{\bf y}}\lambda_{(i,{\bf y})}\leq\tfrac{C}{m},\forall i=1,\dots,m;{\boldsymbol{\theta}}\geq 0,{\boldsymbol{\lambda}}\geq 0.\end{split}

w \geq i, y \sum λ_{(i, y)} [Ψ (y, x; H) - Ψ (y_{i}, x; H)] .

w \geq i, y \sum λ_{(i, y)} [Ψ (y, x; H) - Ψ (y_{i}, x; H)] .

H^{⋆} = H argmax i, y \sum λ_{(i, y)} [Ψ (y, x_{i}; H) - Ψ (y_{i}, x_{i}; H)] .

H^{⋆} = H argmax i, y \sum λ_{(i, y)} [Ψ (y, x_{i}; H) - Ψ (y_{i}, x_{i}; H)] .

\forall c =

\forall c =

ℏ_{c}^{(1) ⋆} (\cdot)

\displaystyle=\operatorname*{argmax\,}_{{\hbar}\in{\cal H}}\;\sum_{i,{\bf y}}\biggr{[}\sum_{p\in{\cal N},\atop y^{p}=c}\underbrace{\lambda_{(i,{\bf y})}{\hbar}_{y^{p}}^{(1)}({\bf x}_{i}^{p})}_{\text{positive}}

\displaystyle-\sum_{p\in{\cal N},\atop y_{i}^{p}=c}\underbrace{\lambda_{(i,{\bf y})}{\hbar}_{y_{i}^{p}}^{(1)}({\bf x}_{i}^{p})}_{\text{negative}}\biggr{]}.

ℏ^{(2) ⋆} (\cdot, \cdot)

ℏ^{(2) ⋆} (\cdot, \cdot)

\displaystyle-\sum_{(p,q)\in{\cal S}}\underbrace{{\hbar}^{(2)}({\bf x}^{p},{\bf x}^{q})I(y_{i}^{p}\neq y_{i}^{q})}_{\text{negative}}\biggr{]}.

w \geq 0, ξ \geq 0 min

w \geq 0, ξ \geq 0 min

s.t. :

\geq \frac{1}{m} i = 1 \sum m r_{i} Δ (y_{i}, y) - ξ, \forall r \in {0, 1}^{m}; \forall y \in Y .

y_{i}^{⋆} = argmin w^{⊤} Ψ (y, x; H) - Δ (y_{i}, y)

y_{i}^{⋆} = argmin w^{⊤} Ψ (y, x; H) - Δ (y_{i}, y)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications

Full text

Structured Learning of Tree Potentials in CRF for Image Segmentation

Fayao Liu, Guosheng Lin, Ruizhi Qiao, Chunhua Shen

F. Liu, R. Qiao, C. Shen are with The University of Adelaide, Australia. G. Lin is with Nanyang Technological University, Singapore. This work was done when G. Lin was with The University of Adelaide. Email: {fayao.liu, ruizhi.qiao, chunhua.shen}@adelaide.edu.au, [email protected] *Appearing in IEEE Transactions on Neural Networks and Learning Systems, 26 March 2017. *

Abstract

We propose a new approach to image segmentation, which exploits the advantages of both conditional random fields (CRFs) and decision trees. In the literature, the potential functions of CRFs are mostly defined as a linear combination of some pre-defined parametric models, and then methods like structured support vector machines (SSVMs) are applied to learn those linear coefficients. We instead formulate the unary and pairwise potentials as nonparametric forests—ensembles of decision trees, and learn the ensemble parameters and the trees in a unified optimization problem within the large-margin framework. In this fashion, we easily achieve nonlinear learning of potential functions on both unary and pairwise terms in CRFs. Moreover, we learn class-wise decision trees for each object that appears in the image. Due to the rich structure and flexibility of decision trees, our approach is powerful in modelling complex data likelihoods and label relationships. The resulting optimization problem is very challenging because it can have exponentially many variables and constraints. We show that this challenging optimization can be efficiently solved by combining a modified column generation and cutting-planes techniques. Experimental results on both binary (Graz-02, Weizmann horse, Oxford flower) and multi-class (MSRC-21, PASCAL VOC 2012) segmentation datasets demonstrate the power of the learned nonlinear nonparametric potentials.

Index Terms:

Conditional random fields, Decision trees, Structured support vector machines, Image segmentation.

I Introduction
II Learning tree potentials in CRFs
II-A Segmentation using CRFs models
II-B Energy Formulation
II-C Learning CRFs in the large-margin framework
II-D Learning tree potentials using column generation
II-E Speeding up optimization using cutting-plane
III Experiments
III-A Experimental setup
III-B Comparing with baseline methods
III-C Comparing with state-of-the-art methods
IV Conclusion

I Introduction

The goal of object segmentation is to produce a pixel level segmentation of different object categories. It is challenging as the objects may appear in various backgrounds and in different visual conditions. CRFs [19] model the conditional distribution of labels given observations, and represents the state-of-the-art in image/object segmentation [34, 32, 10, 24, 27, 15]. The max-margin principle has also been applied to predict structured outputs, including SSVMs [36], and max-margin Markov networks [35]. These three methods share similarities when viewed as optimization problems using different loss functions. Szummer et al. [34] proposed to learn linear coefficients of CRFs potentials using SSVMs and graph cuts. To date, most of these methods assume a pre-defined parametric model for the potential functions, and typically only the linear coefficients of the parametric model are learned. This can greatly limit the flexibility of the model capability of CRFs, and thus calls for effective methods to incorporate nonlinear nonparametric models for learning the potential functions in CRFs.

As similar in standard support vector machines (SVMs), nonlinearity can be achieved by introducing nonlinear kernels for SSVMs. However, the time complexity of nonlinear SVMs is roughly $O(n^{3.5})$ with $n$ being the number of training data examples. This time complexity is problematic for SSVMs, where the number of constraints grows exponentially in the description length of the label ${\bf y}$ . Moreover, nonlinear functions can significantly slow down the test time in most cases. Because of these reasons, currently most SSVMs applications use linear kernels (or linear parametric potential functions in CRFs), despite the fact that nonlinear functions usually deliver more promising prediction accuracy. In this work, we address this issue by combining CRFs with nonparametric decision trees. Both CRFs and decision trees have gained tremendous success in computer vision. Decision trees are capable of modelling complex relations and generalize well on test data. Unlike kernel methods, decision trees are fast to evaluate and can be used to select informative features.

In this work, we propose to use ensembles of decision trees to map the image content to both the unary terms and the pairwise interaction values in CRFs. The proposed method is termed as CRFTree. Specifically, we formulate both the unary and pairwise potentials as nonparametric forests—ensembles of decision trees, and learn the ensemble parameters and the trees in a single optimization framework. In this way, the nonlinearity is easily introduced into CRFs learning without confronting the kernel dilemma. Furthermore, we learn class-wise decision trees for each object. Due to the rich structure and flexibility of decision trees, our approach is powerful in modelling complex data likelihoods and label relationships. The resulting optimization problem is very challenging in the sense that it can involve exponentially or even infinitely many variables and constraints. We summarize our main contributions as follows.

We formulate the unary and pairwise potentials as ensembles of decision trees, and show how to jointly learn the ensemble parameters and the trees as a unified optimization problem within the large-margin framework. In this fashion, we achieve nonlinear potential learning on both the unary and pairwise terms.

2.

We learn class-wise decision trees (potentials) for each object that appears in the image.

3.

We show how to train the proposed CRFTree model efficiently. In particular, we combine the column generation and cutting-planes techniques to approximately solve the resulting optimization problem, which can involve exponentially many variables and constraints.

4.

We empirically demonstrate that CRFTree outperforms existing methods for image segmentation. On both binary and multi-class segmentation datasets we show the advantages of the learned nonlinear nonparametric potentials of decision trees.

Related work We briefly review the recent works that are relevant to ours. A few attempts have been made to apply nonlinear kernels in SSVMs. Yu et al.[37] and Severyn et al.[29] developed sampled cuts based methods for training SSVMs with kernels. Sampled cuts methods were originally proposed for standard kernel SVMs. When applied to SSVMs, the performance is compromised [24]. In [5], the image-mask pair kernels are designed to exploit image-level structural information for object segmentation. However, these kernels are restricted to the unary term. Although not in the large margin framework, the kernel CRFs proposed in [20] incorporates kernels into the CRFs learning. The authors only demonstrated the efficacy of their method on a synthetic and a small scale protein dataset. To sum up, these approaches are hampered by the heavy computation complexity. Furthermore, it is not a trivial task to design appropriate kernels for structured problems. Recently, Lucchi et al. [24] proposed a two-step solution to tackle this problem. Specifically, they train linear SSVMs by using kernelized feature vectors that are obtained from training a standard non-linear kernel SVMs model. They experimentally demonstrate that the kernel transferred linear SVMs model achieves similar performance as the Gaussian SVMs. However, this approach is heuristic and it cannot be shown theoretically that their formulation approximates a nonlinear SSVMs model. Besides, their method consumes extra usage of memory and training time since the dimension of the transformed features equals to the number of support vectors, while the latter is linearly proportional to the size of the training data [33]. Moreover, compared to the above mentioned works of [5] and [24], we achieve nonlinear learning on both the unary and the pairwise terms while theirs are limited to nonlinear unary potential learning. The recent work of Shen et al.[31] generalizes standard boosting methods to structured learning, which shares similarities to our work here. However, our method bears critical differences from theirs:

We design a column generation method for non-linear tree potentials learning in CRFs directly from the SSVMs formulation. Different from the case in [31], which can directly derive column generation method analogous to LPBoost [8], our derivation here is more challenging. This is because we can not obtain the most violated constraint from the constraints of the dual problem, on which the column generation technique relies. We instead inspect the KKT condition to seek for the most violated constraint. This is an important difference compared to existing column generation techniques.
We develop a CRFs learning method for multi-class semantic segmentation, while [31] only shows CRFs learning for binary foreground/background segmentation. Our experiments on the MSRC-21 dataset shows that our method achieves state-of-the-art results.
We learn class-wise decision trees (potentials) for each object that appears in the image. This is different from [31]. The work of decision tree fields [28] is close to ours in that they also use decision trees to model the pairwise potentials. The major difference is that in [28] potential functions are constructed by directly summing the energy tables associated with the set of nodes taken during evaluating the decision trees. Their trees are generally deep, with depth 15 for the unary potential and 6 for the pairwise potential in their experiment. By contrast, we model the potential functions as an ensemble of decision trees and learn them in the large margin framework. In our method, the decision trees are shallow and simple with binary outputs.

II Learning tree potentials in CRFs

We present the details of our method in this section by first introducing the CRFs models for segmentation, then formulating the energy functions and showing how to learn decision tree potentials in the large-margin framework.

II-A Segmentation using CRFs models

Before presenting our method, we first revisit how to use CRFs models to perform image segmentation. Given an image instance ${\bf x}$ and its corresponding labelling ${\bf y}$ , CRFs [19] models the conditional distribution of the form

[TABLE]

where ${\bf w}$ are parameters and $Z$ is the normalization term. The energy $E$ of an image ${\bf x}$ with segmentation labels ${\bf y}$ over the nodes (superpixels) $\cal N$ and edges $\cal S$ , takes the following form:

[TABLE]

Here ${\bf x}\in{\cal X},{\bf y}\in{\cal Y}$ ; ${\Phi}^{(1)}$ and ${\Phi}^{(2)}$ are the unary and pairwise potentials, both of which depend on the observations as well as the parameter ${\bf w}$ . CRFs seeks an optimal labeling that achieves maximum a posterior (MAP), which mainly involves a two-step process [34]: 1) Learning the model parameters from the training data; 2) Inferring a most likely label for the test data given the learned parameters. The segmentation problem thus reduces to minimizing the energy (or cost) over ${\bf y}$ by the learned parameters ${\bf w}$ , which is ${\bf y}^{*}=\operatorname*{argmin\,}_{{\bf y}\in{\cal Y}}E({\bf y},{\bf x};{\bf w})$ . When the energy function is submodular, this inference problem can be efficiently solved via graph cuts [34].

II-B Energy Formulation

Given the energy function in Eqn. (2), we show how to construct the unary and pairwise potentials using decision trees. We denote ${\bf x}^{p}$ as the features of superpixel $p$ ( $p=1,\ldots,n$ ), with its label $y^{p}\in\{1,\ldots,K\}$ , where $K$ is the number of classes. Let $\cal H$ be a set of decision trees, which can be infinite. Each ${\hbar}_{j}^{(1)}(\cdot)\in\cal H$ takes ${\bf x}^{p}$ as the input, and ${\hbar}_{j}^{(2)}(\cdot,\cdot)\in\cal H$ takes a pair $({\bf x}^{p},{\bf x}^{q})$ as the input to output $\{0,1\}$ . We introduce $(K+1)$ groups of decision trees, in which $K$ groups are for the unary potential and one group for the pairwise potential. For the unary potential, the $K$ groups of decision trees are denoted by ${\bf H}_{c}^{(1)}(c=1,\ldots,K)$ , which correspond to $K$ categories. Each ${\bf H}_{c}^{(1)}$ is associated with the $c$ -th class. In other words, for each class, we maintain its own unary feature mappings. Each group of decision trees for the unary potential can be written as: ${\bf H}_{c}^{(1)}=[{\hbar}_{c1}^{(1)},{\hbar}_{c2}^{(1)},\ldots]^{{\!\top}}$ , which are the output of decision trees: ${\hbar}_{cj}^{(1)}$ . All decision trees of the unary potential are denoted by ${\bf H}^{(1)}=[{\bf H}_{1}^{(1)},{\bf H}_{2}^{(1)},\ldots,{\bf H}_{K}^{(1)}]$ . Accordingly, for the pairwise potential, the group of decision trees is denoted by ${\bf H}^{(2)}$ , and ${\bf H}^{(2)}=[{\hbar}_{1}^{(2)},{\hbar}_{2}^{(2)},\ldots]^{{\!\top}}$ being the output of all ${\hbar}_{j}^{(2)}$ . The whole set of decision trees is denoted by ${\bf H}=[{\bf H}^{(1)},{\bf H}^{(2)}]$ . We then construct the unary and pairwise potentials as

[TABLE]

where $I(\cdot)$ is an indicator function which equals $1$ if the input is true and [math] otherwise. Then the energy function in Eqn. (2) can be written as:

[TABLE]

Next we show how to learn these decision tree potentials in the large-margin framework.

II-C Learning CRFs in the large-margin framework

Instead of directly minimizing the negative log-likelihood loss, we here learn the CRFs parameters in the large margin framework, similar to [34]. Given a set of training examples $\{{\bf x}_{i},{\bf y}_{i}\}_{i=1}^{m}$ , the large-margin based CRFs learning solves the following optimization:

[TABLE]

where ${\it\Delta}:{\cal Y}\times{\cal Y}\mapsto\mathbb{R}$ is a loss function associated with the prediction and the true label mask. In general, we have ${\it\Delta}({\bf y},{\bf y})=0$ and ${\it\Delta}({\bf y},{\bf y}^{\prime})>0$ for any ${\bf y}^{\prime}\neq{\bf y}$ . Intuitively, the optimization in Eqn. (6) is to encourage the energy of the ground truth label $E({\bf y}_{i},{\bf x}_{i};{\bf w})$ to be lower than that of any other incorrect labels $E({\bf y},{\bf x}_{i};{\bf w})$ by at least a margin ${\it\Delta}({\bf y}_{i},{\bf y})$ .

To learn the potential functions we proposed in §II-B in the large-margin framework, we introduce the following definitions. For the unary part, we define ${\bf w}^{(1)}={\bf w}_{1}^{(1)}\odot{\bf w}_{2}^{(1)}\odot\ldots\odot{\bf w}_{K}^{(1)}$ , where $\odot$ stacks two vectors, and

[TABLE]

where $\otimes$ denotes the tensor operation (e.g., ${\bf x}^{p}\otimes y^{p}=[I(y^{p}=1){\bf x}^{p{\!\top}},\ldots,I(y^{p}=K){\bf x}^{p{\!\top}}]^{{\!\top}}$ ). Recall that ${\bf x}^{p}$ denotes the $p$ -th superpixel of the image ${\bf x}$ . Here, $\Psi^{(1)}$ acts as the unary feature mapping. Clearly we have:

[TABLE]

For the pairwise part, we define the pairwise feature mapping as:

[TABLE]

Then we have the following relation:

[TABLE]

We further define ${\bf w}={\bf w}^{(1)}\odot{\bf w}^{(2)}$ , and the joint feature mapping as

[TABLE]

With the definitions of ${\bf w}$ and $\Psi$ , the energy function can then be written as:

[TABLE]

Now we can apply the large-margin framework to learn CRFs using the proposed energy functions by rewriting the optimization problem in Eqn. (6) as:

[TABLE]

Note that we add the ${\bf w}\geq 0$ constraint to ensure submodular property of our energy functions, which we will discuss the details later in §II-E. Up until now, we are ready to learn ${\bf w}$ and $\Psi$ (or ${\bf H}$ ) in a single optimization problem formulated in Eqn. (II-C), but it is not clear how. Next we demonstrate how to solve the optimization problem in Eqn. (II-C) by using column generation and cutting-plane.

II-D Learning tree potentials using column generation

We aim to learn a set of decision trees ${\bf H}$ and the potential parameter ${\bf w}$ by solving the optimization problem in Eqn. (II-C). However, jointly learning ${\bf H}$ and ${\bf w}$ is generally difficult. Here we propose to apply column generation techniques [8, 30] to alternatively construct the set of decision trees and solve for ${\bf w}$ . From the point of view of column generation techniques, the dimension of the primal variable ${\bf w}$ is infinitely large; the column generation is to iteratively select (generate) variables for solving the optimization. In our case, infinitely many dimension of ${\bf w}$ corresponds to infinitely many decision trees, thus we iteratively generate decision trees to solve the optimization.

Basically, we construct a working set of decision trees (denoted as $\mathcal{W}_{\bf H}$ ). During each column generation iteration we perform two steps. In the first step, we generate new decision trees and add them to $\mathcal{W}_{\bf H}$ . In the second step, we solve a restricted optimization problem in Eqn. (II-C) on the current working set $\mathcal{W}_{\bf H}$ to obtain the solution of ${\bf w}$ . We repeat these two steps until convergence. Next we describe how to generate decision trees in a principal way by using the dual solution of the optimization in Eqn. (II-C), which is similar to the conventional column generation technique. First we derive the Lagrange dual problem of Eqn. (II-C), which can be written as

[TABLE]

Here ${\boldsymbol{\theta}},\boldsymbol{\lambda}$ are the dual variables. When using column generation technique, one need to find the most violated constraint in the dual. However, the constraints of the dual problem do not involve decision trees ${\bf H}$ . Instead of examining the dual constraint, we inspect the KKT condition, which is an important difference compared to existing column generation techniques. According to the KKT condition, when at optimal, the following condition holds for the primal solution ${\bf w}$ and the current working set $\mathcal{W}_{{\bf H}}$ :

[TABLE]

All of those generated ${\bf H}\in\mathcal{W}_{{\bf H}}$ satisfy the above condition. Obviously, generating new decision trees which most violate the above condition would contribute the most to the optimization of Eqn. (II-C). Hence the strategy of generating new decision trees is to solve the following problem:

[TABLE]

Then ${\bf H}^{\star}$ is added to the current working set $\mathcal{W}_{{\bf H}}$ . If ${\bf H}^{\star}$ still satisfies the condition in Eqn. (15), the current solution of ${\bf H}$ and ${\bf w}$ is already the globally optimal one.

The optimization in Eqn. (16) for generating new decision trees can be independently decomposed into solving the unary part and the pairwise part. Hence ${\bf H}^{\star}$ can be written as: ${\bf H}^{\star}=[{\bf H}^{(1)\star},{\bf H}^{(2)\star}]$ . For the unary part, we learn class-wise decision trees, namely, we generate $K$ decision trees corresponding to $K$ categories at each column generation iteration. Hence ${\bf H}^{(1)\star}$ is composed of $K$ decision trees: ${\bf H}^{(1)\star}=[{\hbar}_{1}^{(1)\star},\dots,{\hbar}_{K}^{(1)\star}]$ . More specifically, according to the definition of $\Psi({\bf y},{\bf x})$ in Eqn. (11), we solve the following $K$ problems:

[TABLE]

To solve the above optimization problems, we here train $K$ weighted decision tree classifiers. Specifically, when training decision trees for the $c$ -th class, the training data is composed of those superpixels whose ground truth label or predicted label is equal to the category label $c$ . Since the output of the decision tree is in $\{0,1\}$ and $\lambda_{(i,{\bf y})}\geq 0$ , the maximization in Eqn. (II-D) is achieved if ${\hbar}_{c}^{(1)}$ outputs 1 for each of the superpixel $p$ with $y^{p}=c$ , and outputs 0 for each of the superpixel $p$ with $y_{i}^{p}=c$ . Therefore, as indicated by the horizontal curly braces in Eqn. (II-D), superpixels with the predicted labels of category $c$ are used as positive training examples, while superpixels with ground truth labels of category c are used as negative training examples. The dual solution ${\boldsymbol{\lambda}}$ serves as weightings of the training data.

For the pairwise part, we generate one decision tree in each column generation iteration, hence ${\bf H}^{(2)\star}$ can be written as ${\bf H}^{(2)\star}=[{\hbar}^{(2)\star}]$ , the new decision tree for the pairwise part is generated as:

[TABLE]

Similar to the unary case, we train a weighted decision tree classifier with ${\boldsymbol{\lambda}}$ as training example weightings. The positive and negative training data are indicated by the horizontal curly braces in Eqn. (II-D). ${\hbar}^{(2)}$ is the response of a decision tree applied on the pairwise features constructed by two neighbouring superpixels ( ${\bf x}^{p}$ , ${\bf x}^{q}$ ), e.g., color differences or shared boundary lengths.

With the above analysis, we can now apply column generation to jointly learn the decision trees ${\bf H}^{(1)},{\bf H}^{(2)}$ and ${\bf w}$ . The column generation (CG) procedure iterates the following two steps:

Solve Eqn. (II-D), Eqn. (II-D) to generate decision trees ${\bf H}^{(1)\star}$ , ${\bf H}^{(2)\star}$ ;
Add ${\bf H}^{(1)\star}$ and ${\bf H}^{(2)\star}$ to working set $\mathcal{W}_{{\bf H}}$ and resolve for the primal solution ${\bf w}$ and dual solution ${\boldsymbol{\lambda}}$ .

We show two segmentation examples on the Oxford flower dataset produced by our method with different CG iterations in Fig. 1. As can be seen, our method refines the segmentation with the increase of CG iterations. Since this dataset is relatively simple, a few CG iterations are enough to get satisfactory results.

For solving the primal problem in the second step, it involves a large number of constraints due to the large output space $\{{\bf y}\in{\cal Y}\}$ . We next show how to apply the cutting-plane technique [13] to efficiently solve this problem.

II-E Speeding up optimization using cutting-plane

To apply cutting-plane for solving the optimization in Eqn. (II-C), we first derive its $1$ -slack formulation. The $1$ -slack SSVMs formulation was first introduced by [13]. The $1$ -slack formulation of our method can be written as:

[TABLE]

Cutting-plane methods work by finding the most violated constraint for each example $i$

[TABLE]

at every iteration and add it to the constraint working set. The sketch of our method is summarized in Algorithm 1, which calls Algorithm 2 to solve the $1$ -slack optimization.

Implementation details

To deal with the unbalanced appearance of different categories in the dataset, we define ${\it\Delta}({\bf y}_{i},{\bf y})$ as weighted Hamming loss, which weighs errors for a given class inversely proportional to the frequency it appears in the training data, as similar in [24]. In the inference problem of Eqn. (20), when using the hamming loss as the label cost ${\it\Delta}$ , the label cost term can be absorbed into the unary part. We therefore can apply Graph-cut to efficiently solve Eqn. (20). As for more complicated label cost functions, an efficient inference algorithm is proposed in [4]. During each CG iteration, our method first solves Eqn. (II-D), (II-D) given the current ${\bf x}$ and $\xi$ , and then solves a quadratic programming (QP) problem given ${\bf H}$ . When solving Eqn. (II-D), (II-D), we train weighted decision tree classifiers using the highly optimized decision tree training method of [3].

Discussions on the submodularity

It is known that if graph cuts are to be applied to achieve globally optimum labelling in segmentation, the energy function must be submodular. For foreground/background segmentation in which a (super-)pixel label takes value in $\{0,1\}$ , we show that our method keeps this submodular property. It is commonly known that an energy function is submodular if its pairwise term satisfies: $\eta_{pq}(0,0)+\eta_{pq}(1,1)\leq\eta_{pq}(0,1)+\eta_{pq}(1,0)$ . Recall that our pairwise energy is written as $\eta_{pq}(y^{p},y^{q})={\bf w}^{(2){\!\top}}{\bf H}^{(2)}({\bf x}^{p},{\bf x}^{q})I(y^{p}\neq y^{q})$ . Clearly we have ( $\eta_{pq}(0,0)=\eta_{pq}(1,1)=0$ ) because of the indicator function $I(y^{p}\neq y^{q})$ . The second thing is to ensure $\eta_{pq}(1,0)+\eta_{pq}(0,1)\geq 0$ . Given the non-negativeness constraint we impose on ${\bf w}$ in our model, and the output of decision trees in our method taking values from $\{0,1\}$ , we have $\eta_{pq}(1,0)\geq 0$ and $\eta_{pq}(0,1)\geq 0$ . We thus accomplish the proof of the submodularity of our model. In the case of multi-object segmentation, the inference is done by the $\alpha$ -expansion of graph cuts.

Discussions on the non-negative constraint on ${\bf w}$

Our learning framework aligns with boosting methods, where we learn a non-negative weighted ensemble of weak structured learners (constructed by decision trees), which is analogous to weak learners in boosting methods. This is similar to boosting methods, such as AdaBoost, LPBoost [8], where the non-negative weighting is commonly used. Further, a weak structured learner generated by our column generation method is expected to make positive contribution to the learning objective. If it is of no use to the objective, the weight will approach zero. Therefore it is reasonable to enforce the non-negative constraint on ${\bf w}$ .

III Experiments

To demonstrate the effectiveness of the proposed method, we first compare our model with some most related baseline methods, which are SVMs, AdaBoost and SSVMs. In section III-C, we show that our method achieves state-of-the-art results by exploiting recent advances in feature learning [6, 16].

III-A Experimental setup

The datasets evaluated here include three binary datasets (Weizmann horse, Oxford flower and Graz-02) and two multi-class datasets (MSRC-21 and PASCAL VOC 2012). The Weizmann horse dataset111http://www.msri.org/people/members/eranb/ consists of 328 horse images from various backgrounds, with groundtruth masks available for each image. We use the same data split as in [5] and [17]. The Oxford 17 category flower dataset [26] is composed of 849 flower images. Those with too small foreground are removed, which leaves 753 for segmentation purpose [26]. The data split stated in [26] is used to perform the evaluation. During our experiment, images of the Weizmann horse and the Oxford flower datasets are resized to 256 $\times$ 256. The Graz-02 dataset222http://www.emt.tugraz.at/~pinz/ contains 3 categories (bike, car and people). This dataset is considered challenging as the objects appear at various background and with different poses. We follow the evaluation protocol in [25] to use 150 for training and 150 for testing for each category. The MSRC-21 dataset [32] is a popular multi-class segmentation benchmark with 591 images containing objects from 21 categories. We follow the standard split to divide the dataset into training/validation/test subsets. The PASCAL VOC 2012 dataset 333http://host.robots.ox.ac.uk/pascal/VOC/voc2012/ is a widely used benchmark for semantic segmentation, which contains 2913 images from the trainval set and 1456 images from the test set, making up 21 categories. Unlike many state-of-the-arts methods such as [38], we do not use any additional training data for this dataset.

We start with over-segmenting the images into superpixels using SLIC [1], with $\sim$ 700 superpixels generated per image. We extract dense SIFT descriptors and color histograms around each superpixel centroid with different block sizes (12 $\times$ 12, 24 $\times$ 24, 36 $\times$ 36). The dense SIFT descriptors are then quantized into bag-of-words features using nearest neighbour search with a codebook size of 400. We construct four types of pairwise features also using different block sizes to enforce spatial smoothness, which are color difference in LUV space, color histogram difference, texture difference in terms of LBP operators as well as shared boundary length [10]. The column generation iteration number of our CRFTree is set to 50 based on a validation set. We learn tree potentials with the tree depth being $2$ . Training on the MSRC-21 dataset on a standard PC machine takes around 16 hours.

III-B Comparing with baseline methods

We first compare CRFTree with some conventional methods, which are linear SVMs, AdaBoost and SSVMs to demonstrate the superiority of our method. For SVMs and AdaBoost, each superpixel is classified independently without CRFs. We mainly evaluate on the more challenging Graz-02 and MSRC-21 dataset in this part. The regularization parameter C of SVMs, SSVMs and our CRFTree are selected from $\{1,10,100,1000\}$ based on a validation set. We use depth-2 decision trees for training AdaBoost and our CRFTree. The maximum iteration number of AdaBoost is chosen from $\{$ 50, 100, 200 $\}$ . For our method, we treat the foreground and background as two categories in the binary case to learn class-wise potentials.

Graz-02

For a comprehensive evaluation, we use two measurements to quantify the performance on the Graz-02 dataset, which are intersection over union score and the pixel accuracy (including foreground and background). We report the results in Table I. As can be observed, AdaBoost based on a depth-2 decision tree performs better than the linear SVMs. On the other hand, structured methods which jointly consider local information and spatial consistency are able to significantly outperform the simple binary models. By introducing nonlinear and class-wise potential learning, our method is able to gain further improvement over SSVMs.

MSRC-21

We learn class-wise potentials using our CRFTree for each of the 21 classes on the MSRC dataset. The compared results are summarized in Table II (upper part). Similar conclusions can be drawn as on the Graz-02 dataset and our CRFTree again outperforms all its baseline competitors.

III-C Comparing with state-of-the-art methods

Since features play a pivotal role in the performance of vision algorithms, we exploit recent advances in feature learning to pursue state-of-the-art results, i.e., unsupervised feature learning [6] and convolutional neural networks (CNN) [16]. Specifically, for the unsupervised feature learning, we first learn a dictionary $\bf B$ of size 400 and patch size 6 $\times$ 6 based on the evaluated image dataset using Kmeans, and then use the soft threshold coding [6] to encode patches extracted from each superpixel block. The final feature vectors (we call it encoding feature here) are obtained by performing a three-level max pooling over the superpixel block. For the CNN features, we use the Alex model [16] trained on the ImageNet444http://image-net.org to generate CNN features. These two versions of our method are denoted as CRFTree (FL) and CRFTree (CNN) respectively. We only report the results of CRFTree (CNN) on the MSRC-21 and PASCAL VOC 2012 datasets since our method already performs very well by using the encoding features on the three binary datasets.

Weizmann horse

We quantify the performance by the global pixel-wise accuracy $S_{a}$ and the foreground intersection over union score $S_{o}$ , as did in [5]. $S_{a}$ measures the percentage of pixels correctly classified while $S_{o}$ directly reflects the segmentation quality of the foreground. The results are reported in Table III. Our method performs better than the kernel structural learning method of [5], which may result from the fact that they only introduced nonlinearity into the unary part while our method achieves nonlinearity on both unary and pairwise terms. The best $S_{a}$ score is obtained by [21]. However their method relies on an assumption that a perfect bounding box of the horse is available for each test image, which is not practically applicable. On the contrary, we provide a principal and general way of nonlinearly learning CRFs parameters. We show some segmentation examples of our method in Fig. 2.

Oxford flower

As in [5], we also use $S_{a}$ and $S_{o}$ to measure the performance on the Oxford flower dataset, and report the results in Table IV. Our method performs comparable to the original work of [26] on this dataset in terms of $S_{o}$ while again obtains better results than the closely related state-of-the-art work of [5]. It is also worth noting that the method in [26] is very domain specific, which relies on modelling the flower’s shape (center and petal), while ours is generally applicable.

Graz-02

As in the work of [25], [9], [2], [17], we also evaluate the F-score on the Graz-02 dataset besides the above mentioned intersection over union score and pixel accuracy. The F-score is defined as $F=2pr/(p+r)$ , where $p$ is the precision and $r$ is the recall. We summarize the results in Table V and Table I. From Table V, it can be seen that our method significantly outperforms all the compared methods, which fully demonstrate the power of nonlinear and class-wise potential learning. Furthermore, we can observe from Table I that compared with the previous results, adding more features help to improve the performance.

MSRC-21

The compared results with state-of-the-art works are reported in the lower part of Table II. As we can see, by incorporating more advanced features, our CRFTree gains significant improvements over the previous results which only use bag-of-words and color histogram features. It is worth noting that our method performs better than the closely related work of Lucchi et al.[24] which claims exploiting non-linear kernels. It has to be pointed out that we did not employ any global potentials (while in [24], they improve the global and average per-category accuracy from 70, 73 to 82 and 76 by adding global information). If global or higher potentials are incorporated into our model, further performance promotion can be expected. We show some qualitative evaluation examples in Fig. 3.

PASCAL VOC 2012

We generate deep features of each superpixel by averaging the pixel-wise feature map scores within the superpixel obtained from a pretrained FCN model [22]. We then train our CRFTree model on the standard PASCAL VOC 2012 training dataset with the generated deep features. Following the standard evaluation procedure for the Pascal VOC challenge, we upload our segmentation results to the test server and use the average intersection over union as the evaluation metric. We compare against several state-of-the-art methods ([12], [7], [22], [38]) on the test set of the PASCAL VOC 2012 dataset. The results are reported in Table VI. As seen from the table, our CRFTree beats the Hypercolum [12] and the CFM [7] and outperforms the FCN [22] by a notable margin. Although our method is triumphed by [38], it should be noted that their result is obtained by using extra training data (11,685 images vs 1456 images used for training our CRFTree). Some qualitative evaluation examples of our method are illustrated in Fig. 4.

IV Conclusion

Nonlinear structured learning has been a promising yet challenging topic in the community. In this work, we have proposed a nonlinear structured learning method of tree potentials for image segmentation. The unary and pairwise potentials are ensembles of class-wise trees, with the ensemble parameters and the trees jointly learned in a unified large-margin framework. In this way, nonlinearity is easily introduced into the CRFs learning. The resulted model involves exponential number of variables and constraints. We therefore derive a novel algorithm combining a modified column generation method and the cutting-plane technique for efficient model training. We have exemplified the superiority of the proposed nonlinear potential learning method by comparing against state-of-the-art methods on both binary and multi-class object segmentation datasets. A potential disadvantage of our method is that it is prone to overfitting due to the outstanding non-linear learning capacity. This can be alleviated by using more training data. On the other hand, as we show in Table II, our method using pre-trained CNN features has shown the best performance. Therefore it is worth exploiting to further combine our method with deep learning techniques in the future work.

Bibliography38

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] R. Achanta, K. Smith, A. Lucchi, P. Fua, and S. S sstrunk. Slic superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence , 2012.
2[2] D. Aldavert, A. Ramisa, R. L. de M ntaras, and R. Toledo. Fast and robust object segmentation with the integral linear classifier. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition , 2010.
3[3] R. Appel, T. J. Fuchs, P. Dollár, and P. Perona. Quickly boosting decision trees - pruning underachieving features early. In Proceedings of International Conference on Machine Learning , 2013.
4[4] A. Bauer, S. Nakajima, and K.-R. Müller. Efficient exact inference with loss augmented objective in structured learning. IEEE Transactions on Neural Networks and Learning Systems , 2016.
5[5] L. Bertelli, T. Yu, D. Vu, and B. Gokturk. Kernelized structural SVM learning for supervised object segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition , 2011.
6[6] A. Coates and A. Y. Ng. The importance of encoding versus training with sparse coding and vector quantization. In Proceedings of International Conference on Machine Learning , 2011.
7[7] J. Dai, K. He, and J. Sun. Convolutional feature masking for joint object and stuff segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition , 2015.
8[8] A. Demiriz, K. P. Bennett, and J. Shawe-Taylor. Linear programming boosting via column generation. Machine Learning , 2002.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Structured Learning of Tree Potentials in CRF for Image Segmentation

Abstract

Index Terms:

Contents

I Introduction

II Learning tree potentials in CRFs

II-A Segmentation using CRFs models

II-B Energy Formulation

II-C Learning CRFs in the large-margin framework

II-D Learning tree potentials using column generation

II-E Speeding up optimization using cutting-plane

Implementation details

Discussions on the submodularity

Discussions on the non-negative constraint on w{\bf w}w

III Experiments

III-A Experimental setup

III-B Comparing with baseline methods

Graz-02

MSRC-21

III-C Comparing with state-of-the-art methods

Weizmann horse

Oxford flower

Graz-02

MSRC-21

PASCAL VOC 2012

IV Conclusion

Discussions on the non-negative constraint on ${\bf w}$