Interactive Segmentation as Gaussian Process Classification
Minghao Zhou, Hong Wang, Qian Zhao, Yuexiang Li, Yawen Huang, Deyu, Meng, Yefeng Zheng

TL;DR
This paper introduces GPCIS, a Gaussian process-based framework for interactive segmentation that explicitly propagates click information, leading to improved accuracy and efficiency over existing deep learning methods.
Contribution
It formulates interactive segmentation as a Gaussian process classification problem and develops an efficient, flexible inference method with theoretical guarantees.
Findings
Outperforms existing methods on benchmark datasets.
Explicit click information propagation improves segmentation accuracy.
Efficient sampling with linear complexity enhances practical usability.
Abstract
Click-based interactive segmentation (IS) aims to extract the target objects under user interaction. For this task, most of the current deep learning (DL)-based methods mainly follow the general pipelines of semantic segmentation. Albeit achieving promising performance, they do not fully and explicitly utilize and propagate the click information, inevitably leading to unsatisfactory segmentation results, even at clicked points. Against this issue, in this paper, we propose to formulate the IS task as a Gaussian process (GP)-based pixel-wise binary classification model on each image. To solve this model, we utilize amortized variational inference to approximate the intractable GP posterior in a data-driven manner and then decouple the approximated GP posterior into double space forms for efficient sampling with linear complexity. Then, we correspondingly construct a GP classification…
| NoIC | 36 | 30 | 21 | 15 | 15 | 8 | 2 |
|---|
| Backbone | Method | GrabCut [38] | Berkeley [32] | SBD [13] | DAVIS [36] | Average | |||||
| NoC@85 | NoC@90 | NoC@85 | NoC@90 | NoC@85 | NoC@90 | NoC@85 | NoC@90 | NoC@85 | NoC@90 | ||
| DeepLab-LargeFOV [4] | ∗ RIS-Net [24] \ssmall(’17) | - | 5.00 | - | 6.03 | - | - | - | - | - | - |
| CAN [53] | LD [23] \ssmall(’18) | 3.20 | 4.79 | - | - | 7.41 | 10.78 | 5.95 | 9.57 | - | - |
| FCN [29] | ∗DOS [51] \ssmall(’16) | 5.08 | 6.08 | - | - | 9.22 | 12.80 | 9.03 | 12.58 | - | - |
| ∗CMG [31] \ssmall(’19) | - | 3.58 | - | 5.60 | - | - | - | - | - | - | |
| DenseNet [17] | BRS [18] \ssmall(’19) | 2.60 | 3.60 | - | 5.08 | 6.59 | 9.78 | 5.58 | 8.24 | - | 6.68 |
| Xception-65 [8] | ∗CA [22] \ssmall(’20) | - | 3.07 | - | 4.94 | - | - | 5.16 | - | - | - |
| SegFormerB0-S2 [50, 7] | RITM [41] \ssmall(’21) | 1.62 | 1.82 | 1.84 | 2.92 | 4.26 | 6.38 | 4.65 | 6.13 | 3.09 | 4.31 |
| FocalClick [7] \ssmall(’22) | 1.66 | 1.90 | - | 3.14 | 4.34 | 6.51 | 5.02 | 7.06 | - | 4.65 | |
| GPCIS \ssmall(Ours) | 1.60 | 1.76 | 1.84 | 2.70 | 4.16 | 6.28 | 4.45 | 6.04 | 3.01 | 4.20 | |
| HRNet18s-S2 [43, 7] | RITM [41] \ssmall(’21) | 2.00 | 2.24 | 2.13 | 3.19 | 4.29 | 6.36 | 4.89 | 6.54 | 3.33 | 4.58 |
| FocalClick [7] \ssmall(’22) | 1.86 | 2.06 | - | 3.14 | 4.30 | 6.52 | 4.92 | 6.48 | - | 4.55 | |
| GPCIS \ssmall(Ours) | 1.74 | 1.94 | 1.83 | 2.65 | 4.28 | 6.25 | 4.62 | 6.16 | 3.12 | 4.25 | |
| ResNet50 [15] | ∗FCANet[26] \ssmall(’20) | 2.18 | 2.62 | - | 4.66 | - | - | 5.54 | 8.83 | - | - |
| f-BRS-B [40] \ssmall(’20) | 2.20 | 2.64 | 2.17 | 4.22 | 4.55 | 7.45 | 5.44 | 7.81 | 3.59 | 5.53 | |
| CDNet [6] \ssmall(’21) | 2.22 | 2.64 | - | 3.69 | 4.37 | 7.87 | 5.17 | 6.66 | - | 5.22 | |
| RITM [41] \ssmall(’21) | 2.16 | 2.30 | 1.90 | 2.95 | 3.97 | 5.92 | 4.56 | 6.05 | 3.15 | 4.31 | |
| FocusCut [25] \ssmall(’22) | 1.60 | 1.78 | 1.86 | 3.44 | 3.62 | 5.66 | 5.00 | 6.38 | 3.02 | 4.32 | |
| FocalClick [7] \ssmall(’22) | 1.92 | 2.14 | 1.87 | 2.86 | 3.84 | 5.82 | 4.61 | 6.01 | 3.06 | 4.21 | |
| GPCIS \ssmall(Ours) | 1.64 | 1.82 | 1.60 | 2.60 | 3.80 | 5.71 | 4.37 | 5.89 | 2.85 | 4.00 | |
| Method | Berkeley [32] | DAVIS [36] | #Params (MB) | SPC (ms) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NoC100@90 | NoF100@90 | IoU&1 | IoU&5 | NoIC | NoC100@90 | NoF100@90 | IoU&1 | IoU&5 | NoIC | |||
| f-BRS-B [40] | 6.21 | 2 | 77.06% | 85.00% | 1 | 22.62 | 57 | 70.97% | 83.87% | 0 | 39.44 | 116.53 |
| CDNet [6] | - | - | - | - | - | 18.59 | 48 | - | - | - | 39.90 | 57.76 |
| RITM [41] | 3.75 | 1 | 76.88% | 94.66% | 2 | 18.09 | 51 | 72.89% | 89.14% | 74 | 39.48 | 34.24 |
| FocusCut [25] | 4.63 | 1 | 78.89% | 92.89% | 1 | 19.00 | 45 | 72.71% | 87.58% | 6 | 40.36 | 950.68 |
| FocalClick [7] | 4.46 | 2 | 75.59% | 94.90% | 0 | 17.74 | 49 | 70.76% | 88.90% | 42 | 39.50 | 41.80 |
| GPCIS (Ours) | 3.36 | 1 | 79.43% | 95.11% | 0 | 17.03 | 44 | 75.67% | 89.60% | 2 | 39.39 | 38.82 |
| Variants | DKL-F | DKL-W | Concat | Avg. NoC@85 | Avg. NoC@90 | |
|---|---|---|---|---|---|---|
| (a) | ✗ | ✓ | ✓ | ✓ | 2.98 | 4.07 |
| (b) | ✓ | ✗ | ✓ | ✓ | 3.00 | 4.16 |
| (c) | ✓ | ✓ | ✗ | ✓ | 3.10 | 4.34 |
| (d) | ✓ | ✓ | ✓ | ✗ | 2.96 | 4.10 |
| (e) | ✓ | ✓ | ✓ | ✓ | 2.85 | 4.00 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHealth, Environment, Cognitive Aging
MethodsVariational Inference · Gaussian Process
Interactive Segmentation as Gaussian Process Classification
Minghao Zhou1,2, Hong Wang2,111Corresponding author, Qian Zhao1, Yuexiang Li2, Yawen Huang2, Deyu Meng1,3,4, Yefeng Zheng2
1Xi’an Jiaotong University, Xi’an, China 2Tencent Jarvis Lab, Shenzhen, China
3Peng Cheng Laboratory, Shenzhen, China 4Macau University of Science and Technology, Macau, China
[email protected] {hazelhwang, vicyxli, yawenhuang, yefengzheng}@tencent.com
{timmy.zhaoqian, dymeng}@mail.xjtu.edu.cn
Abstract
Click-based interactive segmentation (IS) aims to extract the target objects under user interaction. For this task, most of the current deep learning (DL)-based methods mainly follow the general pipelines of semantic segmentation. Albeit achieving promising performance, they do not fully and explicitly utilize and propagate the click information, inevitably leading to unsatisfactory segmentation results, even at clicked points. Against this issue, in this paper, we propose to formulate the IS task as a Gaussian process (GP)-based pixel-wise binary classification model on each image. To solve this model, we utilize amortized variational inference to approximate the intractable GP posterior in a data-driven manner and then decouple the approximated GP posterior into double space forms for efficient sampling with linear complexity. Then, we correspondingly construct a GP classification framework, named GPCIS, which is integrated with the deep kernel learning mechanism for more flexibility. The main specificities of the proposed GPCIS lie in: 1) Under the explicit guidance of the derived GP posterior, the information contained in clicks can be finely propagated to the entire image and then boost the segmentation; 2) The accuracy of predictions at clicks has good theoretical support. These merits of GPCIS as well as its good generality and high efficiency are substantiated by comprehensive experiments on several benchmarks, as compared with representative methods both quantitatively and qualitatively. Codes will be released at https://github.com/zmhhmz/GPCIS_CVPR2023.
1 Introduction
Driven by the huge potential in reducing the pixel-wise annotation cost, interactive segmentation (IS) has sparked much research interest [14], which aims to segment the target objects under user interaction with less interaction cost. Among various types of user interaction [52, 1, 3, 27, 2, 49, 30, 54], in this paper, we focus on the popular click-based mode, where positive annotations are clicked on the target object while negative ones are clicked in the background regions [18, 40, 41, 25, 7].
Recent years have witnessed the promising success of deep learning (DL)-based methods in the IS task. The most commonly adopted research line is that the user interaction is encoded as click maps and fed into a deep neural network (DNN) together with input images to extract deep features for the subsequent segmentation [51, 41]. However, these methods generally suffer from two limitations:
- As shown in Fig. 1 (a), after extracting the deep features, they generally perform pixel-wise classification without specific designs for the IS task. As a result, during the last-layer classification, the deep features of different pixels are not fully interactive and the information contained in clicked pixels cannot be finely propagated to other pixels under explicit regularization.
- There is no explicit theoretical support that the clicked regions can be properly activated and correctly classified. Although some researchers have proposed different strategies, e.g., non-local-based modules [6] and the backpropagating refinement scheme [18, 40], they usually incur extra computational cost and are not capable enough to deal with the two problems simultaneously. Besides, the relations between deep features of different pixels are generally characterized and captured based on off-the-shelf network modules. Such implicit design makes it hard to clearly understand the working mechanism underlying these methods.
To alleviate these aforementioned issues, inspired by the intrinsic capabilities of Gaussian process (GP) models, e.g., explicitly measuring the relations between data points by a kernel function, and promoting accurate predictions at training data via interpolation, we rethink the IS task and attempt to construct a GP-based inference framework for the specific IS task. Concretely, as shown in Fig. 1 (b), we propose to treat the IS task from an alternative perspective and reformulate it as a pixel-level binary classification problem on each image, where clicks are viewed as training pixels with classification labels, i.e., foreground or background, and the unclicked points as the to-be-classified testing pixels. With such understanding, we construct the corresponding GP classification model. To solve it, we propose to utilize the amortized variational inference to efficiently approximate the intractable GP posterior in a data-driven manner, and then adopt the decoupling techniques [47, 48] to achieve the GP posterior sampling with linear complexity. To improve the learning flexibility, we further embed the deep kernel learning strategy into the decoupled GP posterior inference procedure. Finally, by correspondingly integrating the derived GP posterior sampling mechanism with DNN backbones, we construct a GP Classification-based Interactive Segmentation framework, called GPCIS. In summary, our contributions are mainly three-fold:
-
We propose to carefully formulate the IS task as a Gaussian process classification model on each image. To adapt the GP model to the IS task, we propose specific designs and accomplish the approximation and efficient sampling of the GP posterior, which are then effectively integrated with the deep kernel learning mechanism for more flexibility.
-
We build a concise and clear interactive segmentation network under a theoretically sound framework. As shown in Fig. 1 (b), the correlation between the deep features of different pixels is modeled by GP posterior. With such explicit regularization, the information contained in clicks can be finely propagated to the entire image and boost the prediction of unclicked pixels. Besides, our method can provide rational theoretical support for accurate predictions at clicked points. These merits are finely validated in Sec. 5.2.
-
Extensive experimental comparisons as well as model verification comprehensively substantiate the superiority of our proposed GPCIS in segmentation quality and interaction efficiency. It is worth mentioning that the proposed GPCIS can consistently achieve superior performance under different backbone segmentors, showing its fine generality.
2 Related Work
In this section, we briefly review the related work on the click-based interactive segmentation (IS) task.
Traditional methods for IS [11, 12, 19, 38] generally utilize the low-level features of to-be-segmented images and build optimization-based graphical models, which usually suffer from unsatisfactory performance and low efficiency. Motivated by the promising success of deep neural networks (DNN) [29, 4] in semantic segmentation, various methods have borrowed these pipelines for handling the IS task by transforming user interactions into click maps and taking them as the network input [51, 24, 23, 26]. In 99%AccuracyNet [10] and RITM [41], the mask predicted during the previous click was also regarded as the network input for helping the predictions for the current click. Recently, to better exploit the information contained in clicks and further propagate it to the entire image for promoting the segmentation of unclicked points, FCANet [26] put more emphasis on leveraging the first click and CDNet [6] designed the non-local-based conditional diffusion modules. Although these methods can deal with the relations between the features of different pixels to some extent, they can hardly provide any explicit theoretical basis for corrected predictions at clicked points. To this end, BRS [18], f-BRS [40], and CA [22] have proposed to perform loss backpropagation during testing to adapt click maps or their network parameters to the current testing image. Clearly, the extra computation cost would adversely affect the efficiency of interactive segmentation. Recently, another research line, e.g., RIS-Net [24], FocalClick [7], and FocusCut [25], deals with the IS task from a local view to refine the segmentation results. Albeit attaining performance improvement, these methods have not fully exploited the relations between the deep features of clicks and those of unclicked points. Against this issue, in this paper, we build a concise and efficient model to explicitly model the relations between the deep features of the entire to-be-segmented image. It is worth noting that [42] employs a Gaussian process model to develop an active learning framework for interactive segmentation, aiming to actively query pixels to be labeled.
3 Preliminaries: Gaussian Processes
Gaussian processes (GP) [45] can be understood as the “Gaussian distribution over functions”. As a compelling tool, by directly modeling the prior and posterior of functions, it has been widely adopted in various tasks [44, 33, 28]. Mathematically, a GP is defined as a stochastic process where the joint distribution of any finite random variables is Gaussian. Define a mean function and a covariance function , a GP satisfies with mean and covariance matrix , for any finite observations . Specifically, for the GP prior, is generally assumed to be a constant zero function. The covariance function can be elaborately designed to model the correlations between the data points.
Given noise-free latent observations at training data , the GP posterior at testing data is written as [45]:
[TABLE]
where
[TABLE]
As seen, the GP posterior utilizes the relations between the testing data and the training data to estimate the distribution of the function at , where the relations are explicitly measured by the kernel function .
4 Methodology
In this section, we firstly propose that the interactive segmentation (IS) problem can be regarded as a pixel-wise binary classification task on each input image. Based on such understanding, we carefully formulate this task with a GP classification model. Then, to solve it, we propose the corresponding algorithms to finely approximate and efficiently sample from GP posterior. Finally, by flexibly combining the GP model with DNN backbones, we construct the entire inference framework. The details are given below.
4.1 Model Formulation
For the interactive segmentation on an RGB image , users iteratively impose positive or negative clicks on the image to segment the target object, where is the number of the pixels; is the number of the interactive clicks; and is the label (i.e., foreground/background) of the click. By feeding the to-be-segmented image to a DNN , we can extract the feature representations as , where denotes the features of pixel . Given the features at clicked pixels and their labels , our goal is to predict the labels of the unclicked pixels with the features , where is the number of unclicked pixels. Next, we aim to solve the two core problems: ❶ How to finely model the relations between the deep features of different pixels and fully propagate the information contained in clicks for boosting the correct predictions at unclicked pixels? ❷ How to guide and promote accurate predictions at clicks?
Inspired by the appealing properties of Gaussian process (GP) models for our task, e.g., the capability of explicitly modeling the relations between data points and accurately interpolating the training data, we propose to rethink the IS task from a micro perspective and formulate it as a pixel-level binary classification task on each image, where the features of clicked pixels are regarded as training data with labels and those of unclicked pixels as testing data. Based on such understanding, we attempt to handle the pixel-wise binary classification task via GP models.
Specifically, we define a GP with a zero-mean prior and a covariance function over the classification function , which takes the feature of pixel as input and outputs the score for binary classification, i.e., positive score for foreground and negative score for background. Then the inference process from the available click information to the unknown labels at can be transformed into the following GP classification model which aims to solve the posterior distribution of the labels given , mathematically expressed as:
[TABLE]
where is the GP posterior. For the binary classification task, it is conventionally set that , where is the sigmoid function [33].
As seen, the integral in Eq. (3) is explicitly intractable for our task. Fortunately, if we can obtain the GP posterior, the integral can be approximated with a Monte Carlo method [16]. Specifically, suppose is sampled from the derived GP posterior, we can approximately get that the probability of classifying the testing data into the foreground is . Hence, the key is how to obtain the GP posterior . Besides, after obtaining the GP posterior, how to achieve efficient sampling from it is also worth exploring since high inference efficiency is critical for the IS task. Next, we will answer the two questions.
4.2 GP Posterior Approximation and Sampling
In this subsection, we aim to approximate the GP posterior and achieve efficient sampling.
GP posterior approximation. It is easily known that the GP posterior can be rewritten as:
[TABLE]
where follows a Gaussian distribution as defined in Eq. (1); ; and . For the classification task, due to the non-Gaussian likelihood , is non-Gaussian and leads to that the GP posterior in Eq. (4) is intractable. Against this issue, previous methods [34, 33, 16] have proposed to approximate with a Gaussian variational distribution by minimizing their KL divergence as:
[TABLE]
To solve Eq. (5), conventional variational inference-based methods [34, 33, 16] independently optimize the objective on each training task (i.e., each training image in our IS case). These methods are generally time-consuming and fail to exploit the shared information among different images. In contrast, we imitate the amortized variation inference [21] to efficiently infer from and the distribution parameters for can be flexibly learned based on all the training images (i.e., the whole benchmark dataset) in an end-to-end manner. Specifically, the variational distribution is set as:
[TABLE]
where the mean function is designed as:
[TABLE]
where MLP represents a multi-layer perceptron parameterized by , which transforms the features from to . The activation function is the smooth version of ReLU, whose output is consistently positive. By empirically setting a small variance as 0.01, for any , we have , which has the same positive/negative sign as and helps the correct category prediction at clicks.
By substituting Eq. (6) and derived in Eq. (4), we can rewrite the KL divergence in Eq. (5) as: 111Please refer to Supplementary Material (SM) for detailed derivations.
[TABLE]
where we simplify as .
By optimizing Eq. (8) over all the training images in an end-to-end manner, we can obtain the variational distribution . Then by substituting it into Eq. (4), we can easily derive that the GP posterior is Gaussian and can be approximated as: 1
[TABLE]
where
[TABLE]
Decoupling GP posterior for efficient sampling. From the analysis of Eq. (3), by sampling from the tractable GP posterior in Eq. (9), we can obtain the classification probability for unclicked pixels as . To sample , the standard approach is to compute with [47]. As seen, the computation cost of is cubic w.r.t. the number of unclicked pixels , i.e., , which severely affects the efficiency. Against this issue, we propose to adopt the techniques [47, 48] which decouple the GP posterior into a weight-space prior term and a function-space update term, largely reducing the sampling cost without sacrificing interpolation accuracy at clicks. Then, for the GP posterior in Eq. (9), we can derive the sampling framework as [47, 48]: 1
[TABLE]
where ; ; is constructed by a set of Fourier bases and the -th basis is expressed as [37]:
[TABLE]
where ; ; ; is sampled from the spectral density of the kernel . We will carefully design the kernel function in Sec. 4.3.
In practice, considering and , the cost of sampling from Eq. (11) is reduced from to [47, 48]. Note that in our practical implementation, to keep consistency with most DL-based methods [41, 6, 7], we execute an inference on the entire image with pixels. That is to say, we also sample using Eq. (11) in parallel with , by replacing the subscripts (i.e., the number of unclicked pixels) with the total number of pixels . Then, we can obtain the entire prediction results of pixels, i.e., .
Remark 1: It is worth mentioning that the proposed sampling strategy in Eq. (11) possesses two inherent characteristics: ❶ The relations between the deep features of clicked points and those of the unclicked points are fully utilized and explicitly modeled by the function-space update term, which enables the information contained in clicked regions to propagate to other regions. ❷ For training stability, the matrix inversion in Eqs. (8) (11) is practically computed by , where is empirically set to 0.01 during training. In Eq. (11), if we replace the number of the unclicked pixels (subscripts ) with the number of clicked pixels (subscripts ) and set a small enough , we can obtain that , showing that the sampled has the same positive/negative sign as the labels . This implies that the proposed sampling strategy can provide theoretical support for encouraging accurate predictions at clicked points. The two characteristics are validated by the model verification in Sec. 5.2.
4.3 Double Space Deep Kernel Learning
From Eqs. (11) and (12), we can see that the kernel affects both the function-space update and weight-space prior terms. Designing a proper and flexible kernel is important for better modeling the relations between pixels and extracting the prior knowledge underlying the segmentation function.
In the decoupling paradigm [47, 48], the adopted kernel function is generally pre-defined and fixed, which would lead to two potential limitations:
- In function space, the kernel representation capacity would be constrained and the similarity measure between data points may not be optimal for our task; 2) In weight space, the prior term is not flexible enough to capture the prior knowledge underlying the IS task. Against these issues, instead of adopting the fixed manually-designed kernels, inspired by deep kernel learning (DKL) [46], we propose to flexibly learn the kernel function in both function space and weight space from the abundant training images in a data-driven manner.
Specifically, we propose to perform double space DKL on , where represents the concatenation of the deep features and the image RGB values at pixel . Here, the concatenation of input image is for providing more information as validated in Sec. 5.4. In function space, to improve representation flexibility, we select a modified radial basis function (RBF) with scaling hyperparamters as the kernel function: where and is the -th element of . In weight space, since the hyperparameters are updating during network training, it is not suitable to sample in Eq. (12) from the kernel’s spectral density, thus it is set as a learnable parameter. To further improve flexibility and representation capacity of the weight-space prior term for better extracting the image prior, we parameterize the prior distribution of as . These hyperparameters in the double space, including , and , are trained in an end-to-end manner based on the entire training dataset.
Compared to the pre-fixed kernel design manner, the proposed double space DKL strategy is more flexible and it can utilize the powerful representation ability of DNNs to promote the performance, which is validated in Sec. 5.4.
4.4 The Proposed GPCIS Framework
Based on the derived GP posterior sampling procedure as well as the double space DKL mechanism, we can correspondingly construct the entire framework, called Gaussian Process Classification-based Interactive Segmentation (GPCIS). As presented in Fig. 2, similar to [41, 6, 7], we firstly input the image and the click maps together with the previous mask to a general backbone segmentor for extracting the deep features . By feeding the concatenation of and the image , i.e., , to the efficient GP posterior sampling framework in Eq. (11), we can generate a weight-space prior map and a function-space update map. Finally, we can obtain the segmentation result by adding the two maps followed by a sigmoid function.
Remark 2: As seen, in our proposed GPCIS, the correlation modeling on the deep features of different pixels are explicitly corresponding to the derived GP posterior sampling strategy. Compared to the current methods [18, 40, 6, 25, 7] which are implicitly built based on off-the-shelf network modules, our method has a clearer working mechanism.
Network training. For the proposed GPCIS framework, the involved parameters are automatically learned from the training data in an end-to-end manner, including for the backbone segmentor, for variational distribution , for function-space DKL, and for weight-space DKL. The training loss is set as: 222The entire algorithm flowchart is provided in SM.
[TABLE]
where is the output segmentation result; is the ground truth mask; is the weighting parameter which is empirically set to ; is the normalized focal loss [39] which is widely adopted by the existing IS methods [41, 6, 7]; is the optimization objective in Eq. (8).
5 Experiments
5.1 Experimental Settings
Datasets. We conduct IS experiments on four widely-adopted datasets: 1) GrabCut [38] contains 50 images with single object masks; 2) Berkeley [32] contains 96 images with 100 object masks; 3) SBD [13] contains 20,172 masks for 8,498 images as the training set, 6,671 instance-level masks for 2,857 images as the validation set. The annotated masks are polygonal; 4) DAVIS [36] contains 345 frames randomly sampled from 50 videos, with high-quality masks. We adopt the training split of SBD as the training set and take the other mentioned datasets for evaluation.
Evaluation metrics. Following [51, 23, 41, 25, 7], we adopt the same strategy to simulate the clicks, which generates the next click at the center of the largest error region by comparing the prediction and ground truth. The Number of Clicks (NoC) is adopted as the metric, which counts the average number of clicks needed to achieve the target Intersection over Union (IoU). Following [51, 23, 41, 25, 7], we set the IoU threshold to 85% and 90%. The evaluation metrics are denoted as NoC@85 and NoC@90, respectively. The default maximum number of clicks is 20. The Number of Failures (NoF) is also reported and it counts the number of images that cannot achieve the target IoU within the specified maximum number of clicks. Besides, we also report the average IoU at the N-th click, denoting IoU&N. To evaluate the correctness of predictions at clicks, we propose a new metric as NoIC which counts the Number of Incorrectly classified Clicks over a testing dataset. Lower NoC, NoF, and NoIC, as well as higher IoU&N, indicate better performance.
Implementation details. We implement the proposed framework with PyTorch [35] based on 4 NVIDIA V100 GPUs. For the backbone segmentor, we adopt three different networks, including SegFormerB0-S2 [50, 7], HRNet18s-S2 [43, 7], and DeepLabv3+ [5] with ResNet50 [15], to substantiate the generality of our method. The initial learning rate is for SegFormerB0-S2 and ResNet50, and for HRNet18s-S2. It is divided by 10 at [190, 220] epochs and the total number of epochs is 230, as in [7]. We adopt the Adam optimizer [20] with the total batch size of 64 and the training patch size of . For inferring in Eq. (6), we adopt a one-hidden-layer MLP with 96 hidden units. More details are provided in SM.
5.2 Model Verification
Decoupled GP posterior. We firstly execute a model verification experiment to present the working mechanism underlying the decoupled GP posterior sampling framework Eq. (11). From Fig. 3, we can clearly observe that the probability maps output by the weight-space prior term can provide rough segmentation results of the target objects. This is mainly attributed to the proposed weight space DKL strategy which can flexibly learn the prior knowledge for the IS task from the training dataset. Besides, as presented, the function-space update term compensates the prior term by utilizing relations of pixels and assigning a larger probability to unclicked pixels semantically similar to the clicks. Then it helps achieve better predictions of unclicked points by propagating the information of the clicks, such as the regions far from the click on the tiger and the long stick. Attributed to the mutual promotion of the weight-space prior and function-space update, our method obtains accurate segmentation results, approaching the ground truth (GT) masks. The results finely comply with the analysis in Remark 1 ❶ and validate the rationality of our proposed method.
Accuracy at clicked points. Based on the backbone ResNet50 and the DAVIS dataset, we utilize the NoIC metric to evaluate the prediction accuracy at clicks of our proposed GPCIS under different during testing. From Table 1 where is set to 0.01 during training, we can easily observe that as gradually gets smaller during testing, NoIC almost shows a clear downward trend, which supports the claim in Remark 1 ❷ that our proposed GPCIS can achieve accurate predictions at clicks with small enough . Hence, in the following experiments, we reasonably adopt a larger as for training stability and a smaller as during testing for more accurate predictions at clicks.
5.3 Performance Evaluation
In this section, based on the four datasets, i.e., GrabCut, Berkeley, SBD, and DAVIS, we comprehensively validate the effectiveness of our proposed method by comparing it with a series of IS methods [24, 23, 51, 31, 18, 22, 7, 26, 40, 6, 25]. For fair comparisons with the current state-of-the-art (SOTA) methods [7, 26, 40, 6, 25], we separately implement our proposed GPCIS with the backbone segmentor SegFormerB0-S2 and HRNet18s-S2 adopted by [7], and with ResNet50 widely adopted by [26, 40, 6, 25]. Note that our proposed method is orthogonal to most of the competitors and yet we do not adopt their exclusive designs, such as cropping click-centered patches with adaptive scopes in FocusCut [25], and local refinement and progressive merge in FocalClick [7]. RITM [41] is also reimplemented as our baseline under the same experimental settings. 333More experimental results are provided in SM.
Quantitative evaluation. Table 2 lists the NoC@85 and NoC@90 of all the comparing methods on the four different datasets. We can clearly find that the proposed GPCIS consistently achieves the lowest average NoC@85 and NoC@90 under three different backbone segmentors, which substantiates its promising effectiveness and good generality. Note that although our method does not introduce the extra processing strategies contained in the SOTA method FocusCut, e.g., cropping click-centered patches with adaptive scopes, it can still obtain the superior (Berkeley & DAVIS) or at least comparable (GrabCut & SBD) performance to FocusCut.
For comprehensive comparisons, we provide more quantitative results on different metrics as well as the number of network parameters and inference efficiency. As listed in Table 3, the proposed GPCIS consistently outperforms other competing methods on NoC100@90, NoF100@90, IoU&1, IoU&5, and the model size, where NoC100@90 and NoF100@90 represent the numbers of clicks and failures to get 90% IoU within 100 clicks, respectively. For NoIC and SPC, it still performs competing and is comparable to the first rank. From Table 2 and Table 3, we can easily conclude that compared to other comparing methods, our proposed GPCIS shows better generality and it has the capability to efficiently attain higher segmentation accuracy with fewer clicks and fewer failure cases. This indicates that our method has good potential for practical IS. Note that compared to the baseline RITM, our inference speed is slightly slower due to the proposed GP posterior inference procedure. However, this cost is acceptable or even negligible considering the performance gains brought by our method.
Qualitative evaluation. Fig. 4 presents the visualization comparisons on the output probability maps of different methods. As seen, for RITM and FocalClick, the regions far from the click cannot be properly and fully activated and have low prediction probability. Although FocusCut confidently segments the main part of the object, it mistakenly leaves out the upper part with low prediction probability. Comparatively, our proposed GPCIS achieves better segmentation results and approaches the GT mask, which is mainly attributed to the explicit modeling of the semantic relations between pixels. To fully substantiate the effectiveness of our proposed inference process, we also provide more visual comparisons with the baseline RITM. From the first row in Fig. 5, we can observe that without fully utilizing the information contained in clicks, RITM fails to finely segment the whole target object. In contrast, our method almost accomplishes the accurate segmentation of the three target instances, i.e., two persons and a drum, within three clicks. Besides, the second row shows that from the 3 to the 16 clicks, RITM repetitively clicks in the same location because it cannot provide correct predictions at clicks. However, with good theoretical support, GPCIS alleviates this issue and obtains a 97% IoU within 8 clicks.
5.4 Ablation Studies
Based on the backbone segmentor ResNet50, we execute an ablation study to quantitatively evaluate the effect of the modules involved in our method on the average NoC@85/90 over GrabCut, Berkeley, SBD and DAVIS. Table 4 reports the results under different settings where variant (e) is the final strategy we adopt in comparison experiments above. By comparing (a) and (e), we can easily find that the proper guidance of is indeed helpful for network learning. In (b), we discard the deep kernel learning mechanism in function space and fix the kernel hyperparameters as and (). Similarly, in (c), we discard the deep kernel learning mechanism in weight space and set , , , and . During network training, they are not updated. As expected, without the DKL design in double space, the network flexibility is weakened, leading to degraded performance. Besides, by comparing (d) and (e), it shows that the concatenation of input image with deep features shown in Fig. 2 can further boost the information propagation across pixels and bring better segmentation performance.
6 Conclusion
In this paper, for the interactive segmentation task, we have dived into a new perspective and regarded it as a pixel-wise binary classification problem on each input image. Based on such understanding, we have formulated the task as a Gaussian process classification model. To solve this model, we have proposed to variationally approximate the GP posterior in a data-driven manner, along with a decoupled sampling strategy with linear complexity. Correspondingly, we have constructed an efficient and flexible GP classification framework integrated with double space deep kernel learning, called GPCIS, which has clear working mechanism. Based on several benchmark datasets and different backbone segmentors, we have conducted comprehensive experiments as well as model verification, which fully substantiated the superiority of our proposed GPCIS as well as its rational theoretical support for correct predictions at clicks. With high efficiency and fine generality, the proposed GPCIS should be a potential driver for the interactive segmentation field.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] David Acuna, Huan Ling, Amlan Kar, and Sanja Fidler. Efficient interactive annotation of segmentation datasets with Polygon-RNN++. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 859–868, 2018.
- 2[2] Junjie Bai and Xiaodong Wu. Error-tolerant scribbles based interactive image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 392–399, 2014.
- 3[3] Lluis Castrejon, Kaustav Kundu, Raquel Urtasun, and Sanja Fidler. Annotating object instances with a Polygon-RNN. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 5230–5238, 2017.
- 4[4] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFS. IEEE Transactions on Pattern Analysis and Machine Intelligence , 40(4):834–848, 2017.
- 5[5] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision , pages 801–818, 2018.
- 6[6] Xi Chen, Zhiyan Zhao, Feiwu Yu, Yilei Zhang, and Manni Duan. Conditional diffusion for interactive segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 7345–7354, 2021.
- 7[7] Xi Chen, Zhiyan Zhao, Yilei Zhang, Manni Duan, Donglian Qi, and Hengshuang Zhao. Focal Click: Towards practical interactive image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 1300–1309, 2022.
- 8[8] François Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 1251–1258, 2017.
