The Mutex Watershed and its Objective: Efficient, Parameter-Free Graph   Partitioning

Steffen Wolf; Alberto Bailoni; Constantin Pape; Nasim Rahaman; Anna; Kreshuk; Ullrich K\"othe; Fred A. Hamprecht

arXiv:1904.12654·cs.CV·April 20, 2021

The Mutex Watershed and its Objective: Efficient, Parameter-Free Graph Partitioning

Steffen Wolf, Alberto Bailoni, Constantin Pape, Nasim Rahaman, Anna, Kreshuk, Ullrich K\"othe, Fred A. Hamprecht

PDF

TL;DR

The paper introduces the Mutex Watershed, an efficient, parameter-free graph partitioning algorithm capable of handling both attractive and repulsive cues, achieving state-of-the-art results in image segmentation benchmarks.

Contribution

It presents a simple, deterministic algorithm that globally optimizes a multicut-related objective, accommodating complex cues without seeds or thresholds.

Findings

01

Achieves the best results on ISBI 2012 EM segmentation benchmark.

02

Solves a multicut-related objective to global optimality.

03

Empirically linearithmic complexity.

Abstract

Image partitioning, or segmentation without semantics, is the task of decomposing an image into distinct segments, or equivalently to detect closed contours. Most prior work either requires seeds, one per segment; or a threshold; or formulates the task as multicut / correlation clustering, an NP-hard problem. Here, we propose an efficient algorithm for graph partitioning, the "Mutex Watershed''. Unlike seeded watershed, the algorithm can accommodate not only attractive but also repulsive cues, allowing it to find a previously unspecified number of segments without the need for explicit seeds or a tunable threshold. We also prove that this simple algorithm solves to global optimality an objective function that is intimately related to the multicut / correlation clustering integer linear programming formulation. The algorithm is deterministic, very simple to implement, and has empirically…

Tables3

Table 1. (a) Top five entries at time of submission. Our Mutex Watershed (MWS) is state-of-the-art without relying on the complex lifted multicut postprocessing used by most other top entries.

Method	Rand-Score	VI-Score
UNet + MWS	0.98792	0.99183
ResNet + LMC [76]	0.98788	0.99072
SCN + LMC [77]	0.98680	0.99144
M2FCN-MFA [78]	0.98383	0.98981
FusionNet + LMC [44]	0.98365	0.99130

Table 2. (a) Top five entries at time of submission. Our Mutex Watershed (MWS) is state-of-the-art without relying on the complex lifted multicut postprocessing used by most other top entries.

Method	Rand-Score	VI-Score
UNet + MWS	0.98792	0.99183
ResNet + LMC [76]	0.98788	0.99072
SCN + LMC [77]	0.98680	0.99144
M2FCN-MFA [78]	0.98383	0.98981
FusionNet + LMC [44]	0.98365	0.99130

Table 3. (b) Comparison to other segmentation strategies, all of which are based on our CNN. Runtimes were measured on a single thread of a Intel Xeon CPU E5-2650 v3 @ 2.30GHz.

Method	Rand-Score	VI-Score	Time [s]
MWS	0.98792	0.99183	43.3
MC-FULL	0.98029	0.99044	9415.8
LMC	0.97990	0.99007	966.0
THRESH	0.91435	0.96961	0.2
WSDT	0.88336	0.96312	4.4
MC-LOCAL	0.70990	0.86874	1410.7
WS	0.63958	0.89237	4.9

Equations91

\forall i, j \in V :

\forall i, j \in V :

Π_{i \to j}

connected (i, j; A^{+})

cluster (i; A^{+})

mutex (i, j; A^{+}, A^{-})

mutex (i, j; A^{+}, A^{-})

M [C_{i}] = {(u, v) \in A^{-} ∣ u \in C_{i} \lor v \in C_{i}}

M [C_{i}] = {(u, v) \in A^{-} ∣ u \in C_{i} \lor v \in C_{i}}

O (max (E lo g E, E M)) .

O (max (E lo g E, E M)) .

Empirical Mutex Watershed Complexity: O (E lo g E)

Empirical Mutex Watershed Complexity: O (E lo g E)

connected (i, j; A^{+}) ⟹ not mutex (i, j; A^{+}, A^{-}) .

connected (i, j; A^{+}) ⟹ not mutex (i, j; A^{+}, A^{-}) .

x^{A} := \mathbbm 1 {e \in / A)} \in {0, 1}^{∣ E ∣} .

x^{A} := \mathbbm 1 {e \in / A)} \in {0, 1}^{∣ E ∣} .

\forall C \in C^{-} (G, w) : e \in E_{C} \sum x_{e}^{A} \geq 1 ⟺ C^{-} (A, G, w) = \emptyset.

\forall C \in C^{-} (G, w) : e \in E_{C} \sum x_{e}^{A} \geq 1 ⟺ C^{-} (A, G, w) = \emptyset.

\forall C \in C^{-} (G, w) : e \in E_{C} \sum x_{e} \geq 1.

\forall C \in C^{-} (G, w) : e \in E_{C} \sum x_{e} \geq 1.

x \in SC (G, w) min e \in E \sum ∣ w_{e} ∣ x_{e} .

x \in SC (G, w) min e \in E \sum ∣ w_{e} ∣ x_{e} .

∣ w_{e} ∣^{p} > t \in E, w_{t} < w_{e} \sum ∣ w_{t} ∣^{p} \forall e \in E,

∣ w_{e} ∣^{p} > t \in E, w_{t} < w_{e} \sum ∣ w_{t} ∣^{p} \forall e \in E,

x \in SC (G, w) min e \in E \sum ∣ w_{e} ∣^{p} x_{e}

x \in SC (G, w) min e \in E \sum ∣ w_{e} ∣^{p} x_{e}

x^{\mathbf{MWS}}:=\mathbbm{1}\Big{\{}e\notin\mathbf{MWS}\Big{(}\mathcal{G},w,\textrm{connect\_all=True}\Big{)}\Big{\}}

x^{\mathbf{MWS}}:=\mathbbm{1}\Big{\{}e\notin\mathbf{MWS}\Big{(}\mathcal{G},w,\textrm{connect\_all=True}\Big{)}\Big{\}}

not mutex (i, j, A^{+}, A^{-}) \Leftrightarrow

not mutex (i, j, A^{+}, A^{-}) \Leftrightarrow

not connected (s, t, A^{+}) \Leftrightarrow

\mathcal{C}^{-}\Big{(}\mathbf{MWS}\big{(}\mathcal{G},w,\textrm{connect\_all=True}\big{)}\Big{)}=\emptyset.

\mathcal{C}^{-}\Big{(}\mathbf{MWS}\big{(}\mathcal{G},w,\textrm{connect\_all=True}\big{)}\Big{)}=\emptyset.

x^{MWS} \in SC (G, w) .

x^{MWS} \in SC (G, w) .

A \subseteq E arg min - e \in A \sum ∣ w_{e} ∣^{p} s.t. C^{-} (A, G, w) = \emptyset.

A \subseteq E arg min - e \in A \sum ∣ w_{e} ∣^{p} s.t. C^{-} (A, G, w) = \emptyset.

S (G, \tilde{A}) := A \subseteq (E ∖ \tilde{A}) argmin T (A) with T (A) := - e \in A \sum ∣ w_{e} ∣^{p},

S (G, \tilde{A}) := A \subseteq (E ∖ \tilde{A}) argmin T (A) with T (A) := - e \in A \sum ∣ w_{e} ∣^{p},

s.t. C^{-} (A \cup \tilde{A}, G, w) = \emptyset,

C^{-} (\tilde{A}, G, w) = \emptyset.

C^{-} (\tilde{A}, G, w) = \emptyset.

\exists \tilde{e} \in E ∖ \tilde{A} s.t. C^{-} (\tilde{A} \cup {\tilde{e}}, G, w) = \emptyset

\exists \tilde{e} \in E ∖ \tilde{A} s.t. C^{-} (\tilde{A} \cup {\tilde{e}}, G, w) = \emptyset

g := e \in (E ∖ \tilde{A}) argmax ∣ w (e) ∣ s.t. C^{-} (\tilde{A} \cup {e}, G, w) = \emptyset.

g := e \in (E ∖ \tilde{A}) argmax ∣ w (e) ∣ s.t. C^{-} (\tilde{A} \cup {e}, G, w) = \emptyset.

g \in S (G, \tilde{A}) .

g \in S (G, \tilde{A}) .

∣ w (e) ∣ < ∣ w (g) ∣ \forall e \in S (G, \tilde{A}) .

∣ w (e) ∣ < ∣ w (g) ∣ \forall e \in S (G, \tilde{A}) .

T(A^{\prime})=-|w_{g}|^{p}\overset{(\ref{eq:pcondition})}{<}-\sum_{t\in S(\mathcal{G},\tilde{A})}|w_{t}|^{p}=T\Big{(}S(\mathcal{G},\tilde{A})\Big{)}

T(A^{\prime})=-|w_{g}|^{p}\overset{(\ref{eq:pcondition})}{<}-\sum_{t\in S(\mathcal{G},\tilde{A})}|w_{t}|^{p}=T\Big{(}S(\mathcal{G},\tilde{A})\Big{)}

S (G, \tilde{A}) = {g} \cup S (G, \tilde{A} \cup {g}) .

S (G, \tilde{A}) = {g} \cup S (G, \tilde{A} \cup {g}) .

S (G, \tilde{A}) = s. t. A \subseteq (E ∖ \tilde{A}) argmin T (A) C^{-} (A \cup \tilde{A}, G, w) = \emptyset; g \in A

S (G, \tilde{A}) = s. t. A \subseteq (E ∖ \tilde{A}) argmin T (A) C^{-} (A \cup \tilde{A}, G, w) = \emptyset; g \in A

\displaystyle\begin{split}S(\mathcal{G},\tilde{A})=&\;\{g\}\;\cup\underset{A\subseteq\;E\setminus(\tilde{A}\cup\{g\})}{\text{argmin}}\quad T(A)\\ \text{s. t.}&\quad\mathcal{C}^{-}\Big{(}A\cup\{g\}\cup\tilde{A},\mathcal{G},w\Big{)}=\emptyset\end{split}

1 \geq λ_{0} > λ_{1} > \dots λ_{t - 1} > 0

1 \geq λ_{0} > λ_{1} > \dots λ_{t - 1} > 0

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

The Mutex Watershed and its Objective:

Efficient, Parameter-Free Signed Graph Partitioning

Steffen Wolf∗, Alberto Bailoni∗, Constantin Pape, Nasim Rahaman,

Anna Kreshuk, Ullrich Köthe, and Fred A. Hamprecht† ∗ Authors contributed equally† Corresponding authorAll authors are with HCI/IWR, Heidelberg University, Germany.

E-mail: <firstname>.<lastname>@iwr.uni-heidelberg.de A. Kreshuk and C. Pape are with EMBL, Heidelberg, Germany.

The Mutex Watershed and its Objective:

Efficient, Parameter-Free Graph Partitioning

Steffen Wolf∗, Alberto Bailoni∗, Constantin Pape, Nasim Rahaman,

Anna Kreshuk, Ullrich Köthe, and Fred A. Hamprecht† ∗ Authors contributed equally† Corresponding authorAll authors are with HCI/IWR, Heidelberg University, Germany.

E-mail: <firstname>.<lastname>@iwr.uni-heidelberg.de A. Kreshuk and C. Pape are with EMBL, Heidelberg, Germany.

Abstract

Image partitioning, or segmentation without semantics, is the task of decomposing an image into distinct segments, or equivalently to detect closed contours. Most prior work either requires seeds, one per segment; or a threshold; or formulates the task as multicut / correlation clustering, an NP-hard problem. Here, we propose an efficient algorithm for graph partitioning, the “Mutex Watershed”. Unlike seeded watershed, the algorithm can accommodate not only attractive but also repulsive cues, allowing it to find a previously unspecified number of segments without the need for explicit seeds or a tunable threshold. We also prove that this simple algorithm solves to global optimality an objective function that is intimately related to the multicut / correlation clustering integer linear programming formulation. The algorithm is deterministic, very simple to implement, and has empirically linearithmic complexity. When presented with short-range attractive and long-range repulsive cues from a deep neural network, the Mutex Watershed gives the best results currently known for the competitive ISBI 2012 EM segmentation benchmark.

Index Terms:

Image segmentation, partitioning algorithms, greedy algorithms, optimization, integer linear programming, machine learning, convolutional neural networks.

1 Introduction

Most image partitioning algorithms are defined over a graph encoding purely attractive interactions. No matter whether a segmentation or clustering is then found agglomeratively (as in single linkage clustering / watershed) or divisively (as in spectral clustering or iterated normalized cuts), the user either needs to specify the desired number of segments or a termination criterion. An even stronger form of supervision is in terms of seeds, where one pixel of each segment needs to be designated either by a user or automatically. Unfortunately, clustering with automated seed selection remains a fragile and error-fraught process, because every missed or hallucinated seed causes an under- or oversegmentation error. Although the learning of good edge detectors boosts the quality of classical seed selection strategies (such as finding local minima of the boundary map, or thresholding boundary maps), non-local effects of seed placement along with strong variability in region sizes and shapes make it hard for any learned predictor to place exactly one seed in every true region.

In contrast to the above class of algorithms, multicut / correlation clustering partitions vertices with both attractive and repulsive interactions encoded into the edges of a graph. Multicut has the great advantage that a “natural” partitioning of a graph can be found, without needing to specify a desired number of clusters, or a termination criterion, or one seed per region. Its great drawback is that its optimization is NP-hard.

The main insight of this paper is that when both attractive and repulsive interactions between pixels are available, then a generalization of the watershed algorithm can be devised that segments an image without the need for seeds or stopping criteria or thresholds. It examines all graph edges, attractive and repulsive, sorted by their weight and adds these to an active set iff they are not in conflict with previous, higher-priority, decisions. The attractive subset of the resulting active set is a forest, with one tree representing each segment. However, the active set can have loops involving more than one repulsive edge. See Fig. 1 for a visual abstract.

In summary, our principal contributions are, first, a fast deterministic algorithm for graph partitioning with both positive and negative edge weights that does not need prior specification of the number of clusters (section 4); and second, its theoretical characterization, including proof that it globally optimizes an objective related to the multicut correlation clustering objective (4).

Combined with a deep net, the algorithm also happens to define the state-of-the-art in a competitive neuron segmentation challenge (section 5).

This is an extended version version of [1], with the second principal contribution (section 4) being new.

2 Related Work

In the original watershed algorithm [2, 3], seeds were automatically placed at all local minima of the boundary map. Unfortunately, this leads to severe over-segmentation. Defining better seeds has been a recurring theme of watershed research ever since. The simplest solution is offered by the seeded watershed algorithm [4]: It relies on an oracle (an external algorithm or a human) to provide seeds and assigns each pixel to its nearest seed in terms of minimax path distance.

In the absence of an oracle, many automatic methods for seed selection have been proposed in the last decades with applications in the fields of medicine and biology. Many of these approaches rely on edge feature extraction and edge detection like gradient calculation [5, 6]. Other types of methods generate seeds by first performing feature extraction [7, 8], whereas others first extract region of interests and then place seeds inside these regions by using thresholding [9], binarization [10], $k$ -means [11] or other strategies [12, 13].

In applications where the number of regions is hard to estimate, simple automatic seed selection methods, e.g. defining seeds by connected regions of low boundary probability, don’t work: The segmentation quality is usually insufficient because multiple seeds are in the same region and/or seeds leak through the boundary. Thus, in these cases seed selection may be biased towards over-segmentation (with seeding at all minima being the extreme case). The watershed algorithm then produces superpixels that are merged into final regions by more or less elaborate postprocessing. This works better than using watersheds alone because it exploits the larger context afforded by superpixel adjacency graphs. Many criteria have been proposed to identify the regions to be preserved during merging, e.g. region dynamics [14], the waterfall transform [15], extinction values [16], region saliency [17], and $(\alpha,\omega)$ -connected components [18]. A merging process controlled by criteria like these can be iterated to produce a hierarchy of segmentations where important regions survive to the next level. Variants of such hierarchical watersheds are reviewed and evaluated in [19].

These results highlight the close connection of watersheds to hierarchical clustering and minimum spanning trees/forests [20, 21], which inspired novel merging strategies and termination criteria. For example, [22] simply terminated hierarchical merging by fixing the number of surviving regions beforehand. [23] incorporate predefined sets of generalized merge constraints into the clustering algorithm. Graph-based segmentation according to [24] defines a measure of quality for the current regions and stops when the merge costs would exceed this measure. Ultrametric contour maps [25] combine the gPb (global probability of boundary) edge detector with an oriented watershed transform. Superpixels are agglomerated until the ultrametric distance between the resulting regions exceeds a learned threshold. An optimization perspective is taken in [26, 27], which introduces $h$ -increasing energy functions and builds the hierarchy incrementally such that merge decisions greedily minimize the energy. The authors prove that the optimal cut corresponds to a different unique segmentation for every value of a free regularization parameter.

An important line of research is given by partitioning of graphs with both attractive and repulsive edges [28]. Solutions that optimally balance attraction and repulsion do not require external stopping criteria such as predefined number of regions or seeds. This generalization leads to the NP-hard problem of correlation clustering or (synonymous) multicut (MC) partitioning. Fortunately, modern integer linear programming solvers in combination with incremental constraint generation can solve problem instances of considerable size [29], and good approximations exist for even larger problems [30, 31] Reminiscent of strict minimizers [32] with minimal $L_{\infty}$ -norm solution, our work solves the multicut objective optimally when all graph weights are raised to a large power.

Related to the proposed method, the greedy additive edge contraction (GAEC) [33] heuristic for the multicut also sequentially merges regions, but we handle attractive and repulsive interactions separately and define edge strength between clusters by a maximum instead of an additive rule. The greedy fixation algorithm introduced in [34] is closely related to the proposed method; it sorts attractive and repulsive edges by their absolute weight, merges nodes connected by attractive edges and introduces no-merge constraints for repulsive edges. However, similar to GAEC, it defines edge strength by an additive rule, which increases the algorithm’s runtime complexity compared to the presented Mutex Watershed. Also, it is not yet known what objective the algorithm optimizes globally, if any.

Another beneficial extension is the introduction of additional long-range edges. The strength of such edges can often be estimated with greater certainty than is achievable for the local edges used by watersheds on standard 4- or 8-connected pixel graphs. Such repulsive long-range edges have been used in [35] to represent object diameter constraints, which is still an MC-type problem. When long-range edges are also allowed to be attractive, the problem turns into the more complicated lifted multicut (LMC) [36]. Realistic problem sizes can only be solved approximately [33, 37], but watershed superpixels followed by LMC postprocessing achieve state-of-the-art results on important benchmarks [38]. Long-range edges are also used in [39], as side losses for the boundary detection convolutional neural network (CNN); but they are not used explicitly in any downstream inference.

In general, striking progress in watershed-based segmentation has been achieved by learning boundary maps with CNNs. This is nicely illustrated by the evolution of neurosegmentation for connectomics, an important field we also address in the experimental section. CNNs were introduced to this application in [40] and became, in much refined form [41], the winning entry of the ISBI 2012 Neuro-Segmentation Challenge [42]. Boundary maps and superpixels were further improved by progress in CNN architectures and data augmentation methods, using U-Nets [43], FusionNets [44] or inception modules [38]. Subsequent postprocessing with the GALA algorithm [45, 46], conditional random fields [47] or the lifted multicut [38] pushed the envelope of final segmentation quality. MaskExtend [48] applied CNNs to both boundary map prediction and superpixel merging, while flood-filling networks [49] eliminated superpixels altogether by training a recurrent neural network to perform region growing one region at a time.

Most networks mentioned so far learn boundary maps on pixels, but learning works equally well for edge-based watersheds, as was demonstrated in [50, 51] using edge weights generated with a CNN [52, 53]. Tayloring the learning objective to the needs of the watershed algorithm by penalizing critical edges along minimax paths [53] or end-to-end training of edge weights and region growing [54] improved results yet again.

Outside of connectomics, [55] obtained superior boundary maps from CNNs by learning not just boundary strength, but also its gradient direction. Holistically-nested edge detection [56, 57] couples the CNN loss at multiple resolutions using deep supervision and is successfully used as a basis for watershed segmentation of medical images in [58].

We adopt important ideas from this prior work (hierarchical single-linkage clustering, attractive and repulsive interactions, long-range edges, and CNN-based learning). The proposed efficient segmentation framework can be interpreted as a generalization of [23], because we also allow for soft repulsive interactions (which can be overridden by strong attractive edges), and constraints are generated on-the-fly.

3 The Mutex Watershed Algorithm as an Extension of Seeded Watershed

In this section we introduce the Mutex Watershed Algorithm, an efficient graph clustering algorithm that can ingest both attractive and repulsive cues. We first reformulate seeded watershed as a graph partitioning with infinitely repulsive edges and then derive the generalized algorithm for finitely repulsive edges, which obviates the need for seeds.

3.1 Definitions and notation

Let $\mathcal{G}=(V,E,w)$ be a weighted graph. The scalar attribute $w:E\rightarrow\mathbb{R}$ associated with each edge is a merge affinity: the higher this number, the higher the inclination of the two incident vertices to be assigned to the same cluster. Conversely, large negative affinity indicates a greater desire of the incident vertices to be in different clusters. In our application, each vertex corresponds to one pixel in the image to be segmented. We call an edge $e\in E$ repulsive if $w_{e}<0$ and we call it attractive if $w_{e}>0$ and collect them in $E^{-}=\{e\in E\,|\,w_{e}<0\}$ and $E^{+}=\{e\in E\;|\;w_{e}>0\}$ respectively.

In our application, each vertex corresponds to one pixel in the image to be segmented. The Mutex Watershed algorithm, defined in subsection 3.3, maintains disjunct active sets $A^{+}\subseteq E^{+}$ , $A^{-}\subseteq E^{-}$ , $A^{+}\cap A^{-}=\emptyset$ that encode merges and mutual exclusion constraints, respectively. Clusters are defined via the “connected” predicate:

[TABLE]

Conversely, the active subset $A^{-}\subseteq E^{-}$ of repulsive edges defines mutual exclusion relations by using the following predicate:

[TABLE]

Admissible active edge sets $A^{+}$ and $A^{-}$ must be chosen such that the resulting clustering is consistent, i.e. nodes engaged in a mutual exclusion constraint cannot be in the same cluster: $\textrm{mutex}(i,j;A^{+},A^{-})\Rightarrow\textrm{not}\>\>\textrm{connected}(i,j;A^{+})$ . The “connected” and “mutex” predicates can be efficiently evaluated using a union find data structure.

3.2 Seeded watershed from a mutex perspective

One interpretation of the proposed method is in terms of a generalization of the edge-based watershed algorithm [59, 60, 20] or image foresting transform [61]. This algorithm can only ingest a graph with purely attractive interactions, $E^{-}=\emptyset$ . Without further constraints, the algorithm would yield only the trivial result of a single cluster comprising all vertices. To obtain more interesting output, an oracle needs to provide seeds (e.g. one node per cluster). These seed vertices are all connected to an auxiliary node (see Fig. 2 (a)) by auxiliary edges with infinite merge affinity. A maximum spanning tree (MST) on this augmented graph can be found in linearithmic time; and the maximum spanning tree (or in the case of degeneracy: at least one of the maximum spanning trees) will include the auxiliary edges. When the auxiliary edges are deleted from the MST, a forest results, with each tree representing one cluster [20, 59, 61].

We now reformulate this well-known algorithm in a way that will later emerge as a special case of the proposed Mutex Watershed: we eliminate the auxiliary node and edges, and replace them by a set of infinitely repulsive edges, one for each pair of seeds (Fig. 2 (b)). Algorithm 1 is a variation of Kruskal’s MST algorithm operating on the seed mutex graph just defined, and gives results identical to seeded watershed on the original graph.

This algorithm differs from Kruskal’s only by the check for mutual exclusion in the if-statement. Obviously, the modified algorithm has the same effect as the original algorithm, because the final set $A^{+}$ is exactly the maximum spanning forest obtained after removing the auxiliary edges from the original solution.

In the sequel, we generalize this construction by admitting less-than-infinitely repulsive edges. Importantly, these can be dense and are hence much easier to estimate automatically than seeds with their strict requirement of only-one-per-cluster.

3.3 Mutex Watersheds

We now introduce our core contribution: an algorithm that is empirically no more expensive than a MST computation; but that can ingest both attractive and repulsive cues and partition a graph into a number of clusters that does not need to be specified beforehand. Neither seeds nor hyperparameters that implicitly determine the number of resulting clusters are required.

The Mutex Watershed, Algorithm 2, proceeds as follows. Given a graph $\mathcal{G}=(V,E)$ with signed weights $w:E\rightarrow\mathbb{R}$ , do the following: sort all edges $E$ , attractive or repulsive, by their absolute weight in descending order into a priority queue. Iteratively pop all edges from the queue and add them to the active set one by one, provided that a set of conditions are satisfied. More specifically, assuming connect_all is False, if the next edge popped from the priority queue is attractive and its incident vertices are not yet in the same tree, then connect the respective trees provided this is not ruled out by a mutual exclusion constraint. If on the other hand the edge popped is repulsive, and if its incident vertices are not yet in the same tree, then add a mutual exclusion constraint between the two trees. The output clustering is defined by the connected components of the final attractive active set $A^{+}$ .

The crucial difference to Algorithm 1 is that mutex constraints are no longer pre-defined, but created dynamically whenever a repulsive edge is found. However, new exclusion constraints can never override earlier, high-priority merge decisions. In this case, the repulsive edge in question is simply ignored. Similarly, an attractive edge must never override earlier and thus higher-priority must-not-link decisions.

The boolean value of the connect_all input parameter of the algorithm does not influence the final output clustering, but defines the internal cluster connectedness: when it is set to True, the algorithm adds all attractive intra-cluster edges to the active set $A^{+}$ . When it is set to False, then a maximum spanning tree is built for each cluster similarly to the seeded watershed. This variant of the algorithm will be helpful in the next section 4 to highlight the relation between the Mutex Watershed and the multicut problem.

Fig. 3 illustrates the proposed algorithm: Fig. 3a and Fig. 3b show examples of an unconstrained merge and an added mutex constraint, respectively; Fig. 3c and Fig. 3d show, respectively, an example of an attractive edge ( $w_{e}=14$ ) and repulsive edge ( $w_{e}=-13$ ) that are not added to the active set because their incident vertices are already “connected” and belong to the same tree of the forest $A^{+}$ ; finally, Fig. 3e shows an attractive edge ( $w_{e}=12$ ) that is ruled out by a previously introduced mutual exclusion relation.

3.4 Time Complexity Analysis

Before analyzing the time complexity of algorithm 2 we first review the complexity of Kruskal’s algorithm. Using a union-find data structure (with path compression and union by rank) the time complexity of $\textrm{merge}(i,~{}j)$ and $\textrm{connected}(i,~{}j)$ is $\mathcal{O}(\alpha(V))$ , where $\alpha$ is the slowly growing inverse Ackerman function, and the total runtime complexity is dominated by the initial sorting of the edges $\mathcal{O}(E\log E)$ [62].

To check for mutex constraints efficiently, we maintain a set of all active mutex edges

[TABLE]

for every $C_{i}=\textrm{cluster}(i)$ using hash tables, where insertion of new mutex edges (i.e. addmutex) and search have an average complexity of $\mathcal{O}(1)$ . Note that every cluster can be efficiently identified by its union-find root node. For $\textrm{mutex}(i,~{}j)$ we check if $M[C_{i}]\cap M[C_{j}]=\emptyset$ by searching for all elements of the smaller hash table in the larger hash table. Therefore $\textrm{mutex}(i,~{}j)$ has an average complexity of $\mathcal{O}(\min(|M[C_{i}]|,|M[C_{j}]|)$ . Similarly, during $\textrm{merge}(i,~{}j)$ , mutex constraints are inherited by merging two hash tables, which also has an average complexity $\mathcal{O}(\min(|M[C_{i}]|,|M[C_{j}]|)$ .

In conclusion, the average runtime contribution of attractive edges $\mathcal{O}(\max(|E^{+}|\cdot\alpha(V),|E^{+}|\cdot M))$ (checking mutex constraints and possibly merging) and repulsive edges $\mathcal{O}(\max(|E^{-}|\cdot\alpha(V),|E^{-}|))$ (insertion of one mutex edge) result in a total average runtime complexity of algorithm 2:

[TABLE]

where $M$ is the expected value of $\min(|M[C_{i}]|,|M[C_{j}]|)$ and $\alpha(V)\in\mathcal{O}(\log V)\in\mathcal{O}(\log E)$ 111In the worst case $G$ is a fully connected graph, with $|E|=|V|^{2}$ , hence $\log|V|=\frac{1}{2}\log|E|$ ..

In the worst case $\mathcal{O}(M)\in\mathcal{O}(E)$ , the Mutex Watershed Algorithm has a runtime complexity of $\mathcal{O}(E^{2})$ . Empirically, we find that $\mathcal{O}(EM)\approx\mathcal{O}(E\log E)$ by measuring the runtime of Mutex Watershed for different sub-volumes of the ISBI challenge (see Figure 4), leading to a

[TABLE]

4 Theoretical characterization

Towards the Multicut framework. In section 3.3, we have introduced the Mutex Watershed (MWS) algorithm as a generalization of seeded watersheds and the Kruskal algorithm in particular. However, since we are considering graphs with negative edge weights, the MWS is conceptually closer to the multicut problem and related heuristics such as GAEC and GF [34]. Fortunately, due to the structure of the MWS it can be analyzed using dynamic programming. This section summarizes our second contribution, i.e. the proof that the Mutex Watershed Algorithm globally optimizes a precise objective related to the multicut.

4.1 Review of the Multicut problem and its objective

In the following, we will review the multicut problem not in its standard formulation but in the Cycle Covering Formulation introduced in [63], which is similar to the MWS formulation as it also considers the set of attractive and repulsive edges separately. Previously, in Sec. 3.1, we defined a clustering by introducing the concept of an active set of edges $A=A^{+}\cup A^{-}\subseteq E$ and the connected/mutex predicates. In particular, an active set describes a valid clustering if it does not include both a path of only attractive edges and a path with exactly one repulsive edge connecting any two nodes $i,j\in V$ :

[TABLE]

In other words, an active set is consistent and describes a clustering if it does not contain any cycle with exactly one repulsive edge (known as conflicted cycles).

Definition 4.1.

Conflicted cycles – We call a cycle of $\mathcal{G}$ conflicted w.r.t. $(\mathcal{G},w)$ if it contains precisely one repulsive edge $e\in E^{-}$ , s.t. $w_{e}<0$ . We denote by $\mathcal{C}^{-}(\mathcal{G},w)\subseteq\mathcal{C}(\mathcal{G},w)$ the set of all conflicted cycles. Furthermore, given a set of edges $A\subseteq E$ , we denote by $\mathcal{C}^{-}(A,\mathcal{G},w)\subseteq\mathcal{C}^{-}(\mathcal{G},w)$ the set of conflicted cycles involving only edges in $A$ .

From now on, in order to describe different clustering solutions in the framework of (integer) linear programs, we associate each active set $A$ with the following edge indicator $x^{A}$

[TABLE]

In this way, the cycle-free property $\mathcal{C}^{-}(A,\mathcal{G},w)=\emptyset$ of an active set can be reformulated in terms of linear inequalities:

[TABLE]

In words, the active set cannot contain conflicted cycles; or vice versa, every conflicted cycle must contain at least one edge that is not part of the active set. Following [63], via this property we describe the space of all possible clustering solutions by defining the convex hull $\mathsf{SC}(\mathcal{G},w)$ of all edge indicators corresponding to valid clusterings of $(\mathcal{G},w)$ :

Definition 4.2.

Let $\mathsf{SC}(\mathcal{G},w)$ denote the convex hull of all edge indicators $x\in\{0,1\}^{|E|}$ satisfying the following system of inequalities:

[TABLE]

That is, $\mathsf{SC}(\mathcal{G},w)$ contains all edge labelings for which every conflicted cycle is broken at least once. We call $\mathsf{SC}(\mathcal{G},w)$ the set covering polyhedron with respect to conflicted cycles, similarly to [63].

Fig. 5 summarizes these definitions and provides an example of consistent and inconsistent active sets with their associated clusterings and edge indicators.

As shown in [63], the multicut optimization problem can be formulated with constraints over conflicted cycles in terms of the following integer linear program (ILP), which is NP-hard:

[TABLE]

The solution of the multicut problem is given by the clustering associated to the connected components of the active set $\hat{A}^{+}=\{e\in E^{+}|\hat{x}_{e}=0\}$ , where $\hat{x}\in\{0,1\}^{|E|}$ is the solution of (7).

4.2 Mutex Watershed Objective

We now define the Mutex Watershed objective that is minimized by the Mutex Watershed Algorithm (proof in subsection 4.3) and show how it is closely related to the multicut problem defined in Eq. (7). Lange et al. [63] introduce the concept of dominant edges in a graph. For example, an attractive edge $f\in E^{+}$ is called dominant if there exists a cut $B$ with $f\in E_{B}$ such that $|w_{f}|\geq\sum_{e\in E_{B}\backslash\{f\}}\left|w_{e}\right|$ . These highlight an aspect of the multicut problem that can be used to search for optimal solutions more efficiently. Not all weighted graphs contain dominant edges; but if, assuming no ties, we raise all graph weights to a large enough power a similar property emerges.

Definition 4.3.

Dominant power: Let $\mathcal{G}=(V,E,w)$ be an edge-weighted graph, with unique weights $w:E\rightarrow\mathbb{R}$ . We call $p\in\mathbb{N}^{+}$ a dominant power if:

[TABLE]

In contrast to dominant edges [63], we do not consider edges on a cut but rather all edges with smaller absolute weight. Note that there exists a dominant power for any finite set of edges, since for any $e\in E$ we can divide (8) by $|w_{e}|^{p}$ and observe that the normalized weights $|w_{t}|^{p}/|w_{e}|^{p}$ (and any finite sum of these weights) converges to 0 when $p$ tends to infinity.

By considering the multicut problem in Eq. (7) and raising the weights $|w_{e}|$ to a dominant power $p$ , we fundamentally change the problem structure:

Definition 4.4.

Mutex Watershed Objective: Let $\mathcal{G}=(V,E,w)$ be an edge-weighted graph, with unique weights $w:E\rightarrow\mathbb{R}$ and $p\in\mathbb{N}^{+}$ a dominant power. Then the Mutex Watershed Objective is defined as the integer linear program

[TABLE]

where $\mathsf{SC}(\mathcal{G},w)$ is the convex hull defined in Def. 4.2.

In the following section, we will prove that this modified version of the multicut objective, which we call Mutex Watershed Objective, is indeed optimized by the Mutex Watershed Algorithm:

Theorem 4.1.

Let $\mathcal{G}=(V,E,w)$ be an edge-weighted graph, with unique weights $w:E\rightarrow\mathbb{R}$ and $p\in\mathbb{N}^{+}$ a dominant power. Then the edge indicator given by the Mutex Watershed Algorithm 2

[TABLE]

minimizes the Mutex Watershed Objective in Eq. (9).

4.3 Proof of optimality via dynamic programming

In this section we prove Theorem 4.1, i.e. that the Mutex Watershed Objective defined in 4.4 is solved to optimality by the Mutex Watershed Algorithm 3. Particularly, in the following Sec. 4.3.1 we show that the edge indicator associated to the solution of the MWS algorithm lies in $\mathsf{SC}(\mathcal{G},w)$ , whereas in Sec. 4.3.2 we prove that it solves Eq. 9 to optimality.

4.3.1 Cycle consistency

The Mutex Watershed algorithm introduced in Sec. 3 iteratively builds an active set $A=A^{+}\cup A^{-}$ such that nodes engaged in a mutual exclusion constraint (encoded by edges in $A^{-}$ ) are never part of the same cluster. In other words, this means that the active set built by the Mutex Watershed at every iteration does never include a conflicted cycle and is always consistent. In particular, for any attractive edge $(i,j)=e^{+}\in E^{+}$ and any consistent set $A$ that fulfills $\mathcal{C}^{-}(A,\mathcal{G},w)=\emptyset$ :

[TABLE]

Therefore, we can rewrite Algorithm 2 in the form of Algorithm 3. This new formulation makes it clear that

[TABLE]

Thus, thanks to Eq. 5 and definition 4.2, it follows that the MWS edge indicator $x^{\mathbf{MWS}}$ defined in 4.1 lies in $\mathsf{SC}(\mathcal{G},w)$ :

[TABLE]

4.3.2 Optimality

We first note that the Mutex Watershed Objective 4.4 and Theorem 4.1 can easily be reformulated in terms of active sets to minimize

[TABLE]

We now generalize the Mutex Watershed (see Algorithm 4) and the objective such that an initial consistent set of active edges $\tilde{A}\subseteq E$ is supplied:

Definition 4.5.

Energy optimization subproblem. Let $\mathcal{G}=(V,E,w)$ be an edge-weighted graph. Define the optimal solution of the subproblem as

[TABLE]

where $\tilde{A}\subseteq E$ is a set of initially activated edges such that $\mathcal{C}^{-}(\tilde{A},\mathcal{G},w)=\emptyset$ .

We note that for $\tilde{A}=\emptyset$ , the optimal solution $S(\mathcal{G},\emptyset)$ is equivalent to the solution minimizing the Mutex Watershed Objective and Eq. (12).

Definition 4.6.

Incomplete, consistent initial set: For an edge-weighted graph $\mathcal{G}=(V,E,w)$ a set of edges $\tilde{A}\subseteq E$ is consistent if

[TABLE]

$\tilde{A}$ * is incomplete if it is not the final solution and there exists a consistent edge $\tilde{e}$ that can be added to $\tilde{A}$ without violating the constraints.*

[TABLE]

Definition 4.7.

First greedy step: Let us consider an incomplete, consistent initial active set $\tilde{A}\subseteq E$ on $\mathcal{G}=(V,E,w)$ . We define

[TABLE]

as the feasible edge with the highest weight, which is always the first greedy step of Algorithm 4.

In the following two lemmas, we prove that the Mutex Watershed problem has an optimal substructure property and a greedy choice property [62], which are sufficient to prove that the Mutex Watershed algorithm finds the optimum of the Mutex Watershed Objective.

Lemma 4.2.

Greedy-choice property. For an incomplete, consistent initial active set $\tilde{A}$ of the Mutex Watershed, the first greedy step $g$ is always part of the optimal solution

[TABLE]

Proof.

We will prove the theorem by contradiction by assuming that the first greedy choice is not part of the optimal solution, i.e. $g\notin S(\mathcal{G},\tilde{A})$ . Since $g$ is by definition the feasible edge with highest weight, it follows that:

[TABLE]

We now consider the alternative active set $A^{\prime}=\{g\}$ , that is a consistent solution, with

[TABLE]

which contradicts the optimality of $S(\mathcal{G},\tilde{A})$ . ∎

Lemma 4.3.

Optimal substructure property. Let us consider an initial active set $\tilde{A}$ , the optimization problem defined in Equation 13, and assume to have an incomplete, consistent problem (see Def. 4.6). Then it follows that:

After making the first greedy choice $g$ , we are left with a subproblem that can be seen as a new optimization problem of the same structure; 2. 2.

The optimal solution $S(\mathcal{G},\tilde{A})$ is always given by the combination of the first greedy choice and the optimal solution of the remaining subproblem.

Proof.

After making the first greedy choice and selecting the first feasible edge $g$ defined in Equation 17, we are clearly left with a new optimization problem of the same structure that has the following optimal solution: $S(\mathcal{G},\tilde{A}\cup\{g\})$ .

In order to prove the second point of the theorem, we now show that:

[TABLE]

Since algorithm 4 fulfills the greedy-choice property, ${g\in S(\mathcal{G},\tilde{A})}$ and we can add the edge $g$ as an additional constraint to the optimal solution:

[TABLE]

which is equivalent to Equation 20. ∎

Proof of Theorems 4.1.

In Lemmas 4.2 and 4.3 we have proven that the optimization problem defined in 12 has the optimal substructure and a greedy choice property. It follows through induction that the final active set $\mathbf{MWS}\big{(}\mathcal{G},w,\textrm{connect\_all=True}\big{)}$ found by the Mutex Watershed Algorithm 3 is the optimal solution for the Mutex Watershed objective (12) [62]. ∎

4.4 Relation to the extended Power Watershed framework

The Power Watershed [64] is an important framework for graph-based image segmentation that includes several algorithms like seeded watershed, random walker and graph cuts. Recently, [65] extended the framework to even more general types of hierarchical optimization algorithms thanks to the use of $\Gamma$ -theory and $\Gamma$ -convergence [66, 67]. In this section, we show how the Mutex Watershed algorithm can also be included in this extended framework222The connection between the Mutex Watershed and the extended Power Watershed framework was kindly pointed out by an anonymous reviewer. and how the framework suggests an optimization problem that is solved by the Mutex Watershed.

4.4.1 Mutex Watershed as hierarchical optimization algorithm

We first start by introducing the extended Power Watershed framework and restating the main theorem from [65]:

Theorem 4.4.

[65]**** Extended Power Watershed Framework. Consider three strictly positive integers $p,m,t\in\mathbb{N}^{+}$ and $t$ real numbers

[TABLE]

Given $t$ continuous functions $Q_{k}:\mathbb{R}^{m}\rightarrow\mathbb{R}$ with $0\leq k<t$ , define the function

[TABLE]

Then, if any sequence $(x_{p})_{p>0}$ of minimizers $x_{p}$ of $Q^{p}(x)$ is bounded (i.e. there exists $C>0$ such that for all $p>0$ , $||x_{p}||_{\infty}\leq C$ ), the sequence is convergent, up to taking a subsequence, toward a point of $M_{t-1}$ , which is the set of minimizers recursively defined in Algorithm 5.

Proof.

See [65] (Theorem 3.3). ∎

We now show that the Mutex Watershed algorithm can be seen as a special case of the generic hierarchical Algorithm 5, for a specific choice of scales $\lambda_{k}$ and functions ${Q_{k}(x):\mathbb{R}^{m}\rightarrow\mathbb{R}}$ (see definitions (25, 26) below) .

Scales $\lambda_{k}$ : Let $\tilde{w}_{k}$ be the signed edge weights $w:E\rightarrow\mathbb{R}$ ordered by decreasing absolute value $|\tilde{w}_{1}|>|\tilde{w}_{2}|>\ldots>|\tilde{w}_{t-1}|$ . If two edges share the same weight, then the weight is called $\tilde{w}_{k}$ for both and $E_{k}\subseteq E$ denotes the set of all edges with weight $\tilde{w}_{k}$ . We then define the scales $\lambda_{k}$ as

[TABLE]

The continuous functions $Q_{k}(x):\mathbb{R}^{|E|}\rightarrow\mathbb{R}$ are defined as follows

[TABLE]

where $\mathsf{ISC}(\mathcal{G},w)$ is defined as:

[TABLE]

In words, $Q_{0}(x)$ is proportional to the distance between $x$ and the closest point on the set $\mathsf{ISC}(\mathcal{G},w)$ , whereas $Q_{k}(x)$ depends only on the indicators $x_{e}$ of edges in $E_{k}$ , for $k>0$ .

Algorithm 6 is obtained by substituting the scales $\lambda_{k}$ and functions $Q_{k}(x)$ (respectively defined in Eq. (25) and (26)) into Algorithm 5 . The algorithm starts by setting $M_{0}$ to $\mathsf{ISC}(\mathcal{G},w)$ , i.e. by restricting the space of the solutions only to integer edge labelings $x$ that do not include any conflicted cycles. Then, in the following iterations $k\in 1,\ldots,t-1$ , the algorithm solves a series of minimization sub-problems that in the most general case are NP-hard, even though they involve a smaller set of edges $E_{k}\subseteq E$ . Nevertheless, if we assume that all weights are distinct, then $|E_{k}|=1$ for all $k$ and the solution to the sub-problems amounts to checking if the new edge can be labeled with $x_{e}=0$ without introducing any conflicted cycles. This procedure is identical to Algorithm 2: at every iteration, the Mutex Watershed tries to add an edge to the active set $A$ , provided that no mutual exclusion constraints are violated.

In summary, the framework in [65] provides a new formulation of the Mutex Watershed Algorithm that is even applicable to graphs with tied edge weights. In practice, when edge weights are estimated by a CNN, we do not expect tied edge weights.

4.4.2 Convergence of the sequence of minimizers

In this section, we see how Theorem 4.4 also suggests a minimization problem that is solved by the Mutex Watershed algorithm. A short summary is given in the final paragraph of the section.

First, we make sure that the conditions of Theorem 4.4 are satisfied when we apply it to Algorithm 6:

Lemma 4.5.

Let us consider the scales $\lambda_{k}$ and continuous functions ${Q_{k}(x):\mathbb{R}^{|E|}\rightarrow\mathbb{R}}$ respectively defined in Eq. (25) and (26). For any value of $p\in\mathbb{N}^{+}$ , let $x_{p}\in\mathbb{R}^{|E|}$ be a minimizer of the function $Q^{p}(x)$ defined in Eq. (24). Then, the minimizer $x_{p}$ lies in the set $\mathsf{ISC}(\mathcal{G},w)$ . From this, it follows that any sequence of minimizers $(x_{p})_{p>0}$ is bounded and the conditions of Theorem 4.4 are satisfied.

Proof.

See Appendix A. ∎

Then, given any $p\in\mathbb{N}^{+}$ and the Def. (25, 26), we have that the minimization of the function $Q^{p}(x)$ defined in Eq. (24) is given by the following problem:

[TABLE]

where we used Lemma 4.5 and restricted the domain of the $\operatorname*{arg\,min}$ operation to $\mathsf{ISC}(\mathcal{G},w)$ , so that $Q_{0}(x)=0$ for all $x\in\mathsf{ISC}(\mathcal{G},w)$ .

It follows from Lemma 4.5 and Theorem 4.4 that a sequence of minimizers $(x_{p})_{p>0}$ of the problem (30) converge, up to taking a subsequence, to the solution $x^{*}$ returned by Algorithm 6. More specifically, we know that any minimizer $x_{p}$ of (30) is in the discrete set $\mathsf{ISC}(\mathcal{G},w)$ . Hence, the convergent sequence of minimizers $(x_{p})_{p>0}$ eventually becomes constant and there exists a $p^{\prime}\in\mathbb{N}^{+}$ large enough such that $x_{p}=x^{*}$ for all $p\geq p^{\prime}$ . In other words, in the case of unique weights and $p\geq p^{\prime}$ large enough, the solution $x^{*}$ of the Mutex Watershed Algorithm 6 solves the problem (30), which is just a rescaled version of the Mutex Watershed Objective we introduced in Sec. 4.2.

To summarize, we used the extended Power Watershed framework to show that the Mutex Watershed provides a solution to the minimization problem in Eq. (30) for $p$ large enough. In particular, this problem suggested by the Power Watershed framework is the same one previously derived in Sec. 4.2 by linking the Mutex Watershed Algorithm to the multicut optimization problem.

5 Experiments

We evaluate the Mutex Watershed on the challenging task of neuron segmentation in electron microscopy (EM) image volumes. This application is of key interest in connectomics, a field of neuro-science that strives to reconstruct neural wiring digrams spanning complete central nervous systems. The task requires segmentation of neurons from electron microscopy images of neural tissue – a challenging endeavor, since segmentation has to be based only on boundary information (cell membranes) and some of the boundaries are not very pronounced. Besides, cells contain membrane-bound organelles, which have to be suppressed in the segmentation. Some of the neuron protrusions are very thin, but all of those need to be preserved in the segmentation to arrive at the correct connectivity graph. While a lot of progress is being made, currently only manual tracing or proof-reading yields sufficient accuracy for correct circuit reconstruction [68].

We validate the Mutex Watershed algorithm on the most popular neural segmentation challenge: ISBI2012 [42]. We estimate the edge weights using a CNN as described in Section 5.1 and compare with other entries in the leaderboard as well as with other popular post-processing methods for the same network predictions in Section 5.2.

5.1 Estimating edge weights with a CNN

The common first step to EM segmentation is to predict which pixels belong to a cell membrane using a CNN. Different post-processing methods are then used to obtain a segmentation, see section 2 for an overview of such methods. The CNN can either be trained to predict boundary pixels [41, 38] or undirected affinities [39, 69] which express how likely it is for a pixel to belong to a different cell than its neighbors in the 6-neighborhood. In this case, the output of the network contains three channels, corresponding to left, down and next imaging plane neighbors in 3D. The affinities do not have to be limited to immediate neighbors – in fact, [39] have shown that introduction of long-range affinities is beneficial for the final segmentation even if they are only used to train the network. Building on the work of [39], we train a CNN to predict short- and long-range affinities and then use those directly as weights for the Mutex Watershed algorithm.

We estimate the affinities / edge weights for the neighborhood structure shown in Figure 6. To that end, we define local attractive and long-range repulsive edges. When attractive edges are only short-range, the solution will consist of spatially connected segments that cannot comprise “air bridges”. This holds true for both (lifted) multicut and for Mutex Watershed. We use a different pattern for in-plane and between-plane edges due to the great anisotropy of the data set. In more detail, we pick a sparse ring of in-plane repulsive edges and additional longer-range in-plane edges which are necessary to split regions reliably (see Figure 6(a)). We also added connections to the indirect neighbors in the lower adjacent slice to ensure correct 3D connectivity (see Figure 6(b)). In our experiments, we pick a subset of repulsive edges, by using strides of 2 in the XY-plane in order to avoid artifacts caused by occasional very thick membranes. Note that the stride is not applied to local (attractive) edges, but only to long-range (repulsive) edges. The particular pattern used was selected after inspecting the size of typical regions. The specific pattern is the only one we have tried and was not optimized over.

In total, $C^{+}$ attractive and $C^{-}$ repulsive edges are defined for each pixel, resulting in $C^{+}+C^{-}$ output channels in the network. We partition the set of attractive / repulsive edges into subsets $H^{+}$ and $H^{-}$ that contain all edges at a specific offset: $E^{+}={\bigcup_{c=1}^{C^{+}}}H^{+}_{c}$ for attractive edges, with $H^{-}$ defined analogously. Each element of the subsets $H^{+}_{c}$ and $H^{-}_{c}$ corresponds to a specific channel predicted by the network. We further assume that weights take values in $[0,1]$ .

Network architecture and training

We use the 3D U-Net [43, 70] architecture, as proposed in [69].

Our training targets for attractive / repulsive edges $\stackrel{{\scriptstyle\smash{*}\rule{0.0pt}{-3.01389pt}}}{{w}}^{\pm}$ can be derived from a groundtruth label image $\stackrel{{\scriptstyle\smash{*}\rule{0.0pt}{-3.01389pt}}}{{L}}$ according to

[TABLE]

Here, $i$ and $j$ are the indices of vertices / image pixels. Next, we define the loss terms

[TABLE]

for attractive edges (i.e. channels) and repulsive edges (i.e. channels).

Equation 33 is the Sørensen-Dice coefficient [71, 72] formulated for fuzzy set membership values. During training we minimize the sum of attractive and repulsive loss terms $\mathcal{J}=\sum_{c}^{C^{+}}\mathcal{J}^{+}_{c}+\sum_{c}^{C^{-}}\mathcal{J}^{-}_{c}$ . This corresponds to summing up the channel-wise Sørensen-Dice loss. The terms of this loss are robust against prediction and / or target sparsity, a desirable quality for neuron segmentation: since membranes are locally two-dimensional and thin, they occupy very few pixels in three-dimensional the volume. More precisely, if $w^{+}_{e}$ or $\stackrel{{\scriptstyle\smash{*}\rule{0.0pt}{-3.01389pt}}}{{w}}^{+}_{e}$ (or both) are sparse, we can expect the denominator $\sum_{e}(({w^{+}_{e}})^{2}+(\stackrel{{\scriptstyle\smash{*}\rule{0.0pt}{-3.01389pt}}}{{w}}^{+}_{e})^{2})$ to be small, which has the effect that the numerator is adaptively weighted higher. In this sense, the Sørensen-Dice loss at every pixel $i$ is conditioned on the global image statistics, which is not the case for a Hamming-distance based loss like Binary Cross-Entropy or Mean Squared Error.

We optimize this loss using the Adam optimizer [73] and additionally condition learning rate decay on the Adapted Rand Score [42] computed on the training set every 100 iterations. During training, we augment the data set by performing in-plane rotations by multiples of 90 degrees, flips along the X- and Y-axis as well as elastic deformations. At prediction time, we use test time data augmentation, presenting the network with seven different versions of the input obtained by a combination of rotations by a multiple of 90 degrees, axis-aligned flips and transpositions. The network predictions are then inverse-transformed to correspond to the original image, and the results averaged.

5.2 ISBI Challenge

The ISBI 2012 EM Segmentation Challenge [42] is the neuron segmentation challenge with the largest number of competing entries. The challenge data contains two volumes of dimensions 1.5 $\times$ 2 $\times$ 2 microns and has a resolution of 50 $\times$ 4 $\times$ 4 nm per pixel. The groundtruth is provided as binary membrane labels, which can easily be converted to a 2D, but not 3D segmentation. To train a 3D model, we follow the procedure described in [38].

The test volume has private groundtruth; results can be submitted to the leaderboard. They are evaluated based on the Adapted Rand Score (Rand-Score) and the Variation of Information Score (VI-Score) [42].

Our method holds the top entry in the challenge’s leader board333http://brainiac2.mit.edu/isbi_challenge/leaders-board-new at the time of submission, see Table I(a). This is especially remarkable insofar as it is simpler than the methods holding the other top entries. Three out of four rely on a CNN to predict boundary locations and postprocess its output with the complex pipeline described in [38]. This post-processing first generates superpixels via distance transform watersheds. Then it computes a merge cost for local and long-range connections between superpixels. Based on this, it defines a lifted multicut partioning problem that is solved approximately. In contrast, our method finds an optimal solution of its objective purely on the pixel level.

Comparison with other segmentation methods

The weights predicted by the CNN described above can be post-processed directly by the Mutex Watershed algorithm. To ensure a fair comparison, we transform the same CNN predictions into a segmentation using basic and state-of-the-art post-processing methods. We start from simple thresholding (THRESH) and seeded watershed. Since these cannot take long-range repulsions into account, we generate a boundary map by taking the maximum444The maximum is chosen to preserve boundaries. values over the attractive edge channels. Based on this boundary map, we introduce seeds at the local minima (WS) and at the maxima of the smoothed distance transform (WSDT). For both variants, the degree of smoothing was optimized such that each region receives as few seeds as possible, without however causing severe under-segmentation. The performance of these three baseline methods in comparison to Mutex Watershed is summarized in Table I(b). The methods were applied only in 2D, because the high degree of anisotropy leads to inferior results when applied in 3D. In contrast, the Mutex Watershed can be applied in 3D out of the box and yields significantly better 2D segmentation scores.

Qualitatively, we show patches of results in Figure 7. The major failure case for WS (Figure 7(e)) and WSDT (Figure 7(f)) is over-segmentation caused by over-seeding a region. The major failure case for THRESH is under-segmentation due to week boundary evidence (see Figure 7(d)). In contrast, the Mutex Watershed produces a better segmentation, only causing minor over-segmentation (see Figure 7(a), Figure 7(b)).

Note that, in contrast to most pixel-based postprocessing methods, our algorithm can take long range predictions into account. To compare with methods which share this property, we turn to the multicut and lifted multicut-based partitioning for neuron segmentations as introduced in [29] and [36]. As proposed in [74], we compute costs corresponding to edge cuts from the affinities estimated by the CNN via:

[TABLE]

We set up two multicut problems: the first is induced only by the short-range edges (MC-LOCAL), the other by short- and long-range edges together (MC-FULL). Note that the solution to the full connectivity problem can contain “air bridges”, i.e. pixels that are connected only by long-range edges, without a path along the local edges connecting them. However, we found this not to be a problem in practice. In addition, we set up a lifted multicut (LMC) problem from the same edge costs.

Both problems are NP-hard, hence it is not feasible to solve them exactly on large grid graphs. For our experiments, we use the approximate Kernighan Lin [75, 33] solver. Even this allows us to only solve individual 2D problems at a time. The results for MC-LOCAL and MC-FULL can be found in Table I(b). The MC-LOCAL approach scores poorly because it under-segments heavily. This observation emphasizes the importance of incorporating the longer-range edges. The MC-FULL and LMC approaches perform well. Somewhat surprisingly, the Mutex Watershed yields a better segmentation still, despite being much cheaper in inference. We note that both MC-FULL, LMC and the Mutex Watershed are evaluated on the same long-range affinity maps (i.e. generated by the same CNN with the same set of weights).

6 Conclusion and Discussion

We have presented a fast algorithm for the clustering of graphs with both attractive and repulsive edges. The ability to consider both gives a valid alternative to other popular graph partitioning algorithms that rely on a stopping criterion or seeds. The proposed method has low computational complexity in imitation of its close relative, Kruskal’s algorithm. We have shown which objective this algorithm optimizes exactly, and that this objective emerges as a specific case of the multicut objective. It is possible that recent interesting work [63] on partial optimal solutions may open an avenue for an alternative proof.

Finally, we have found that the proposed algorithm, when presented with informative edge costs from a good neural network, outperforms all known methods on a competitive bioimage partitioning benchmark, including methods that operate on the very same network predictions.

7 Acknowledgments

This work was partially supported by the grants DFG HA 4364/8-1, DFG SFB 1129 from the Deutsche Forschungsgemeinschaft and the Baden-Württemberg Stiftung Elite PostDoc Program.

Appendix A Property of the minimizers of $Q^{p}(x)$

See 4.5

Proof.

The function $Q^{p}(x)$ can be explicitly written as (see Eq. 24, 25 and 26):

[TABLE]

We then denote these two terms by:

[TABLE]

Intuitively, we now prove that the minimizer $x_{p}$ of $Q^{p}(x)$ lies in $\mathsf{ISC}(\mathcal{G},w)$ by showing that the first term $Q_{\textrm{A}}^{p}(x)$ is always “dominant” as compared to $Q_{\textrm{B}}^{p}(x)$ .

First, we note that the gradient of the first term $Q_{\textrm{A}}^{p}(x)$ has always norm equal to $|E|$ and points in the direction of the closest point $x^{\prime}\in\mathsf{ISC}(\mathcal{G},w)$ . Given a generic point $y\in\mathbb{R}^{|E|}$ , the only two cases when the gradient $\nabla_{x}\,Q_{\textrm{A}}^{p}(x)$ does not exists are: i) if $y\in\mathsf{ISC}(\mathcal{G},w)$ ; ii) if there are at least two points $x^{\prime\prime},x^{\prime\prime\prime}\in\mathsf{ISC}(\mathcal{G},w)$ such that $||y-x^{\prime\prime}||=||y-x^{\prime\prime\prime}||$ . Clearly, $Q_{\textrm{A}}^{p}(x)$ presents minima only in the first case, when $y\in\mathsf{ISC}(\mathcal{G},w)$ .

On the other hand, the second term $Q_{\textrm{B}}^{p}(x)$ is always differentiable and the norm of its gradient is never greater than $\sqrt{|E|}$ :

[TABLE]

where we used the fact that $\tilde{w}_{k}/2\tilde{w}_{1}<1$ for every $1\leq k<t$ . Thus, the magnitude of the gradient given by the first term is always larger compared to the one given by the second term. We then conclude that the objective can always be reduced unless $x_{p}$ is a point of $\mathsf{ISC}(\mathcal{G},w)$ . ∎

Bibliography78

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] S. Wolf, C. Pape, A. Bailoni, N. Rahaman, A. Kreshuk, U. Köthe, and F. Hamprecht, “The mutex watershed: Efficient, parameter-free image partitioning,” Proc. ECCV’18 , 2018.
2[2] L. Vincent and P. Soille, “Watersheds in digital spaces: an efficient algorithm based on immersion simulations,” IEEE Trans. Pattern Analysis Machine Intelligence , no. 6, pp. 583–598, 1991.
3[3] S. Beucher and C. Lantuéjoul, “Use of watersheds in contour detection,” in Int. Workshop on Image Processing . Rennes, France: CCETT/IRISA, Sept. 1979.
4[4] S. Beucher and F. Meyer, “The morphological approach to segmentation: the watershed transformation,” Optical Engineering , vol. 34, pp. 433–433, 1992.
5[5] R. Pohle and K. D. Toennies, “Segmentation of medical images using adaptive region growing,” in Medical Imaging 2001: Image Processing , vol. 4322. International Society for Optics and Photonics, 2001, pp. 1337–1346.
6[6] M. A. Alattar, N. F. Osman, and A. S. Fahmy, “Myocardial segmentation using constrained multi-seeded region growing,” in International Conference Image Analysis and Recognition . Springer, 2010, pp. 89–98.
7[7] S. Poonguzhali and G. Ravindran, “A complete automatic region growing method for segmentation of masses on ultrasound images,” in 2006 International Conference on Biomedical and Pharmaceutical Engineering . IEEE, 2006, pp. 88–92.
8[8] J. Wu, S. Poehlman, M. D. Noseworthy, and M. V. Kamath, “Texture feature based automated seeded region growing in abdominal MRI segmentation,” in 2008 International Conference on Bio Medical Engineering and Informatics , vol. 2. IEEE, 2008, pp. 263–267.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

The Mutex Watershed and its Objective:

The Mutex Watershed and its Objective:

Abstract

Index Terms:

1 Introduction

2 Related Work

3 The Mutex Watershed Algorithm as an Extension of Seeded Watershed

3.1 Definitions and notation

3.2 Seeded watershed from a mutex perspective

3.3 Mutex Watersheds

3.4 Time Complexity Analysis

4 Theoretical characterization

4.1 Review of the Multicut problem and its objective

Definition 4.1**.**

Definition 4.2**.**

4.2 Mutex Watershed Objective

Definition 4.3**.**

Definition 4.4**.**

Theorem 4.1**.**

4.3 Proof of optimality via dynamic programming

4.3.1 Cycle consistency

4.3.2 Optimality

Definition 4.5**.**

Definition 4.6**.**

Definition 4.7**.**

Lemma 4.2**.**

Proof.

Lemma 4.3**.**

Proof.

Proof of Theorems 4.1.

4.4 Relation to the extended Power Watershed framework

4.4.1 Mutex Watershed as hierarchical optimization algorithm

Theorem 4.4**.**

Proof.

4.4.2 Convergence of the sequence of minimizers

Lemma 4.5**.**

Proof.

5 Experiments

5.1 Estimating edge weights with a CNN

Network architecture and training

5.2 ISBI Challenge

Comparison with other segmentation methods

6 Conclusion and Discussion

7 Acknowledgments

Appendix A Property of the minimizers of Qp(x)Q^{p}(x)Qp(x)

Proof.

Definition 4.1.

Definition 4.2.

Definition 4.3.

Definition 4.4.

Theorem 4.1.

Definition 4.5.

Definition 4.6.

Definition 4.7.

Lemma 4.2.

Lemma 4.3.

Theorem 4.4.

Lemma 4.5.

Appendix A Property of the minimizers of $Q^{p}(x)$