Boolean matrix factorization meets consecutive ones property

Nikolaj Tatti; Pauli Miettinen

arXiv:1901.05797·cs.DS·May 16, 2019

Boolean matrix factorization meets consecutive ones property

Nikolaj Tatti, Pauli Miettinen

PDF

TL;DR

This paper introduces a new variant of Boolean matrix factorization with the consecutive ones property, applies it to graph visualization, proves its computational hardness, and proposes efficient greedy algorithms with strong experimental results.

Contribution

It formulates the OBMF problem, proves its NP-hardness, and develops a linear-time greedy algorithm using pq-trees for high-quality factorizations.

Findings

01

OBMF is NP-hard and hard to approximate.

02

The proposed greedy algorithm finds high-quality factorizations.

03

Algorithms scale well and are effective for graph visualization tasks.

Abstract

Boolean matrix factorization is a natural and a popular technique for summarizing binary matrices. In this paper, we study a problem of Boolean matrix factorization where we additionally require that the factor matrices have consecutive ones property (OBMF). A major application of this optimization problem comes from graph visualization: standard techniques for visualizing graphs are circular or linear layout, where nodes are ordered in circle or on a line. A common problem with visualizing graphs is clutter due to too many edges. The standard approach to deal with this is to bundle edges together and represent them as ribbon. We also show that we can use OBMF for edge bundling combined with circular or linear layout techniques. We demonstrate that not only this problem is NP-hard but we cannot have a polynomial-time algorithm that yields a multiplicative approximation guarantee…

Tables3

Table 1. Table 1: Properties of real-world data sets. Rank indicates the rank used in the decomposition.

data	rows	cols	% of 1s	sym.	rank
Les Misérables	$77$	$77$	$8.57$	Yes	$10$
Paleo	$124$	$139$	$11.48$	No	$10$
Newsgroups	$100$	$348$	$6.30$	No	$10$
Terms	$100$	$100$	$48.54$	Yes	$10$
Locations	$3203$	$3203$	$8.42$	Yes	$50$
Mammals	$194$	$194$	$58.04$	Yes	$10$

Table 2. Table 2: Relative errors with asymmetric (left) and symmetric (right) algorithms on real-world data.

	Les Mis	Paleo	News	Terms	Locations	Mammals
obmf	$0.3562$	$0.7123$	$0.7428$	$0.3166$	$0.4026$	$0.2569$
cobmf	$0.3562$	$0.7194$	$0.7428$	$0.3199$	$0.4027$	$0.2556$
asso	$0.3287$	$0.7087$	$0.7232$	$0.2865$	$0.3444$	$0.2617$

Table 3. Table 3: Average relative errors and standard deviation with random columns as seeds for asymmetric algorithms on real-world data. Ten random samples.

	Les Misérables		Paleo		Newsgroups		Terms		Locations		Mammals
obmf	$0.5147$ $\pm$	$0.0836$	$0.7562$ $\pm$	$0.0173$	$0.8293$ $\pm$	$0.0223$	$0.3203$ $\pm$	$0.0$	$0.5301$ $\pm$	$0.0$	$0.2573$ $\pm$	$0.0$
cobmf	$0.4639$ $\pm$	$0.0489$	$0.7555$ $\pm$	$0.0223$	$0.8277$ $\pm$	$0.0200$	$0.3245$ $\pm$	$0.01$	$0.5149$ $\pm$	$0.0$	$0.2571$ $\pm$	$0.0$

Equations28

(A \circ B)_{ij} = ℓ = 1 ⋁ k a_{i ℓ} b_{ℓ j} .

(A \circ B)_{ij} = ℓ = 1 ⋁ k a_{i ℓ} b_{ℓ j} .

\bigl{\lVert}\bm{{D}}-(\bm{{X}}^{T}\circ\bm{{Y}})\bigr{\rVert}_{F}^{2}\;.

\bigl{\lVert}\bm{{D}}-(\bm{{X}}^{T}\circ\bm{{Y}})\bigr{\rVert}_{F}^{2}\;.

\bigl{\lVert}\bm{{D}}-(\bm{{X}}^{T}\circ\bm{{Y}})\bigr{\rVert}_{F}^{2}\;.

\bigl{\lVert}\bm{{D}}-(\bm{{X}}^{T}\circ\bm{{Y}})\bigr{\rVert}_{F}^{2}\;.

border (v) x y = max (x, y), where = i max border (c_{i}) + j = 1 \sum i - 1 total (c_{j}), = i max border (c_{i}) + j = i + 1 \sum ℓ total (c_{j}) .

border (v) x y = max (x, y), where = i max border (c_{i}) + j = 1 \sum i - 1 total (c_{j}), = i max border (c_{i}) + j = i + 1 \sum ℓ total (c_{j}) .

inner (v) x y = max (x, y), where = i max inner (c_{i}), = i < j max border (c_{i}) + border (c_{j}) + ℓ = i + 1 \sum j - 1 total (c_{ℓ}) .

inner (v) x y = max (x, y), where = i max inner (c_{i}), = i < j max border (c_{i}) + border (c_{j}) + ℓ = i + 1 \sum j - 1 total (c_{ℓ}) .

border (v) = b + i \sum max (total (c_{i}), 0) .

border (v) = b + i \sum max (total (c_{i}), 0) .

inner (v) x y = max (x, y), where = i max inner (c_{i}), = max (b_{1}, 0) + max (b_{2}, 0) + i \sum max (total (c_{i}), 0) .

inner (v) x y = max (x, y), where = i max inner (c_{i}), = max (b_{1}, 0) + max (b_{2}, 0) + i \sum max (total (c_{i}), 0) .

border (c_{j}) + i < j max border (c_{i}) + ℓ = i + 1 \sum j - 1 total (c_{ℓ})

border (c_{j}) + i < j max border (c_{i}) + ℓ = i + 1 \sum j - 1 total (c_{ℓ})

border (c_{j}) + (ℓ = 1 \sum j - 1 total (c_{ℓ})) + i < j max t (i, j),

border (c_{j}) + (ℓ = 1 \sum j - 1 total (c_{ℓ})) + i < j max t (i, j),

t (i, j) = i < j max border (c_{i}) - ℓ = 1 \sum i total (c_{ℓ}) .

t (i, j) = i < j max border (c_{i}) - ℓ = 1 \sum i total (c_{ℓ}) .

i < j max t (i, j) = max (t (j - 1, j) i < j - 1 max t (i, j)),

i < j max t (i, j) = max (t (j - 1, j) i < j - 1 max t (i, j)),

\bigl{\lVert}\bm{{D}}-\bigl{(}(\bm{{X}}^{T}\circ\bm{{Y}})\lor(\bm{{Y}}^{T}\circ\bm{{X}})\bigr{)}\bigr{\rVert}_{F}^{2}\;.

\bigl{\lVert}\bm{{D}}-\bigl{(}(\bm{{X}}^{T}\circ\bm{{Y}})\lor(\bm{{Y}}^{T}\circ\bm{{X}})\bigr{)}\bigr{\rVert}_{F}^{2}\;.

\frac{∥ D - X ^{T} \circ Y ∥ _{F}^{2}}{∥ D ∥ _{F}^{2}} .

\frac{∥ D - X ^{T} \circ Y ∥ _{F}^{2}}{∥ D ∥ _{F}^{2}} .

2 (3 m - n + 2) = 2 k = i = 1 \sum 3 m + 1 f_{i} + g_{i} = f_{1} + g_{3 m + 1} + i = 1 \sum 3 m f_{i + 1} + g_{i} \geq 2 (3 m + 1) - 2 h,

2 (3 m - n + 2) = 2 k = i = 1 \sum 3 m + 1 f_{i} + g_{i} = f_{1} + g_{3 m + 1} + i = 1 \sum 3 m f_{i + 1} + g_{i} \geq 2 (3 m + 1) - 2 h,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Boolean matrix factorization meets consecutive ones property††thanks: This is an extended version of the paper of the same name presented in 2019 SIAM International Conference on Data Mining.

Nikolaj Tatti University of Helsinki, Helsinki, Finland,

[email protected]

Pauli Miettinen University of Eastern Finland, Kuopio, Finland,

[email protected]. Part of this work was done while the author was with MPI-INF, Saarbrücken, Germany.

Abstract

Boolean matrix factorization is a natural and a popular technique for summarizing binary matrices. In this paper, we study a problem of Boolean matrix factorization where we additionally require that the factor matrices have consecutive ones property (OBMF). A major application of this optimization problem comes from graph visualization: standard techniques for visualizing graphs are circular or linear layout, where nodes are ordered in circle or on a line. A common problem with visualizing graphs is clutter due to too many edges. The standard approach to deal with this is to bundle edges together and represent them as ribbon. We also show that we can use OBMF for edge bundling combined with circular or linear layout techniques.

We demonstrate that not only this problem is $\mathrm{NP}$ -hard but we cannot have a polynomial-time algorithm that yields a multiplicative approximation guarantee (unless $\mathrm{P}=\mathrm{NP}$ ). On the positive side, we develop a greedy algorithm where at each step we look for the best 1-rank factorization. Since even obtaining 1-rank factorization is $\mathrm{NP}$ -hard, we propose an iterative algorithm where we fix one side and and find the other, reverse the roles, and repeat. We show that this step can be done in linear time using pq-trees. We also extend the problem to cyclic ones property and symmetric factorizations. Our experiments show that our algorithms find high-quality factorizations and scale well.

1 Introduction

Matrix factorization is an immensely popular way of summarizing data as well as discovering signal from the data. While being useful, the interpretation and visualization of discovered factor matrices may be difficult. A popular variant for factorizing binary matrices is a $k$ -Boolean matrix factorization, which, essentially, summarizes the binary data as a union of $k$ tiles, that is, submatrices full of 1s. However, visualizing such factorization is difficult as the discovered rows and columns can be any sets, and there is no insightful way of visualizing them all at once.

In this paper we consider $k$ -Boolean matrix factorization such that the resulting matrix has a certain property: we can order the columns and the rows such that the matrix consists of union of $k$ contiguous tiles. We do not know the order before-hand, and we discover the order as we also discover the factorization.

Our motivation for discovering such factorization is primarily due to easy exploration of the factorization: we can draw the factorization as $k$ tiles. While in certain cases, such a constraint may be too restrictive, there are many settings, where this constraint comes naturally. As a specific example, consider visualizing graphs. A classic technique for visualizing a graph is using linear or circular layout, where the nodes are drawn on a line or circle, and they are connected with arcs. The most common problem with visualizing graphs is clutter due to too many edges. To combat the clutter, edges are often grouped, and drawn in ribbons (see Figure 3 for an example). The problem is to discover such ribbons and the node order, while minimizing the error. We show that we can use matrix factorization on the adjacency matrix of a graph to find the order and the groups.

We show that the factorization we seek can be expressed with consecutive ones property (C1P). Namely, we will look for factor matrices $\bm{{X}}$ and $\bm{{Y}}$ whose columns can be shuffled such that each row has a form of $[0,\ldots,0,1,\ldots,1,0,\ldots,0]$ . We show that the problem is NP-hard, even if $k=1$ , and it is inapproximable for $k>1$ . On the positive side, we propose a greedy algorithm that searches the factors in iterative manner. The search is done by first fixing a vector in $\bm{{X}}$ and finding the optimal counterpart in $\bm{{Y}}$ , then fixing the vector in $\bm{{Y}}$ and finding the optimal vector in $\bm{{X}}$ , and so on, until convergence. We show that we can find the optimal counterpart in linear time using pq-trees.

We also consider 3 extensions of this factorization: the first variant, cyclic decomposition, consists of allowing factors to “wrap around the border.” the second variant is specifically designed for symmetric matrices, while the last variant combines the two. Performing cyclic and symmetric decomposition proves to be useful for cyclic layout of graphs.

The rest of the paper is organized as follows: We present preliminary notation and define the matrix factorization and the cyclic version in Section 2. We present the search algorithm in Section 3. The symmetric extensions are given in Section 4. Section 6 is dedicated to related work, and Section 5 is dedicated to experimental evaluation. Finally, we conclude the paper with remarks given in Section 7. All proofs are given in Appendix A.

2 Preliminary notation and problem definitions

We begin by presenting preliminary notation, and then present the two main problem definitions. Extended problems are discussed in Section 4.

2.1 Notation

Given an $n\text{-by-}k$ binary matrix $\bm{{A}}$ and a $k\text{-by-}m$ binary matrix $\bm{{B}}$ , the Boolean matrix product $\bm{{A}}\circ\bm{{B}}$ is defined element-wise as

[TABLE]

The Boolean matrix sum of $\bm{{A}}\in\{0,1\}^{n\times m}$ and $\bm{{B}}\in\{0,1\}^{n\times m}$ is defined elementwise as $(\bm{{A}}\lor\bm{{B}})_{ij}=a_{ij}\lor b_{ij}$ .

To measure the distance between two binary matrices, we use the squared Frobenius norm of their (normal) difference, $\left\lVert\bm{{A}}-\bm{{B}}\right\rVert_{F}^{2}$ . Notice that as $\bm{{A}}$ and $\bm{{B}}$ are both binary, this is the same as calculating the number of disagreements between $\bm{{A}}$ and $\bm{{B}}$ : $\left\lVert\bm{{A}}-\bm{{B}}\right\rVert_{F}^{2}=\left\lvert\{(i,j):a_{ij}\neq b_{ij}\}\right\rvert$ .

We say that a binary matrix $\bm{{X}}$ has a consecutive ones property (C1P) if its columns can be permuted such that each row has a form of $[0,\ldots,0,1,\ldots,1,0,\ldots,0]$ , that is, 1s form a contiguous interval. For the sake of presentation, we will also refer these matrices as unimodal.

We say that a binary matrix $\bm{{X}}$ is cyclic if its columns can be permuted such that each row has a form of $[0,\ldots,0,1,\ldots,1,0,\ldots,0]$ or $[1,\ldots,1,0,\ldots,0,1,\ldots,1]$ .

2.2 Problem definitions

Next we will give our two main optimization problems.

Problem 1 (Ordered BMF, obmf).

Given a binary matrix $\bm{{D}}$ and an integer $k\in\mathbb{N}$ , find two unimodal binary matrices $\bm{{X}}$ and $\bm{{Y}}$ that minimize the number of disagreements

[TABLE]

Problem 2 (Cyclic Ordered BMF, cobmf).

Given a binary matrix $\bm{{D}}$ and an integer $k\in\mathbb{N}$ , find two cyclic binary matrices $\bm{{X}}$ and $\bm{{Y}}$ that minimize the number of disagreements

[TABLE]

The matrix $\bm{{Z}}=\bm{{X}}^{T}\circ\bm{{Y}}$ given in Eq. 2 has another natural alternative characterization: the columns and the rows of $\bm{{Z}}$ can be permuted such that the resulting matrix is a union of $k$ contiguous tiles of 1s. Similarly, the matrix $\bm{{Z}}=\bm{{X}}^{T}\circ\bm{{Y}}$ given in Eq 3 can be permuted such that the resulting matrix is a union of $k$ contiguous tiles, but we also allow the tiles to wrap around the border.

Unsurprisingly, the problems are computationally infeasible. First, we demonstrate that obmf is difficult even if $k=1$ .

Theorem 1.

The obmf problem is $\mathrm{NP}$ -hard, even if $k=1$ .

Our next result shows that not only obmf is difficult, but it is also impossible to approximate. To show this, it is enough to demonstrate that testing for zero-error solution is expensive.

Theorem 2.

Deciding whether obmf has a zero-error solution is $\mathrm{NP}$ -complete.

The proofs of these and other statements are given in Appendix A.

3 Iterative greedy algorithm

3.1 Greedy algorithm

As we saw in the previous section, not only the problem is $\mathrm{NP}$ -hard, we cannot construct any polynomial-time algorithm with a multiplicative guarantee. Hence, we need to resort to heuristics. The most natural heuristic is a greedy heuristic, where given a $(k-1)$ -sized factorization we look for a $k$ -sized factorization by adding one row and one column to $\bm{{X}}$ and $\bm{{Y}}$ . Note that these rows need to be selected carefully such that $\bm{{X}}$ and $\bm{{Y}}$ remain unimodal, and we also need to maintain the permutation(s).

Unfortunately, Theorem 1 states that we cannot even find the best solution for $k=1$ in polynomial-time. Fortunately, we can solve quickly a subproblem, where we have fixed one side.

Problem 3 (Ordered BMF step, obmfstep).

Given a binary matrix $\bm{{D}}$ of size $n\text{-by-}m$ and two unimodal matrices, $\bm{{X}}^{\prime}$ of size $k\text{-by-}n$ and $\bm{{Y}}^{\prime}$ of size $(k-1)\text{-by-}m$ , find the decomposition $\bm{{X}}^{T}\circ\bm{{Y}}$ solving obmf such that $\bm{{X}}=\bm{{X}}^{\prime}$ and $\bm{{Y}}$ is obtained by adding one new row to $\bm{{Y}}^{\prime}$ .

We can use obmfstep as follows. Assume that we have already found $(k-1)\text{-by-}m$ matrices $\bm{{X}}$ and $\bm{{Y}}$ . We first extend $\bm{{X}}$ with a new row using a given seed, and find the optimal new row for $\bm{{Y}}$ (strategy for such selection is given later using obmfstep. We fix the discovered row, and use obmfstep to find the corresponding row for $\bm{{X}}$ . Since we solve each step optimally, the error will never increase. We stop when the error stops decreasing. Note that we will need to provide a seed for the initial row in $\bm{{X}}$ . Here, we test several possible seeds $S$ , and select the best. We experiment with several options in experiments, but the default is that $S$ is equal to all singleton columns. The pseudo-code for the algorithm is given in Algorithm 1.

The remainder of this section is about solving obmfstep in linear time. Almost the same approach will also work for the cyclic version, cobmfstep; we will point the minute difference.

3.2 Expressing permutations with pq-trees

The complicated aspect of obmfstep is that we need to make sure that the new matrix is unimodal. Luckily, we can use pq-trees, a classic structure that allows us to express every permutation for which a set of binary vertices remain unimodal. In this section we will give a brief review of pq-trees and the two main properties that are relevant to us.

Assume that we are given a universe $U$ ; in our case this will be either rows or columns of the input matrix. A pq-tree is a tree with each leaf corresponding to $u\in U$ . There are two types of non-leaf nodes, these types will dictate what permutations we can perform on the children. We can permute children of p-node in any order whereas the order of the children of q-node is fixed but we can flip the direction. The leaves of the permuted tree will then indicate an order. We will denote such orders by $\mathit{order}\mathopen{}\left(T\right)$ , where $T$ is the pq-tree.

Two seminal results are important to us. The first result states that there is a pq-tree $T$ such that $\mathit{order}\mathopen{}\left(T\right)$ are exactly the orders under which a set of binary vertices remain unimodal.

Theorem 3 (Booth and Lueker [3]).

Given a universe $U$ and $k$ sets $S_{i}\subseteq U$ , there is a pq-tree $T$ such that $\mathit{order}\mathopen{}\left(T\right)$ are exactly the permutations of $U$ under which each $S_{i}$ is contiguous.

The second result states that we can efficiently update the pq-tree.

Theorem 4 (Booth and Lueker [3]).

Assume that we have a pq-tree $T$ over a universe $U$ and a set $S\subseteq T$ . Let $P$ be the set of all permutations of $U$ where $S$ is contiguous. If $\mathit{order}\mathopen{}\left(T\right)\cap P\neq\emptyset$ , then there is an $\mathcal{O}\mathopen{}\left(\left\lvert U\right\rvert\right)$ -time algorithm that constructs a tree $T^{\prime}$ such that $\mathit{order}\mathopen{}\left(T^{\prime}\right)=\mathit{order}\mathopen{}\left(T\right)\cap P$ . If $\mathit{order}\mathopen{}\left(T\right)\cap P=\emptyset$ , then the same algorithm detects a failure.

The detailed description of the algorithm for updating the pq-tree can be found in [3].

3.3 Finding the optimal row

In this section we describe the algorithm that solves obmfstep. Assume that we have a pq-tree $T$ representing the permutations of columns in $\bm{{D}}$ allowed by the previously discovered rows in $\bm{{Y}}^{\prime}$ . When dealing with pq-trees it is notationally easier to deal with sets rather than with vectors. Naturally every binary vector $\bm{y}$ can be represented as a set $S=\left\{i:y_{i}=1\right\}$ .

Let us define $U$ to be the column indices of $\bm{{D}}$ ; these are exactly the leaves of $T$ . We say that a set $S\subseteq U$ is compatible with a pq-tree $T$ , if there is an order in $\mathit{order}\mathopen{}\left(T\right)$ where $S$ is contiguous. Obviously, compatible sets $S$ correspond exactly to suitable new rows in $\bm{{Y}}$ .

We can express obmfstep as an instance of the following problem.

Problem 4 (optset).

Given a universe $U$ , weights $w(u)$ for each $u\in U$ , and a pq-tree $T$ over the universe $U$ , find a set $S$ that is compatible with $T$ and maximizes the total weight $\sum_{u\in S}w(u)$ .

Recall that $u\in U$ corresponds to a column index of $\bm{{D}}$ . Define $w(u)$ to be the gain in the error-function if we were to use $u$ in our new row for $\bm{{Y}}$ . More formally, let $\bm{x}$ be the fixed counterpart in $\bm{{X}}$ for the new row in $\bm{{Y}}$ . Let $p$ be the number of ones in $\bm{{D}}$ at rows $\bm{x}$ and column $u$ that are not yet covered by the previous factors. Let $n$ be the number of zeros in $\bm{{D}}$ at rows $\bm{x}$ and column $u$ that are not yet covered by the previous factors. We define $w(u)=p-n$ . Solving optset with these weights solves obmfstep.

In order to solve cobmfstep, we solve optset using $w(u)=p-n$ , as above, yielding a set, say $S_{1}$ . In addition, we also solve optset using $w(u)=n-p$ , yielding a set, say $S_{2}$ . Then, we use either $S_{1}$ or $U\setminus S_{2}$ , whichever yields a better gain.

In order to solve optset, we need an additional definition: Let $S$ be a compatible set of a pq-tree $T$ . If there is a permutation in $\mathit{order}\mathopen{}\left(T\right)$ with the first or the last element in $S$ , we call $S$ a border-compatible set.

Let $T$ be a pq-tree. To solve optset we will compute 3 counters for a node $v$ in $T$ , namely, $\mathit{inner}\mathopen{}\left(v\right)$ , $\mathit{border}\mathopen{}\left(v\right)$ , and $\mathit{total}\mathopen{}\left(v\right)$ . The counter $\mathit{total}\mathopen{}\left(v\right)$ corresponds to the total weight of leaves under $v$ , while the counter $\mathit{inner}\mathopen{}\left(v\right)$ corresponds to the best $S$ that is compatible with the subtree starting at $v$ . Finally, $\mathit{border}\mathopen{}\left(v\right)$ corresponds to the best $S$ that is border-compatible with the subtree starting at $v$ .

We should stress that, strictly by definition, $\mathit{inner}\mathopen{}\left(v\right)$ can represent an empty set, whereas $\mathit{total}\mathopen{}\left(v\right)$ and $\mathit{border}\mathopen{}\left(v\right)$ should be never empty, even if they produce a negative value. Thus, $\mathit{inner}\mathopen{}\left(v\right)\geq 0$ but $\mathit{border}\mathopen{}\left(v\right)$ and $\mathit{total}\mathopen{}\left(v\right)$ can have negative values. Moreover, it is possible that $\mathit{border}\mathopen{}\left(v\right)$ represents every leaf of $v$ , in which case, $\mathit{border}\mathopen{}\left(v\right)=\mathit{total}\mathopen{}\left(v\right)$ .

Naturally, we want to compute $\mathit{inner}\mathopen{}\left(r\right)$ , where $r$ is the root of $T$ . To obtain this value we compute each value iteratively, children first. We also maintain the lists of the children that were responsible for producing the optimal value. These lists are clear from the proofs of the following lemmata. This allows us to extract the optimal $S$ .

First, note that computing $\mathit{total}\mathopen{}\left(v\right)$ is trivial since $\mathit{total}\mathopen{}\left(v\right)=\sum_{c\in\mathit{ch}\mathopen{}\left(v\right)}\mathit{total}\mathopen{}\left(c\right)$ . If $v$ is a leaf-node, then $\mathit{border}\mathopen{}\left(v\right)=\mathit{total}\mathopen{}\left(v\right)$ and $\mathit{inner}\mathopen{}\left(v\right)=\max(0,\mathit{total}\mathopen{}\left(v\right))$ .

The next two lemmata establish how to compute the counters for q-nodes.

Lemma 5.

Let $v$ be a q-node and let $c_{1},\ldots,c_{\ell}$ be its children. Then

[TABLE]

Lemma 6.

Let $v$ be a q-node and let $c_{1},\ldots,c_{\ell}$ be its children. Then

[TABLE]

Our next step is to compute the counters for p-nodes. For that we need to define the following helper function: given a node $v$ we define $g(v)=\mathit{border}\mathopen{}\left(v\right)-\max(\mathit{total}\mathopen{}\left(v\right),0)$ . We will use $g(v)$ in the next two lemmata describing on how to compute the counters for p-node.

Lemma 7.

Let $v$ be a p-node and let $c_{1},\ldots,c_{\ell}$ be its children. Define $b=\max g(c_{i})$ . Then

[TABLE]

Note that since we require the set responsible for $\mathit{border}\mathopen{}\left(v\right)$ be non-empty, it is possible that $\mathit{border}\mathopen{}\left(v\right)<0$ . This can happen only if $b<0$ and every child $w$ of $v$ has $\mathit{total}\mathopen{}\left(w\right)<0$ .

Lemma 8.

Let $v$ be a p-node and let $c_{1},\ldots,c_{\ell}$ be its children. Define $b_{1}$ and $b_{2}$ be the top-2 values of $g(c_{i})$ . Then

[TABLE]

Note that using these lemmas every counter can be trivially solved in linear time, except for $\mathit{inner}\mathopen{}\left(v\right)$ , where $v$ is q-node. To compute $\mathit{inner}\mathopen{}\left(v\right)$ in linear time, it is enough if we can solve

[TABLE]

in constant time for a fixed $j$ . Luckily, we can rewrite this function as

[TABLE]

where

[TABLE]

Let $i(j)$ to be the optimal $i$ for a fixed $j$ . Since

[TABLE]

we have either $i(j)=i(j-1)$ or $i(j)=j-1$ . If we were to test each $j$ consecutively, then this allows us to compute $i(j)$ in constant time: we simply compare the solution $i=j-1$ to the best previous solution $i(j-1)$ .

In summary, each counter of $v$ can be computed in $\mathcal{O}\mathopen{}\left(\left\lvert\mathit{ch}\mathopen{}\left(v\right)\right\rvert\right)$ . Thus we need $\mathcal{O}\mathopen{}\left(\ell\right)$ , where $\ell$ is the number of nodes in $T$ . Since $\ell\in\mathcal{O}\mathopen{}\left(\left\lvert U\right\rvert\right)$ , we can compute the counters in $\mathcal{O}\mathopen{}\left(\left\lvert U\right\rvert\right)$ time, where $\left\lvert U\right\rvert$ is the number of columns in $\bm{{D}}$ .

When computing the counters we also store which children were responsible for this value. Once we have computed $\mathit{inner}\mathopen{}\left(r\right)$ , where $r$ is the root of the tree, we can backtrack to obtain the optimal $S$ . This can be also done in linear time.

Computing the weights $w$ in optset can be done in $\mathcal{O}\mathopen{}\left(p\right)$ time, where $p$ is the number of 1s in the dataset $\bm{{D}}$ of size $n\text{-by-}m$ . Consequently, obmfstep can be done in $\mathcal{O}\mathopen{}\left(p+n+m\right)$ time.

4 Symmetric decomposition

We now propose an extension for symmetric matrices.

4.1 Definition

If $\bm{{D}}$ is symmetric (e.g. an adjacency matrix of an undirected graph), we have the following problem:

Problem 5 (Symmetric obmf, obmfsym).

Given a binary matrix $\bm{{D}}$ and an integer $k\in\mathbb{N}$ , find two binary matrices $\bm{{X}}$ and $\bm{{Y}}$ such that $[\bm{{X}};\bm{{Y}}]$ is unimodal, that minimize the number of disagreements

[TABLE]

We define similarly cobmfsym, a cyclic and symmetric variant of obmf.

The unimodality condition in obmfsym states that we should be able to permute $\bm{{X}}$ and $\bm{{Y}}$ with the same permutation so that the rows are in form of $[0,\ldots,0,1,\ldots,1,0,\ldots,0]$ .

Notice that we do not use the more common symmetric decomposition $\bm{{D}}\approx\bm{{X}}^{T}\circ\bm{{X}}$ as this would lead to necessarily having the blocks around the diagonal.

4.2 Algorithm

The discovery algorithm for symmetric obmf is similar. Like with the regular obmf, we use a greedy algorithm as an iterative step for discovering new rows.

The first difference is that we maintain only one pq-tree, corresponding to the rows in both $\bm{{X}}$ and $\bm{{Y}}$ .

The second difference is that – as $\bm{{X}}^{T}\circ\bm{{Y}}$ and $\bm{{Y}}^{T}\circ\bm{{X}}$ can have overlapping 1s – maximizing optset does not necessarily produce the optimal row. Instead, we can show that solving optset, with the weights as described in the previous section, minimizes $\bigl{\lVert}\bm{{D}}-\bm{{X}}^{T}\circ\bm{{Y}}\bigr{\rVert}_{F}^{2}+\bigl{\lVert}\bm{{D}}-\bm{{Y}}^{T}\circ\bm{{X}}\bigr{\rVert}_{F}^{2}$ . It follows easily that minimizing this function yields a 2-approximation for finding optimal counterpart row.

5 Experimental evaluation

In this section we study how well the algorithms from Sections 3 and 4.2 work with synthetic and real-world data. We denote the algorithms with the same names as the problems they are solving, and differentiate the algorithms from the problems via the font. That is, obmf is the algorithm for obmf, and so on. The algorithms are implemented in C++, and we make the source code and synthetic experiments freely available.111https://cs.uef.fi/~pauli/bmf/ordered_bmf/

5.1 Resilience to Noise

We start by evaluating the algorithms’ resilience to noise. To that end, we synthesized random matrices of size $95\times 95$ with block structure (6 blocks of size $20\times 20$ along the diagonal, with 5 overlapping rows and columns) and corrupted those matrices with flipping a varying amounts of entries. The amount of flipped entries varied from $0\text{\,}\mathrm{\char 37\relax}50\text{\,}\mathrm{\char 37\relax}$ (of total elements) and we compared the quality of the results to both the noise-free matrix and noisy matrix. The results are shown in Figure 1.

With lower leves of noise ( $35\text{\,}\mathrm{\char 37\relax}$ for obmf and cobmf and $25\text{\,}\mathrm{\char 37\relax}$ for the symmetric variants), the reconstruction of the original data is more accurate. With higher levels of noise, the noise has destroyed so much of the structure that the algorithms start fitting to the noise only, with a clear reduction of the quality versus the original data.

It is also worth noticing that obmf obtains exact decompositions when the data has no noise; the other methods introduce a slight error even in these cases emphasizing their more complex setting.

5.2 Scalability

In this section we test how well obmf scales to larger data sets and how well it benefits from multiple cores. These experiments were executed on a server with 40 cores of Intel Xeon E7-4870 processors running at $2.4\text{\,}\mathrm{GHz}$ . The algorithm was compiled using GCC 8.1.0 and the parallel code uses the OpenMP library.

To test the scalability, we generated $n\text{-by-}n$ square matrices with $n=2^{i}$ for $i=9,\ldots,13$ . All matrices have a density of approximately $24\text{\,}\mathrm{\char 37\relax}$ . The results are presented in Figure 2a.

The algorithm shows very good scalability over the full range, although it does get slower when the data size increases from $2^{12}$ to $2^{13}$ . It should be noted, though, that as the density is constant, the number of non-zeros in the matrices increases as the square of the matrix size. Hence, obmf exhibits linear growth with respect to the number of non-zero elements.

Algorithm 1 is almost embarrassingly parallel over the different seeds vectors. Hence, we parallellized the test of different seeds, and tested how the algorithm behaves with increased number of cores. The results are in Figure 2b, where we can see that the speed-up is essentially linear up to $4$ cores, slightly slower until $16$ cores, and only marginal gains are available when increasing the number of cores to $32$ , indicating that at the algorithm has become memory bus constrained.

Overall, the experiments show that the algorithm scales very well, and is able to benefit from modern multi-core computers. We study further speed-up options later in Section 5.3.2.

5.3 Experiments with Real-World Data

We now turn to real-world data sets. We used six different real-world data sets, selected to offer a wide variety of different types of data. The data sets we used are as follows. Les Misérables is a standard benchmark data222http://moreno.ss.uci.edu/data.html of the characters of Victor Hugo’s novel Les Misérables. Paleo is a palaeontological data333NOW 030717, http://www.helsinki.fi/science/now/ in the form of a locations-by-genera matrix, giving information where different fossiles have been found. Newsgroups is a subset of the famous 20Newsgroups data444http://qwone.com/~jason/20Newsgroups/ consisting four newsgroups and $100$ terms. Terms the terms-by-terms co-occurrence matrix based on Newsgroups. Locations is locations-by-locations matrix indicating mammal species co-location in the northern hemisphere: the data has a $1$ in element $(i,j)$ if locations $i$ and $j$ have at least five mammals in common. The data is based on the IUC Red List data.555http://www.iucnredlist.org/technical-documents/spatial-data The final data set, Mammals, contains a species-by-species co-inhabitation matrix.666Available for research purposes from the Societas Europaea Mammalogica at http://www.european-mammals.org The data set properties are summarized in Table 1.

To the best of our knowledge, this is the first work to address the ordered Boolean matrix factorization problem. To understand what kind of an effect the ordering constraint has to the reconstruction error, we compare our results with those of asso [15]. The asso algorithm is a well-known method for computing the standard Boolean matrix factorization. We used an implementation available from the author777https://cs.uef.fi/~pauli/basso/basso-0.5.tar.gz and set the rank for asso the same as for our algorithms, and used threshold values $\tau=\{0.2,0.4,0.6,0.8\}$ .

For symmetric data sets, we also computed the symmetric Boolean factorization. This was done by first computing the standard $\bm{{X}}^{T}\circ\bm{{Y}}$ factorization, and then testing whether $\bm{{X}}^{T}\circ\bm{{X}}$ or $\bm{{Y}}^{T}\circ\bm{{Y}}$ gives smaller reconstruction error and using that one. This version of asso is denoted assosym.

5.3.1 Reconstruction errors

We first compute the reconstruction errors for the various data sets. To facilitate the comparisons, we report the relative reconstruction error

[TABLE]

The results of all datasets are given in Table 2.

In case of asymmetric decompositions, asso is – as expected, as its factor matrices are not restricted to unimodal or cyclic – almost always slightly better than either obmf or cobmf. This difference is, however, very small in many data sets (only $8\text{\,}\mathrm{\char 37\relax}$ in Les Misérables and $0.5\text{\,}\mathrm{\char 37\relax}$ in Paleo). A remarkable exception is the Mammals data, where asso is in fact worse than either obmf or cobmf. As the data set is the densest of the ones we tested, it is possible that asso was unable to obtain good candidates from it with the rounding thresholds we tried.

There is almost no difference between obmf and cobmf in the terms of reconstruction error in these data sets. Usually, obmf is on par or slightly better than cobmf, except again in Mammals, where cobmf is slightly better. The asymmetric data sets, Paleo and Newsgroups, cause the highest reconstruction errors at over $70\text{\,}\mathrm{\char 37\relax}$ . It should be noted, though, that also asso has similarly high errors with these data sets, indicating that they might not have strong Boolean low-rank structure.

In symmetric decompositions, the relationship between the ordered BMF algorithms and asso is reversed, with assosym being often the worse method (with the exception of Terms). This is not very surprising, given that asso is not designed for symmetric decompositions. The errors are slightly worse than with the asymmetric algorithms, highlighting the complexity of finding the symmetric decompositions.

5.3.2 Changing the seeds

In the above experiments, we used the columns as the seeds $S$ for the algorithm (cf. Algorithm 1). This slows the algorithm down, as it has to attempt all of the potential seeds. In this section we study if we can improve the running time without hurting the reconstruction error by sampling only some of the columns for the seed set $S$ .

In particular, we sampled $10\text{\,}\mathrm{\char 37\relax}$ of the columns uniformly at random to create the seed set. As the algorithm scales linearly with the number of seeds, this provides an order of magnitude speed-up. To test the quality, we repeated the sampling ten times and report the average relative reconstruction errors and standard deviations in Table 3.

The first thing to notice in Table 3 are the low standard deviations; less than $3\text{\,}\mathrm{\char 37\relax}$ in almost all data sets. The reconstruction errors are also only slightly higher than those in Table 2; for instance, obmf with Paleo has only $6\text{\,}\mathrm{\char 37\relax}$ higher error on average when using random sampling. In most cases the speed-up obtained by the sampling is significant compared to the loss in accuracy.

5.4 Visualizing the Graphs

One of the motivations for the ordered BMF is that it allows the convenient visualization of the graphs using edge bundles (or ribbons) between nodes that are placed in a circle. In this section we explore some of these visualizations and explain what we can learn from the respective data sets using them. In the following plots, the edge bundles and the ordering are obtained form the factorization. Further visualizations can be found in Appendix B.

The Les Misérables data: The visualization of the Les Misérables data is presented in Figure 3. Most edge bundles form a circular segment indicating that all of the nodes under the segment are connected to each other (the characters appear in the same parts of the book). Some of the bundles are contained in other bundles, indicating important subset of characters. Multiple bundles intersect on a node at south-east of the circle called Valjean – the protagonist of the book.

The Mammals data: The second data set is the Mammals data, in Figure 4. For a clearer visualization, we only consider $134$ species that do not appear too frequently in the data, as such species are neighbours of every other species in graph. The edge bundles in Figure 4 are essentially rotating around the middle. This probably corresponds to the change of fauna when moving from north to south. The change is gradual, hence two consecutive edge bundles have a significant overlap, but over longer distance, the change in the fauna becomes more obvious and the edge bundles are more disjoint. This gives a good intuition about the structure of the data.

6 Related Work

Boolean matrix factorization (BMF) has received increasing interest in the data analysis community [15, 12, 2, 17, 16, 13, 9, 10, 14, 11], proving to be a versatile tool for analyzing Boolean matrices. Many different algorithms have been proposed, including algorithms based on candidate creation and selection [15, 12], proximal alternations [10], and message passing [16], to name but a few. It has also found applications in diverse fields, such as bioinformatics [5], information extraction [4], and lifted inference [18]. To the best of our knowledge, however, the ordering constraint is not studied in earlier work related to Boolean matrix factorization.

Tiling databases [6] can be seen as a restricted version of BMF, where the factorization cannot express any [math]s as $1$ . Geometric tiling [8] is a variation thereof, where the tiles have to be consecutive. The main difference to our work is a different optimization function, [8] uses log-likelihood, and that it assumes that the order is already given, for example, by spectral ordering, whereas we discover the order on the fly.

A binary matrix has the consecutive ones property (C1P) if its columns can be permuted so that all rows have all 1s consecutively. The pq-trees can be used to check for the C1P [3] and Atkins et al. [1] propose spectral ordering algorithm. The spectral ordering approach is used in [8] to permute the data for finding the geometric tiles.

7 Conclusions

Ordered Boolean matrix factorization (obmf) and its variations (cobmf, obmfsym) are restricted versions of Boolean matrix factorization, requiring the factors to have the consecutive ones property (or be cyclic, in case of cobmf). This restriction facilitates the interpretation of the factorization, in particular in the case of the edge bundle visualizations of graphs, as we saw in Section 5.4. On the other hand, the restriction yields higher reconstruction errors, though our experiments show that the difference to state-of-the-art Boolean matrix factorization algorithm is usually very small.

In this paper we laid the theoretical foundations of the obmf problem and its variations, and proposed algorithms based on the pq-trees. An important part of the proposed algorithm is the choice of the seed vectors. In this paper, we mostly used all columns of the data as the seed, though the experiments in Section 5.3.2 show that sampling the columns could work equally well. An interesting question for the future is whether other methods for selecting the seeds would yield better reconstruction errors.

In the problem setting of this paper, the user provides the rank of the decomposition and the goal is to minimize the reconstruction error over the rank- $k$ obmf decompositions. A common variant in the Boolean matrix factorization world is to make the rank a free variable and replace the target function with measure that penalizes for higher ranks (see, e.g. [14, 12, 10]). The Minimum Description Length principle is a common approach. The ordered nature of our factor matrices could help with finding more efficient MDL decompositions, as the factor matrices are easier to compress using run-length encoding or similar approaches.

Appendix A Proofs

Proof of Theorem 1.

In this case, we are looking for a decomposition of format $\bm{{D}}\approx\bm{x}^{T}\bm{y}$ , where $\bm{{D}}\in\{0,1\}^{n\times m}$ , $\bm{x}\in\{0,1\}^{n}$ , and $\bm{y}\in\{0,1\}^{m}$ . Notice that (i) whether we use normal or Boolean algebra does not matter in this case; and (ii) we can always find the ordering after we have found the decomposition, as we only need to order the vectors $\bm{x}$ and $\bm{y}$ . But this problem, the rank-1 binary matrix factorization problem, is known to be $\mathrm{NP}$ -hard [7], finalizing the proof. ∎

Proof of Theorem 2.

The decision problem is obviously in $\mathrm{NP}$ .

We prove the hardness by reduction from Hamilton path, where we are given a graph $G=(V,E)$ and asked whether there is a hamiltonian path, that is, a path visiting every vertex exactly once.

Assume that we are given a graph $G=(V,E)$ with $n$ vertices and $m$ edges. Assume that we have some arbitrary order on the vertices $V=v_{1},\ldots,v_{n}$ , and on the edges $E=e_{1},\ldots,e_{m}$ .

Let us define $\bm{{D}}$ first. The dataset will be of size $(n+m+1)\text{-by-}(3m+1)$ . To define the matrix, we split the rows in two parts $R=r_{1},\ldots,r_{n}$ and $S=s_{0},\ldots,s_{m}$ , containing respectively $n$ and $m$ rows. Similarly, we split the columns in 3 parts, $X=x_{1},\ldots,x_{m}$ , $Y=y_{1},\ldots,y_{m}$ , $Z=y_{0},\ldots,y_{m}$ .

The 1s in $\bm{{D}}$ are as follows. for each edge $e_{\ell}=(v_{i},v_{j})$ , we set the cells $(r_{i},x_{\ell})$ $(r_{j},x_{\ell})$ $(r_{i},y_{\ell})$ $(r_{j},y_{\ell})$ to be 1. For two adjacent edges $e_{\ell}$ and $e_{\ell+1}$ , we set the cells $(s_{\ell},y_{\ell})$ $(s_{\ell},z_{\ell})$ $(s_{\ell},x_{\ell+1})$ . Finally, we set $(s_{0},x_{1})$ , $(s_{0},z_{0})$ , and $(s_{m},y_{m})$ , $(s_{m},z_{m})$ to be 1. The remaining values are 0.

We argue that there is a zero-error solution for obmf using $k=3m-n+2$ if and only there is a hamiltonian path.

Let us prove the easy direction: assume that there is a hamiltonian path. To that end, let us permute the rows and columns $\bm{{D}}$ such that the factor matrices do not have gap zeros. Permute $\bm{{D}}$ as follows: Set the column order as $z_{0},x_{1},y_{1},z_{1},x_{2},y_{2},\ldots$ . Order the rows in $R$ according to the hamiltonian path, followed by the rows in $S$ . We denote the resulting matrix by $\bm{{D}}^{\prime}$ . There is a zero-solution if the ones in $\bm{{D}}^{\prime}$ are a union of $k$ contiguous blocks. The $k$ blocks are as follows: $m+1$ blocks covering individual rows in $S$ , $n-1$ blocks covering edges along the hamiltonian path (this can be done since the corresponding rows in $R$ and the corresponsding columns in $X$ and $Y$ are adjacent), and $2(m-n+1)$ blocks to cover the remaining edges, 2 blocks per edge. This covers all 1s using $m+1+n-1+2(m-n+1)=k$ blocks.

Let us prove the other direction. Assume that there is zero-error solution, and let $\bm{{D}}^{\prime}$ be the permuted version of $\bm{{D}}$ with no gap zeros. Then the ones in $\bm{{D}}^{\prime}$ must be a union of $k$ contiguous blocks. For a column index $i$ , we define $f_{i}$ to be the number of blocks started at the $i$ th column. Let us also define $g_{i}$ to be the number of blocks ended at $i$ th columns. Trivially, $\sum_{i}f_{i}+g_{i}=2k$ .

We say that an edge $(v_{i},v_{j})\in E$ is active if $i$ and $j$ are adjacent in $\bm{{D}}^{\prime}$ . Let $h$ be the total number of active edges. Note that we have $h\leq n-1$ . Assume for a moment that $h=n-1$ and let $w_{1},\ldots w_{n}$ be the vertices ordered according to the order of $R$ in $\bm{{D}}^{\prime}$ . Since $h=n-1$ , we are forced to have $(w_{i},w_{i+1})\in E$ . This implies that $w_{1},\ldots,w_{n}$ is a hamiltonian path.

We will now argue that $h\geq n-1$ .

Consider two adjacent columns at $i$ and $i+1$ . If none of the columns are in $Z$ , then both columns contain 1 that is not in the other column. This forces $g_{i}+f_{i+1}\geq 2$ . The same argument holds if both columns are in $Z$ .

Assume that the $j$ th column is in $X$ and $(j+1)$ th column is in $Z$ . Assume that $g_{i}+f_{i+1}=1$ . Let $a$ and $b$ be the rows in $R$ that are active in the $j$ th columns. Since $Z$ does not have active rows, the block(s) covering $a$ and $b$ must terminate, and since $g_{i}\geq 1$ , we have only block, implying that $a$ and $b$ are adjacent. The same result holds if we replace $X$ with $Y$ or permute the order of the two columns. To summarize, if $g_{i}+f_{i+1}=1$ , then either $i$ th or the $(i+1)$ th column corresponds to an active edge.

In addition, we must have $f_{1}\geq 1$ and $g_{3m+1}\geq 1$ as these columns have 1s. This leads to

[TABLE]

proving the result. ∎

Proof of Lemma 5.

Let $S$ be the optimal border-compatible set. Then there is $i$ such that $S$ is a union of the best border-compable set of $c_{i}$ and either the union of all leaves in $c_{1},\ldots,c_{i-1}$ or $c_{i+1},\ldots,c_{\ell}$ . ∎

Proof of Lemma 6.

Let $S$ be the optimal compatible set. Then $S$ is either included completely within one child, or there are indices $i<j$ such that $S$ is a union of the best border-compable sets of $c_{i}$ , $c_{j}$ , and the union of all leaves in $c_{i+1},\ldots,c_{j-1}$ . ∎

Proof of Lemma 7.

Let $S$ be the optimal border-compatible set. Then there is $i$ such that $S$ is a union of the best border-compable set of $c_{i}$ and the union of all leaves of some children.

Let $w$ be a child of $v$ , if $\mathit{total}\mathopen{}\left(w\right)\geq 0$ , then having the leaves of $w$ in $S$ has positive gain. Let $P$ be these children. The total gain corresponds of having these children is $\sum_{i}\max(\mathit{total}\mathopen{}\left(v\right),0)$ .

We need to transform one of the children to a partial. Let $w$ be a child of $v$ . If $\mathit{total}\mathopen{}\left(w\right)<0$ , then $v\notin P$ and adding $w$ will have a gain of $\mathit{border}\mathopen{}\left(w\right)$ . If $\mathit{total}\mathopen{}\left(w\right)\geq 0$ , then $v\in P$ , and transforming $w$ from a fully-covered node to a partial node will have a gain of $\mathit{border}\mathopen{}\left(w\right)-\mathit{total}\mathopen{}\left(w\right)$ . In summary, the gain is equal to $g(w)$ . Thus, selecting the vertex with the maximal $g(w)$ should be the partial child in $S$ . ∎

Proof of Lemma 8.

Let $S$ be the optimal compatible set. Then $S$ is either included completely within one child, or $S$ is a union of some children and possibly up to two of the best border-compable sets for some $c_{i}$ and $c_{j}$ .

Let $w$ be a child of $v$ , if $\mathit{total}\mathopen{}\left(w\right)\geq 0$ , then having the leaves of $w$ in $S$ has positive gain. Let $P$ be these children. The total gain corresponds of having these children is $\sum_{i}\max(\mathit{total}\mathopen{}\left(v\right),0)$ .

As shown in the proof of Lemma 7, $b_{1}$ and $b_{2}$ correspond the top-2 border-compatible sets. It may happen that $b_{1}$ or $b_{2}$ are negative, in which case we simply do not add them to $S$ . Thus the total gain of border-compatible sets is $\max(b_{1},0)+\max(b_{2},0)$ . ∎

Appendix B Further Visualizations

Here we present for the Terms and Locations data sets.

The Terms data

The visualization of the Terms data, in Figure 5, is markedly different from Figure 3. Here, most bundles overlap each other. This indicates that many of these terms are used together in different posts. Yet, we can also identify specialized groups of terms. At the left of Figure 5, we have a blue bundle, from mission to nasa, that contains terms used when discussing space programs. This overlaps with a larger orange bundle, from chip to tap, containing terms related to cryptography.

The Locations data

For the Locations data, in Figure 6, we cannot print any labels, as the data consists of $3203$ geographical locations. For these results, we did a rank- $10$ decomposition. Most of the edge bundles again form segments along the edge of the circle, corresponding to locations with similar fauna. Few larger edge bundles cover most of these locations, as well, corresponding to more general biospheres. In this figure, many nodes have no edges drawn. This indicates that they were not part of any significant quasi-clique.

Bibliography18

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Atkins et al. [1998] J. E. Atkins, E. G. Boman, and B. Hendrickson. A Spectral Algorithm for Seriation and the Consecutive Ones Problem. SIAM J. Comput. , 28(1):297–310, 1998.
2Bělohlávek and Vychodil [2010] R. Bělohlávek and V. Vychodil. Discovery of optimal factors in binary data via a novel method of matrix decomposition. J. Comput. Syst. Sci. , 76(1):3–20, 2010.
3Booth and Lueker [1976] K. S. Booth and G. S. Lueker. Testing for the consecutive ones property, interval graphs, and graph planarity using pq-tree algorithms. J. Comput. Syst. Sci. , 13(3):335–379, 1976.
4Cergani and Miettinen [2013] E. Cergani and P. Miettinen. Discovering relations using matrix factorization methods. In CIKM ’13 , pages 1549–1552, 2013.
5Corrado et al. [2014] G. Corrado, T. Tebaldi, G. Bertamini, F. Costa, A. Quattrone, G. Viero, and A. Passerini. PT Rcombiner: mining combinatorial regulation of gene expression from post-transcriptional interaction maps. BMC Genomics , 15(1), Apr. 2014.
6Geerts et al. [2004] F. Geerts, B. Goethals, and T. Mielikäinen. Tiling databases. In DS ’04 , pages 278–289, 2004.
7Gillis and Vavasis [2015] N. Gillis and S. A. Vavasis. On the Complexity of Robust PCA and ℓ 1 subscript ℓ 1 \ell_{1} -norm Low-Rank Matrix Approximation. ar Xiv , 2015.
8Gionis et al. [2004] A. Gionis, H. Mannila, and J. K. Seppänen. Geometric and Combinatorial Tiles in 0–1 Data. In PKDD ’04 , pages 173–184, 2004.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Boolean matrix factorization meets consecutive ones property††thanks: This is an extended version of the paper of the same name presented in 2019 SIAM International Conference on Data Mining.

Abstract

1 Introduction

2 Preliminary notation and problem definitions

2.1 Notation

2.2 Problem definitions

Problem 1** (Ordered BMF, obmf).**

Problem 2** (Cyclic Ordered BMF, cobmf).**

Theorem 1**.**

Theorem 2**.**

3 Iterative greedy algorithm

3.1 Greedy algorithm

Problem 3** (Ordered BMF step, obmfstep).**

3.2 Expressing permutations with pq-trees

Theorem 3** (Booth and Lueker [3]).**

Theorem 4** (Booth and Lueker [3]).**

3.3 Finding the optimal row

Problem 4** (optset).**

Lemma 5**.**

Lemma 6**.**

Lemma 7**.**

Lemma 8**.**

4 Symmetric decomposition

4.1 Definition

Problem 5** (Symmetric obmf, obmfsym).**

4.2 Algorithm

5 Experimental evaluation

5.1 Resilience to Noise

5.2 Scalability

5.3 Experiments with Real-World Data

5.3.1 Reconstruction errors

5.3.2 Changing the seeds

5.4 Visualizing the Graphs

6 Related Work

7 Conclusions

Appendix A Proofs

Proof of Theorem 1.

Proof of Theorem 2.

Proof of Lemma 5.

Proof of Lemma 6.

Proof of Lemma 7.

Proof of Lemma 8.

Appendix B Further Visualizations

The Terms data

The Locations data

Problem 1 (Ordered BMF, obmf).

Problem 2 (Cyclic Ordered BMF, cobmf).

Theorem 1.

Theorem 2.

Problem 3 (Ordered BMF step, obmfstep).

Theorem 3 (Booth and Lueker [3]).

Theorem 4 (Booth and Lueker [3]).

Problem 4 (optset).

Lemma 5.

Lemma 6.

Lemma 7.

Lemma 8.

Problem 5 (Symmetric obmf, obmfsym).