Boolean matrix factorization meets consecutive ones property
Nikolaj Tatti, Pauli Miettinen

TL;DR
This paper introduces a new variant of Boolean matrix factorization with the consecutive ones property, applies it to graph visualization, proves its computational hardness, and proposes efficient greedy algorithms with strong experimental results.
Contribution
It formulates the OBMF problem, proves its NP-hardness, and develops a linear-time greedy algorithm using pq-trees for high-quality factorizations.
Findings
OBMF is NP-hard and hard to approximate.
The proposed greedy algorithm finds high-quality factorizations.
Algorithms scale well and are effective for graph visualization tasks.
Abstract
Boolean matrix factorization is a natural and a popular technique for summarizing binary matrices. In this paper, we study a problem of Boolean matrix factorization where we additionally require that the factor matrices have consecutive ones property (OBMF). A major application of this optimization problem comes from graph visualization: standard techniques for visualizing graphs are circular or linear layout, where nodes are ordered in circle or on a line. A common problem with visualizing graphs is clutter due to too many edges. The standard approach to deal with this is to bundle edges together and represent them as ribbon. We also show that we can use OBMF for edge bundling combined with circular or linear layout techniques. We demonstrate that not only this problem is NP-hard but we cannot have a polynomial-time algorithm that yields a multiplicative approximation guarantee…
| data | rows | cols | % of 1s | sym. | rank |
|---|---|---|---|---|---|
| Les Misérables | Yes | ||||
| Paleo | No | ||||
| Newsgroups | No | ||||
| Terms | Yes | ||||
| Locations | Yes | ||||
| Mammals | Yes |
| Les Mis | Paleo | News | Terms | Locations | Mammals | |
|---|---|---|---|---|---|---|
| obmf | ||||||
| cobmf | ||||||
| asso |
| Les Misérables | Paleo | Newsgroups | Terms | Locations | Mammals | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| obmf | ||||||||||||
| cobmf | ||||||||||||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Boolean matrix factorization meets consecutive ones property††thanks: This is an extended version of the paper of the same name presented in 2019 SIAM International Conference on Data Mining.
Nikolaj Tatti University of Helsinki, Helsinki, Finland,
Pauli Miettinen University of Eastern Finland, Kuopio, Finland,
[email protected]. Part of this work was done while the author was with MPI-INF, Saarbrücken, Germany.
Abstract
Boolean matrix factorization is a natural and a popular technique for summarizing binary matrices. In this paper, we study a problem of Boolean matrix factorization where we additionally require that the factor matrices have consecutive ones property (OBMF). A major application of this optimization problem comes from graph visualization: standard techniques for visualizing graphs are circular or linear layout, where nodes are ordered in circle or on a line. A common problem with visualizing graphs is clutter due to too many edges. The standard approach to deal with this is to bundle edges together and represent them as ribbon. We also show that we can use OBMF for edge bundling combined with circular or linear layout techniques.
We demonstrate that not only this problem is -hard but we cannot have a polynomial-time algorithm that yields a multiplicative approximation guarantee (unless ). On the positive side, we develop a greedy algorithm where at each step we look for the best 1-rank factorization. Since even obtaining 1-rank factorization is -hard, we propose an iterative algorithm where we fix one side and and find the other, reverse the roles, and repeat. We show that this step can be done in linear time using pq-trees. We also extend the problem to cyclic ones property and symmetric factorizations. Our experiments show that our algorithms find high-quality factorizations and scale well.
1 Introduction
Matrix factorization is an immensely popular way of summarizing data as well as discovering signal from the data. While being useful, the interpretation and visualization of discovered factor matrices may be difficult. A popular variant for factorizing binary matrices is a -Boolean matrix factorization, which, essentially, summarizes the binary data as a union of tiles, that is, submatrices full of 1s. However, visualizing such factorization is difficult as the discovered rows and columns can be any sets, and there is no insightful way of visualizing them all at once.
In this paper we consider -Boolean matrix factorization such that the resulting matrix has a certain property: we can order the columns and the rows such that the matrix consists of union of contiguous tiles. We do not know the order before-hand, and we discover the order as we also discover the factorization.
Our motivation for discovering such factorization is primarily due to easy exploration of the factorization: we can draw the factorization as tiles. While in certain cases, such a constraint may be too restrictive, there are many settings, where this constraint comes naturally. As a specific example, consider visualizing graphs. A classic technique for visualizing a graph is using linear or circular layout, where the nodes are drawn on a line or circle, and they are connected with arcs. The most common problem with visualizing graphs is clutter due to too many edges. To combat the clutter, edges are often grouped, and drawn in ribbons (see Figure 3 for an example). The problem is to discover such ribbons and the node order, while minimizing the error. We show that we can use matrix factorization on the adjacency matrix of a graph to find the order and the groups.
We show that the factorization we seek can be expressed with consecutive ones property (C1P). Namely, we will look for factor matrices and whose columns can be shuffled such that each row has a form of . We show that the problem is NP-hard, even if , and it is inapproximable for . On the positive side, we propose a greedy algorithm that searches the factors in iterative manner. The search is done by first fixing a vector in and finding the optimal counterpart in , then fixing the vector in and finding the optimal vector in , and so on, until convergence. We show that we can find the optimal counterpart in linear time using pq-trees.
We also consider 3 extensions of this factorization: the first variant, cyclic decomposition, consists of allowing factors to “wrap around the border.” the second variant is specifically designed for symmetric matrices, while the last variant combines the two. Performing cyclic and symmetric decomposition proves to be useful for cyclic layout of graphs.
The rest of the paper is organized as follows: We present preliminary notation and define the matrix factorization and the cyclic version in Section 2. We present the search algorithm in Section 3. The symmetric extensions are given in Section 4. Section 6 is dedicated to related work, and Section 5 is dedicated to experimental evaluation. Finally, we conclude the paper with remarks given in Section 7. All proofs are given in Appendix A.
2 Preliminary notation and problem definitions
We begin by presenting preliminary notation, and then present the two main problem definitions. Extended problems are discussed in Section 4.
2.1 Notation
Given an binary matrix and a binary matrix , the Boolean matrix product is defined element-wise as
[TABLE]
The Boolean matrix sum of and is defined elementwise as .
To measure the distance between two binary matrices, we use the squared Frobenius norm of their (normal) difference, . Notice that as and are both binary, this is the same as calculating the number of disagreements between and : .
We say that a binary matrix has a consecutive ones property (C1P) if its columns can be permuted such that each row has a form of , that is, 1s form a contiguous interval. For the sake of presentation, we will also refer these matrices as unimodal.
We say that a binary matrix is cyclic if its columns can be permuted such that each row has a form of or .
2.2 Problem definitions
Next we will give our two main optimization problems.
Problem 1** (Ordered BMF, obmf).**
Given a binary matrix and an integer , find two unimodal binary matrices and that minimize the number of disagreements
[TABLE]
Problem 2** (Cyclic Ordered BMF, cobmf).**
Given a binary matrix and an integer , find two cyclic binary matrices and that minimize the number of disagreements
[TABLE]
The matrix given in Eq. 2 has another natural alternative characterization: the columns and the rows of can be permuted such that the resulting matrix is a union of contiguous tiles of 1s. Similarly, the matrix given in Eq 3 can be permuted such that the resulting matrix is a union of contiguous tiles, but we also allow the tiles to wrap around the border.
Unsurprisingly, the problems are computationally infeasible. First, we demonstrate that obmf is difficult even if .
Theorem 1**.**
The obmf problem is -hard, even if .
Our next result shows that not only obmf is difficult, but it is also impossible to approximate. To show this, it is enough to demonstrate that testing for zero-error solution is expensive.
Theorem 2**.**
Deciding whether obmf has a zero-error solution is -complete.
The proofs of these and other statements are given in Appendix A.
3 Iterative greedy algorithm
3.1 Greedy algorithm
As we saw in the previous section, not only the problem is -hard, we cannot construct any polynomial-time algorithm with a multiplicative guarantee. Hence, we need to resort to heuristics. The most natural heuristic is a greedy heuristic, where given a -sized factorization we look for a -sized factorization by adding one row and one column to and . Note that these rows need to be selected carefully such that and remain unimodal, and we also need to maintain the permutation(s).
Unfortunately, Theorem 1 states that we cannot even find the best solution for in polynomial-time. Fortunately, we can solve quickly a subproblem, where we have fixed one side.
Problem 3** (Ordered BMF step, obmfstep).**
Given a binary matrix of size and two unimodal matrices, of size and of size , find the decomposition solving obmf such that and is obtained by adding one new row to .
We can use obmfstep as follows. Assume that we have already found matrices and . We first extend with a new row using a given seed, and find the optimal new row for (strategy for such selection is given later using obmfstep. We fix the discovered row, and use obmfstep to find the corresponding row for . Since we solve each step optimally, the error will never increase. We stop when the error stops decreasing. Note that we will need to provide a seed for the initial row in . Here, we test several possible seeds , and select the best. We experiment with several options in experiments, but the default is that is equal to all singleton columns. The pseudo-code for the algorithm is given in Algorithm 1.
The remainder of this section is about solving obmfstep in linear time. Almost the same approach will also work for the cyclic version, cobmfstep; we will point the minute difference.
3.2 Expressing permutations with pq-trees
The complicated aspect of obmfstep is that we need to make sure that the new matrix is unimodal. Luckily, we can use pq-trees, a classic structure that allows us to express every permutation for which a set of binary vertices remain unimodal. In this section we will give a brief review of pq-trees and the two main properties that are relevant to us.
Assume that we are given a universe ; in our case this will be either rows or columns of the input matrix. A pq-tree is a tree with each leaf corresponding to . There are two types of non-leaf nodes, these types will dictate what permutations we can perform on the children. We can permute children of p-node in any order whereas the order of the children of q-node is fixed but we can flip the direction. The leaves of the permuted tree will then indicate an order. We will denote such orders by , where is the pq-tree.
Two seminal results are important to us. The first result states that there is a pq-tree such that are exactly the orders under which a set of binary vertices remain unimodal.
Theorem 3** (Booth and Lueker [3]).**
Given a universe and sets , there is a pq-tree such that are exactly the permutations of under which each is contiguous.
The second result states that we can efficiently update the pq-tree.
Theorem 4** (Booth and Lueker [3]).**
Assume that we have a pq-tree over a universe and a set . Let be the set of all permutations of where is contiguous. If , then there is an -time algorithm that constructs a tree such that . If , then the same algorithm detects a failure.
The detailed description of the algorithm for updating the pq-tree can be found in [3].
3.3 Finding the optimal row
In this section we describe the algorithm that solves obmfstep. Assume that we have a pq-tree representing the permutations of columns in allowed by the previously discovered rows in . When dealing with pq-trees it is notationally easier to deal with sets rather than with vectors. Naturally every binary vector can be represented as a set .
Let us define to be the column indices of ; these are exactly the leaves of . We say that a set is compatible with a pq-tree , if there is an order in where is contiguous. Obviously, compatible sets correspond exactly to suitable new rows in .
We can express obmfstep as an instance of the following problem.
Problem 4** (optset).**
Given a universe , weights for each , and a pq-tree over the universe , find a set that is compatible with and maximizes the total weight .
Recall that corresponds to a column index of . Define to be the gain in the error-function if we were to use in our new row for . More formally, let be the fixed counterpart in for the new row in . Let be the number of ones in at rows and column that are not yet covered by the previous factors. Let be the number of zeros in at rows and column that are not yet covered by the previous factors. We define . Solving optset with these weights solves obmfstep.
In order to solve cobmfstep, we solve optset using , as above, yielding a set, say . In addition, we also solve optset using , yielding a set, say . Then, we use either or , whichever yields a better gain.
In order to solve optset, we need an additional definition: Let be a compatible set of a pq-tree . If there is a permutation in with the first or the last element in , we call a border-compatible set.
Let be a pq-tree. To solve optset we will compute 3 counters for a node in , namely, , , and . The counter corresponds to the total weight of leaves under , while the counter corresponds to the best that is compatible with the subtree starting at . Finally, corresponds to the best that is border-compatible with the subtree starting at .
We should stress that, strictly by definition, can represent an empty set, whereas and should be never empty, even if they produce a negative value. Thus, but and can have negative values. Moreover, it is possible that represents every leaf of , in which case, .
Naturally, we want to compute , where is the root of . To obtain this value we compute each value iteratively, children first. We also maintain the lists of the children that were responsible for producing the optimal value. These lists are clear from the proofs of the following lemmata. This allows us to extract the optimal .
First, note that computing is trivial since . If is a leaf-node, then and .
The next two lemmata establish how to compute the counters for q-nodes.
Lemma 5**.**
Let be a q-node and let be its children. Then
[TABLE]
Lemma 6**.**
Let be a q-node and let be its children. Then
[TABLE]
Our next step is to compute the counters for p-nodes. For that we need to define the following helper function: given a node we define . We will use in the next two lemmata describing on how to compute the counters for p-node.
Lemma 7**.**
Let be a p-node and let be its children. Define . Then
[TABLE]
Note that since we require the set responsible for be non-empty, it is possible that . This can happen only if and every child of has .
Lemma 8**.**
Let be a p-node and let be its children. Define and be the top-2 values of . Then
[TABLE]
Note that using these lemmas every counter can be trivially solved in linear time, except for , where is q-node. To compute in linear time, it is enough if we can solve
[TABLE]
in constant time for a fixed . Luckily, we can rewrite this function as
[TABLE]
where
[TABLE]
Let to be the optimal for a fixed . Since
[TABLE]
we have either or . If we were to test each consecutively, then this allows us to compute in constant time: we simply compare the solution to the best previous solution .
In summary, each counter of can be computed in . Thus we need , where is the number of nodes in . Since , we can compute the counters in time, where is the number of columns in .
When computing the counters we also store which children were responsible for this value. Once we have computed , where is the root of the tree, we can backtrack to obtain the optimal . This can be also done in linear time.
Computing the weights in optset can be done in time, where is the number of 1s in the dataset of size . Consequently, obmfstep can be done in time.
4 Symmetric decomposition
We now propose an extension for symmetric matrices.
4.1 Definition
If is symmetric (e.g. an adjacency matrix of an undirected graph), we have the following problem:
Problem 5** (Symmetric obmf, obmfsym).**
Given a binary matrix and an integer , find two binary matrices and such that is unimodal, that minimize the number of disagreements
[TABLE]
We define similarly cobmfsym, a cyclic and symmetric variant of obmf.
The unimodality condition in obmfsym states that we should be able to permute and with the same permutation so that the rows are in form of .
Notice that we do not use the more common symmetric decomposition as this would lead to necessarily having the blocks around the diagonal.
4.2 Algorithm
The discovery algorithm for symmetric obmf is similar. Like with the regular obmf, we use a greedy algorithm as an iterative step for discovering new rows.
The first difference is that we maintain only one pq-tree, corresponding to the rows in both and .
The second difference is that – as and can have overlapping 1s – maximizing optset does not necessarily produce the optimal row. Instead, we can show that solving optset, with the weights as described in the previous section, minimizes \bigl{\lVert}\bm{{D}}-\bm{{X}}^{T}\circ\bm{{Y}}\bigr{\rVert}_{F}^{2}+\bigl{\lVert}\bm{{D}}-\bm{{Y}}^{T}\circ\bm{{X}}\bigr{\rVert}_{F}^{2}. It follows easily that minimizing this function yields a 2-approximation for finding optimal counterpart row.
5 Experimental evaluation
In this section we study how well the algorithms from Sections 3 and 4.2 work with synthetic and real-world data. We denote the algorithms with the same names as the problems they are solving, and differentiate the algorithms from the problems via the font. That is, obmf is the algorithm for obmf, and so on. The algorithms are implemented in C++, and we make the source code and synthetic experiments freely available.111https://cs.uef.fi/~pauli/bmf/ordered_bmf/
5.1 Resilience to Noise
We start by evaluating the algorithms’ resilience to noise. To that end, we synthesized random matrices of size with block structure (6 blocks of size along the diagonal, with 5 overlapping rows and columns) and corrupted those matrices with flipping a varying amounts of entries. The amount of flipped entries varied from (of total elements) and we compared the quality of the results to both the noise-free matrix and noisy matrix. The results are shown in Figure 1.
With lower leves of noise ( for obmf and cobmf and for the symmetric variants), the reconstruction of the original data is more accurate. With higher levels of noise, the noise has destroyed so much of the structure that the algorithms start fitting to the noise only, with a clear reduction of the quality versus the original data.
It is also worth noticing that obmf obtains exact decompositions when the data has no noise; the other methods introduce a slight error even in these cases emphasizing their more complex setting.
5.2 Scalability
In this section we test how well obmf scales to larger data sets and how well it benefits from multiple cores. These experiments were executed on a server with 40 cores of Intel Xeon E7-4870 processors running at . The algorithm was compiled using GCC 8.1.0 and the parallel code uses the OpenMP library.
To test the scalability, we generated square matrices with for . All matrices have a density of approximately . The results are presented in Figure 2a.
The algorithm shows very good scalability over the full range, although it does get slower when the data size increases from to . It should be noted, though, that as the density is constant, the number of non-zeros in the matrices increases as the square of the matrix size. Hence, obmf exhibits linear growth with respect to the number of non-zero elements.
Algorithm 1 is almost embarrassingly parallel over the different seeds vectors. Hence, we parallellized the test of different seeds, and tested how the algorithm behaves with increased number of cores. The results are in Figure 2b, where we can see that the speed-up is essentially linear up to cores, slightly slower until cores, and only marginal gains are available when increasing the number of cores to , indicating that at the algorithm has become memory bus constrained.
Overall, the experiments show that the algorithm scales very well, and is able to benefit from modern multi-core computers. We study further speed-up options later in Section 5.3.2.
5.3 Experiments with Real-World Data
We now turn to real-world data sets. We used six different real-world data sets, selected to offer a wide variety of different types of data. The data sets we used are as follows. Les Misérables is a standard benchmark data222http://moreno.ss.uci.edu/data.html of the characters of Victor Hugo’s novel Les Misérables. Paleo is a palaeontological data333NOW 030717, http://www.helsinki.fi/science/now/ in the form of a locations-by-genera matrix, giving information where different fossiles have been found. Newsgroups is a subset of the famous 20Newsgroups data444http://qwone.com/~jason/20Newsgroups/ consisting four newsgroups and terms. Terms the terms-by-terms co-occurrence matrix based on Newsgroups. Locations is locations-by-locations matrix indicating mammal species co-location in the northern hemisphere: the data has a in element if locations and have at least five mammals in common. The data is based on the IUC Red List data.555http://www.iucnredlist.org/technical-documents/spatial-data The final data set, Mammals, contains a species-by-species co-inhabitation matrix.666Available for research purposes from the Societas Europaea Mammalogica at http://www.european-mammals.org The data set properties are summarized in Table 1.
To the best of our knowledge, this is the first work to address the ordered Boolean matrix factorization problem. To understand what kind of an effect the ordering constraint has to the reconstruction error, we compare our results with those of asso [15]. The asso algorithm is a well-known method for computing the standard Boolean matrix factorization. We used an implementation available from the author777https://cs.uef.fi/~pauli/basso/basso-0.5.tar.gz and set the rank for asso the same as for our algorithms, and used threshold values .
For symmetric data sets, we also computed the symmetric Boolean factorization. This was done by first computing the standard factorization, and then testing whether or gives smaller reconstruction error and using that one. This version of asso is denoted assosym.
5.3.1 Reconstruction errors
We first compute the reconstruction errors for the various data sets. To facilitate the comparisons, we report the relative reconstruction error
[TABLE]
The results of all datasets are given in Table 2.
In case of asymmetric decompositions, asso is – as expected, as its factor matrices are not restricted to unimodal or cyclic – almost always slightly better than either obmf or cobmf. This difference is, however, very small in many data sets (only in Les Misérables and in Paleo). A remarkable exception is the Mammals data, where asso is in fact worse than either obmf or cobmf. As the data set is the densest of the ones we tested, it is possible that asso was unable to obtain good candidates from it with the rounding thresholds we tried.
There is almost no difference between obmf and cobmf in the terms of reconstruction error in these data sets. Usually, obmf is on par or slightly better than cobmf, except again in Mammals, where cobmf is slightly better. The asymmetric data sets, Paleo and Newsgroups, cause the highest reconstruction errors at over . It should be noted, though, that also asso has similarly high errors with these data sets, indicating that they might not have strong Boolean low-rank structure.
In symmetric decompositions, the relationship between the ordered BMF algorithms and asso is reversed, with assosym being often the worse method (with the exception of Terms). This is not very surprising, given that asso is not designed for symmetric decompositions. The errors are slightly worse than with the asymmetric algorithms, highlighting the complexity of finding the symmetric decompositions.
5.3.2 Changing the seeds
In the above experiments, we used the columns as the seeds for the algorithm (cf. Algorithm 1). This slows the algorithm down, as it has to attempt all of the potential seeds. In this section we study if we can improve the running time without hurting the reconstruction error by sampling only some of the columns for the seed set .
In particular, we sampled of the columns uniformly at random to create the seed set. As the algorithm scales linearly with the number of seeds, this provides an order of magnitude speed-up. To test the quality, we repeated the sampling ten times and report the average relative reconstruction errors and standard deviations in Table 3.
The first thing to notice in Table 3 are the low standard deviations; less than in almost all data sets. The reconstruction errors are also only slightly higher than those in Table 2; for instance, obmf with Paleo has only higher error on average when using random sampling. In most cases the speed-up obtained by the sampling is significant compared to the loss in accuracy.
5.4 Visualizing the Graphs
One of the motivations for the ordered BMF is that it allows the convenient visualization of the graphs using edge bundles (or ribbons) between nodes that are placed in a circle. In this section we explore some of these visualizations and explain what we can learn from the respective data sets using them. In the following plots, the edge bundles and the ordering are obtained form the factorization. Further visualizations can be found in Appendix B.
The Les Misérables data: The visualization of the Les Misérables data is presented in Figure 3. Most edge bundles form a circular segment indicating that all of the nodes under the segment are connected to each other (the characters appear in the same parts of the book). Some of the bundles are contained in other bundles, indicating important subset of characters. Multiple bundles intersect on a node at south-east of the circle called Valjean – the protagonist of the book.
The Mammals data: The second data set is the Mammals data, in Figure 4. For a clearer visualization, we only consider species that do not appear too frequently in the data, as such species are neighbours of every other species in graph. The edge bundles in Figure 4 are essentially rotating around the middle. This probably corresponds to the change of fauna when moving from north to south. The change is gradual, hence two consecutive edge bundles have a significant overlap, but over longer distance, the change in the fauna becomes more obvious and the edge bundles are more disjoint. This gives a good intuition about the structure of the data.
6 Related Work
Boolean matrix factorization (BMF) has received increasing interest in the data analysis community [15, 12, 2, 17, 16, 13, 9, 10, 14, 11], proving to be a versatile tool for analyzing Boolean matrices. Many different algorithms have been proposed, including algorithms based on candidate creation and selection [15, 12], proximal alternations [10], and message passing [16], to name but a few. It has also found applications in diverse fields, such as bioinformatics [5], information extraction [4], and lifted inference [18]. To the best of our knowledge, however, the ordering constraint is not studied in earlier work related to Boolean matrix factorization.
Tiling databases [6] can be seen as a restricted version of BMF, where the factorization cannot express any [math]s as . Geometric tiling [8] is a variation thereof, where the tiles have to be consecutive. The main difference to our work is a different optimization function, [8] uses log-likelihood, and that it assumes that the order is already given, for example, by spectral ordering, whereas we discover the order on the fly.
A binary matrix has the consecutive ones property (C1P) if its columns can be permuted so that all rows have all 1s consecutively. The pq-trees can be used to check for the C1P [3] and Atkins et al. [1] propose spectral ordering algorithm. The spectral ordering approach is used in [8] to permute the data for finding the geometric tiles.
7 Conclusions
Ordered Boolean matrix factorization (obmf) and its variations (cobmf, obmfsym) are restricted versions of Boolean matrix factorization, requiring the factors to have the consecutive ones property (or be cyclic, in case of cobmf). This restriction facilitates the interpretation of the factorization, in particular in the case of the edge bundle visualizations of graphs, as we saw in Section 5.4. On the other hand, the restriction yields higher reconstruction errors, though our experiments show that the difference to state-of-the-art Boolean matrix factorization algorithm is usually very small.
In this paper we laid the theoretical foundations of the obmf problem and its variations, and proposed algorithms based on the pq-trees. An important part of the proposed algorithm is the choice of the seed vectors. In this paper, we mostly used all columns of the data as the seed, though the experiments in Section 5.3.2 show that sampling the columns could work equally well. An interesting question for the future is whether other methods for selecting the seeds would yield better reconstruction errors.
In the problem setting of this paper, the user provides the rank of the decomposition and the goal is to minimize the reconstruction error over the rank- obmf decompositions. A common variant in the Boolean matrix factorization world is to make the rank a free variable and replace the target function with measure that penalizes for higher ranks (see, e.g. [14, 12, 10]). The Minimum Description Length principle is a common approach. The ordered nature of our factor matrices could help with finding more efficient MDL decompositions, as the factor matrices are easier to compress using run-length encoding or similar approaches.
Appendix A Proofs
Proof of Theorem 1.
In this case, we are looking for a decomposition of format , where , , and . Notice that (i) whether we use normal or Boolean algebra does not matter in this case; and (ii) we can always find the ordering after we have found the decomposition, as we only need to order the vectors and . But this problem, the rank-1 binary matrix factorization problem, is known to be -hard [7], finalizing the proof. ∎
Proof of Theorem 2.
The decision problem is obviously in .
We prove the hardness by reduction from Hamilton path, where we are given a graph and asked whether there is a hamiltonian path, that is, a path visiting every vertex exactly once.
Assume that we are given a graph with vertices and edges. Assume that we have some arbitrary order on the vertices , and on the edges .
Let us define first. The dataset will be of size . To define the matrix, we split the rows in two parts and , containing respectively and rows. Similarly, we split the columns in 3 parts, , , .
The 1s in are as follows. for each edge , we set the cells to be 1. For two adjacent edges and , we set the cells . Finally, we set , , and , to be 1. The remaining values are 0.
We argue that there is a zero-error solution for obmf using if and only there is a hamiltonian path.
Let us prove the easy direction: assume that there is a hamiltonian path. To that end, let us permute the rows and columns such that the factor matrices do not have gap zeros. Permute as follows: Set the column order as . Order the rows in according to the hamiltonian path, followed by the rows in . We denote the resulting matrix by . There is a zero-solution if the ones in are a union of contiguous blocks. The blocks are as follows: blocks covering individual rows in , blocks covering edges along the hamiltonian path (this can be done since the corresponding rows in and the corresponsding columns in and are adjacent), and blocks to cover the remaining edges, 2 blocks per edge. This covers all 1s using blocks.
Let us prove the other direction. Assume that there is zero-error solution, and let be the permuted version of with no gap zeros. Then the ones in must be a union of contiguous blocks. For a column index , we define to be the number of blocks started at the th column. Let us also define to be the number of blocks ended at th columns. Trivially, .
We say that an edge is active if and are adjacent in . Let be the total number of active edges. Note that we have . Assume for a moment that and let be the vertices ordered according to the order of in . Since , we are forced to have . This implies that is a hamiltonian path.
We will now argue that .
Consider two adjacent columns at and . If none of the columns are in , then both columns contain 1 that is not in the other column. This forces . The same argument holds if both columns are in .
Assume that the th column is in and th column is in . Assume that . Let and be the rows in that are active in the th columns. Since does not have active rows, the block(s) covering and must terminate, and since , we have only block, implying that and are adjacent. The same result holds if we replace with or permute the order of the two columns. To summarize, if , then either th or the th column corresponds to an active edge.
In addition, we must have and as these columns have 1s. This leads to
[TABLE]
proving the result. ∎
Proof of Lemma 5.
Let be the optimal border-compatible set. Then there is such that is a union of the best border-compable set of and either the union of all leaves in or . ∎
Proof of Lemma 6.
Let be the optimal compatible set. Then is either included completely within one child, or there are indices such that is a union of the best border-compable sets of , , and the union of all leaves in . ∎
Proof of Lemma 7.
Let be the optimal border-compatible set. Then there is such that is a union of the best border-compable set of and the union of all leaves of some children.
Let be a child of , if , then having the leaves of in has positive gain. Let be these children. The total gain corresponds of having these children is .
We need to transform one of the children to a partial. Let be a child of . If , then and adding will have a gain of . If , then , and transforming from a fully-covered node to a partial node will have a gain of . In summary, the gain is equal to . Thus, selecting the vertex with the maximal should be the partial child in . ∎
Proof of Lemma 8.
Let be the optimal compatible set. Then is either included completely within one child, or is a union of some children and possibly up to two of the best border-compable sets for some and .
Let be a child of , if , then having the leaves of in has positive gain. Let be these children. The total gain corresponds of having these children is .
As shown in the proof of Lemma 7, and correspond the top-2 border-compatible sets. It may happen that or are negative, in which case we simply do not add them to . Thus the total gain of border-compatible sets is . ∎
Appendix B Further Visualizations
Here we present for the Terms and Locations data sets.
The Terms data
The visualization of the Terms data, in Figure 5, is markedly different from Figure 3. Here, most bundles overlap each other. This indicates that many of these terms are used together in different posts. Yet, we can also identify specialized groups of terms. At the left of Figure 5, we have a blue bundle, from mission to nasa, that contains terms used when discussing space programs. This overlaps with a larger orange bundle, from chip to tap, containing terms related to cryptography.
The Locations data
For the Locations data, in Figure 6, we cannot print any labels, as the data consists of geographical locations. For these results, we did a rank- decomposition. Most of the edge bundles again form segments along the edge of the circle, corresponding to locations with similar fauna. Few larger edge bundles cover most of these locations, as well, corresponding to more general biospheres. In this figure, many nodes have no edges drawn. This indicates that they were not part of any significant quasi-clique.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Atkins et al. [1998] J. E. Atkins, E. G. Boman, and B. Hendrickson. A Spectral Algorithm for Seriation and the Consecutive Ones Problem. SIAM J. Comput. , 28(1):297–310, 1998.
- 2Bělohlávek and Vychodil [2010] R. Bělohlávek and V. Vychodil. Discovery of optimal factors in binary data via a novel method of matrix decomposition. J. Comput. Syst. Sci. , 76(1):3–20, 2010.
- 3Booth and Lueker [1976] K. S. Booth and G. S. Lueker. Testing for the consecutive ones property, interval graphs, and graph planarity using pq-tree algorithms. J. Comput. Syst. Sci. , 13(3):335–379, 1976.
- 4Cergani and Miettinen [2013] E. Cergani and P. Miettinen. Discovering relations using matrix factorization methods. In CIKM ’13 , pages 1549–1552, 2013.
- 5Corrado et al. [2014] G. Corrado, T. Tebaldi, G. Bertamini, F. Costa, A. Quattrone, G. Viero, and A. Passerini. PT Rcombiner: mining combinatorial regulation of gene expression from post-transcriptional interaction maps. BMC Genomics , 15(1), Apr. 2014.
- 6Geerts et al. [2004] F. Geerts, B. Goethals, and T. Mielikäinen. Tiling databases. In DS ’04 , pages 278–289, 2004.
- 7Gillis and Vavasis [2015] N. Gillis and S. A. Vavasis. On the Complexity of Robust PCA and ℓ 1 subscript ℓ 1 \ell_{1} -norm Low-Rank Matrix Approximation. ar Xiv , 2015.
- 8Gionis et al. [2004] A. Gionis, H. Mannila, and J. K. Seppänen. Geometric and Combinatorial Tiles in 0–1 Data. In PKDD ’04 , pages 173–184, 2004.
