Bayesian Nonparametric Boolean Factor Models

Tammo Rukat; Christopher Yau

arXiv:1907.00063·stat.ML·July 2, 2019

Bayesian Nonparametric Boolean Factor Models

Tammo Rukat, Christopher Yau

PDF

Open Access

TL;DR

This paper introduces a Bayesian nonparametric Boolean factor model using an Indian Buffet Process prior, enabling flexible latent dimension inference and scalable, efficient posterior inference for large-scale Boolean matrix and tensor factorization.

Contribution

It extends existing Boolean factor models by removing the fixed number of latent factors constraint through an IBP prior, simplifying posterior inference and enhancing scalability.

Findings

01

Achieved accurate Boolean matrix factorization on large datasets.

02

Demonstrated efficient inference with billions of observations.

03

Applied model successfully to a real-world dataset with 6 million entries.

Abstract

We build upon probabilistic models for Boolean Matrix and Boolean Tensor factorisation that have recently been shown to solve these problems with unprecedented accuracy and to enable posterior inference to scale to Billions of observation. Here, we lift the restriction of a pre-specified number of latent dimensions by introducing an Indian Buffet Process prior over factor matrices. Not only does the full factor-conditional take a computationally convenient form due to the logical dependencies in the model, but also the posterior over the number of non-zero latent dimensions is remarkably simple. It amounts to counting the number false and true negative predictions, whereas positive predictions can be ignored. This constitutes a very transparent example of sampling-based posterior inference with an IBP prior and, importantly, lets us maintain extremely efficient inference. We discuss…

Equations21

x_{n d} = l OR [AND (z_{n l}, u_{l d})] .

x_{n d} = l OR [AND (z_{n l}, u_{l d})] .

p (x_{n d} ∣ u, z, λ) = σ [λ \tilde{x}_{n d} (1 - 2 l \prod (1 - z_{n l} u_{l d}))] .

p (x_{n d} ∣ u, z, λ) = σ [λ \tilde{x}_{n d} (1 - 2 l \prod (1 - z_{n l} u_{l d}))] .

p(z_{nl}|.)=\sigma\bigg{[}\lambda\tilde{z}_{nl}\sum\limits_{d}\tilde{x}_{nd}u_{ld}\prod\limits_{l^{\prime}\neq l}(1{-}z_{nl^{\prime}}u_{l^{\prime}d}){+}\text{logit}(p(z_{nl}))\bigg{]}\,.

p(z_{nl}|.)=\sigma\bigg{[}\lambda\tilde{z}_{nl}\sum\limits_{d}\tilde{x}_{nd}u_{ld}\prod\limits_{l^{\prime}\neq l}(1{-}z_{nl^{\prime}}u_{l^{\prime}d}){+}\text{logit}(p(z_{nl}))\bigg{]}\,.

Z \sim IBP (α) .

Z \sim IBP (α) .

p (U ∣ q) = d, l \prod q^{u_{l d}} (1 - q)^{1 - u_{l d}} .

p (U ∣ q) = d, l \prod q^{u_{l d}} (1 - q)^{1 - u_{l d}} .

p (z_{n l} = 1∣.) = σ logit (\frac{m _{- n, l}}{N}) + λ \tilde{z}_{n l} d \sum \tilde{x}_{n d} u_{d l} l^{'} \neq = l \prod (1 - z_{n l^{'}} u_{d l^{'}}) .

p (z_{n l} = 1∣.) = σ logit (\frac{m _{- n, l}}{N}) + λ \tilde{z}_{n l} d \sum \tilde{x}_{n d} u_{d l} l^{'} \neq = l \prod (1 - z_{n l^{'}} u_{d l^{'}}) .

p (L_{n}^{'} ∣.) = p (L_{n}^{'} ∣ x_{n}, z_{n, l = 1 : L + L_{n}^{'}}, U_{d = 1 : D, l = 1 : L}) \propto p (x_{n} ∣ z_{n, l = 1 : L + L_{n}^{'}}, U_{d = 1 : D, l = 1 : L}, L_{n}^{'}) p (L_{n}^{'}) .

p (L_{n}^{'} ∣.) = p (L_{n}^{'} ∣ x_{n}, z_{n, l = 1 : L + L_{n}^{'}}, U_{d = 1 : D, l = 1 : L}) \propto p (x_{n} ∣ z_{n, l = 1 : L + L_{n}^{'}}, U_{d = 1 : D, l = 1 : L}, L_{n}^{'}) p (L_{n}^{'}) .

p (x_{n d} ∣ z_{n, l = 1 : L + L^{'}}, u_{d, l = 1 : L}, L_{n}^{'}) = u_{d, l = L + 1 : L_{n}^{'}} \sum σ λ \tilde{x}_{n d} 1 - 2 l = 1 \prod L (1 - z_{n l} u_{l d}) l = L + 1 \prod L_{n}^{'} (1 - u_{l d}) p (u_{d, l = L + 1 : L_{n}^{'}}) .

p (x_{n d} ∣ z_{n, l = 1 : L + L^{'}}, u_{d, l = 1 : L}, L_{n}^{'}) = u_{d, l = L + 1 : L_{n}^{'}} \sum σ λ \tilde{x}_{n d} 1 - 2 l = 1 \prod L (1 - z_{n l} u_{l d}) l = L + 1 \prod L_{n}^{'} (1 - u_{l d}) p (u_{d, l = L + 1 : L_{n}^{'}}) .

\overset{ˉ}{N} \sum lo g p (x_{n} ∣ z_{n, l = 1 : L + L_{n}^{'}}, U_{d = 1 : D, l = 1 : L}, L_{n}^{'})

\overset{ˉ}{N} \sum lo g p (x_{n} ∣ z_{n, l = 1 : L + L_{n}^{'}}, U_{d = 1 : D, l = 1 : L}, L_{n}^{'})

= \overset{ˉ}{N} \sum lo g [p (u_{d, l = L + 1 : L_{n}^{'}} = 0) σ (- λ \tilde{x}_{n d}) + p (u_{d, l = L + 1 : L_{n}^{'}} \neq = 0) σ (λ \tilde{x}_{n d})]

= \overset{ˉ}{N} \sum lo g [q^{L_{n}^{'}} σ (- λ \tilde{x}_{n d}) + (1 - q^{L_{n}^{'}}) σ (λ \tilde{x}_{n d})]

= FN lo g [q^{- L_{n}^{'}} σ (- λ) + (1 - q^{- L_{n}^{'}}) σ (λ)] + TN lo g [q^{- L_{n}^{'}} σ (λ) + (1 - q^{- L_{n}^{'}}) σ (- λ)] .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Methods and Mixture Models · Data Management and Algorithms · Statistical Methods and Inference

Full text

Bayesian Nonparametric Boolean Factor Models

Tammo Rukat

Amazon Research

[email protected]

&Christopher Yau

University of Birmingham and

The Alan Turing Institute Work done while at the University of Oxford and the Alan Turing Institute

Abstract

We build upon probabilistic models for Boolean Matrix and Boolean Tensor factorisation that have recently been shown to solve these problems with unprecedented accuracy and to enable posterior inference to scale to Billions of observations [6, 5]. Here, we lift the restriction of a pre-specified number of latent dimensions by introducing an Indian Buffet Process prior over factor matrices. Not only does the full factor-conditional take a computationally convenient form due to the logical dependencies in the model, but also the posterior over the number of non-zero latent dimensions is remarkably simple. It amounts to counting the number false and true negative predictions, whereas positive predictions can be ignored. This constitutes a very transparent example of sampling-based posterior inference with an IBP prior and, importantly, lets us maintain extremely efficient inference. We discuss applications to simulated data, as well as to a real world data matrix with 6 Million entries.

1 Introduction

Boolean matrix factorisation decomposes binary data $\mathbf{X}=[x_{nd}]\in\{0,1\}^{N\times D}$ into a pair of low-rank, binary matrices $\mathbf{Z}=[z_{nl}]\in\{0,1\}^{N\times L}$ and $\mathbf{U}=[u_{dl}]\in\{0,1\}^{D\times L}$ . The data generating process is based on the Boolean product between binary matrices, which can be expressed through logical operations:

[TABLE]

This model provides a framework for learning from binary data, similar to binary factor analysis or a clustering with joint assignments, where each observation is assigned to a subset of $L$ cluster centroids or codes. Here, one of the factor matrices represents a basis of binary codes, while the other contains indicator variables and provides a compact representation denoting the presence or absence of codes. A feature $x_{nd}$ takes a value of one if it equals one in any of the assigned codes. Note, that formally the designation of $\mathbf{U}$ and $\mathbf{Z}$ as codes or compact representation is arbitrary. They denote subsets of rows and subsets of columns, respectively, but their roles would simply interchange upon transposition of the data matrix. Recently, a probabilistic model for Boolean matrix factorisation has been introduced [6], enabling sampling based posterior inference that scales to Billions of data points and outperforms previous approaches in finding accurate decompositions. In this workshop paper, we build upon this model and lift the restriction of a finite number of latent dimensions by specifying Indian Buffet Process (IBP) as prior over one of the factor matrices.

This approach has been studied in similar models [8, 3]. Nevertheless, our approach is methodologically interesting because the conditional distribution over the number of new latent dimensions takes an extremely simple, intuitive form as we show in Section 2. Moreover, this work is of practical interest because it scales Bayesian nonparametric inference to very large datasets as we demonstrate on a moderately sized example of single-cell gene expression data with 6 Millions data-points in Section 3. We conclude the introduction with a brief description of the finite probabilistic model for Boolean Matrix Factorisation.

1.1 Probabilistic Boolean Matrix Factorisation

Denoting binary data as $\{0,1\}$ greatly simplifies notation in the following but is an otherwise arbitrary choice. We add i.i.d. Bernoulli noise at the observation level to the model in eq. (1) to find a factorial likelihood of the form

[TABLE]

The logistic sigmoid, $\sigma(y)=1/(1+\exp(-y))$ , leads to a convenient expression by virtue of its property, $\sigma(-y)=1-\sigma(y)$ togehter with the mapping from $\{0,1\}$ to $\{-1,1\}$ , defined by $\tilde{x}=2x-1$ . The noise is controlled by a global parameter $\lambda\in\mathbb{R}^{+}$ . Due to the deterministic logical dependencies among the variables, the full conditional distribution for any entry of the factor matrices takes a simple form that lends itself to highly efficient computation:

[TABLE]

In particular, terms inside the sum over $d$ , are known to be zero if any of the following two conditions holds: (i) $u_{ld}=0$ ; (ii) $\exists\,l^{\prime}\neq l\text{, where }z_{nl}=u_{ld}=1$ . Testing for these logical conditions can be implemented efficiently and in parallel. The noise parameter is updated after each sweep through the factors, setting it to its maximum likelihood estimate which is available in closed form, akin to a Monte Carlo EM algorithm [6, 5].

2 Taking the Infinite Limit

We use the likelihood in eq. (2) and specify an IBP prior on one of the factor matrices,

[TABLE]

The IBP is a prior over binary matrices, where the entries in each column follow a Bernoulli distribution with parameter $\mu_{l}$ and where each $\mu_{l}$ is drawn independently from a Beta-distribution. It results from integrating out the $\mu_{l}$ and taking the limit as $L\rightarrow\infty$ , such that the distribution has support over an infinite number of latent dimensions. For the purpose of this paper, we omit further details and refer the interested reader to Griffiths and Ghahramani [2]. For the other factor, we use and independent Bernoulli prior,

[TABLE]

In order to retain a greater degree of symmetry between $U$ and $Z$ , we could alternatively choose a finite Beta-Bernoulli prior over the independent columns of $U$ . However, we refrain from doing so, because it would prohibit parallel inference for the rows of $U$ as described in [6].

The number of columns, $L$ , is notionally infinite and, in practice, denotes the number of columns with at least a one. The infinitely many other columns do not affect the likelihood and therefore do not need to be represented explicitly. We define the number of ones per column as $m_{l}=\sum_{n=1}^{N}z_{nl}$ . Similarly, $m_{-n,l}$ omits row $n$ in the summation, denoting the number of times feature $l$ has been applied to observations $n^{\prime}\neq n$ . Next we describe the sampling procedure for $Z$ , while samples from $U$ are drawn as previously described [6].

2.1 Updates for existing codes

If $m_{-n,l}>0$ , we sample from the conditional as usual, but with the infinite Beta-Bernoulli prior, $p(z_{nl}{=}1|\mathbf{z}_{n,-l})=\frac{m_{-n,l}}{N}$ . In analogy to eq. (3), we find

[TABLE]

The prior contribution couples the rows of $Z$ , such that updates can not be computed in parallel.

2.2 Sampling new codes

In practice, we only need to represent columns with non-zero entries. However, we still need to sample from the remaining columns. Let $L^{\prime}_{n}$ denote the number of columns of $Z$ that contain a $1$ only in row $n$ and change the notation such that let $L$ denotes the number of remaining columns with non-zero entries. We can compute the probability of $L^{\prime}_{n}$ in order to sample the number of such columns. This corresponds to the number of new dishes ordered by customer $n$ and is independent of the other rows of $Z$ such that the conditional distribution is given by

[TABLE]

The prior is $\text{Poisson}(\frac{\alpha}{N})$ , the likelihood factorises over $d$ and can be computed by marginalising over the new columns of $U$ ,

[TABLE]

Note, that for positive predictions, that is for $x_{nd}$ , where $\exists\,l\leq L^{\prime}{:}z_{nl}u_{dl}{=}1$ , the term in parentheses is independent of the entries in the new columns of $U$ , i.e. in the product that runs from $l{=}L{+}1$ to $l{=}L^{\prime}_{n}$ . The intuition is, that the logical disjunction explaining these data-points already emits a one, independent of any additional arguments. Taking the logarithm of the factorial likelihood, we have a sum over the two different types of matrix entries, $x_{nd}$ : The positive predictions, $\bar{P}$ , and the negative predictions, $\bar{N}$ , defined as $x_{nd}$ , where $\nexists\,l:z_{nl}u_{dl}=1$ . Terms for the positive predictions are independent of $L^{\prime}_{n}$ and will cancel when normalising the probabilities for different values of $L^{\prime}_{n}$ . For the negative predictions we have two cases to consider: The Boolean disjunction emits a one, if any entry in the previously unused columns of $\mathbf{U}$ is one and emits a zero otherwise. There exists a single configuration for the latter case where all new entries are zero. We thus have

[TABLE]

In the last step, we have subdivide the negative predictions into true negatives (TN), where $x_{nd}=0$ and false negatives (FN) where $x_{nd}=1$ . Note, that we can pre-compute the terms in the square brackets. These precomputed quantities need only be updated for a new values of $\lambda$ . Thus, sampling $L^{\prime}_{n}$ essentially amounts to counting the number of true positive and true negative predictions in the current configuration of the factors. With the Poisson prior in eq. (6) we can now compute the posterior probability for new values $L^{\prime}$ . We truncate the distribution over $L^{\prime}$ , by sampling only for $L^{\prime}<10$ . The sampling procedure is sketched in Algorithm 1.

3 Experiments

3.1 Synthetic Data

We generate synthetic data of size $200\times 500$ with balanced density from a Boolean product of iid random matrices, varying the latent dimensionality from 2 to 10. Figures for these experiments are not shown. 200 samples are drawn, the first 100 discarded as burn-in. We investigate posterios mean and modes of the distribution of latent dimensions, indicating the ability to recover the true data-generating dimensionality. We find that the model reliably recovers the ground-truth number of latent dimensions. In most cases, the sampler locks onto a single posterior mode. We repeat these experiments adding noise with independent bit-flip probability to the data, where we find close-to perfect recovery for a noise level of 10% and a systematic overestimation of roughly 1 latent dimension for a noise level of 20%. The intuitive justification for this behavious is that the algorithms can not distinguish noise from true patterns at this noise level and thus introduced additional latent patterns.

3.2 Data from Single-Cell Gene Expression

We show results for a real-world dataset from single-cell RNA expression analysis, a revolutionary experimental technique that facilitates the measurement of gene expression on the level of a single cell [1]. The dataset, described in [4] consists of 301 cells of 9 known cell types. The number of sequencing reads per nucleotide is low such that we binarise data that now indicates the presence or absence of expression in approximately 21,000 genes with approximately 35% of the cell/gene pairs being expressed. The data matrix has around 6 Million entries but drawing 200 samples from the factor matrices takes 1-2 minutes on a laptop. This is based on a Python implementation with substantial scope for further optimisation. Figure 1 shows the inferred cell-specific factor matrix. Each of the 301 rows depicts the marginal posterior mean of the binary representation a single cell profile. Each column is a latent dimension corresponds to a subset of the 21,000 genes. We see that the representation has a strong specificity for cell-types while some latent properties are shared across different cells. The corresponding gene-sets are biologically plausible.

4 Conclusion and Future work

We have shown that the probabilistic model for Boolean Matrix Factorisation [6] can be efficiently extended using an IBP prior to infer a posterior distribution over the number of latent dimensions. Due to the logical structure of the posterior, computing full conditionals for Gibbs sampling is extremely fast. In particular, drawing samples from the distribution over additional latent dimensions amounts to counting the true negative and false negative predictions. The results is a flexible, nonparametric model for the analysis of binary data with outstanding scalability.

In future work we will extend this to data of arbitrary arity, as previously shown for the finite case [5]. Moreover we will explore a fully parallel GPU-based implementation using the stick-breaking construction [7] as previously proposed [9].

Bibliography9

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] P. C. Blainey and S. R. Quake. Dissecting Genomic Diversity, One Cell At a Time. Nat. Methods , 11(1):19–21, jan 2014.
2[2] T. L. Griffiths and Z. Ghahramani. The Indian Buffet Process: An Introduction and Review. J. Mach. Learn. Res. , 12:1185–1224, jul 2011.
3[3] E. Meeds, Z. Ghahramani, R. M. Neal, and S. T. Roweis. Modeling dyadic data with binary latent factors. Advances in neural information processing systems , 19:977, 2007.
4[4] A. A. Pollen, T. J. Nowakowski, J. Shuga, X. Wang, A. A. Leyrat, J. H. Lui, N. Li, L. Szpankowski, B. Fowler, P. Chen, et al. Low-coverage single-cell m RNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. Nature biotechnology , 32(10):1053, 2014.
5[5] T. Rukat, C. Holmes, and C. Yau. Probabilistic boolean tensor decomposition. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning , volume 80 of Proceedings of Machine Learning Research , pages 4413–4422, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
6[6] T. Rukat, C. C. Holmes, M. K. Titsias, and C. Yau. Bayesian Boolean Matrix Factorisation. Proceedings of the 34th Annual International Conference on Machine Learning , pages 2969–2978, jul 2017.
7[7] Y. W. Teh, D. Grür, and Z. Ghahramani. Stick-breaking construction for the indian buffet process. In M. Meila and X. Shen, editors, Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics , volume 2 of Proceedings of Machine Learning Research , pages 556–563, San Juan, Puerto Rico, 21–24 Mar 2007. PMLR.
8[8] F. Wood, T. Griffiths, and Z. Ghahramani. A Non-Parametric Bayesian Method for Inferring Hidden Causes. ar Xiv preprint ar Xiv:1206.6865 , 2012.