Learning Mixtures of Sparse Linear Regressions Using Sparse Graph Codes

Dong Yin; Ramtin Pedarsani; Yudong Chen; Kannan Ramchandran

arXiv:1703.00641·cs.IT·August 3, 2018

Learning Mixtures of Sparse Linear Regressions Using Sparse Graph Codes

Dong Yin, Ramtin Pedarsani, Yudong Chen, Kannan Ramchandran

PDF

Open Access

TL;DR

This paper introduces a novel algorithm called Mixed-Coloring for efficiently estimating and demixing mixtures of sparse linear regressions, leveraging coding theory to achieve near-optimal sample and computational complexities, especially in noisy settings.

Contribution

The paper presents a new coding-theoretic approach and algorithm for mixture of sparse linear regressions that improves efficiency and accuracy over existing methods.

Findings

01

Achieves order-optimal sample and time complexity in noiseless settings.

02

Near-optimal sample and time complexity for two-component mixtures in noisy settings.

03

Significantly faster than EM algorithm in experiments with large dimensions.

Abstract

In this paper, we consider the mixture of sparse linear regressions model. Let $β^{(1)}, \dots, β^{(L)} \in C^{n}$ be $L$ unknown sparse parameter vectors with a total of $K$ non-zero coefficients. Noisy linear measurements are obtained in the form $y_{i} = x_{i}^{H} β^{(ℓ_{i})} + w_{i}$ , each of which is generated randomly from one of the sparse vectors with the label $ℓ_{i}$ unknown. The goal is to estimate the parameter vectors efficiently with low sample and computational costs. This problem presents significant challenges as one needs to simultaneously solve the demixing problem of recovering the labels $ℓ_{i}$ as well as the estimation problem of recovering the sparse vectors $β^{(ℓ)}$ . Our solution to the problem leverages the connection between modern coding theory and statistical inference. We introduce a new algorithm, Mixed-Coloring,…

Tables5

Table 1. TABLE I : Sample complexity of the Mixed-Coloring algorithm

$L$	$2$	$3$	$4$
$p^{*}$	$5.1 \times 10^{- 6}$	$8.8 \times 10^{- 6}$	$8.1 \times 10^{- 6}$
$m = C K$	$33.39 K$	$37.80 K$	$40.32 K$

Table 2. TABLE II : Design parameters of Mixed-Coloring algorithm

Parameter	Description	$L = 2$	$L = 3$	$L = 4$
$p^{*}$	error fraction	$5.1 \times 10^{- 6}$	$8.8 \times 10^{- 6}$	$8.1 \times 10^{- 6}$
$d$	left degree of bipartite graph	$15$	$15$	$13$
$R$	number of type-I / type-II index measurements in each bin	$3$	$5$	$8$
$V$	number of verification measurements in each bin	$3$	$5$	$8$
$M$	number of bins	$3.71 K$	$2.52 K$	$1.68 K$
$m$	total number of measurements, $m = (2 R + V) M$	$33.39 K$	$37.80 K$	$40.32 K$

Table 3. TABLE III : Design parameters of Robust Mixed-Coloring algorithm

Parameter	Description	Choice
$p^{*}$	error fraction	$5.1 \times 10^{- 6}$
$d$	left degree of bipartite graph	$15$
$N$	number of repetitions of each query vector	$Θ (polylog (n))$
$P_{1}$	number of binary indexing vectors in each bin	$⌈ \log_{2} (n) ⌉$
$P_{2}$	number of verification vectors in each bin	$Θ (\log (n))$
$P_{3}$	number of summation check vectors in each bin	$(\frac{P_{1} + P_{2}}{2}) = Θ (\log^{2} (n))$
$M$	number of bins	$3.71 K$
$m$	total number of measurements	$m = K N (P_{1} + P_{2} + P_{3}) = Θ (K polylog (n))$

Table 4. TABLE IV : Comparison of the Mixed-Coloring algorithm (M-C) and the EM-style algorithm (EM). Mixed-Coloring algorithm is advantageous in time complexity for both sparse and dense problems, and is advantageous in sample complexity for sparse problems.

Problem type	$(n, K)$	$\frac{sample(M-C)}{sample(EM)}$	$\frac{run-time(M-C)}{run-time(EM)}$
Sparse	$(100, 20)$	$0.57$	$0.00806$
Sparse	$(500, 50)$	$0.33$	$0.00272$
Dense	$(100, 100)$	$2.78$	$0.0526$
Dense	$(500, 500)$	$3.00$	$0.0270$

Table 5. TABLE V : Constants in the results of sample complexity.

$L = 2$	$d$	11	12	13	14	15	16	17	18
	$p^{*} / 10^{- 6}$	6.7	8.7	1.9	3.1	5.1	1.6	0.5	7.4
	$M / K$	2.95	3.17	3.23	3.46	3.71	3.78	3.86	4.37
	$R$	4	4	4	3	3	3	3	3
	$V$	4	3	3	4	3	3	3	2
	$m / K$	35.4	34.87	35.53	34.6	33.39	34.02	34.74	34.96
$L = 3$	$d$	11	12	13	14	15	16	17	18
	$p^{*} / 10^{- 6}$	4.4	5.2	2.7	9.2	8.8	2.8	6.2	2.3
	$M / K$	1.94	2.08	2.17	2.39	2.52	2.56	2.76	2.81
	$R$	7	6	6	5	5	5	5	5
	$V$	7	7	6	6	5	5	4	4
	$m / K$	40.74	39.52	39.06	38.24	37.80	38.4	38.64	39.34
$L = 4$	$d$	11	12	13	14	15	16	17	18
	$p^{*} / 10^{- 6}$	7.8	8.7	8.1	5.6	4.2	3.3	4.0	5.0
	$M / K$	1.48	1.59	1.68	1.76	1.85	1.93	2.04	2.16
	$R$	9	9	8	8	7	7	7	6
	$V$	11	8	8	7	8	7	6	7
	$m / K$	42.92	41.34	40.32	40.48	40.7	40.53	40.8	41.04

Equations159

y_{i} = x_{i}^{H} β^{(ℓ)} + w_{i} with probability q_{ℓ}, for ℓ \in [L],

y_{i} = x_{i}^{H} β^{(ℓ)} + w_{i} with probability q_{ℓ}, for ℓ \in [L],

\mathbb{P}\big{\{}|{\rm supp}(\hat{\boldsymbol{\beta}}^{(\ell)})|\geq(1-p^{*})|{\rm supp}(\boldsymbol{\beta}^{(\ell)})|\big{\}}\geq 1-\mathcal{O}(1/K).

\mathbb{P}\big{\{}|{\rm supp}(\hat{\boldsymbol{\beta}}^{(\ell)})|\geq(1-p^{*})|{\rm supp}(\boldsymbol{\beta}^{(\ell)})|\big{\}}\geq 1-\mathcal{O}(1/K).

P {\hat{β}_{j}^{(ℓ)} = β_{j}^{(ℓ)}} \geq 1 - p^{*} - O (1/ K) .

P {\hat{β}_{j}^{(ℓ)} = β_{j}^{(ℓ)}} \geq 1 - p^{*} - O (1/ K) .

D ≜ {\pm Δ, \pm 2Δ, \dots, \pm b Δ} \subset R,

D ≜ {\pm Δ, \pm 2Δ, \dots, \pm b Δ} \subset R,

\mathbb{P}\big{\{}|{\rm supp}(\hat{\boldsymbol{\beta}}^{(\ell)})|\geq(1-p^{*})|{\rm supp}(\boldsymbol{\beta}^{(\ell)})|\big{\}}\geq 1-\mathcal{O}(1/K).

\mathbb{P}\big{\{}|{\rm supp}(\hat{\boldsymbol{\beta}}^{(\ell)})|\geq(1-p^{*})|{\rm supp}(\boldsymbol{\beta}^{(\ell)})|\big{\}}\geq 1-\mathcal{O}(1/K).

P {\hat{β}_{j}^{(ℓ)} = β_{j}^{(ℓ)}} \geq 1 - p^{*} - O (1/ K) .

P {\hat{β}_{j}^{(ℓ)} = β_{j}^{(ℓ)}} \geq 1 - p^{*} - O (1/ K) .

\left[\begin{array}[]{c}y_{1}\\ y_{2}\end{array}\right]=\left[\begin{array}[]{c}\boldsymbol{x}_{1}^{\rm H}\\ \boldsymbol{x}_{2}^{\rm H}\end{array}\right]\boldsymbol{\beta}^{(1)}\\ =\left[\begin{array}[]{cccccccc}0&r_{2}&r_{3}&0&0&r_{6}&0&0\\ 0&r_{2}W&r_{3}W^{2}&0&0&r_{6}W^{5}&0&0\end{array}\right]\boldsymbol{\beta}^{(1)}.

\left[\begin{array}[]{c}y_{1}\\ y_{2}\end{array}\right]=\left[\begin{array}[]{c}\boldsymbol{x}_{1}^{\rm H}\\ \boldsymbol{x}_{2}^{\rm H}\end{array}\right]\boldsymbol{\beta}^{(1)}\\ =\left[\begin{array}[]{cccccccc}0&r_{2}&r_{3}&0&0&r_{6}&0&0\\ 0&r_{2}W&r_{3}W^{2}&0&0&r_{6}W^{5}&0&0\end{array}\right]\boldsymbol{\beta}^{(1)}.

y_{i} = x_{i, 2} β_{2}^{(1)} + x_{i, 3} β_{3}^{(1)}, i = 1, 2

y_{i} = x_{i, 2} β_{2}^{(1)} + x_{i, 3} β_{3}^{(1)}, i = 1, 2

y_{i} \leftarrow y_{i} - x_{i, 2} β i = 1, 2.

y_{i} \leftarrow y_{i} - x_{i, 2} β i = 1, 2.

\boldsymbol{H}=\left[\begin{array}[]{ccccc}0&1&1&0&1\\ 1&0&1&1&1\\ 1&1&0&1&0\end{array}\right].

\boldsymbol{H}=\left[\begin{array}[]{ccccc}0&1&1&0&1\\ 1&0&1&1&1\\ 1&1&0&1&0\end{array}\right].

Q_{ℓ} = [1 - (1 - q_{ℓ})^{V}] [1 - (1 - q_{ℓ})^{R}]^{2} .

Q_{ℓ} = [1 - (1 - q_{ℓ})^{V}] [1 - (1 - q_{ℓ})^{R}]^{2} .

ξ_{k}^{(ℓ)} = Q_{ℓ} (k K _{ℓ}) (\frac{d}{M})^{k} (1 - \frac{d}{M})^{K_{ℓ} - k} .

ξ_{k}^{(ℓ)} = Q_{ℓ} (k K _{ℓ}) (\frac{d}{M})^{k} (1 - \frac{d}{M})^{K_{ℓ} - k} .

ξ_{k}^{(ℓ)} \approx Q_{ℓ} \frac{λ _{ℓ}^{k} e ^{- λ_{ℓ}}}{k !} .

ξ_{k}^{(ℓ)} \approx Q_{ℓ} \frac{λ _{ℓ}^{k} e ^{- λ_{ℓ}}}{k !} .

ρ_{k}^{(ℓ)} = \frac{k M}{K _{ℓ} d} ξ_{k}^{(ℓ)} = Q_{ℓ} \frac{λ _{ℓ}^{k - 1} e ^{- λ_{ℓ}}}{( k - 1 )!} .

ρ_{k}^{(ℓ)} = \frac{k M}{K _{ℓ} d} ξ_{k}^{(ℓ)} = Q_{ℓ} \frac{λ _{ℓ}^{k - 1} e ^{- λ_{ℓ}}}{( k - 1 )!} .

q_{s}^{(ℓ)} = 1 - (1 - ρ_{1}^{(ℓ)})^{d} = Θ (1) .

q_{s}^{(ℓ)} = 1 - (1 - ρ_{1}^{(ℓ)})^{d} = Θ (1) .

[x_{1} \dots x_{P}]^{T} = [B^{T} V^{T} C^{T}]^{T} diag (h) .

[x_{1} \dots x_{P}]^{T} = [B^{T} V^{T} C^{T}]^{T} diag (h) .

P {∣ K_{s}^{(ℓ)} - K_{ℓ} q_{s}^{(ℓ)} ∣ \leq δ K_{ℓ}} \geq 1 - 2 exp (- 2 δ^{2} K_{ℓ}) .

P {∣ K_{s}^{(ℓ)} - K_{ℓ} q_{s}^{(ℓ)} ∣ \leq δ K_{ℓ}} \geq 1 - 2 exp (- 2 δ^{2} K_{ℓ}) .

Q_{ℓ} = [1 - (1 - q_{ℓ})^{V}] [1 - (1 - q_{ℓ})^{R}]^{2} .

Q_{ℓ} = [1 - (1 - q_{ℓ})^{V}] [1 - (1 - q_{ℓ})^{R}]^{2} .

ξ_{k}^{(ℓ)} = Q_{ℓ} (k K _{ℓ}) (\frac{d}{M})^{k} (1 - \frac{d}{M})^{K_{ℓ} - k} .

ξ_{k}^{(ℓ)} = Q_{ℓ} (k K _{ℓ}) (\frac{d}{M})^{k} (1 - \frac{d}{M})^{K_{ℓ} - k} .

ξ_{k}^{(ℓ)} \approx Q_{ℓ} \frac{λ _{ℓ}^{k} e ^{- λ_{ℓ}}}{k !} .

ξ_{k}^{(ℓ)} \approx Q_{ℓ} \frac{λ _{ℓ}^{k} e ^{- λ_{ℓ}}}{k !} .

ρ_{k}^{(ℓ)} = \frac{k M}{K _{ℓ} d} ξ_{k}^{(ℓ)} = Q_{ℓ} \frac{λ _{ℓ}^{k - 1} e ^{- λ_{ℓ}}}{( k - 1 )!},

ρ_{k}^{(ℓ)} = \frac{k M}{K _{ℓ} d} ξ_{k}^{(ℓ)} = Q_{ℓ} \frac{λ _{ℓ}^{k - 1} e ^{- λ_{ℓ}}}{( k - 1 )!},

q_{s}^{(ℓ)} = 1 - (1 - ρ_{1}^{(ℓ)})^{d},

q_{s}^{(ℓ)} = 1 - (1 - ρ_{1}^{(ℓ)})^{d},

P {∣ K_{s}^{(ℓ)} - K_{ℓ} q_{s}^{(ℓ)} ∣ \leq δ K_{ℓ}} \geq 1 - 2 exp (- 2 δ^{2} K_{ℓ}),

P {∣ K_{s}^{(ℓ)} - K_{ℓ} q_{s}^{(ℓ)} ∣ \leq δ K_{ℓ}} \geq 1 - 2 exp (- 2 δ^{2} K_{ℓ}),

P {∣ M_{s}^{(ℓ)} - M ν_{ℓ} ∣ \leq δ M} \geq 1 - 2 exp (- 2 δ^{2} M) .

P {∣ M_{s}^{(ℓ)} - M ν_{ℓ} ∣ \leq δ M} \geq 1 - 2 exp (- 2 δ^{2} M) .

q_{d}^{(ℓ)}

q_{d}^{(ℓ)}

= \frac{1 - P { B ˉ } - P { D ˉ } + P { B ˉ ⋂ D ˉ }}{1 - P { D ˉ }}

= \frac{1 - ( 1 - ρ _{1}^{(ℓ)} ) ^{d} - ( 1 - ρ _{2}^{(ℓ)} ) ^{d} + ( 1 - ρ _{1}^{(ℓ)} - ρ _{2}^{(ℓ)} ) ^{d}}{1 - ( 1 - ρ _{2}^{(ℓ)} ) ^{d}} .

P {∣ M_{s}^{(ℓ)} - M ν_{ℓ} ∣ \leq δ M} \geq 1 - 2 exp (- 2 δ^{2} M),

P {∣ M_{s}^{(ℓ)} - M ν_{ℓ} ∣ \leq δ M} \geq 1 - 2 exp (- 2 δ^{2} M),

\frac{2 M ν _{ℓ}}{K _{ℓ} q _{s}^{(ℓ)}} > 1,

\frac{2 M ν _{ℓ}}{K _{ℓ} q _{s}^{(ℓ)}} > 1,

\frac{K _{G}^{(ℓ)}}{K _{ℓ}} - ζ_{ℓ} q_{s}^{(ℓ)} \leq δ,

\frac{K _{G}^{(ℓ)}}{K _{ℓ}} - ζ_{ℓ} q_{s}^{(ℓ)} \leq δ,

ζ_{ℓ} + exp (- 2 \frac{ζ _{ℓ} M ν _{ℓ}}{K _{ℓ} q _{s}^{(ℓ)}}) = 1,

ζ_{ℓ} + exp (- 2 \frac{ζ _{ℓ} M ν _{ℓ}}{K _{ℓ} q _{s}^{(ℓ)}}) = 1,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Sparse and Compressive Sensing Techniques · Advanced biosensing and bioanalysis techniques

Full text

Learning Mixtures of Sparse Linear Regressions Using Sparse Graph Codes

Dong Yin

Ramtin Pedarsani

Yudong Chen

and Kannan Ramchandran D. Yin and K. Ramchandran are with the Department of EECS at UC Berkeley, email: {dongyin, kannanr}@eecs.berkeley.edu.R. Pedarsani is with the Department of ECE at UC Santa Barbara, email: [email protected]. Chen is with the School of ORIE at Cornell University, email: [email protected] paper was presented in part at IEEE 55th Annual Allerton Conference on Communication, Control, and Computing, 2017.

Abstract

In this paper, we consider the mixture of sparse linear regressions model. Let $\boldsymbol{\beta}^{(1)},\ldots,\boldsymbol{\beta}^{(L)}\in\mathbb{C}^{n}$ be $L$ unknown sparse parameter vectors with a total of $K$ non-zero elements. Noisy linear measurements are obtained in the form $y_{i}=\boldsymbol{x}_{i}^{\rm H}\boldsymbol{\beta}^{(\ell_{i})}+w_{i}$ , each of which is generated randomly from one of the sparse vectors with the label $\ell_{i}$ unknown. The goal is to estimate the parameter vectors efficiently with low sample and computational costs. This problem presents significant challenges as one needs to simultaneously solve the demixing problem of recovering the labels $\ell_{i}$ as well as the estimation problem of recovering the sparse vectors $\boldsymbol{\beta}^{(\ell)}$ .

Our solution to the problem leverages the connection between modern coding theory and statistical inference. We introduce a new algorithm, Mixed-Coloring, which samples the mixture strategically using query vectors $\boldsymbol{x}_{i}$ constructed based on ideas from sparse graph codes. Our novel code design allows for both efficient demixing and parameter estimation. To find $K$ non-zero elements, it is clear that we need at least $\Theta(K)$ measurements, and thus the time complexity is at least $\Theta(K)$ . In the noiseless setting, for a constant number of sparse parameter vectors, our algorithm achieves the order-optimal sample and time complexities of $\Theta(K)$ . In the presence of Gaussian noise,111The proposed algorithm works even when the noise is non-Gaussian in nature, but the guarantees on sample and time complexities are difficult to obtain. for the problem with two parameter vectors (i.e., $L=2$ ), we show that the Robust Mixed-Coloring algorithm achieves near-optimal $\Theta(K\operatorname{polylog}(n))$ sample and time complexities. When $K=\mathcal{O}(n^{\alpha})$ for some constant $\alpha\in(0,1)$ (i.e., $K$ is sublinear in $n$ ), we can achieve sample and time complexities both sublinear in the ambient dimension. In one of our experiments, to recover a mixture of two regressions with dimension $n=500$ and sparsity $K=50$ , our algorithm is more than $300$ times faster than EM algorithm, with about one third of its sample cost.

I Introduction

Mixture and latent variable models, such as Gaussian mixtures and subspace clustering, are expressive, flexible, and widely used in a broad range of problems including background modeling [1], speaker identification [2] and recommender systems [3]. However, parameter estimation in mixture models is notoriously difficult due to the non-convexity of the likelihood functions and the existence of local optima. In particular, it often requires a large sample size and many re-initializations of the algorithms to achieve an acceptable accuracy.

Our goal is to develop provably fast and efficient algorithms for mixture models—with sample and time complexities sublinear in the problem’s ambient dimension when the parameter vectors of interest are sparse—by leveraging the underlying low-dimensional structures.

In this paper we focus on a powerful class of models called mixtures of linear regressions [4]. We consider the sparse setting with a query-based algorithmic framework. In particular, we assume that each query-measurement pair $(\boldsymbol{x}_{i},y_{i})$ is generated from a sparse linear model chosen randomly from $L$ possible models:222We use $\boldsymbol{x}_{i}^{\rm H}$ to denote the conjugate transpose of $\boldsymbol{x}_{i}$ . In this paper, for any positive integer $N$ , $[N]$ denotes the set $\{1,2,\ldots,N\}$ .

[TABLE]

where $w_{i}$ is noise. Here, the probability $q_{\ell}>0$ is also called the mixture weight of $\boldsymbol{\beta}^{(\ell)}$ , and they satisfy $\sum_{\ell=1}^{L}q_{\ell}=1$ . The total number of non-zero elements in the parameter vectors $\{\boldsymbol{\beta}^{(\ell)}\in\mathbb{C}^{n},\ell\in[L]\}$ is assumed to be $K$ . The goal is to estimate the $\boldsymbol{\beta}^{(\ell)}$ ’s, without knowing which $\boldsymbol{\beta}^{(\ell)}$ generates each query-measurement pair. We also note that when $L=1$ , we recover the compressive sensing problem that has been extensively studied in recent years [5, 6].

A mixture of regressions provides a flexible model for various heterogeneous settings where the regression coefficients differ for different subsets of observations. This model has been applied to a broad range of tasks including medicine measurement design [7], behavioral health care [8] and music perception modeling [9]. Here, we study the problem when the query vectors $\boldsymbol{x}_{i}$ can be designed by the user; in Section I-B we discuss several practical applications that motivate the study of this query-based setting. Our results show that by appropriately exploiting this design freedom, one can achieve significant reduction in the sample and computational costs.

To recover $K$ unknown non-zero elements, the number of linear measurements needed scales at least as $\Theta(K)$ . The corresponding time complexity is also at least $\Theta(K)$ , which is the time needed to write down $K$ numbers as the solution. We introduce a new algorithm, called the Mixed-Coloring algorithm, that matches these sublinear sample and time complexity lower bounds. The design of query vectors and decoding algorithm leverages ideas from sparse graph codes such as low-density parity-check (LDPC) codes [10]. For any $L=\Theta(1)$ , our algorithm recovers the parameter vectors with optimal $\Theta(K)$ sample and time complexities in the noiseless setting, both in theory and empirically. In the noisy setting, for compressive sensing problems (i.e., $L=1$ ), it is known from an information-theoretic point of view that the optimal sample complexity is $\Theta(K\log(n/K))$ [11, 12]. In this work, we show that when the noise is Gaussian distributed, $L=2$ , and the non-zero elements take value in a finite quantized set, the Robust Mixed-Coloring algorithm has $\Theta(K\operatorname{polylog}(n))$ sample and time complexities. Since our problem is harder than compressive sensing, the sample and time complexities of our algorithm are optimal up to polylogarithmic factors. When $K=\mathcal{O}(n^{\alpha})$ for some $\alpha\in(0,1)$ , the sample and time complexities are sublinear in the ambient dimension $n$ . In noisy setting with continuous-valued parameter vectors, we provide experimental results and show that our algorithm can successfully recover the best quantized approximation of the parameter vectors, provided that the continuous-valued parameter vectors are close to the quantized grid in $\ell_{\infty}$ norm333In Section VI, we formally define the perturbation of the continuous-valued parameter vector $\boldsymbol{\beta}$ with respect to the quantized alphabet. The notion of perturbation essentially measures the distance between the continuous-valued parameter vector and the quantized grid in $\ell_{\infty}$ norm.. Prior literature on this problem that does not utilize the design freedom typically have sample and time complexities that are at least polynomial in $n$ ; we provide a survey of prior work and a more detailed comparison in Section III. Empirically, we find that our algorithm is orders of magnitude faster than standard Expectation-Maximization (EM) algorithms for mixture of regressions. For example, in one of our experiments, detailed in Section VI, we consider recovering a mixture of two regressions with dimension $n=500$ and sparsity $K=50$ ; our algorithm is more than $300$ times faster than EM algorithm, with about $1/3$ of its sample cost.

I-A Algorithm Overview

Our Mixed-Coloring algorithm solves two problems simultaneously: (i) rapid demixing, namely identifying the label $\ell_{i}$ of the vector $\boldsymbol{\beta}^{(\ell_{i})}$ that generates each measurement $y_{i}$ ; (ii) efficient identification of the location and value of the non-zero elements of the $\boldsymbol{\beta}^{(\ell)}$ ’s. The main idea is to use a divide-and-conquer approach that iteratively reduces the original problem into simpler ones with much sparser parameter vectors. More specifically, we design $\Theta(K)$ sets of sparse query vectors, with each set only associated with a subset of all the non-zero elements. The design of the query vectors ensures that we can first identify the sets which are associated with a single non-zero element (called singletons), and recover the location and value of that element (motivated by a balls-and-bins model that we utilize for designing our measurements, we call them singleton balls, shown as shaded balls in Figure 1(b)). We further identify the pairs of singleton balls which have the same (but unknown) label, indicated by the edges in Figure 1(b). Results from random graph theory guarantee that, with high probability, the $L$ largest connected components (giant components) of the singleton graph have different labels, and thus we recover a fraction of the non-zero elements in each $\boldsymbol{\beta}^{(\ell)}$ , as shown in Figure 1(c). We can then iteratively enlarge the recovered fraction with a guess-and-check method until finding all the non-zero elements. We revisit Figure 1 when describing the details of our algorithm in Section IV.

I-B Motivation

Our problem is a natural extension of the setting of compressive sensing, in which one often has full freedom of designing query vectors in order to estimate a sparse parameter vector. In many applications, the unknown sparse parameter vector can be affected by latent variables, leading to a mixture of sparse linear regressions, and these scenarios have been observed in neuroscience [13], genetics [14], psychology [7], etc. Here, we provide a concrete example motivated by neuroscience applications. In neural signal processing, sensors are used to measure the brain activities, represented by an unknown sparse vector $\boldsymbol{\beta}$ . The sensors can be modeled as digital filters, and one can design the linear filter weights ( $\boldsymbol{x}_{i}$ ’s) when measuring the neural signal. Multiple sensors are usually placed in a particular area of the brain in order to acquire enough compressed measurements. However, there may be more than one neuron affecting a particular area of the brain, as shown in Figure 2, and different neurons may have different activities, corresponding to the $\boldsymbol{\beta}^{(\ell)}$ ’s. Consequently, each sensor may be measuring one of several different sparse signals. Further, if we use the sensors multiple times, a single sensor may even obtain measurements that are generated by different neurons, since neurons are different depths may be active during different time periods. Thus, the problem can be formulated as a mixture of sparse linear regressions. Variants of this problem, such as neural spike sorting [13], have been studied in neuroscience. While the common solution is to use clustering algorithms on the spike signals, we believe that our algorithm provides the potential of improving sensor design and reducing sample and time complexities.

In addition, our work adds the intellectual value of the power of design freedom in tackling sparse mixture problems by highlighting the significant performance gap between algorithms that can exploit the design freedom and those that cannot. We also believe that our ideas are applicable more broadly for other latent-variable problems that require experimental designs, such as survey designs in psychology with mixed type of respondents and biology experiments with mixed cell interior environments.

I-C Organization

We summarize our main results in Section II, discuss related works in Section III, present the details of our algorithm in the noiseless and noisy settings in Sections IV and V, respectively, provide experimental results in Section VI, and make conclusions in Section VII.

II Main Results

In this section, we present the recovery guarantees for the Mixed-Coloring algorithm, and provide bounds on its sample and time complexities. We assume there are $L$ unknown $n$ -dimensional parameter vectors $\boldsymbol{\beta}^{(1)},\ldots,\boldsymbol{\beta}^{(L)}$ . Each $\boldsymbol{\beta}^{(\ell)}$ has $K_{\ell}$ non-zero elements, i.e., $|{\rm supp}(\boldsymbol{\beta}^{(\ell)})|=|\{j:\beta_{j}^{(\ell)}\neq 0\}|=K_{\ell}$ . Let $K=\sum_{\ell=1}^{L}K_{\ell}$ be the total number of non-zero elements. Using the query vectors $\{\boldsymbol{x}_{i}\}\in\mathbb{C}^{n}$ , the Mixed-Coloring algorithms obtains $m$ measurements $y_{i}$ , $i\in[m]$ generated independently according to the model (1), and outputs an estimate $\{\hat{\boldsymbol{\beta}}^{(\ell)}$ , $\ell\in[L]\}$ of the unknown parameter vectors. We defer more details to Sections IV and V.

Our results are stated in the asymptotic regime where $n$ and $K$ approach infinity. A constant is a quantity that does not depend on $n$ and $K$ , with the associated Big-O notations $\mathcal{O}(\cdot)$ and $\Theta(\cdot)$ . We assume that $L$ is a known and fixed constant, and the mixture weights satisfy $q_{\ell}=\Theta(1)$ for each $\ell\in[L]$ and thus are of the same order. Similarly, the sparsity levels of the parameter vectors are also of the same order with $K_{\ell}=\Theta(K)$ .

II-A Guarantees for the Noiseless Setting

In the noiseless case, i.e., $w_{i}\equiv 0$ , we consider for generality the complex-valued setting with $\boldsymbol{\beta}^{(\ell)}\in\mathbb{C}^{n}$ (our results can be easily applied to real case). We make a mild technical assumption, which stipulates that if any pair of parameter vectors have overlapping support, then the elements in the overlap are different.

Assumption 1.

For each pair $\ell_{1},\ell_{2}\in[L]$ , $\ell_{1}\neq\ell_{2}$ and each index $j\in{\rm supp}(\boldsymbol{\beta}^{(\ell_{1})})\cap{\rm supp}(\boldsymbol{\beta}^{(\ell_{2})})$ , we have $\beta_{j}^{(\ell_{1})}\neq\beta_{j}^{(\ell_{2})}$ .

We need this assumption due to our element-wise recovery strategy. However, this assumption is mild in practice. In particular, in the noiseless case, if the non-zero elements are generated from some continuous distribution, it is a measure zero event that two elements at the same coordinate share exactly the same value. Under the above setting, we have the following recovery guarantees for the Mixed-Coloring algorithm.

Theorem 1.

Consider the asymptotic regime where $n$ and $K$ approach infinity. Under Assumption 1, for any fixed constant $p^{*}\in(0,1)$ , there exists a constant $C>0$ such that if the number of measurements is $m\geq CK$ , then the Mixed-Coloring algorithm guarantees the following three properties for each $\ell\in[L]$ (up to a label permutation):

(No false discovery) For each $j\in{\rm supp}(\boldsymbol{\beta}^{(\ell)})$ , $\hat{\beta}^{(\ell)}_{j}$ equals either $\beta^{(\ell)}_{j}$ or [math]; for each $j\notin{\rm supp}(\boldsymbol{\beta}^{(\ell)})$ , $\hat{\beta}^{(\ell)}_{j}=0$ . 2. 2.

(Support recovery)

[TABLE] 3. 3.

(Element-wise recovery) For each $j\in{\rm supp}(\boldsymbol{\beta}^{(\ell)})$ ,

[TABLE]

Moreover, the computational time of the Mixed-Coloring algorithm is $\Theta(K)$ .

As we can see, to recover an arbitrarily large fraction of the non-zero elements, our Mixed-Coloring algorithm has order-optimal $\Theta(K)$ sample and time complexities. More specifically, the first property ensures that Mixed-Coloring algorithm has no false discovery: for zero elements in the parameter vectors, our algorithm does not produce non-zero estimates, and for non-zero elements, our algorithm outputs either the true value or zero. The second property ensures that the Mixed-Coloring algorithm recovers $(1-p^{*})$ fraction of the non-zero elements with high probability. The third property ensures that for each non-zero element, the probability that it can be recovered is asymptotically at least $1-p^{*}$ . In fact, the recovered fraction of the non-zero elements is uniformly distributed on the support of the parameter vectors.

The error fraction $p^{*}$ is an input parameter to algorithm, and can be made arbitrarily close to zero by adjusting the oversampling ratio $C\equiv C(p^{*},L,\{q_{\ell}\})$ . By more careful analysis, one can show that the dependence of $C$ on $p^{*}$ is $C=\mathcal{O}(\log(1/p^{*}))$ (see the proof of Lemma 6 in Appendix A). Thus, when $p^{*}$ approaches [math], the sample and time complexities grow slowly as $\log(1/p^{*})$ . Here, since we set $p^{*}$ as a constant, we hide this dependence in the constant $C$ . Given the number of components $L$ , mixture weights $\{q_{\ell}\}$ and the target $p^{*}$ , the value of the constant $C$ can be computed numerically. Table I gives some of the $C$ values for several $p^{*}$ and $L$ , under the setting $q_{\ell}=1/L,\forall\ell\in[L]$ . We see that the value of $C$ is quite modest. More details of computing the constants in the sample complexity can be found in Appendix B.

We can in fact boost the above guarantee to recover all the non-zero elements, by running the Mixed-Coloring algorithm $\Theta(\log K)$ times independently and aggregating the results by majority voting. By property 2 in Theorem 1 and a union bound argument, this procedure exactly recovers all the parameter vectors with probability $1-\mathcal{O}(1/\operatorname{poly}(K))$ with $\Theta(K\log K)$ sample and time complexities.

II-B Guarantees for the Noisy Setting

An extension of the previous algorithm, Robust Mixed-Coloring, handles noise in the measurement model (1), in the case of two parameter vectors which appear equally likely, i.e., $L=2$ and $q_{\ell}=1/2$ , $\ell=1,2$ . Many interesting applications have binary latent factors: gene mutation present/not, gender, healthy/sick individual, children/adult, etc; see also the examples in [4, 9, 15]. We would like to mention that our goal is to design a query-based algorithm that can simultaneously conduct fast demixing and robust estimation in the presence of noise. Even if there are only two possible parameter vectors, achieving this goal is highly non-trivial, and we believe that our framework provides useful intellectual insights to this problem. Extending our results to the setting with $L>2$ is an important and interesting direction, and we leave it to future work.

The noise $w_{i}$ is assumed to be i.i.d. Gaussian with mean zero and constant variance $\sigma^{2}$ . We note that the Gaussian noise assumption is mainly for theoretical reason. As one can see in Subsection V-B, our algorithm uses EM algorithm as a subroutine to estimate the component means of a mixture of two random variables. The analysis of EM algorithm is known to be hard due to the non-convexity of the likelihood functions. For simplicity, in this paper we assume that the noise is Gaussian and employ the recent convergence results on EM algorithm for two-component Gaussian mixtures in [16]. Since EM algorithm is widely used for non-Gaussian noise and is shown to have good performance in many applications, we believe that our algorithm can work well in practice even if the noise is not Gaussian distributed.

In the noisy setting, we make an additional assumption that the non-zero elements in the parameter vectors take value in a finite quantized set.

Assumption 2.

The non-zero elements of the parameter vectors satisfy $\beta_{j}^{(\ell)}\in\mathbb{D},\forall\beta_{j}^{(\ell)}\neq 0,\ell\in[L]$ , where

[TABLE]

The positive constants $\Delta$ and $b$ are known to the algorithms.

Here, we note that this assumption is mild in practice. As mentioned in Theorem 2, the quantization step size $\Delta$ can be as small as a constant multiple of the standard error of the noise, and this quantization step size should be small enough for most applications. In fact, for continuous-valued parameter vectors, in the noisy setting, it is fundamental that the non-zero elements can only be recovered up to certain precision. Moreover, in our empirical results in Section VI, the Robust Mixed-Coloring algorithm works even when the assumption is violated. In this case, the algorithm produces the best quantized approximation to the unknown parameter vectors, provided that they are not too far off the quantized set. We would also like to mention that it is a major challenge to develop algorithms with sublinear complexity and provable guarantees in noisy settings of sparse mixed regression, even with the assumption of quantized non-zero elements. Thus, even with this mild simplifying assumption, our work demonstrates a significant progress. Establishing strong theoretical guarantees for a fast recovery algorithm with sublinear sample and time complexities for the continuous alphabet setting remains to be an open problem..

When the quantization assumption holds, exact recovery is possible, as guaranteed in the following theorem. The Robust Mixed-Coloring algorithm maintains sublinear sample and time complexities, and recovers the parameter vectors in the presence of i.i.d. Gaussian noise.

Theorem 2.

Consider the asymptotic regime where $K$ and $n$ approach infinity. Suppose that the noise in the measurements are i.i.d. Gaussian distributed with mean [math] and variance $\sigma^{2}$ , and that $\Delta/\sigma\geq\frac{4}{\sqrt{3}}$ . When $L=2$ and Assumptions 1 and 2 hold, if the number of measurements is $m=\Theta(K\operatorname{polylog}(n))$ , then, the Robust Mixed-Coloring algorithm guarantees the following three properties for each $\ell\in\{1,2\}$ (up to a label permutation):

(No false discovery) With probability at least $1-\mathcal{O}(1/\operatorname{poly}(n))$ , for each $j\in{\rm supp}(\boldsymbol{\beta}^{(\ell)})$ , $\hat{\beta}^{(\ell)}_{j}$ equals either $\beta^{(\ell)}_{j}$ or [math]; for each $j\notin{\rm supp}(\boldsymbol{\beta}^{(\ell)})$ , $\hat{\beta}^{(\ell)}_{j}=0$ . 2. 2.

(Support recovery)

[TABLE] 3. 3.

(Element-wise recovery) For each $j\in{\rm supp}(\boldsymbol{\beta}^{(\ell)})$ ,

[TABLE]

Moreover, the computational time of the Robust Mixed-Coloring algorithm is $\Theta(K\operatorname{polylog}(n))$ .

We can make similar remarks as in the noiseless case: 1) with high probability, the Robust Mixed-Coloring has no false discovery, 2) the algorithm can recover an arbitrarily large $(1-p^{*})$ fraction of the supports, and 3) each element is recovered with probability asymptotically at least $1-p^{*}$ . As we can see, the sample and time complexities of the Robust Mixed-Coloring algorithm are both $\Theta(K\operatorname{polylog}(n))$ , and thus, when $K=\mathcal{O}(n^{\alpha})$ for some $\alpha\in(0,1)$ , we can achieve sublinear sample and time complexities in the ambient dimension $n$ . We also note that, Assumption 1 is still needed in the noisy setting, i.e., the two parameter vectors differ at overlapping support. This assumption can still be mild in the sublinear regime where $K=o(n)$ . In particular, if the supports of the two parameter vectors are independently drawn from certain distributions, then the probability that the two parameter vectors have overlapping supports vanishes as $n$ approaches infinity. As for the dependence on $p^{*}$ , we again note that when $p^{*}$ approaches [math], the sample and time complexities grow slowly as $\log(1/p^{*})$ . Here, since we set $p^{*}$ as a constant, we hide this dependence in the big-O notation.

Similar to the noiseless case, by running the Robust Mixed-Coloring algorithm $\Theta(\log K)$ times, one can exactly recover the two parameter vectors with probability $1-\mathcal{O}(1/\operatorname{poly}(K))$ . In this case, the sample and time complexities are $\Theta(K\log(K)\operatorname{polylog}(n))$ , and further, if we assume that $K=\Theta(n^{\alpha})$ for some constant $\alpha$ , we can still conclude that the sample and time complexities for full recovery are $\Theta(K\operatorname{polylog}(n))$ .

III Related Work

III-A Mixture of Regressions

Parameter estimation using the expectation-maximization (EM) algorithm is studied empirically in [17]. In [18], an $\ell_{1}$ -penalized EM algorithm is proposed for the sparse setting. Theoretical analysis of the EM algorithm is difficult due to non-convexity. Progress was made in [19], [16], and [20] under stylized Gaussian settings with dense $\boldsymbol{\beta}$ , for which a sample complexity of $\Theta(n\operatorname{polylog}(n))$ is proved given a suitable initialization of EM. The algorithm uses a grid search initialization step to guarantee that the EM algorithm can find the global optimal solution, with the assumption that the query vectors are i.i.d. Gaussian distributed. The time complexity is polynomial in $n$ . An alternative algorithm is proposed in [15], which achieves optimal $\mathcal{O}(n)$ sample complexity, but has high computational cost due to the use of semidefinite lifting. The algorithm in [21] makes use of tensor decomposing techniques, but suffers from a high sample complexity of $\mathcal{O}(n^{6})$ . In comparison, our approach has near-optimal sample and time complexities by utilizing the potential design freedom. The classification version of this problem has also been studied in [22].

III-B Coding-theoretic Methods and Group Testing

Many modern error-correcting codes such as LDPC codes and polar codes [23] with their roots in communication problems, exploit redundancy to achieve robustness, and use structural design to allow for fast decoding. These properties of codes have recently found applications in statistical problems, including graph sketching [24], sparse covariance estimation [25], low-rank approximation [26], and discrete inference [27]. Most related to our approach is the work in [28, 29, 30, 31], which apply sparse graph codes with peeling-style decoding algorithms to compressive sensing and phase retrieval problems. In our setting we need to handle a mixture distribution, which requires more sophisticated query design and novel demixing algorithms that go beyond the standard peeling-style decoding.

Another line of work relevant to our scheme is designing measurements in group testing [32] via error correcting codes and expander graphs [33, 34, 35, 36]. These results bear some similarities to our algorithm as they also exploit linear sketches of data for efficient sparsity pattern recovery. Our scheme differs from these works since we tackle problems in real and complex fields, whereas in group testing problems one consideres binary OR operations. In addition, we aim to solve the demixing problem in sparse recovery, and this is a more challenging task that has not been studied in the context of group testing.

III-C Combinatorial and Dimension Reduction Techniques

Our results demonstrate the power of strategic query and coding theoretic tools in mixture problems, and can be considered as efficient linear sketching of a mixture of sparse vectors. In this sense, our work is in line with recent works that make use of combinatorial and dimension reduction techniques in high-dimensional and large scale statistical problems. These techniques, such as locality-sensitive hashing [37], sketching of convex optimization [38], and coding-theoretic methods [39], allow one to design highly efficient and robust algorithms applicable to computationally challenging datasets without compromising statistical accuracy.

IV Mixed-Coloring Algorithm for Noiseless Recovery

In this section, we provide details of the Mixed-Coloring algorithm in the noiseless setting. We first provide some primitives that serve as important ingredients in the algorithm, and then describe the design of query vectors and decoding algorithm in detail.

IV-A Primitives

The algorithm makes use of four basic primitives: summation check, indexing, guess-and-check, and peeling, which are described below.

Summation Check: Suppose that we generate two query vectors $\boldsymbol{x}_{1}$ and $\boldsymbol{x}_{2}$ independently from some continuous distribution on $\mathbb{C}^{n}$ , and a third query vector of the form $\boldsymbol{x}_{1}+\boldsymbol{x}_{2}$ . Let $y_{1}$ , $y_{2}$ , and $y_{3}$ be the corresponding measurements. We check the sum of the measurements and in the noiseless case, if $y_{3}=y_{1}+y_{2}$ , then we know that these three measurements are generated from the same parameter vector $\boldsymbol{\beta}^{(\ell)}$ almost surely. In this case we call $\{y_{1},y_{2}\}$ a consistent pair of measurements as they are from the same $\boldsymbol{\beta}^{(\ell)}$ (the third measurement $y_{3}$ is now redundant).

Indexing: The indexing procedure is to find the locations and values of the non-zero elements by carefully designed query vectors. In the noiseless case, this can be done by suitably designed ratio tests. We sketch the idea of ratio test here. Consider a consistent pair of measurements $\{y_{1},y_{2}\}$ and corresponding query vectors $\{\boldsymbol{x}_{1},\boldsymbol{x}_{2}\}$ . We design the query vectors such that the information of the locations of the non-zero elements is encoded in the relative phase between $y_{1}$ and $y_{2}$ . In particular, we generate $n$ i.i.d. random variables $r_{j},j\in[n]$ uniformly distributed on the unit circle. Letting $W=e^{{\bf i}\frac{2\pi}{n}}$ where ${\bf i}$ is the imaginary unit, we set the $j$ -th entries of $\boldsymbol{x}_{1}$ and $\boldsymbol{x}_{2}$ to be either $x_{1,j}=x_{2,j}=0$ , or $x_{1,j}=r_{j}$ and $x_{2,j}=r_{j}W^{j-1}$ . (The locations of the zeros are determined using sparse-graph codes and discussed later.) Below is an example of such a consistent pair of measurements and the corresponding linear system:

[TABLE]

Suppose that $\boldsymbol{\beta}^{(1)}$ is $3$ -sparse and of the form $\boldsymbol{\beta}^{(1)}=[0~{}0~{}{*}~{}0~{}{*}~{}0~{}0~{}{*}]^{\rm T}$ . There is only one non-zero element, $\beta_{3}^{(1)}$ , that contributes to the measurements $y_{1}$ and $y_{2}$ . In this case the consistent measurement pair $\{y_{1},y_{2}\}$ is called a singleton. A singleton can be detected by testing the integrality of the relative phase of the ratio $y_{1}/y_{2}$ . In the above example, since $y_{1}=r_{3}\beta_{3}^{(1)}$ and $y_{2}=r_{3}W^{2}\beta_{3}^{(1)}$ , we observe that $|y_{1}|=|y_{2}|$ and the relative phase $\angle(y_{2}/y_{1})=2\cdot\frac{2\pi}{8}$ is an integral multiple of $\frac{2\pi}{8}$ . We therefore know that with probability one, this consistent pair is a singleton, and moreover the corresponding non-zero element is located at the $3$ -rd coordinate with value $\beta_{3}^{(1)}=y_{1}/r_{3}$ . In general, for a consistent measurement pair $\{y_{1},y_{2}\}$ , if we observe that $|y_{1}|=|y_{2}|$ and the relative phase $\angle(y_{2}/y_{1})=k\cdot\frac{2\pi}{n}$ for some nonnegative integer $k$ , then, we know that this consistent pair is a singleton, and the corresponding non-zero element is located at the $(k+1)$ -th coordinate with value $y_{1}/r_{k+1}$ . We would like to remark that the indexing step can also be done using real-valued query vectors.

Guess-and-check and Peeling: After the ratio tests, we have already found some singletons, i.e., consistent pairs that are only associated with a single non-zero element. Ideally, we would like to iteratively reduce the problem by subtracting off recovered elements, in a Gaussian elimination-like manner, and find other non-zero elements. However, although we have recovered the locations and values of some non-zero elements, we still do not know their labels, and the uncertainty in the labels brings additional difficulty to the problem. To resolve this issue, we use a guess-and-check strategy. In the example above, suppose instead that $\boldsymbol{\beta}^{(1)}$ is $4$ -sparse, i.e., $\boldsymbol{\beta}^{(1)}=[0~{}{*}~{}{*}~{}0~{}{*}~{}0~{}0~{}{*}]^{\rm T}$ , in which case the consistent pair

[TABLE]

is associated with two non-zero elements of $\boldsymbol{\beta}^{(1)}$ . Suppose that, in a previous iteration of the algorithm we have recovered the location and value of $\beta^{(1)}_{2}$ . At this point, we only know that this non-zero element is located at the second coordinate, and has value $\beta$ ( $\beta=\beta^{(1)}_{2}$ ), but we do not know that this element belongs to the parameter vector $\boldsymbol{\beta}^{(1)}$ , nor do we know that the consistent pair $\{y_{1},y_{2}\}$ is generated by $\boldsymbol{\beta}^{(1)}$ . Despite the uncertainty in the labels, we guess that, this non-zero element belongs to the parameter vector that generates $\{y_{1},y_{2}\}$ . Then we can peel off (i.e., subtract) this recovered element by

[TABLE]

The updated measurement pair satisfies $y_{i}=x_{i,3}\beta^{(1)}_{3},i=1,2$ . Then, we can check whether our previous guess is correct, by doing ratio test on the updated pair. In this example, since the updated measurements are only associated with $\beta^{(1)}_{3}$ (i.e., this pair becomes a singleton), they can pass the ratio test. Then, we know that the peeling step is valid, and that the previous non-zero element at the second coordinate (with value $\beta$ ) and the newly recovered element at the third coordinate belong to the same parameter vector almost surely. If the updated measurement pair cannot pass the ratio test, there are two possibilities: 1) this pair is generated by some other parameter vector, or 2) this pair is associated with more than two non-zero elements. In this case, we keep both the measurement pairs before and after peeling for future usage. In general, the guess-and-check strategy and the peeling step can be combined to detect that two non-zero elements are from the same parameter vectors.

The continuing execution of these four primitives is made possible by the design of the query vectors using sparse-graph codes, which we describe next.

IV-B Design of Query Vectors

As illustrated in Figure 4, we construct $M=\Theta(K)$ sets of query vectors (called bins). The query vectors in each bin are associated with some coordinates of the parameter vectors (i.e., the query vectors are non-zero only on those coordinates). The association between the coordinates and bins is determined by a $d$ -left regular bipartite graph with $n$ left nodes (coordinates) and $M$ right nodes (bins), where each left node is connected to $d=\Theta(1)$ right nodes chosen independently uniformly at random. Here, we note that other designs of the bipartite graph may also be employed, such as expander graphs [40, 41]. As we see in later sections, as long as the bipartite graph structures allow for a density evolution analysis [42], we may be able to use such graphs. In this paper, we choose to use $d$ -left regular bipartite graph since it is amenable to a transparent analysis and already achieves order-optimal sample and time complexities in the noiseless setting. In our design, each bin consists of three query vectors. The values of the non-zero elements of the first two query vectors are in the form of (2), enabling the ratio test. The third query vectors equals the sum of the first two and is used for the summation check.

More precisely, we first design three random vectors $\boldsymbol{r}_{1},\boldsymbol{r}_{2},\boldsymbol{r}_{3}\in\mathbb{C}^{n}$ , where $\boldsymbol{r}_{1}=[r_{1},r_{2},\ldots,r_{n}]^{\rm T}$ consists of elements that are i.i.d. uniformly distributed on the unit circle $\{z:|z|=1\}$ , and $\boldsymbol{r}_{2}$ is a vector with elements of $\boldsymbol{r}_{1}$ being modulated by Fourier coefficients $W^{j}$ , $j=0,\ldots,n-1$ , $W=e^{{\bf i}\frac{2\pi}{n}}$ , and $\boldsymbol{r}_{3}=\boldsymbol{r}_{1}+\boldsymbol{r}_{2}$ . Let $\boldsymbol{H}\in\{0,1\}^{M\times n}$ be the biadjacency matrix of the bipartite graph, and $\boldsymbol{h}_{i}^{\rm T}$ be the $i$ -th row of $\boldsymbol{H}$ . Then, the query vectors of the $i$ -th bin is $\boldsymbol{r}_{1}{\rm diag}(\boldsymbol{h}_{i})$ , $\boldsymbol{r}_{2}{\rm diag}(\boldsymbol{h}_{i})$ , and $\boldsymbol{r}_{3}{\rm diag}(\boldsymbol{h}_{i})$ . We provide a concrete example with $n=5$ , $M=3$ , $d=2$ in Figure 3. The biadjacency matrix is given in (4).

[TABLE]

If the query vectors in each bin were used only once, then we would have very few bins passing the summation check and hence few consistent pairs. Instead, we use the first two query vectors repeatedly for $R=\Theta(1)$ times, obtaining two sets of measurements, each of size $R$ and called type-I and type-II index measurements. We use the third query vector $V=\Theta(1)$ times to obtain a set of verification measurements. We therefore have $2R+V$ measurements associated with each of the $M$ bins, hence a total of $m=(2R+V)M=\Theta(K)$ measurements, as shown in Figure 4.

IV-C Decoding Algorithm

We provide an outline of the decoding algorithm in Algorithm 1. The decoding algorithm first finds consistent pairs (by summation check) in each bin, within which singletons are identified (by the ratio test). The ratio test also recovers the locations and values of several non-zero elements, some of which can then be associated with the same $\boldsymbol{\beta}^{(\ell)}$ by guess-and-check. Using tools from random graph theory, we can separate part of the recovered non-zero elements from different parameter vectors. At this point, for each $\boldsymbol{\beta}^{(\ell)}$ , we have recovered some of its non-zero elements (including their locations, values and labels). We iteratively conduct a combined operation of guess-and-check and peeling, so that we can subtract the recovered elements from the remaining consistent pairs, until no more non-zero elements can be found. Below we elaborate on these steps.

Finding Consistent Pairs: The decoding procedure starts by finding all the consistent pairs. In each bin, we perform summation checks on all triplets $(y_{1},y_{2},y_{3})$ in which $y_{1}$ , $y_{2}$ , and $y_{3}$ are the type-I index measurement, type-II index measurement and verification measurement, respectively. If a triplet passes the summation check, then a consistent pair $\{y_{1},y_{2}\}$ is found. Note that in each bin the number of triplets of the above form is a constant, so this step can be done in $\Theta(K)$ time. The subsequent steps of the algorithm are based on the consistent pairs found in this step. We also note that, since for every $\ell$ , the probability that each measurement is generated by the parameter vector $\boldsymbol{\beta}^{(\ell)}$ is a constant $q_{\ell}$ , the probability that one can find a consistent pair in a particular bin is a constant.

We classify the consistent pairs into a few different types. As we have seen, each consistent pair is only associated with a subset of the non-zero elements of a particular parameter vector, due to the design of the bipartite graph. As before, a consistent pair associated with only one non-zero element is called a singleton, and we call this non-zero element a singleton ball. The consistent pairs associated with two non-zero elements are called doubletons; and those associated with more than one non-zero elements are called multitons444Doubletons are also multitons.. These terminologies are useful for our following discussions.

Finding Singletons: Each non-zero element of the parameter vectors can be identified by its label-location-value triplet $(\ell,j,\beta^{(\ell)}_{j})$ . We visualize these triplets (i.e., non-zero elements) as balls, as shown in Figure 1(a), and initially their labels, locations and values are unknown555Note that the graph in Figure 1 differs from the bipartite graph that we use to design the query vectors.. We run the ratio test on the consistent pairs to identify singletons and their associated singleton balls. The singleton balls found are illustrated in Figure 1(b) as shaded balls. The ratio test also recovers the locations and values of these singleton balls, although at this point we do not know the label $\ell$ of the balls.

To better understand the algorithm, here we analyze the expected number of singleton balls that belong to parameter vector $\boldsymbol{\beta}^{(\ell)}$ . We show that a constant fraction of the non-zero elements in $\boldsymbol{\beta}^{(\ell)}$ can be found as singleton balls in this stage. First, we analyze the probability $Q_{\ell}$ that a particular bin can produce a consistent pair that is generated by $\boldsymbol{\beta}^{(\ell)}$ . According to our probabilistic model, the measurements are generated independently, and therefore, we have

[TABLE]

Denote by $\xi_{k}^{(\ell)}$ the probability of the event that a particular bin produces a consistent pair that is generated by $\boldsymbol{\beta}^{(\ell)}$ , and is associated with $k$ non-zero elements in $\boldsymbol{\beta}^{(\ell)}$ . Since each non-zero element is associated with $d$ bins among the $M$ bins independently and uniformly at random, for a consistent pair generated by $\boldsymbol{\beta}^{(\ell)}$ , the number of non-zero elements associated with this pair is binomial distributed with parameters $K_{\ell}$ (recall that $K_{\ell}$ is the number of non-zero elements in $\boldsymbol{\beta}^{(\ell)}$ ) and $\frac{d}{M}$ , and we have

[TABLE]

In addition, we can use Poisson distribution to approximate the binomial distribution when $\lambda_{\ell}:=\frac{K_{\ell}d}{M}$ is a constant and $K_{\ell}$ approaches infinity, i.e., we have

[TABLE]

Let us ignore the zero elements in $\boldsymbol{\beta}^{(\ell)}$ and consider the bipartite graph representing the association between the $K_{\ell}$ non-zero elements (left notes) in $\boldsymbol{\beta}^{(\ell)}$ and the $M$ bins (right nodes). We know that the total number of edges in this bipartite graph is $K_{\ell}d$ , and we denote by $\rho_{k}^{(\ell)}$ the expected fraction of the edges that are connected to a right node (bin) with degree $k$ . Thus, we have

[TABLE]

We then proceed to analyze the expected fraction of the singleton balls. Let $q_{s}^{(\ell)}$ be the probability that a non-zero element in $\boldsymbol{\beta}^{(\ell)}$ becomes a singleton ball in a certain consistent pair. The event is equivalent to the event that at least one of its $d$ associated right nodes (bins) has degree $1$ . Then, when $K_{\ell}$ approaches infinity, we have asymptotically

[TABLE]

Thus, the expected number of non-zero elements in $\boldsymbol{\beta}^{(\ell)}$ that are found as singleton balls in this stage of the algorithm is $q_{s}^{(\ell)}K_{\ell}=\Theta(K)$ , and this implies that we can recover a constant fraction of the non-zero elements in each parameter vector, without knowing their labels. We can further prove high probability bounds for this fraction, and more details of this analysis are relegated to Lemma 2 in Appendix A.

Recovering a Subset of Non-zero Elements: The next step is crucial: for two singleton balls and a consistent pair associated with the locations of these two balls, we run the guess-and-check and peeling operations to detect if these two singleton balls indeed have the same label (or equivalently, the two non-zero elements are in the same parameter vector). If so, we call this consistent pair a strong doubleton (i.e., doubletons that contain two singleton balls that we find in the previous stage of the algorithm), and connect these two balls with an edge, as shown in Figure 1(b). Doing so creates a graph over the balls (i.e., non-zero elements), and each connected component of the graph is from a single parameter vector. Since each non-zero element is associated with a constant number of consistent pairs (due to using a $d$ -left regular bipartite graph with constant $d$ ), this step can in fact be done efficiently in $\Theta(K)$ time without enumerating all the combinations of singleton ball pairs. Similar to the analysis of singleton balls, we can analyze the number of strong doubletons. In fact, we can show that with high probability, a constant fraction of the consistent pairs are strong doubletons. More details are relegated to Lemma 3 in Appendix A.

By carefully choosing the parameters666To make our main sections concise, here we omit the condition on the design parameters $d$ , $M$ , $R$ , and $V$ in order to form giant components with sizes $\Theta(K)$ ; the precise statement of this condition is relegated to Lemma 4 in Appendix A. $d$ , $M$ , $R$ , and $V$ , and using tools from random graph theory, we can ensure that with high probability the $L$ largest connected components (called giant components) correspond to the $L$ parameter vectors, and each of these components has size $\Theta(K)$ . Then, the labels of the balls in these components are identified. This is illustrated in Figure 1(c) for $L=2$ , where colors represent the labels. More details of this demixing process are provided in Lemma 4 in Appendix A. In summary, at this point we have recovered the labels, locations and values of a constant fraction of the non-zero elements (i.e., balls) of each parameter vector.

Iterative Decoding: The decoding procedure proceeds by identifying the labels of the remaining balls via iteratively applying the guess-and-check and peeling primitives. The connected components in Figure 1(c) are therefore expanded, until no more changes can be made, as illustrated in Figure 1(d).

We provide an example of this iterative procedure in Figure 5. Recall that the association between the coordinates of the parameter vectors and the bins (or consistent pairs) is determined by a bipartite graph. Here, we only show one consistent pair for each bin and omit the zero elements. The non-zero elements and the consistent pairs are shown as balls and squares, respectively, as in Figure 5(a). The steps described in the last part recover a subset of these balls, which are shown in red and blue in Figure 5(b). For simplicity, let us call the corresponding $\boldsymbol{\beta}^{(\ell)}$ ’s red and blue parameter vectors, respectively. If a consistent pair is generated by the red (blue) parameter vector, we say that this consistent pair is red (blue). Now consider the consistent pair 1, which is associated with the coordinates that the balls $a$ , $b$ and $c$ are located. Although we do not know whether this pair is red or blue, we can guess that this pair is blue, and peel the blue balls $a$ and $b$ off from consistent pair 1. Since this consistent pair is indeed blue, the updated measurements can pass the ratio test, and thus we can recover the label, location and value of the non-zero element represented by blue ball $c$ . Similarly, by guessing that the consistent pair 3 is red and peeling off the recovered red ball $v$ from the consistent pair $3$ , we can recover red ball $w$ , as illustrated in Figure 5(c). We continue this process iteratively, guessing the color of the consistent pairs and peeling off balls recovered in the previous iterations to recover more balls. For example, we peel off balls $b$ and $c$ from the measurement pair $2$ to recover ball $d$ , and ball $w$ from pair $4$ to recover ball $z$ , resulting in Figure 5(d).

IV-D Choice of Design Parameters

We have completed the description of the Mixed-Coloring algorithm for the noiseless case. The algorithm involves several design parameters including $d$ , $R$ , $V$ , and $M$ . We need to choose these parameters properly in order to guarantee successful decoding. The precise conditions that these parameters need to satisfy are somewhat technical, and they are relegated to Lemmas 4 and 6 in Appendix A. More specifically, Lemma 4 provides the condition that the $L$ giant components in Figure 1(c) correspond to the $L$ parameter vectors, and each of these components has size $\Theta(K)$ ; and Lemma 6 provides the condition that the iterative decoding process can find an arbitrarily large fraction of the non-zero elements, and the analysis is based on density evolution from modern coding theory [43]. In addition, Lemma 7 in Appendix A provides high probability bound on the recovered fraction of non-zero elements. For concrete settings, the optimal choices of these parameters can actually be computed numerically via the density evolution analysis. In particular, for any upper bound $p^{*}\in(0,1)$ of the error fraction, we can find the proper values of $d$ , $R$ , $V$ , and $M$ for which the peeling process is guaranteed to proceed successfully and recover all but a fraction $p^{*}$ of the non-zero elements. As examples, in Table II we list the optimal values of the design parameters for $L=2,3$ or $4$ parameter vectors with equal probability (mixing weights), i.e., $q_{\ell}=\frac{1}{L}$ . The details of these numerical computations are provided in Appendix B.

V Robust Mixed-Coloring Algorithm for Noisy Recovery

The key idea of Robust Mixed-Coloring algorithm is to turn the noisy problem to a noiseless one. We keep the overall structure of the Robust Mixed-Coloring algorithm the same as its noiseless counterpart. We still use a balls-and-bins model to design the query vectors. In particular, we keep using a $d$ -left regular bipartite graph to represent the association between coordinates and bins (sets of measurements), and the algorithm still proceeds as shown in Algorithm 1. However, the steps in Mixed-Coloring algorithm that rely on the fact that there is no noise in the measurements should be robustified. In particular, in the presence of noise, the ratio test method for indexing (finding the location and value of non-zero elements) and the summation check for finding consistent measurements (measurements that are generated by the same parameter vector) need to be modified. To this end, we use a new design of query vectors, and employ an EM-based noise reduction scheme to effectively obtain the noiseless measurements of these query vectors. We provide more details of the Robust Mixed-Coloring algorithm in the following.

V-A Design of Query Vectors

We keep the high-level design of query vectors as in the noiseless setting. This means that we still use a $d$ -left regular bipartite graph with $n$ left nodes and $M$ right nodes to represent the association between the coordinates and the bins. However, we change the design of query vectors within each bin, and in particular, we design three types of query vectors. The first type, called binary indexing vectors, encodes the location information using binary representations with $\lceil\log_{2}(n)\rceil$ bits (as opposed to using the relative phases in the noiseless case). The second type is called verification vectors, which are used to verify the singleton balls (or equivalently, non-zero elements) found by the binary indexing vectors. We robustify the indexing process by replacing the ratio test query vectors with these two types of query vectors. A similar approach is considered in [31] for compressive phase retrieval. The third type of query vectors is used for consecutive summation check, which finds consistent sets of measurements, and robustifies the summation check step in the noiseless case.

We now provide details of the design. Let $\boldsymbol{H}$ denote the biadjacency matrix of the bipartite graph. For a particular bin (we omit the label of the bin for simplicity), let $\boldsymbol{h}\in\{0,1\}^{n}$ denote the association between this bin and the coordinates. We design $P=\Theta(\log^{2}(n))$ query vectors $\boldsymbol{x}_{i}\in\mathbb{R}^{n}$ , $i\in[P]$ for this bin as follows:

[TABLE]

where $\boldsymbol{B}\in\{0,1\}^{P_{1}\times n}$ , $\boldsymbol{V}\in\{1,-1\}^{P_{2}\times n}$ , and $\boldsymbol{C}\in\mathbb{Z}^{P_{3}\times n}$ correspond to the three types of new query vectors, meaning that they are used for binary indexing, verification, and consecutive summation check, respectively. The matrix $\boldsymbol{B}$ has $P_{1}=\lceil\log_{2}(n)\rceil$ rows, and the $i$ -th column of $\boldsymbol{B}$ is the binary representation of integer $i-1$ . The matrix $\boldsymbol{V}$ has $P_{2}=\Theta(\log(n))$ rows and consists of i.i.d. Rademacher entries, i.e., the entries of $\boldsymbol{V}$ are equally likely to be either $1$ or $-1$ . The matrix $\boldsymbol{C}$ contains $P_{3}={P_{1}+P_{2}\choose 2}$ rows, and the rows of $\boldsymbol{C}$ are indexed by pairs $(j,k)$ , $1\leq j<k\leq P_{1}+P_{2}$ . Let $\boldsymbol{D}=[\boldsymbol{B}^{\rm T}~{}\boldsymbol{V}^{\rm T}]^{\rm T}$ be a collection of the first two matrices. The row of $\boldsymbol{C}$ indexed by $(j,k)$ (denoted by $\boldsymbol{c}_{(j,k)}^{\rm T}$ ) is the summation of the $j$ -th and the $k$ -th row of $\boldsymbol{D}$ , i.e., $\boldsymbol{c}_{(j,k)}^{\rm T}=\boldsymbol{d}_{j}^{\rm T}+\boldsymbol{d}_{k}^{\rm T}$ . Here, we give a simple example with $n=4$ , $P_{1}=2$ , $P_{2}=2$ , and $P_{3}=6$ in Figure 6.

V-B Decoding Algorithm

We now describe the decoding part of the Robust Mixed-Coloring algorithm. As mentioned, we use an EM-based noise reduction scheme to find the noiseless measurements of the query vectors, and also conduct robustified summation check and indexing process. Other parts of the algorithm are kept the same. We elaborate the details in the following.

Noise Reduction: Due to the presence of noise, the first step that we need to take is a noise reduction operation. More specifically, we use each query vector $N=\Theta(\operatorname{polylog}(n))$ times, repeatedly. According to our model, in the presence of Gaussian noise, one can see that if $\boldsymbol{x}_{i}^{\rm T}\boldsymbol{\beta}^{(1)}=\boldsymbol{x}_{i}^{\rm T}\boldsymbol{\beta}^{(2)}$ , the $N$ measurements are i.i.d. Gaussian distributed; otherwise the $N$ measurements are independently distributed as a mixture of two equally weighted Gaussian random variables. Therefore, the problem becomes a standard estimation problem for a one dimensional Gaussian mixture distribution. We propose an EM algorithm with an initialization step using method of moments to estimate the two centers of the mixture. The performance of our proposed EM algorithm can be characterized by Theorem 3, proved in Appendix E.

Theorem 3.

Suppose that $\Delta/\sigma\geq\frac{4}{\sqrt{3}}$ . Then, by using $N=\Theta(\operatorname{polylog}(n))$ measurements, the proposed EM algorithm, with initialization via method of moments, can recover the exact value of the two centers777Note that in our problem, the centers take quantized values and the quantization step is known to the decoder, so the estimation can take the exact value of the true centers. of the mixture of Gaussian distributions with probability at least $1-\mathcal{O}(1/{\rm poly}(n))$ .

In addition, we can see that since each query vector in each bin is repeated $N=\Theta(\operatorname{polylog}(n))$ times, and there is $P=\Theta(\log^{2}(n))$ query vectors in each bin, the total number of measurements we get for each bin is $NP=\Theta(\operatorname{polylog}(n))$ . Since there are $\Theta(K)$ bins, the total number of measurements of the Robust Mixed-Coloring algorithm is $\Theta(K\operatorname{polylog}(n))$ .

Consecutive Summation Check: After the noise reduction operations, for each query vector $\boldsymbol{x}_{i}$ , $i\in[P]$ , we get at most two “centers” $\{y_{i,1},y_{i,2}\}$ (called denoised measurements), which correspond to the inner product of the query vector and the parameter vectors in the noiseless case. However, we do not know the correspondence between the denoised measurements and the two parameter vectors. This means that we can have either $(y_{i,1},y_{i,2})=(\boldsymbol{x}_{i}^{\rm T}\boldsymbol{\beta}^{(1)},\boldsymbol{x}_{i}^{\rm T}\boldsymbol{\beta}^{(2)})$ or $(y_{i,1},y_{i,2})=(\boldsymbol{x}_{i}^{\rm T}\boldsymbol{\beta}^{(2)},\boldsymbol{x}_{i}^{\rm T}\boldsymbol{\beta}^{(1)})$ . Therefore, we need to use the consecutive summation check method to find the denoised measurements which are generated by the same parameter vector.

We illustrate the consecutive summation check process using a simple example in Figure 7. Assume that we have three query vectors $\boldsymbol{x}_{1}$ , $\boldsymbol{x}_{2}$ , $\boldsymbol{x}_{3}$ , and two summation check query vectors $\boldsymbol{x}_{1}+\boldsymbol{x}_{2}$ and $\boldsymbol{x}_{2}+\boldsymbol{x}_{3}$ . Suppose that the denoised measurements that we get for $\boldsymbol{x}_{i}$ , $i=1,2,3$ are $(y_{1,1},y_{1,2})=(1,5)$ , $(y_{2,1},y_{2,2})=(2,4)$ , and $(y_{3,1},y_{3,2})=(2,3)$ , and the denoised measurements for the summation check query vectors are $(y_{(1,2),1},y_{(1,2),2})=(5,7)$ and $(y_{(2,3),1},y_{(2,3),2})=(5,6)$ . By matching summations, one can easily find that the only possible case that we can observe these denoised measurements is that $(y_{1,1},y_{2,2},y_{3,1})$ and $(y_{1,2},y_{2,1},y_{3,2})$ are generated by the same parameter vector (we call them consistent sets), respectively, as shown in different colors in Figure 7. In our algorithm, we need to conduct consecutive summation check on all the denoised indexing and verification measurements, using the denoised summation check measurements. We also mention that the reason that we need summations of all the ${P_{1}+P_{2}\choose 2}$ pairs of the first $P_{1}+P_{2}$ query vectors is that we might have the two denoised measurements taking the same value, i.e., $\boldsymbol{x}_{i}^{\rm T}\boldsymbol{\beta}^{(1)}=\boldsymbol{x}_{i}^{\rm T}\boldsymbol{\beta}^{(2)}$ , and then we have to conduct summation check on two query vectors which are not adjacent. We provide the precise procedures of consecutive summation check in Algorithm 2.

Indexing: We conduct indexing process on the consistent sets. The purpose of the indexing process is to check whether there is a single non-zero element associated with a set of consistent measurements (i.e., whether these measurements form a singleton), and find the location and value of the non-zero element. Recall that after the swapping procedures in Algorithm 2, we obtain two consistent sets of denoised measurements $(y_{1,1},\ldots,y_{P_{1}+P_{2},1})$ and $(y_{1,2},\ldots,y_{P_{1}+P_{2},2})$ . Without loss of generality, we omit the second subscript and use $(y_{1},\ldots,y_{P_{1}+P_{2}})$ to denote one of the consistent set of denoised measurements.

We check the first $P_{1}$ denoised measurements, which correspond to the binary indexing query vectors. We can see that for the consistent set to be a singleton, it is necessary that all the non-zero denoised binary indexing measurements take the same value in $\mathbb{D}$ , say $a\Delta$ . The only possible location index $j$ of the non-zero element satisfies the fact that integer $j-1$ has binary representation $\{\frac{1}{a\Delta}y_{i}\}_{i=1}^{P_{1}}\in\{0,1\}^{P_{1}}$ . For instance, in the example in Figure 6, suppose that we find a consistent set of measurements generated by $\boldsymbol{\beta}^{(1)}$ , and the quantization step size $\Delta=1$ , i.e., the non-zero elements take integer values. Assume that we observe the denoised consistent measurements $(\boldsymbol{b}_{1}^{\rm T}{\rm diag}(\boldsymbol{h})\boldsymbol{\beta}^{(1)},\boldsymbol{b}_{2}^{\rm T}{\rm diag}(\boldsymbol{h})\boldsymbol{\beta}^{(1)})=(2,2)$ . Then, it is possible that this is a singleton, and $\beta_{4}^{(1)}=2$ is the only non-zero element associated with these consistent measurements.

However, the procedure above is not enough to guarantee that the consistent measurements form a singleton. We continue the example in Figure 6. Suppose that the bipartite graph gives us $\boldsymbol{h}=[0~{}1~{}1~{}1]^{\rm T}$ . Then, when we observe measurements $(\boldsymbol{b}_{1}^{\rm T}{\rm diag}(\boldsymbol{h})\boldsymbol{\beta}^{(1)},\boldsymbol{b}_{2}^{\rm T}{\rm diag}(\boldsymbol{h})\boldsymbol{\beta}^{(1)})=(2,2)$ , we can have either ${\rm diag}(\boldsymbol{h})\boldsymbol{\beta}^{(1)}=[0~{}0~{}0~{}2]^{\rm T}$ or ${\rm diag}(\boldsymbol{h})\boldsymbol{\beta}^{(1)}=[0~{}2~{}2~{}0]^{\rm T}$ ; and in the latter case, this set of measurements does not form a singleton any more. To verify that this consistent set is a singleton, we need to use the next $P_{2}$ denoised verification measurements. Recall that for the verification query vectors, we design a Rademacher matrix $\boldsymbol{V}\in\{-1,1\}^{P_{2}\times n}$ with elements $V_{i,j}$ , $i\in[P_{2}],j\in[n]$ , and use the rows of $\boldsymbol{V}{\rm diag}(\boldsymbol{h})$ as query vectors. Here, we make the following claim on the singleton verification procedure.

Lemma 1.

Suppose that all the denoised binary indexing measurements $y_{1},\ldots,y_{P_{1}}$ take value in $\{0,a\Delta\}$ , and the sequence $\{\frac{1}{a\Delta}y_{i}\}_{i=1}^{P_{1}}$ form the binary representation of integer $j-1$ . Then, if the verification measurements satisfy $y_{i}=a\Delta V_{i-P_{1},j}$ for all $i=P_{1}+1,\ldots,P_{1}+P_{2}$ , and $P_{2}=\Theta(\log(n))$ , with probability $1-\mathcal{O}(1/\operatorname{poly}(n))$ , this consistent set is indeed a singleton with the non-zero element located at the $j$ -th coordinate and taking value $a\Delta$ .

This result is a corollary of the Johnson-Lindenstrauss Lemma [44], and we provide the proof in Appendix D. If the denoised measurements pass the verification in Lemma 1, we know that with high probability, the consistent set of measurement is indeed a singleton, and we also obtain the location and value of the non-zero element. We provide the precise procedures of the indexing algorithm in Algorithm 3.

So far, we have demonstrated how we robustify the summation check and indexing process. Once these two parts are robustified, other parts of the algorithm, such as finding giant components, guess-and-check, and peeling-style iterative decoding can proceed as in the noiseless case. We relegate the analysis of Robust Mixed-Coloring algorithm to Appendix C. Again, there are a few design parameters in the Robust Mixed-Coloring algorithm, and we summarize the choices of these parameters in Table III for a particular target error fraction $p^{*}$ .

Finally, we point out that extending the Robust Mixed-Coloring algorithm to cases where $L>2$ is an important future direction. Although the summation check technique does not provably work in the noisy setting when $L>2$ , we believe that using similar but more sophisticated design, we may still be able to obtain consistent sets of measurements.

VI Experimental Results

In this section, we test the sample and time complexities of the Mixed-Coloring algorithm in both noiseless and noisy cases to verify our theoretical results. All simulations are done on a laptop with 2.8 GHz Intel Core i7 CPU and 16 GB memory using Python.

We first investigate the sample complexity of Mixed-Coloring algorithm in the noiseless case. The goal of this experiment is to show that in numerical experiments, the number of measurements that we need to successfully recover the parameter vectors matches the predictions of the density evolution analysis. We use the optimal parameters $(d,R,V)$ from numerical calculations of the density evolution, presented in Table II. We generate instances with different number of measurements $m$ by choosing different number of bins $M$ . Recall that $m=(2R+V)M$ , and thus varying the number of bins is equivalent to varying the total number of measurements. The parameter vectors that we use have equal sparsity, i.e., $K_{\ell}=\frac{1}{L}K$ , and the mixing weights are equal for all the parameter vectors, i.e., $q_{\ell}=\frac{1}{L}$ . The supports of the parameter vectors are chosen uniformly at random, and the values of the non-zero elements are generated from Gaussian distribution. We choose a few pairs of $L$ and $K$ , increase the total number of measurements, and record the empirical success probability and running time averaged over $100$ trials. Here, we use a sufficiently small $p^{*}$ so that the success event is equivalent to recovery of all the non-zero elements. The results are shown in Figure 8(a). The phase transition occurs at some $C=m/K$ that matches the values in Table I, predicted by our theory. More specifically, when $L=2$ , $L=3$ , and $L=4$ , we need about $33K$ , $38K$ , and $40K$ measurements for successful recovery, respectively.

We also test the time complexity of our algorithm in the noiseless case. We use the design parameters that can guarantee successful recovery, as we find in the experiment on sample complexity. More specifically, for $L=2$ , we choose $(d,R,V)=(15,3,3)$ , and $m=34.2K$ (i.e., $M=3.8K$ ); for $L=3$ , we choose $(d,R,V)=(15,5,5)$ , and $m=39K$ (i.e., $M=2.6K$ ); and for $L=4$ , we choose $(d,R,V)=(13,8,8)$ , and $m=43.2K$ (i.e., $M=1.8K$ ). As shown in Figure 8(b), the running time is linear in $K$ and does not depend on $n$ .

Similar experiments are performed for the noisy case using the Robust Mixed-Coloring algorithm, under the quantization assumption. We still focus on the case where the two parameter vectors appear equally like and have the same sparsity. We use quantization step size $\Delta=1$ and the quantized alphabet $\mathbb{D}=\{\pm 1,\pm 2,\ldots,\pm 5\}$ , and the values of the non-zero elements are chosen uniformly at random from $\mathbb{D}$ . Figure 9(a) shows the minimum number of queries $m$ required for 100 consecutive successes, for different $n$ and $K$ . We observe that the sample complexity is linear in $K$ and sublinear in $n$ . The running time exhibits a similar behavior, as shown in Figure 9(b). Both observations agree with the prediction of our theory.

We also compare the Mixed-Coloring algorithm with a state-of-the-art EM-style algorithm (equivalent to alternating minimization in the noiseless setting) from [19]. These comparisons are not entirely fair, since our algorithm is based on carefully designed query vectors, while the algorithm in [19] uses random design, i.e., the entries of $\boldsymbol{x}_{i}$ ’s are i.i.d. Gaussian. However, this is exactly where the intellectual value of our work lies: we expose the gains available by careful design. We consider four test cases with $(L,n,K)=(2,100,20),(2,500,50),(2,100,100),(2,500,500)$ , with the first two cases being sparse problems and the last two being relatively dense problems. We find the minimum number of queries that leads to a 100% successful rate in 100 trials, and the average running time. For the Mixed-Coloring algorithm, we use $d=15$ , $R=V=3$ and $M=3.8K$ . The parameters of the EM-style algorithm are chosen as suggested in the original paper [19]. As shown in Table IV, in both sparse and dense problems, our Mixed-Coloring algorithm is several orders of magnitude faster. As for the sample complexity, our algorithm requires smaller number of samples in the sparse cases, while in dense problems, the sample complexity of our algorithm is within a constant factor (about 3) of that of the alternating minimization algorithm. For the noisy setting, our algorithm is most powerful in the high dimensional setting, i.e., large $n$ , due to the $\text{polylog}(n)$ factors. However, in this setting, it takes prohibitively long time for the state-of-the-art algorithms such as [18] to converge, and thus, we do not present the comparison in the noisy setting.

We further test the Robust Mixed-Coloring algorithm when the quantization assumption is violated. For any $\beta\in\mathbb{R}$ , we define $D(\beta)=\arg\min_{a\in\mathbb{D}}|a-\beta|\boldsymbol{1}(\beta\neq 0)$ , where $\boldsymbol{1}(\cdot)$ denotes the indicator function. This means that $D(\beta)$ is the element in $\mathbb{D}$ which is the closest one to $\beta$ , when $\beta\neq 0$ . For a vector $\boldsymbol{\beta}\in\mathbb{R}^{n}$ , we define $D(\boldsymbol{\beta})=\{D(\beta_{j})\}_{j=1}^{n}$ . We define the perturbation of a vector $\boldsymbol{\beta}$ as $\text{Perturbation}(\boldsymbol{\beta})=\max_{j\in[n]}|\beta_{j}-D(\beta_{j})|/\Delta$ .

In this experiment, we generate sparse parameter vectors $\boldsymbol{\beta}^{(\ell)}$ , $\ell\in[L]$ with a total number of $K$ non-zero elements. These non-zero elements are generated randomly while keeping the perturbation of the parameter vectors under a certain level by adding bounded noise to the quantized non-zero elements. We record the probability of success for different number of bins $M$ and different perturbation level. Here the success event is defined as recovery of $D(\boldsymbol{\beta}^{(\ell)})$ for all $\ell\in[L]$ . The result is shown in Figure 10. We see that the Robust Mixed-Coloring algorithm works without the quantization assumption as long as the perturbations are not too large.

VII Conclusions

We proposed the Mixed-Coloring algorithm as a query based learning algorithm for mixtures of sparse linear regressions. Our algorithm leverages the connection between modern coding theory and statistical inference. The design of the query vectors and the recovery algorithm are based on ideas from sparse graph codes. Our novel code design allows for both efficient demixing and parameter estimation. In the noiseless setting, for a constant number of sparse parameter vectors, our algorithm achieves the order-optimal sample and time complexities of $\Theta(K)$ . In the presence of Gaussian noise, for the problem with two parameter vectors (i.e., $L=2$ ), we show that the Robust Mixed-Coloring algorithm achieves near-optimal $\Theta(K\operatorname{polylog}(n))$ sample and time complexities. Our experiments justified the theoretical results, and we observe that the run-time of our algorithm can be orders of magnitudes smaller than that of the state-of-the-art algorithms. In the noisy scenario, studying the Robust Mixed-Coloring algorithm with more than two parameter vectors and obtaining theoretical results for the continuous alphabet case are two important future directions.

Appendix A Proof of Theorem 1

A-A Proof Outline

We prove Theorem 1 in this section. The proof includes two major steps: (i) show that the expectation of the fraction of non-zero elements which are not recovered can be arbitrarily small; (ii) show that this fraction concentrates around its mean with high probability. The first part mainly uses density evolution techniques which are commonly used in coding theory, and the second part uses Doob’s martingale argument.

A-B Notation

We briefly recall the Mixed-Coloring algorithm in the noiseless case and declare the notation that we use for the rest of the proof.

Recall that the parameter vector $\boldsymbol{\beta}^{(\ell)}$ has $K_{\ell}$ non-zero elements. We call these $K_{\ell}$ non-zero elements balls in color $\ell$ . We design a $d$ -left regular bipartite graph with $n$ left nodes and $M$ right nodes, representing the $n$ coordinates and the $M$ bins, respectively. We denote the $i$ -th bin by $\mathcal{B}_{i}$ . We use the matrix $\boldsymbol{H}\in\{0,1\}^{M\times n}$ to represent the biadjacency matrix of the bipartite graph, i.e., $H_{i,j}=1$ if and only if the $i$ -th bin is associated with the $j$ -th coordinate. Recall that we design three query vectors in the form of (2), for the purpose of ratio test. The third query vectors is the summation of the first two and is used for summation check. We repeat the first two query vectors $R$ times, respectively, and get $R$ type-I and $R$ type-II index measurements. We repeat the third query vector $V$ times and get $V$ verification measurements. For the $j$ -th verification measurement of the $i$ -th bin, we define a sub-bin $\mathcal{B}_{i}^{j}$ . If we can find one type-I index measurement and one type-II index measurement such that the summation of the two measurements is equal to the $j$ -th verification measurement, we know that these three measurements are generated by the same parameter vector, say $\boldsymbol{\beta}^{(\ell)}$ . The two index measurements are called a consistent pair. Then, we say that the sub-bin $\mathcal{B}_{i}^{j}$ has color $\ell$ . We define the color set $\mathcal{C}_{i}^{j}$ of $\mathcal{B}_{i}^{j}$ . If we can find a consistent pair corresponding to the $j$ -th verification measurement, we let $\mathcal{C}_{i}^{j}=\{\ell\}$ , otherwise $\mathcal{C}_{i}^{j}=\emptyset$ . We further define the color set of bin $\mathcal{B}_{i}$ as $\mathcal{C}_{i}=\cup_{j=1}^{V}\mathcal{C}_{i}^{j}$ .

A-C Number of Singleton Balls

In this section, we analyze the number of singleton balls in color $\ell$ found in the first stage of the algorithm. We can show that this number is concentrated around a constant fraction of $K_{\ell}$ with high probability.

Lemma 2.

Let $K_{s}^{(\ell)}$ be the number of singleton balls in color $\ell$ found in the first stage. Then, there exists a constant888Recall that in our paper, constants are defined as quantities which do not depend on $n$ and $K$ . $q_{s}^{(\ell)}$ such that for any constant $\delta>0$ ,

[TABLE]

Proof.

We first specify some terminologies here. For a bin $\mathcal{B}_{i}$ , we say that this bin has color $\ell$ when $\ell\in\mathcal{C}_{i}$ . One should notice that if there are more than one sub-bins in color $\ell$ in bin $\mathcal{B}_{i}$ , these sub-bins are identical. Therefore, we can say that a bin $\mathcal{B}_{i}$ contains $k$ balls in color $\ell$ , when $\mathcal{B}_{i}$ has at least one sub-bin $\mathcal{B}_{i}^{j}$ in color $\ell$ , and the sub-bin is associated with $k$ non-zero elements in $\boldsymbol{\beta}^{(\ell)}$ . Equivalently, the coded parameter vector $\tilde{\boldsymbol{\beta}}_{i}^{\ell}={\rm diag}(\boldsymbol{h}_{i})\boldsymbol{\beta}^{(\ell)}$ satisfies $|{\rm supp}(\tilde{\boldsymbol{\beta}}_{i}^{\ell})|=k$ , $k\geq 0$ .

First, we analyze the probability $Q_{\ell}$ that a particular bin $\mathcal{B}_{i}$ has color $\ell$ . According to our model, the measurements are generated independently, therefore, we have

[TABLE]

Then, we use $\xi_{k}^{(\ell)}$ to denote the probability of the event that a particular bin contains $k$ balls in color $\ell$ . Since each ball is associated with $d$ bins among the $M$ bins independently and uniformly at random, the number of balls in color $\ell$ that a bin contains is binomial distributed with parameters $K_{\ell}$ and $\frac{d}{M}$ , and we have

[TABLE]

In addition, we can use Poisson distribution to approximate the binomial distribution when $\lambda_{\ell}:=\frac{K_{\ell}d}{M}$ is a constant and $K_{\ell}$ approaches infinity. In the following analysis, we use the approximation

[TABLE]

Consider the bipartite graph representing the association between the balls in color $\ell$ and the $M$ bins. We know that there are $K_{\ell}d$ edges connected to the balls in color $\ell$ , and we use $\rho_{k}^{(\ell)}$ to denote the expected fraction of these $K_{\ell}d$ edges which are connected to a bin which contains $k$ balls in color $\ell$ , $k\geq 1$ . Then, we have

[TABLE]

and equivalently, $\rho_{k}^{(\ell)}$ is also the probability that an edge, which is chosen from the $K_{\ell}d$ edges uniformly at random, is connected to a bin $\mathcal{B}_{i}$ containing $k$ balls in color $\ell$ .

Let $q_{s}^{(\ell)}$ be the probability that a ball in color $\ell$ is a singleton ball. The event that this ball is a singleton ball is equivalent to the event that at least one of its $d$ associated bins contains one ball with color $\ell$ . Then, when $K_{\ell}$ approaches infinity, we have

[TABLE]

and this is because in the limit $K_{\ell}\rightarrow\infty$ , the correlations between the $d$ edges connected to a ball become negligible; this technique is often used in the theoretical analysis of density evolution in coding theory, and we use this type of asymptotic argument several times in the proofs. Let $K_{s}^{(\ell)}$ be the number of singleton balls in color $\ell$ , then we have $\mathbb{E}[K_{s}^{(\ell)}]=K_{\ell}q_{s}^{(\ell)}$ . Using the asymptotic argument and by Hoeffding’s inequality, we also have for any constant $\delta>0$ ,

[TABLE]

and this means that the number of singleton balls in color $\ell$ is highly concentrated around $K_{\ell}q_{s}^{(\ell)}$ . ∎

A-D Initial Fractions

We construct the graph $\mathcal{G}_{\ell}$ whose nodes correspond to the singleton balls in color $\ell$ found in the previous stage, and analyze the number of edges in $\mathcal{G}_{\ell}$ , which is equal to the number of strong doubletons in color $\ell$ . Here, for clarification, we emphasize that the graph $\mathcal{G}_{\ell}$ corresponds to a sub-graph with the same color in Figure 1, rather than the bipartite graph that we use to design the query vectors. Then, we can show that the number of strong doubletons is concentrated around a constant fraction of $M$ with high probability.

Lemma 3.

Let $M_{s}^{(\ell)}$ be the number of strong doubletons in color $\ell$ found in the second stage. Then, there exists a constant $\nu_{\ell}>0$ such that for any constant $\delta>0$ ,

[TABLE]

Proof.

We know that the expected number of doubletons in color $\ell$ is $M\xi_{2}^{(\ell)}$ . Then, we analyze the probability that a doubleton is a strong doubleton. Similar to the analysis in [30], for a particular ball in color $\ell$ , we let $B$ denote the event that this ball is in a singleton, and $D$ denote the event that this ball is in a doubleton. We have the conditional probability that a ball in a doubleton is also a singleton ball:

[TABLE]

Then we know the probability that a doubleton is a strong doubleton is $(q_{d}^{(\ell)})^{2}$ , and the expected number of strong doubletons in color $\ell$ is $M\xi_{2}^{(\ell)}(q_{d}^{(\ell)})^{2}$ . Let $\nu_{\ell}=\xi_{2}^{(\ell)}(q_{d}^{(\ell)})^{2}$ and $M_{s}^{(\ell)}$ be the number of edges in graph $\mathcal{G}_{\ell}$ . The expectation of $M_{s}^{(\ell)}$ is $\mathbb{E}[M_{s}^{(\ell)}]=M\nu_{\ell}$ , and according to Hoeffding’s inequality, we have for any $\delta>0$

[TABLE]

meaning that the number of edges is highly concentrated around $M\nu_{\ell}$ . ∎

Then, we get the following result on the size of the giant component of $\mathcal{G}_{\ell}$ , using the asymptotic behavior of the Erdos-Renyi random graphs.

Lemma 4.

Let $K^{(\ell)}_{G}$ be the size of the largest connected component (giant component) of $\mathcal{G}_{\ell}$ . If the parameters of the Mixed-Coloring algorithm satisfy

[TABLE]

then, for any constant $\delta>0$ , with probability $1-\mathcal{O}(1/K_{\ell})$ , initial fraction of the balls in color $\ell$ which are recovered after the second stage satisfies

[TABLE]

where the constant $\zeta_{\ell}$ is the unique solution of the equation

[TABLE]

and other connected components in $\mathcal{G}_{\ell}$ are of sizes $\mathcal{O}(\log(K_{\ell}))$ .

Proof.

This result is a direct corollary of the asymptotic behavior of the Erdos-Renyi random graphs [45], and we only give a brief proof here. First, we condition on the number of singleton balls that we find in the first stage, i.e., $K_{s}^{(\ell)}$ and the number of edges in $\mathcal{G}_{\ell}$ , i.e., $M_{s}^{(\ell)}$ . By symmetry, we know that the $M_{s}^{(\ell)}$ edges are uniformly chosen from the $K_{s}^{(\ell)}\choose 2$ possible edges. Therefore, the graph $\mathcal{G}_{\ell}$ is an Erdos-Renyi random graph. According to the results on the giant component of Erdos-Renyi random graphs, we know that if the limit

[TABLE]

then with probability at least $1-\mathcal{O}(1/K_{s}^{(\ell)})$ , the size of the giant component of graph $\mathcal{G}_{\ell}$ is linear in $K_{s}^{(\ell)}$ , and other connected components have sizes $\mathcal{O}(\log(K_{s}^{(\ell)}))$ . By (5) and (6), we know that for any constant $\epsilon_{1}>0$ , there exists a constant $\alpha_{1}>0$ , such that, with probability at least $1-\mathcal{O}(\exp(-\alpha_{1}K_{\ell}))$ ,

[TABLE]

and that, for any constant $\epsilon_{2}>0$ , there exists a constant $\alpha_{2}>0$ , such that, with probability at least $1-\mathcal{O}(\exp(-\alpha_{2}M))$ ,

[TABLE]

We also know that when $K_{s}^{(\ell)}\in I_{K}$ and $M_{s}^{(\ell)}\in I_{M}$ happen, the limit $\theta$ approaches $\frac{2M\nu_{\ell}}{K_{\ell}q_{s}^{(\ell)}}$ . Let $A$ be the event that the size of the largest connected component (giant component) of $\mathcal{G}_{\ell}$ , i.e., $K_{G}^{(\ell)}$ satisfies (8), and other connected components in $\mathcal{G}_{\ell}$ are of sizes $\mathcal{O}(\log(K_{\ell}))$ . Then, according to the aforementioned property of Erdos-Renyi random graphs, conditioned on $K_{s}^{(\ell)}\in I_{K}$ and $M_{s}^{(\ell)}\in I_{M}$ , we have

[TABLE]

Then, we have

[TABLE]

which completes the proof. ∎

A-E Tree-like Assumption

By Lemma 4, we know that we can recover a constant fraction of the non-zero elements with probability $1-\mathcal{O}(1/K_{\ell})$ . Then, we study the iterative decoding process. The analysis is based on density evolution, which is a common and powerful technique in coding theory. Similar to the density evolution analysis of many modern error-correcting codes [46], our derivation of density evolution is based on a tree-like assumption. Here, we state the tree-like assumption first and provide the results on the probability that the tree-like assumption holds.

As we have mentioned, the association between the balls in color $\ell$ (non-zero elements in $\boldsymbol{\beta}^{(\ell)}$ ) and the bins can be represented by a $d$ -left regular bipartite graph. We label the edges by an ordered pair of a ball $b$ and a bin $\mathcal{B}$ , denoted by $e=(b,\mathcal{B})$ . We define the level- $C^{*}$ neighborhood of $e$ , denoted by $N_{e}^{C^{*}}$ as the subgraph of all the edges and nodes on paths with length less than or equal to $C^{*}$ , which start from $b$ and the first edge of the paths are not $e$ [30]. We have the following results on the probability that $N_{e}^{C^{*}}$ is a tree, or equivalently, cycle-free, for a constant $C^{*}$ .

Lemma 5.

[30]** For a fixed constant $C^{*}$ , $N_{e}^{2C^{*}}$ is a tree with probability at least $1-\mathcal{O}(\log(K_{\ell})^{C^{*}}/K_{\ell})$ .

We conduct the density evolution analysis conditioned on the event that $N_{e}^{2C^{*}}$ is a tree for an edge $e$ which is chosen from the $K_{\ell}d$ edges uniformly at random. Then, we take the complementary event into consideration and complete the analysis.

A-F Density Evolution

Recall that in the first iteration, we find all the singletons, and in the second iteration, we find the strong doubletons and form the giant component. Let $p_{j}^{(\ell)}$ be the probability that at the $j$ th iteration of the learning algorithm, a ball in color $\ell$ , which is chosen from the $K_{\ell}$ balls uniformly at random, is not recovered, $j\geq 2$ . Here, $p_{2}^{(\ell)}$ corresponds to the probability that after the second iteration, a randomly chosen ball in color $\ell$ is not in the giant component. According to the previous section, we know that by choosing parameters which satisfy (7), we have $p_{2}^{(\ell)}=\frac{K_{\ell}-K_{G}^{(\ell)}}{K_{\ell}}=\Theta(1)$ with probability $1-\mathcal{O}(1/K_{\ell})$ . Now we analyze the relationship between $p_{j+1}^{(\ell)}$ and $p_{j}^{(\ell)}$ for $j\geq 2$ .

Consider the iterative decoding process as a message passing process. First, we know that at iteration $j+1$ , a ball in color $\ell$ passes a message to a bin through an edge claiming that it is colored, if and only if at least one of the other $d-1$ neighborhood bins contains a resolvable multiton in color $\ell$ . Second, a sub-bin in color $\ell$ becomes a resolvable multiton if and only if all the other balls in this sub-bin are colored. This message passing process is illustrated in Figure 11. Under the tree-like assumption, the messages passed among the balls and bins are independent, we have

[TABLE]

which gives us

[TABLE]

As we can see, the major difference between the density evolution of the Mixed-Coloring algorithm and the PhaseCode algorithm in [30] (for compressive phase retrieval via sparse-graph codes) is that there is a constant probability $Q_{\ell}$ that a bin has a sub-bin in color $\ell$ .

Next, we show that after a constant number of iterations, $p_{j}^{(\ell)}$ can be arbitrarily small.

Lemma 6.

If we choose parameters satisfying

[TABLE]

then for any constant $\delta>0$ , there exists a constant $T$ , such that $p_{T}^{(\ell)}<\delta$ .

Proof.

Let $f_{\ell}(t)=(1-Q_{\ell}(e^{-\lambda_{\ell}t}-e^{-\lambda_{\ell}}))^{d-1}$ , then we have $p_{j+1}^{(\ell)}=f_{\ell}(p_{j}^{(\ell)})$ . It is easy to see that $f_{\ell}(1)=1$ , $f_{\ell}(0)>0$ , and $f_{\ell}$ is a monotonically increasing function. We also have

[TABLE]

We know that if there is

[TABLE]

then there exists at least one fixed point $t\in(0,1)$ such that $f_{\ell}(t)=t$ . We use $p_{\ell}^{*}$ to represent the largest fixed point of $f_{\ell}(t)$ in $(0,1)$ . Now we argue that the fixed point can be made arbitrarily small by choosing proper parameters. Suppose that for a certain set of parameters $\lambda_{\ell}$ and $d$ , the fixed point is $p_{\ell}^{*}$ , then if we keep $\lambda_{\ell}$ and increase $d$ to $\tilde{C}d$ , where $\tilde{C}>1$ is a constant, then we can see that the new fixed point is upper bounded by $(p_{\ell}^{*})^{\tilde{C}}$ , and in this way, the fixed point can be made an arbitrarily small constant. As shown in [30], as long as we can choose parameters to make the fixed point $p_{\ell}^{*}<\delta/2$ , then, there exists a constant number of iterations $T$ , depending on $\delta$ , such that $p_{T}^{(\ell)}<\delta$ .

Then, we investigate how the sample complexity depends on $p_{\ell}^{*}$ . First, since $p_{\ell}^{*}$ is a fixed point of the iteration (9), we have

[TABLE]

Since $p_{\ell}^{*}$ is usually very small, we use the approximation $e^{-\lambda_{\ell}p_{\ell}^{*}}\approx 1$ , and thus we have

[TABLE]

which gives us $d=\mathcal{O}(\log(1/p_{\ell}^{*}))$ . Further, since we keep $\lambda_{\ell}=\frac{K_{\ell}d}{M}$ as a constant, we know that $M=\mathcal{O}(\log(1/p_{\ell}^{*}))$ as a function of $p_{\ell}^{*}$ .

∎

Then, we can prove the following lemma showing that the number of uncolored balls in color $\ell$ is concentrated around $K_{\ell}p_{T}^{(\ell)}$ with high probability.

Lemma 7.

Let $Z_{\ell}$ be the number of uncolored balls in color $\ell$ after $T$ iterations. Then for any $\delta>0$ , there exists constant $c_{1}$ , such that when conditioned on the event that $p_{2}^{(\ell)}=\Theta(1)$ , and $K_{\ell}$ is large enough,

[TABLE]

The proof of Lemma 7 is the same as in [30], and uses Doob’s martingale argument and Azuma’s concentration bound. We should also notice that the event that the tree-like assumption does not hold is already considered in (12). Now combining Lemmas 4, 6, and 7, we have shown that for a specific $\ell\in[L]$ , there exists proper parameters of the algorithm such that after a constant number of iterations, the Mixed-Coloring algorithm can recover an arbitrarily large fraction of the balls in color $\ell$ with probability $1-\mathcal{O}(1/K_{\ell})$ . Since $L$ is a constant and $K_{\ell}=\Theta(K)$ , the results above implies that for an arbitrarily small constant $p^{*}\in(0,1)$ ,

[TABLE]

Then, we turn to the first and the third properties in Theorem 1. According to our ratio test scheme, as long as we have a singleton, we find the exact location and value of the non-zero element, and thus our algorithm has no false discovery. As for the element-wise recovery, one can see that due to the use of $d$ -left regular random bipartite graph (each left node is connected to $d$ right nodes uniformly at random), the recovered $(1-p^{*})$ fraction of the support is also uniformly distributed on the support of $\boldsymbol{\beta}^{(\ell)}$ . Thus, for each $j\in{\rm supp}(\boldsymbol{\beta}^{(\ell)})$ , $\mathbb{P}\{\hat{\beta}^{(\ell)}_{j}=\beta^{(\ell)}_{j}~{}|~{}|{\rm supp}(\hat{\boldsymbol{\beta}}^{(\ell)})|\geq(1-p^{*})|{\rm supp}(\boldsymbol{\beta}^{(\ell)})|\}\geq 1-p^{*}$ . Then, by total law of probability, we have

[TABLE]

Thus, we have proved the three properties in Theorem 1.

A-G Time Complexity

In this section, we analyze the time complexity of the algorithm. First, note that there are $M=\Theta(K)$ bins and each bin has a constant number of sub-bins. Since refining the measurements of each bin takes $\Theta(1)$ operations, the time complexity of refining measurements is $\Theta(K)$ . Next, to find all the singletons, we need to check all the colored sub-bins, and checking each sub-bin takes $\Theta(1)$ operations, the time complexity of this stage is $\Theta(K)$ . In the third stage, we find all the strong doubletons. We know that there are $\Theta(K)$ singleton balls and for each singleton ball, there are $d$ bins connected to it. For each of the bins, we subtract the measurements contributed by the singleton ball from the refined measurements in the sub-bins, and do the ratio test to see if it is a strong doubleton. Therefore, processing each bin takes $\Theta(1)$ operations and since $d$ is also a constant, the time complexity of finding strong doubletons is also $\Theta(K)$ . Then, we get the graph with $\Theta(K)$ nodes and $\Theta(K)$ edges, corresponding to the singleton balls and strong doubletons, respectively. Using breadth-first search algorithm, the time complexity of finding the connected components is $\Theta(K)$ . In the last stage, we iteratively find other uncolored balls. For each unprocessed sub-bin, since we do not know the color of the sub-bin, there are $L$ possible remaining measurements. Each time when we find a new ball, we update at most $dV$ remaining measurements and do the ratio test. Therefore, it takes $\Theta(1)$ operations when coloring a new ball. Since there are $\Theta(K)$ uncolored balls after finding the giant components, the time complexity of the last stage is also $\Theta(K)$ . Thus, we have shown that the time complexity of Mixed-Coloring algorithm is $\Theta(K)$ , which completes the proof of Theorem 1.

Appendix B Computing the Constants in the Sample Complexity

In this section, we give exact constants in the sample complexity results. For simplicity, we assume that $K_{\ell}=K/L$ and $q_{\ell}=1/L$ for all $\ell\in[L]$ . We let $c:=M/K$ , and thus we have $\lambda_{\ell}=\frac{K_{\ell}d}{M}=\frac{d}{Lc}$ . We analyze the minimum number of measurements that we need to reach a certain reliability target. More precisely, we set the maximum error floor to be $p_{\max}^{*}$ , and numerically calculate the error floor for different values of $d$ , $c$ , $R$ , and $V$ . Then, we minimize the number of total measurements, which is proportional to $(2R+V)c$ with the constraint that the error floor $p^{*}\leq p_{\max}^{*}$ . As we have shown in previous parts, the parameters should also satisfy (7) and (10). We know that if (7) is satisfied, when $K$ is large enough, there should be a giant component with size linear in $K$ for each color, where $\theta>1$ is a threshold that we can choose. Therefore, we select optimal parameters with three constraints, which are (10), (7), and $p^{*}\leq p_{\max}^{*}$ .

The results of the numerical calculation are shown in Table V. In these experiments, we set $p_{\max}^{*}=10^{-5}$ , $\theta=2$ , and we fix the left degree $d$ and choose different values of $c$ , $R$ , and $V$ to minimize the number of measurements with the three constraints. Then we compare the optimal number of measurements over different choices of $d$ and find the optimal $d$ . As we can see, to reach the same reliability level, for $L=2,3,4$ , the optimal number of measurements we need is $33.39K$ , $37.80K$ , and $40.32K$ , respectively. The number of measurements we need only increases slightly with $L$ , and the optimal $d$ is around 13 and 15.

Appendix C Proof of Theorem 2

In this section, we analyze the performance of the Robust Mixed-Coloring algorithm and prove Theorem 2. Recall that the overall structure of the Robust Mixed-Coloring algorithm is the same as its noiseless counterpart. Suppose that one can always perfectly find the consistent sets of measurements, and the correct location and value of the non-zero elements, then, the recovery guarantee in the noisy setting will be exactly the same as in the noiseless setting. Further, in the noisy setting, finding the correct consistent sets of measurements, and the correct location and value of the non-zero elements relies on the success of two events: 1) the EM-based algorithm has to always find the correct denoised measurements, and 2) the verification procedure has to identify all the singletons, and cannot misclassify other consistent sets as singletons.

We provide details below. We define error events: $E_{\ell}^{1}$ , as the event that there exists one incidence that the EM algorithm does not find the correct denoised measurements, and $E_{\ell}^{2}$ , as the event that there exists one incidence where the verification query vectors make a misclassification between singleton and non-singleton. According to Theorem 3, the failure probability of each EM operation is $\mathcal{O}(1/\operatorname{poly}(n))$ . According to the proof in Appendix A-G, there are $\Theta(K)$ bin-level operations during the algorithm, and when processing each bin, we need $\Theta(\log^{2}(n))$ EM operations, since there are $\Theta(\log^{2}(n))$ query vectors in each bin. Therefore, the total amount of EM operations is $\Theta(K\log^{2}(n))$ . By union bound, we know that $\mathbb{P}\{E_{\ell}^{1}\}\leq\mathcal{O}(K\log^{2}(n)/\operatorname{poly}(n))=\mathcal{O}(1/\operatorname{poly}(n))$ . According to Lemma 1, we know that each verification has failure probability $\mathcal{O}(1/\operatorname{poly}(n))$ , and using a similar union bound argument, we know that $\mathbb{P}\{E_{\ell}^{2}\}\leq\mathcal{O}(1/\operatorname{poly}(n))$ . When both error events $E_{\ell}^{1}$ and $E_{\ell}^{2}$ do not happen, the algorithm always find the correct location and value of the recovered elements, and in this case there is no false discovery. By union bound, we know that $\mathbb{P}\{\overline{E_{\ell}^{1}}\cap\overline{E_{\ell}^{2}}\}\geq 1-\mathcal{O}(1/\operatorname{poly}(n))$ . Therefore, the probability that there is no false discovery is $1-\mathcal{O}(1/\operatorname{poly}(n))$ , which proves the first property in the theorem.

Then, we turn to prove the second property. We define the error event $E_{\ell}$ that fewer than $1-p^{*}$ fraction of the $K_{\ell}$ non-zero elements of the parameter vector $\boldsymbol{\beta}^{(\ell)}$ are recovered by the algorithm. Suppose that none of $E_{\ell}^{1}$ and $E_{\ell}^{2}$ happens, then, the analysis of the robustified algorithm becomes exactly the same as in the noiseless setting. Therefore, according to Theorem 1, we know that $\mathbb{P}\{E_{\ell}|\overline{E_{\ell}^{1}}\cap\overline{E_{\ell}^{2}}\}\leq\mathcal{O}(1/K_{\ell})$ . Then, we can apply total law of probability and get

[TABLE]

which proves the second property in the theorem. The third property in the theorem can be derived using the method in (14) in the proof of Theorem 1, and we omit the details here. Thus, we have proved the three properties in Theorem 2. The time complexity can be analyzed using the same method as in the noiseless case, provided in Appendix A-G. The only difference is that, the bin-level operation takes $\Theta(1)$ time in the noiseless setting, while in the noisy setting it takes $\Theta(\operatorname{polylog}(n))$ time. Therefore, the time complexity of the Robust Mixed-Coloring algorithm is $\Theta(K\operatorname{polylog}(n))$ .

Appendix D Proof of Lemma 1

We first provide a simplified interpretation of Lemma 1.

Lemma 8.

Let $\boldsymbol{V}\in\{0,1\}^{P_{2}\times n}$ be a Rademacher matrix with $P_{2}=\Theta(\log(n))$ . Denote the $j$ -th column of $\boldsymbol{V}$ by $\boldsymbol{v}_{j}$ . Suppose that $\boldsymbol{h}\in\{0,1\}^{n}$ and $\boldsymbol{\beta}\in\mathbb{D}^{n}$ , where $\mathbb{D}=\{\pm\Delta,\pm 2\Delta,\ldots,\pm b\Delta\}$ . Let $\boldsymbol{y}=\boldsymbol{V}{\rm diag}(\boldsymbol{h})\boldsymbol{\beta}$ . Suppose that ${\rm diag}(\boldsymbol{h})\boldsymbol{\beta}\neq a\Delta\boldsymbol{e}_{j}$ for some $a\Delta\in\mathbb{D}$ and canonical basis vector $\boldsymbol{e}_{j}$ . Then, with probability $1-\mathcal{O}(1/\operatorname{poly}(n))$ , $\boldsymbol{y}\neq a\Delta\boldsymbol{v}_{j}$ .

Here, $\boldsymbol{\beta}$ is the parameter vector that generates the consistent set of measurements, and $\boldsymbol{h}$ denotes the association between the bin and the coordinates. Define $\tilde{\boldsymbol{\beta}}={\rm diag}(\boldsymbol{h})\boldsymbol{\beta}\in\mathbb{D}^{n}$ , and we have $\boldsymbol{y}=\boldsymbol{V}\tilde{\boldsymbol{\beta}}$ . Our goal is to justify that, when $\tilde{\boldsymbol{\beta}}\neq a\Delta\boldsymbol{e}_{j}$ , with high probability, $\boldsymbol{y}\neq a\Delta\boldsymbol{v}_{j}$ .

Suppose that $\tilde{\boldsymbol{\beta}}\neq a\Delta\boldsymbol{e}_{j}$ but $\boldsymbol{y}=a\Delta\boldsymbol{v}_{j}$ . Then we have $\boldsymbol{V}(\tilde{\boldsymbol{\beta}}-a\Delta\boldsymbol{e}_{j})=\boldsymbol{0}$ . According to a corollary of the Johnson-Lindenstrauss Lemma [44] (one can also refer to Section 4 in [47]), we know that

[TABLE]

Therefore, we can see that by having $P_{2}=\Theta(\log(n))$ verification query vectors, we can guarantee that with probability at least $1-\mathcal{O}(1/\operatorname{poly}(n))$ , we won’t identify $\tilde{\boldsymbol{\beta}}$ as $a\Delta\boldsymbol{e}_{j}$ , and this means that we won’t misclassify a non-singleton as a singleton.

Appendix E Proof of Theorem 3

In this section, we provide a method to estimate the parameters of a mixture of two Gaussian random variables, and give the theoretical analysis to prove Theorem 3. This estimation method is based on EM algorithm with method of moments initialization.

Recall the setting of Theorem 3. Let $z_{i}$ ’s be i.i.d. samples of Bernoulli $(\frac{1}{2})$ distribution, and $w_{i}$ ’s be i.i.d. samples of Gaussian distribution with mean zero and variance $\sigma^{2}$ , independently of $z_{i}$ ’s, $i\in[N]$ . Suppose that random variables $y_{i}$ ’s are generated in the following way:

[TABLE]

Then, we can consider $y_{i}$ as a mixture of two Gaussian random variables with means $\mu_{1}$ and $\mu_{2}$ , respectively. We assume that $\sigma^{2}$ is known, and the parameters $\mu_{1}$ and $\mu_{2}$ are unknown and take value in a finite and quantized set $\mathbb{D}=\{k\Delta:k\in\mathbb{Z},|k|\leq b\}$ , for some $\Delta>0$ . Without loss of generality, we assume that $\mu_{1}\leq\mu_{2}$ . (Note that we allow $\mu_{1}=\mu_{2}$ here.) Our goal is to get accurate estimation of $\mu_{1}$ and $\mu_{2}$ .

The first step is to compute the sample mean of the first $N_{1}$ samples, i.e., $\bar{y}=\frac{1}{N_{1}}\sum_{i=1}^{N_{1}}y_{i}$ . Since we know that the mean of $y_{i}$ ’s takes value in the set $\mathbb{D}^{+}=\{\frac{k}{2}\Delta:k\in\mathbb{Z},|k|\leq 2b\}$ , we find the element in $\mathbb{D}^{+}$ which is the closest one to $\bar{y}$ as the estimator of the mean of $y_{i}$ , i.e., $\frac{1}{2}(\mu_{1}+\mu_{2})$ ,

[TABLE]

We have the following result on the accuracy of the estimator $\hat{\mu}$ .

Lemma 9.

There exist universal constants $c_{1}$ and $c_{2}$ such that for any $\delta>0$ , if

[TABLE]

we have $\hat{\mu}=\frac{1}{2}(\mu_{1}+\mu_{2})$ with probability at least $1-6\delta$ .

We prove Lemma 9 in Appendix E-A. In the second step, we subtract $\hat{\mu}$ from the other $N-N_{1}$ samples, and get centered random variables $\tilde{y}_{i}=y_{i}-\hat{\mu}$ , $i=N_{1}+1,\ldots,N$ . We assume that $\hat{\mu}$ is the actual mean of the $y_{i}$ ’s, meaning that $\hat{\mu}=\frac{1}{2}(\mu_{1}+\mu_{2})$ . Then, we know that if $\mu_{1}=\mu_{2}$ , the centered random variables $\tilde{y}_{i}$ ’s are i.i.d. $\mathcal{N}(0,\sigma^{2})$ distributed; otherwise $\tilde{y}_{i}$ ’s are i.i.d. mixtures of two Gaussian distributions

[TABLE]

where $\theta_{*}=\frac{1}{2}(\mu_{2}-\mu_{1})\geq 0$ . Then, we make an initial estimation of $\theta_{*}$ using $N_{2}$ of the $N-N_{1}$ centered random variables. Specifically, we compute

[TABLE]

We have the following result on $\theta_{0}$ :

Lemma 10.

Condition on the event that $\hat{\mu}=\frac{1}{2}(\mu_{1}+\mu_{2})$ . There exist universal constants $c_{3}$ and $c_{4}$ , such that for any $\delta>0$ , when

[TABLE]

*then $\theta_{0}$ satisfies:

(1) if $\mu_{1}=\mu_{2}$ , $\theta_{0}<\frac{\Delta}{4}$ with probability at least $1-2\delta$ ;

(2) if $\mu_{1}\neq\mu_{2}$ , $|\theta_{0}-\theta_{*}|<\frac{\theta_{*}}{4}$ with probability at least $1-2\delta$ .*

We prove Lemma 10 in Appendix E-B. If $\theta_{0}<\frac{\Delta}{4}$ , we claim that $\mu_{1}=\mu_{2}$ , and give estimators $\hat{\mu}_{1}=\hat{\mu}_{2}=\hat{\mu}$ . Otherwise, we run a standard EM algorithm with the remaining $N_{3}:=N-(N_{1}+N_{2})$ samples using $\theta_{0}$ as an initialization to estimate $\theta_{*}$ . Here, we briefly review the procedures of standard EM algorithm for mixtures of Gaussian distributions. For $t=0,1,2,\ldots$ , conduct the following two steps:

E step: compute the expected log-likelihood.

[TABLE]

where

[TABLE]

M step: compute

[TABLE]

We run the EM algorithm for $T$ iterations, and find the element in $\mathbb{D}^{+}$ which is the closest one to $\theta_{t}$ as the estimator of the mean of $\theta_{*}$ , i.e., $\hat{\theta}_{*}=\arg\min_{\theta\in\mathbb{D}^{+}}|\theta-\theta_{T}|$ . Then, we output the estimation of $\mu_{1}$ and $\mu_{2}$ by $\hat{\mu}_{1}=\hat{\mu}-\hat{\theta}_{*}$ and $\hat{\mu}_{2}=\hat{\mu}+\hat{\theta}_{*}$ .

Here, we review the results in [16] which characterizes the performance of the EM algorithm.

Lemma 11.

[16]** Suppose that $\mu_{1}<\mu_{2}$ . Conditioned on the event that $\hat{\mu}=\frac{1}{2}(\mu_{1}+\mu_{2})$ and the event that $|\theta_{0}-\theta_{*}|<\frac{\theta_{*}}{4}$ . Suppose that $\eta:=\frac{\theta_{*}}{\sigma}\geq\frac{4}{\sqrt{3}}$ . Then, there exist universal constants $c_{5}$ , $c_{6}$ , and $c_{7}$ , such that when $N_{3}\geq c_{5}\log(\frac{1}{\delta})$ , for any $\delta>0$ , we have

[TABLE]

with probability at least $1-\delta$ , where $\kappa\leq\exp(-c_{7}\eta^{2})$ .

Then, we have the direct corollary:

Corollary 1.

Under the same condition that $\hat{\mu}=\frac{1}{2}(\mu_{1}+\mu_{2})$ , $|\theta_{0}-\theta_{*}|<\frac{\theta_{*}}{4}$ , and that $\eta=\frac{\theta_{*}}{\sigma}\geq\frac{4}{\sqrt{3}}$ as in Theorem 3, then, when

[TABLE]

and

[TABLE]

we have $\hat{\theta}_{*}=\theta_{*}$ with probability at least $1-\delta$ , for any $\delta>0$ .

We prove Corollary 1 in Appendix E-C. We have the following theorem to characterize the performance of the proposed estimation algorithm.

Theorem 4.

If $N_{1}$ , $N_{2}$ , $N_{3}$ , and $T$ satisfy (15), (16), (17), and (18), respectively, and $\frac{\Delta}{\sigma}\geq\frac{4}{\sqrt{3}}$ , then the proposed estimation algorithm outputs correct estimations $\hat{\mu}_{1}=\mu_{1}$ and $\hat{\mu}_{2}=\mu_{2}$ with probability at least $1-9\delta$ , for any $\delta>0$ .

Proof.

Let $A_{1}$ and $A_{2}$ be the events that $\hat{\mu}=\frac{1}{2}(\mu_{1}+\mu_{2})$ and that $\hat{\theta}_{*}=\theta_{*}$ , respectively, and $A$ be the event that $\hat{\mu}_{1}=\mu_{1}$ and $\hat{\mu}_{2}=\mu_{2}$ . Then, by Lemma 9, we know that $\mathbb{P}\{A_{1}\}\geq 1-6\delta$ .

If $\mu_{1}=\mu_{2}$ , by Lemma 10, we know that $\mathbb{P}\{A|A_{1}\}\geq 1-2\delta$ . Then $\mathbb{P}\{A\}\geq\mathbb{P}\{A|A_{1}\}\mathbb{P}\{A_{1}\}\geq 1-8\delta$ . If $\mu_{1}<\mu_{2}$ , by Lemma 10, we know that $\mathbb{P}\{A_{2}|A_{1}\}\geq 1-2\delta$ , and by Corollary 1, we know that $\mathbb{P}\{A_{3}|A_{2},A_{1}\}\geq 1-\delta$ . Then, $\mathbb{P}\{A\}\geq\mathbb{P}\{A_{1}\}\mathbb{P}\{A_{2}|A_{1}\}\mathbb{P}\{A_{3}|A_{2},A_{1}\}\geq 1-9\delta$ . ∎

Then, we can derive Theorem 3 in the main paper by setting $\delta=\mathcal{O}(1/\operatorname{poly}(n))$ and $N=N_{1}+N_{2}+N_{3}$ .

E-A Proof of Lemma 9

First, we can see that to get an accurate estimation, it suffices to have $|\bar{y}-\frac{1}{2}(\mu_{1}+\mu_{2})|<\frac{\Delta}{4}$ . Let $N_{11}=\sum_{i=1}^{N_{1}}1-z_{i}$ , and $N_{12}=\sum_{i=1}^{N_{1}}z_{i}$ . We have

[TABLE]

By Hoeffding’s inequality, we have

[TABLE]

and similarly

[TABLE]

By Chernoff’s inequality, we have

[TABLE]

By triangle inequality and union bound, we get

[TABLE]

which completes the proof.

E-B Proof of Lemma 10

Let $A_{1}$ be the event that $\hat{\mu}=\frac{1}{2}(\mu_{1}+\mu_{2})$ . In this lemma, all the probabilities are conditioned on the event $A_{1}$ .

First, consider the case when $\mu_{1}=\mu_{2}$ , i.e., $\theta_{*}=0$ . Let $\tilde{y}:=\frac{1}{\sigma^{2}}\sum_{i=N_{1}+1}^{N_{1}+N_{2}}\tilde{y}_{i}^{2}$ . Then, we know that $\tilde{y}$ is $\chi^{2}$ distributed with $N_{2}$ degrees of freedom. By the concentration result of $\chi^{2}$ distribution, we have for any $\epsilon>0$ ,

[TABLE]

Then, we have

[TABLE]

which implies that if

[TABLE]

conditioned on $A_{1}$ , the probability that $\theta_{0}<\frac{\Delta}{4}$ is at least $1-2\delta$ .

Then, we consider the case when $\mu_{1}\neq\mu_{2}$ . In this case, we have $\theta_{*}\geq\frac{\Delta}{2}$ , and we study the probability that $|\theta_{0}-\theta_{*}|\leq\frac{\theta_{*}}{4}$ . We still define $\tilde{y}:=\frac{1}{\sigma^{2}}\sum_{i=N_{1}+1}^{N_{1}+N_{2}}\tilde{y}_{i}^{2}$ . We can see that $\tilde{y}$ has noncentral $\chi^{2}$ distribution with $N_{2}$ degrees of freedom and noncentrality parameter $\nu=N_{2}\frac{\theta_{*}^{2}}{\sigma^{2}}$ . According to the results of concentrations of non-central $\chi^{2}$ distribution, we have for all $\epsilon>0$ ,

[TABLE]

We analyze the probability that $\frac{\theta_{0}}{\theta_{*}}<\frac{5}{4}$ . We substitute $\tilde{y}$ and $\nu$ in (23) with $N_{2}(\frac{\theta_{0}^{2}}{\sigma^{2}}+1)$ and $N_{2}\frac{\theta_{*}^{2}}{\sigma^{2}}$ , respectively. By some rearrangements, we get

[TABLE]

Then, we know that if $N_{2}$ is large enough such that $\frac{\sigma}{\theta_{*}^{2}}\sqrt{\frac{(\theta_{*}^{2}+\sigma^{2})\epsilon}{N_{2}}}\leq\frac{9}{64}$ and $\frac{\sigma^{2}\epsilon}{\theta_{*}^{2}N_{2}}\leq\frac{9}{64}$ , then we have

[TABLE]

By simple algebra and the fact that $\theta_{*}\geq\frac{\Delta}{2}$ , one can see that there exists universal constants $c_{3}$ such that if $N_{2}$ satisfies

[TABLE]

then the probability that $\frac{\theta_{0}}{\theta_{*}}<\frac{5}{4}$ conditioned on $A_{1}$ is at least $1-\delta$ . Similarly, using (24), we know that when (25) is satisfied, we can guarantee that $\frac{\theta_{0}}{\theta_{*}}>\frac{3}{4}$ with probability at least $1-\delta$ . We can complete the proof by union bound.

E-C Proof of Corollary 1

To guarantee that $\hat{\theta}_{*}=\theta_{*}$ , we need $|\theta_{T}-\theta_{*}|<\frac{\Delta}{2}$ . By Lemma 10, it suffices to guarantee two facts:

[TABLE]

and

[TABLE]

Conditioning on the event that $|\theta_{0}-\theta_{*}|<\frac{\theta_{*}}{4}$ and $\theta_{*}<b\Delta$ , we know that it is sufficient to have $T>\frac{\log(b)}{\log(1/\kappa)}$ and $N_{3}>\frac{16c_{6}^{2}}{(1-\kappa)^{2}}b^{2}(b^{2}\Delta^{2}+\sigma^{2})\log(\frac{1}{\delta})$ .

Bibliography47

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Harville, “A framework for high-level feedback to adaptive, per-pixel, mixture-of-Gaussian background models,” in Proceedings of the 7th European Conference on Computer Vision , pp. 543–560, Springer, 2002.
2[2] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted Gaussian mixture models,” Digital Signal Processing , vol. 10, no. 1, pp. 19–41, 2000.
3[3] A. Zhang, N. Fawaz, S. Ioannidis, and A. Montanari, “Guess who rated this movie: identifying users through subspace clustering,” in Proceedings of the 28th Conference on Uncertainty in Artificial Intelligence , pp. 944–953, AUAI Press, 2012.
4[4] R. De Veaux, “Mixtures of linear regressions,” Computational Statistics and Data Analysis , vol. 8, no. 3, 1989.
5[5] E. J. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information,” IEEE Transactions on Information Theory , vol. 52, no. 2, pp. 489–509, 2006.
6[6] E. J. Candès, J. K. Romberg, and T. Tao, “Stable signal recovery from incomplete and inaccurate measurements,” Communications on Pure and Applied Mathematics , vol. 59, no. 8, pp. 1207–1223, 2006.
7[7] E. Blackwell, C. F. M. de Leon, and G. E. Miller, “Applying mixed regression models to the analysis of repeated-measures data in psychosomatic medicine,” Psychosomatic Medicine , vol. 68, no. 6, 2006.
8[8] P. Deb and M. Holmes, “Estimates of use and costs of behavioural health care: a comparison of standard and finite mixture models,” Econometric Analysis of Health Data , pp. 87–99, 2002.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Learning Mixtures of Sparse Linear Regressions Using Sparse Graph Codes

Abstract

I Introduction

I-A Algorithm Overview

I-B Motivation

I-C Organization

II Main Results

II-A Guarantees for the Noiseless Setting

Assumption 1**.**

Theorem 1**.**

II-B Guarantees for the Noisy Setting

Assumption 2**.**

Theorem 2**.**

III Related Work

III-A Mixture of Regressions

III-B Coding-theoretic Methods and Group Testing

III-C Combinatorial and Dimension Reduction Techniques

IV Mixed-Coloring Algorithm for Noiseless Recovery

IV-A Primitives

IV-B Design of Query Vectors

IV-C Decoding Algorithm

IV-D Choice of Design Parameters

V Robust Mixed-Coloring Algorithm for Noisy Recovery

V-A Design of Query Vectors

V-B Decoding Algorithm

Theorem 3**.**

Lemma 1**.**

VI Experimental Results

VII Conclusions

Appendix A Proof of Theorem 1

A-A Proof Outline

A-B Notation

A-C Number of Singleton Balls

Lemma 2**.**

Proof.

A-D Initial Fractions

Lemma 3**.**

Proof.

Lemma 4**.**

Proof.

A-E Tree-like Assumption

Lemma 5**.**

A-F Density Evolution

Lemma 6**.**

Proof.

Lemma 7**.**

A-G Time Complexity

Appendix B Computing the Constants in the Sample Complexity

Appendix C Proof of Theorem 2

Appendix D Proof of Lemma 1

Lemma 8**.**

Appendix E Proof of Theorem 3

Lemma 9**.**

Lemma 10**.**

Lemma 11**.**

Corollary 1**.**

Theorem 4**.**

Proof.

E-A Proof of Lemma 9

E-B Proof of Lemma 10

E-C Proof of Corollary 1

Assumption 1.

Theorem 1.

Assumption 2.

Theorem 2.

Theorem 3.

Lemma 1.

Lemma 2.

Lemma 3.

Lemma 4.

Lemma 5.

Lemma 6.

Lemma 7.

Lemma 8.

Lemma 9.

Lemma 10.

Lemma 11.

Corollary 1.

Theorem 4.