The Kikuchi Hierarchy and Tensor PCA

Alexander S. Wein; Ahmed El Alaoui; Cristopher Moore

arXiv:1904.03858·cs.DS·August 21, 2025

The Kikuchi Hierarchy and Tensor PCA

Alexander S. Wein, Ahmed El Alaoui, Cristopher Moore

PDF

Open Access

TL;DR

This paper introduces a new hierarchy of algorithms inspired by statistical physics for tensor PCA, matching sum-of-squares performance with simpler proofs and extending to related problems like refuting random XOR formulas.

Contribution

It proposes a Kikuchi Hessian-based hierarchy that generalizes belief propagation, achieving SOS-level performance for tensor PCA and related problems with simpler proofs.

Findings

01

Hierarchy matches SOS performance for tensor PCA

02

Provides polynomial-time algorithms with optimal runtime-statistical tradeoff

03

Simplifies proofs and extends to refuting random XOR formulas

Abstract

For the tensor PCA (principal component analysis) problem, we propose a new hierarchy of increasingly powerful algorithms with increasing runtime. Our hierarchy is analogous to the sum-of-squares (SOS) hierarchy but is instead inspired by statistical physics and related algorithms such as belief propagation and AMP (approximate message passing). Our level- $ℓ$ algorithm can be thought of as a linearized message-passing algorithm that keeps track of $ℓ$ -wise dependencies among the hidden variables. Specifically, our algorithms are spectral methods based on the Kikuchi Hessian, which generalizes the well-studied Bethe Hessian to the higher-order Kikuchi free energies. It is known that AMP, the flagship algorithm of statistical physics, has substantially worse performance than SOS for tensor PCA. In this work we 'redeem' the statistical physics approach by showing that our hierarchy…

Equations404

Y = λ x_{*}^{\otimes p} + G

Y = λ x_{*}^{\otimes p} + G

G := \frac{1}{p !} π \in S_{p} \sum G^{π},

G := \frac{1}{p !} π \in S_{p} \sum G^{π},

Y = λ x_{*}^{\otimes p} + G .

Y = λ x_{*}^{\otimes p} + G .

{Y_{E} = λ x_{*}^{E} + G_{E} : E \subseteq [n], ∣ E ∣ = p},

{Y_{E} = λ x_{*}^{E} + G_{E} : E \subseteq [n], ∣ E ∣ = p},

n \to \infty lim P_{λ} (f (Y) = 1) = 1 and n \to \infty lim P_{0} (f (Y) = 0) = 1.

n \to \infty lim P_{λ} (f (Y) = 1) = 1 and n \to \infty lim P_{0} (f (Y) = 0) = 1.

\limsup_{n\to\infty}\big{\{}\mathbb{P}_{0}(f({\bm{Y}})=1)+\mathbb{P}_{\lambda}(f({\bm{Y}})=0)\big{\}}<1.

\limsup_{n\to\infty}\big{\{}\mathbb{P}_{0}(f({\bm{Y}})=1)+\mathbb{P}_{\lambda}(f({\bm{Y}})=0)\big{\}}<1.

corr (\overset{x}{^}, x) = \frac{∣ ⟨ x ^ , x ⟩ ∣}{∥ x ^ ∥∥ x ∥} .

corr (\overset{x}{^}, x) = \frac{∣ ⟨ x ^ , x ⟩ ∣}{∥ x ^ ∥∥ x ∥} .

Y {u}_{i} = j_{1}, \dots, j_{p - 1} \sum Y_{i, j_{1}, \dots, j_{p - 1}} u_{j_{1}} \dots u_{j_{p - 1}} .

Y {u}_{i} = j_{1}, \dots, j_{p - 1} \sum Y_{i, j_{1}, \dots, j_{p - 1}} u_{j_{1}} \dots u_{j_{p - 1}} .

corr (x, x_{*}) \geq 1 - c λ^{- 1} τ^{1 - p} n^{(1 - p) /2} .

corr (x, x_{*}) \geq 1 - c λ^{- 1} τ^{1 - p} n^{(1 - p) /2} .

λ \geq ℓ^{1/2 - p /4} n^{- p /4} polylog (n) .

λ \geq ℓ^{1/2 - p /4} n^{- p /4} polylog (n) .

M_{S, T} = {Y_{S △ T} 0 \mbox i f ∣ S △ T ∣ = p, \mbox o t h er w i se .

M_{S, T} = {Y_{S △ T} 0 \mbox i f ∣ S △ T ∣ = p, \mbox o t h er w i se .

V_{ii} (v) = 0 \forall i \in [n], \mbox an d V_{ij} (v) = \frac{1}{2} S, T \in (ℓ [ n ]) \sum v_{S} v_{T} 1_{S △ T = {i, j}} \forall i \neq = j .

V_{ii} (v) = 0 \forall i \in [n], \mbox an d V_{ij} (v) = \frac{1}{2} S, T \in (ℓ [ n ]) \sum v_{S} v_{T} 1_{S △ T = {i, j}} \forall i \neq = j .

d_{ℓ} : = (p /2 n - ℓ) (p /2 ℓ) .

d_{ℓ} : = (p /2 n - ℓ) (p /2 ℓ) .

\mathbb{P}_{0}\Big{(}\lambda_{\max}({\bm{M}})\geq\frac{\lambda d_{\ell}}{2}\Big{)}\vee\mathbb{P}_{\lambda}\Big{(}\lambda_{\max}({\bm{M}})\leq\frac{\lambda d_{\ell}}{2}\Big{)}\leq 2n^{\ell}e^{-\lambda^{2}d_{\ell}/8}.

\mathbb{P}_{0}\Big{(}\lambda_{\max}({\bm{M}})\geq\frac{\lambda d_{\ell}}{2}\Big{)}\vee\mathbb{P}_{\lambda}\Big{(}\lambda_{\max}({\bm{M}})\leq\frac{\lambda d_{\ell}}{2}\Big{)}\leq 2n^{\ell}e^{-\lambda^{2}d_{\ell}/8}.

P (x ∣ Y)

P (x ∣ Y)

\displaystyle\propto\exp\Big{\{}\lambda\sum_{i_{1}<\cdots<i_{p}}{\bm{Y}}_{i_{1}\cdots i_{p}}x_{i_{1}}\cdots x_{i_{p}}\Big{\}}=\exp\Big{\{}\lambda\sum_{|E|=p}{\bm{Y}}_{E}x^{E}\Big{\}}

\mathbb{P}(x\,|\,{\bm{Y}})\propto\exp\Big{\{}\frac{\lambda}{N}\sum_{|S\bigtriangleup T|=p}{\bm{Y}}_{S\bigtriangleup T}x^{S}x^{T}\Big{\}}\,,

\mathbb{P}(x\,|\,{\bm{Y}})\propto\exp\Big{\{}\frac{\lambda}{N}\sum_{|S\bigtriangleup T|=p}{\bm{Y}}_{S\bigtriangleup T}x^{S}x^{T}\Big{\}}\,,

\frac{1}{∥ u ∥ ^{2}} ∣ S △ T ∣ = p \sum Y_{S △ T} u_{S} u_{T}

\frac{1}{∥ u ∥ ^{2}} ∣ S △ T ∣ = p \sum Y_{S △ T} u_{S} u_{T}

u_{S} \leftarrow T : ∣ S △ T ∣ = p \sum Y_{S △ T} u_{T} .

u_{S} \leftarrow T : ∣ S △ T ∣ = p \sum Y_{S △ T} u_{T} .

u \leftarrow M u .

u \leftarrow M u .

\frac{1}{2 ^{n}} (u^{x})^{⊤} \overline{M} u^{x} = \frac{1}{2 ^{n}} S, T \subseteq [n] : ∣ S △ T ∣ = p \sum Y_{S △ T} x^{S} x^{T} = ∣ E ∣ = p \sum Y_{E} x^{E} .

\frac{1}{2 ^{n}} (u^{x})^{⊤} \overline{M} u^{x} = \frac{1}{2 ^{n}} S, T \subseteq [n] : ∣ S △ T ∣ = p \sum Y_{S △ T} x^{S} x^{T} = ∣ E ∣ = p \sum Y_{E} x^{E} .

μ in f F (μ),

μ in f F (μ),

F (μ) : = E_{x \sim μ} [H (x)] - \frac{1}{β} S (μ),

F (μ) : = E_{x \sim μ} [H (x)] - \frac{1}{β} S (μ),

E (b) = - λ ∣ S ∣ = p \sum Y_{S} x_{S} \in {\pm 1}^{S} \sum b_{S} (x_{S}) x^{S} .

E (b) = - λ ∣ S ∣ = p \sum Y_{S} x_{S} \in {\pm 1}^{S} \sum b_{S} (x_{S}) x^{S} .

S (b) = 0 < ∣ S ∣ \leq r \sum c_{S} S_{S} (b), S_{S} (b) = - x_{S} \in {\pm 1}^{S} \sum b_{S} (x_{S}) lo g b_{S} (x_{S}),

S (b) = 0 < ∣ S ∣ \leq r \sum c_{S} S_{S} (b), S_{S} (b) = - x_{S} \in {\pm 1}^{S} \sum b_{S} (x_{S}) lo g b_{S} (x_{S}),

T \supseteq S, ∣ T ∣ \leq r \sum c_{T} = 1,

T \supseteq S, ∣ T ∣ \leq r \sum c_{T} = 1,

M_{S, T} = Y_{S △ T} 1_{∣ S △ T ∣ = p} where Y_{S △ T} = λ x_{*}^{S △ T} + G_{S △ T} .

M_{S, T} = Y_{S △ T} 1_{∣ S △ T ∣ = p} where Y_{S △ T} = λ x_{*}^{S △ T} + G_{S △ T} .

M = λ X + Z,

M = λ X + Z,

u_{S}^{φ} = i = 1 \prod m (1_{a_{i} \in S} - 1_{b_{i} \in S}), ∣ S ∣ = ℓ .

u_{S}^{φ} = i = 1 \prod m (1_{a_{i} \in S} - 1_{b_{i} \in S}), ∣ S ∣ = ℓ .

μ_{0} = d_{ℓ} = (p /2 ℓ) (p /2 n - ℓ) .

μ_{0} = d_{ℓ} = (p /2 ℓ) (p /2 n - ℓ) .

μ_{m} = s = 0 \sum m i n (m, p /2) (- 1)^{s} (s m) (p /2 - s ℓ - m) (p /2 - s n - ℓ - m), 0 \leq m \leq ℓ .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTensor decomposition and applications · Machine Learning and Algorithms · Algorithms and Data Compression

MethodsPrincipal Components Analysis

Full text

The Kikuchi Hierarchy and Tensor PCA

Alexander S. Wein Email: [email protected]. Partially supported by NSF grant DMS-1712730 and by the Simons Collaboration on Algorithms and Geometry. Department of Mathematics, Courant Institute of Mathematical Sciences, NYU

Ahmed El Alaoui Email: [email protected]. Partially supported by IIS-1741162, and ONR N00014-18-1-2729. Departments of Electrical Engineering and Statistics, Stanford University

Cristopher Moore Email: [email protected]. Partially supported by NSF grant BIGDATA-1838251. Santa Fe Institute

Abstract

For the tensor PCA (principal component analysis) problem, we propose a new hierarchy of increasingly powerful algorithms with increasing runtime. Our hierarchy is analogous to the sum-of-squares (SOS) hierarchy but is instead inspired by statistical physics and related algorithms such as belief propagation and AMP (approximate message passing). Our level- $\ell$ algorithm can be thought of as a linearized message-passing algorithm that keeps track of $\ell$ -wise dependencies among the hidden variables. Specifically, our algorithms are spectral methods based on the Kikuchi Hessian, which generalizes the well-studied Bethe Hessian to the higher-order Kikuchi free energies.

It is known that AMP, the flagship algorithm of statistical physics, has substantially worse performance than SOS for tensor PCA. In this work we ‘redeem’ the statistical physics approach by showing that our hierarchy gives a polynomial-time algorithm matching the performance of SOS. Our hierarchy also yields a continuum of subexponential-time algorithms, and we prove that these achieve the same (conjecturally optimal) tradeoff between runtime and statistical power as SOS. Our proofs are much simpler than prior work, and also apply to the related problem of refuting random $k$ -XOR formulas. The results we present here apply to tensor PCA for tensors of all orders, and to $k$ -XOR when $k$ is even.

Our methods suggest a new avenue for systematically obtaining optimal algorithms for Bayesian inference problems, and our results constitute a step toward unifying the statistical physics and sum-of-squares approaches to algorithm design.

1 Introduction

High-dimensional Bayesian inference problems are widely studied, including planted clique [Jer92, AKS98], sparse PCA [JL04], and community detection [DKMZ11b, DKMZ11a], just to name a few. For these types of problems, two general strategies, or meta-algorithms, have emerged. The first is rooted in statistical physics and includes the belief propagation (BP) algorithm [Pea86, YFW03] along with variants such as approximate message passing (AMP) [DMM09], and related spectral methods such as linearized BP [KMM*+*13, BLM15], and the Bethe Hessian [SKZ14]. The second meta-algorithm is the sum-of-squares (SOS) hierarchy [Sho87, Par00, Las01], a hierarchy of increasingly powerful semidefinite programming relaxations to polynomial optimization problems, along with spectral methods inspired by it [HSS15, HSSS16]. Both of these meta-algorithms are known to achieve statistically-optimal performance for many problems. Furthermore, when they fail to perform a task, this is often seen as evidence that no polynomial-time algorithm can succeed. Such reasoning takes the form of free energy barriers in statistical physics [LKZ15a, LKZ15b] or SOS lower bounds (e.g., [BHK*+*16]). Thus, we generally expect both meta-algorithms to have optimal statistical performance among all efficient algorithms.

A fundamental question is whether we can unify statistical physics and SOS, showing that the two approaches yield, or at least predict, the same performance on a large class of problems. However, one barrier to this comes from the tensor principal component analysis (PCA) problem [RM14], on which the two meta-algorithms seem to have very different performance. For an integer $p\geq 2$ , in the order- $p$ tensor PCA or spiked tensor problem we observe a $p$ -fold $n\times n\times\cdots\times n$ tensor

[TABLE]

where the parameter $\lambda\geq 0$ is a signal-to-noise ratio (SNR), $x_{*}\in\mathbb{R}^{n}$ is a planted signal with normalization $\|x_{*}\|=\sqrt{n}$ drawn from a simple prior such as the uniform distribution on $\{\pm 1\}^{n}$ , and ${\bm{G}}$ is a symmetric noise tensor with $\mathcal{N}(0,1)$ entries. Information-theoretically, it is possible to recover $x_{*}$ given ${\bm{Y}}$ (in the limit $n\to\infty$ , with $p$ fixed) when $\lambda\gg n^{(1-p)/2}$ [RM14, LML*+*17]. (Here we ignore log factors, so $A\gg B$ can be understood to mean $A\geq B\,\mathrm{polylog}(n)$ .) However, this information-theoretic threshold corresponds to exhaustive search. We would also like to understand the computational threshold, i.e., for what values of $\lambda$ there is an efficient algorithm.

The sum-of-squares hierarchy gives a polynomial-time algorithm to recover $x_{*}$ when $\lambda\gg n^{-p/4}$ [HSS15], and SOS lower bounds suggest that no polynomial-time algorithm can do better [HSS15, HKP*+*17]. However, AMP, the flagship algorithm of statistical physics, is suboptimal for $p\geq 3$ and fails unless $\lambda\gg n^{-1/2}$ [RM14]. Various other “local” algorithms such as the tensor power method, Langevin dynamics, and gradient descent also fail below this “local” threshold $\lambda\sim n^{-1/2}$ [RM14, AGJ18]. This casts serious doubts on the optimality of the statistical physics approach.

In this paper we resolve this discrepancy and “redeem” the statistical physics approach. The Bethe free energy associated with AMP is merely the first level of a hierarchy of Kikuchi free energies [Kik51, Kik94, YFW03]. From these Kikuchi free energies, we derive a hierarchy of increasingly powerful algorithms for tensor PCA, similar in spirit to generalized belief propagation [YFW03]. Roughly speaking, our level- $\ell$ algorithm can be thought of as an iterative message-passing algorithm that reasons about $\ell$ -wise dependencies among the hidden variables. As a result, it has time and space complexity $n^{O(\ell)}$ . Specifically, the level- $\ell$ algorithm is a spectral method on a $n^{O(\ell)}\times n^{O(\ell)}$ submatrix of (a first-order approximation of) the Kikuchi Hessian, i.e., the matrix of second derivatives of the Kikuchi free energy. This generalizes the Bethe Hessian spectral method, which has been successful in the setting of community detection [SKZ14]. We note that the Ph.D. dissertation of Saade [Saa16] proposed the Kikuchi Hessian as a direction for future research.

For order- $p$ tensor PCA with $p$ even, we show that level $\ell=p/2$ of the Kikuchi hierarchy gives an algorithm that succeeds down to the SOS threshold $\lambda\sim n^{-p/4}$ , closing the gap between SOS and statistical physics. Furthermore, by taking $\ell=n^{\delta}$ levels for various values of $\delta\in(0,1)$ , we obtain a continuum of subexponential-time algorithms that achieve a precise tradeoff between runtime and the signal-to-noise ratio—exactly the same tradeoff curve that SOS is known to achieve [BGG*+*16, BGL16].111The strongest SOS results only apply to a variant of the spiked tensor model with Rademacher observations, but we do not expect this difference to be important; see Section 2.6. We obtain similar results when $p$ is odd, by combining a matrix related to the Kikuchi Hessian with a construction similar to [CGL04]; see Appendix F.2.

Our approach also applies to the problem of refuting random $k$ -XOR formulas when $k$ is even, showing that we can strongly refute random formulas with $n$ variables and $m\gg n^{k/2}$ clauses in polynomial time, and with a continuum of subexponential-time algorithms that succeed at lower densities. This gives a much simpler proof of the results of [RRS17], using only the matrix Chernoff bound instead of intensive moment calculations; see Appendix F.1. We leave for future work the problem of giving a similar simplification of [RRS17] when $k$ is odd.

Our results redeem the statistical physics approach to algorithm design and give hope that the Kikuchi hierarchy provides a systematic way to derive optimal algorithms for a large class of Bayesian inference problems. We see this as a step toward unifying the statistical physics and SOS approaches. Indeed, we propose the following informal meta-conjecture: for high-dimensional inference problems with planted solutions (and related problems such as refuting random constraint satisfaction problems) the SOS hierarchy and the Kikuchi hierarchy both achieve the optimal tradeoff between runtime and statistical power.

After the initial appearance of this paper, some related independent work has appeared. A hierarchy of algorithms similar to ours is proposed by [Has19], but with a different motivation based on a system of quantum particles. Also, [BCR19] gives an alternative “redemption” of local algorithms based on replicated gradient descent.

2 Preliminaries and Prior Work

2.1 Notation

Our asymptotic notation (e.g., $O(\cdot),o(\cdot),\Omega(\cdot),\omega(\cdot)$ ) pertains to the limit $n\to\infty$ (large dimension) and may hide constants depending on $p$ (tensor order), which we think of as fixed. We say an event occurs with high probability if it occurs with probability $1-o(1)$ .

A tensor ${\bm{T}}\in(\mathbb{R}^{n})^{\otimes p}$ is an $n\times n\times\cdots\times n$ ( $p$ times) multi-array with entries denoted by ${\bm{T}}_{i_{1},\ldots,i_{p}}$ , where $i_{k}\in[n]\mathrel{\mathop{:}}=\{1,2,\ldots,n\}$ . We call $p$ the order of ${\bm{T}}$ and $n$ the dimension of ${\bm{T}}$ . For a vector $u\in\mathbb{R}^{n}$ , the rank-1 tensor $u^{\otimes p}$ is defined by $(u^{\otimes p})_{i_{1},\ldots,i_{p}}=\prod_{k=1}^{p}u_{i_{k}}$ . A tensor ${\bm{T}}$ is symmetric if ${\bm{T}}_{i_{1},\ldots,i_{p}}={\bm{T}}_{i_{\pi(1)},\ldots,i_{\pi(p)}}$ for any permutation $\pi\in S_{p}$ . For a symmetric tensor, if $E=\{i_{1},\ldots,i_{p}\}\subseteq[n]$ , we will often write ${\bm{T}}_{E}\mathrel{\mathop{:}}={\bm{T}}_{i_{1},\ldots,i_{p}}$ .

2.2 The Spiked Tensor Model

A general formulation of the spiked tensor model is as follows. For an integer $p\geq 2$ , let $\widetilde{{\bm{G}}}\in(\mathbb{R}^{n})^{\otimes p}$ be an asymmetric tensor with entries i.i.d. $\mathcal{N}(0,1)$ . We then symmetrize this tensor and obtain a symmetric tensor ${\bm{G}}$ ,

[TABLE]

where $S_{p}$ is the symmetric group of permutations of $[p]$ , and $\widetilde{{\bm{G}}}^{\pi}_{i_{1},\ldots,i_{p}}:=\widetilde{{\bm{G}}}_{i_{\pi(1)},\ldots,i_{\pi(p)}}$ . Note that if $i_{1},\ldots,i_{p}$ are distinct then ${\bm{G}}_{i_{1},\ldots,i_{p}}\sim\mathcal{N}(0,1)$ . We draw a ‘signal’ vector or ’spike’ $x_{*}\in\mathbb{R}^{n}$ from a prior distribution $P_{\mathtt{x}}$ supported on the sphere $\mathcal{S}^{n-1}=\{x\in\mathbb{R}^{n}\,:\,\|x\|=\sqrt{n}\}$ . Then we let ${\bm{Y}}\in(\mathbb{R}^{n})^{\otimes p}$ be the tensor

[TABLE]

We will mostly focus on the Rademacher-spiked model where $x_{*}$ is uniform in $\{\pm 1\}^{n}$ , i.e., $P_{\mathtt{x}}=2^{-n}\prod_{i}\big{(}\delta(x_{i}-1)+\delta(x_{i}+1)\big{)}$ . We will sometimes state results without specifying the prior $P_{\mathtt{x}}$ , in which case the result holds for any prior normalized so that $\|x_{*}\|=\sqrt{n}$ . Let $\mathbb{P}_{\lambda}$ denote the law of the tensor ${\bm{Y}}$ . The parameter $\lambda=\lambda(n)$ may depend on $n$ . We will consider the limit $n\to\infty$ with $p$ held fixed.

Our algorithms will depend only on the entries ${\bm{Y}}_{i_{1},\ldots,i_{p}}$ where the indices $i_{1},\ldots,i_{p}$ are distinct: that is, on the collection

[TABLE]

where ${\bm{G}}_{E}\sim\mathcal{N}(0,1)$ and for a vector $x\in\mathbb{R}^{n}$ we write $x^{E}=\prod_{i\in E}x_{i}$ .

Perhaps one of the simplest statistical tasks is binary hypothesis testing. In our case this amounts to, given a tensor ${\bm{Y}}$ as input with the promise that it was sampled from $\mathbb{P}_{\lambda}$ with $\lambda\in\{0,\bar{\lambda}\}$ , determining whether $\lambda=0$ or $\lambda=\bar{\lambda}$ . We refer to $\mathbb{P}_{\lambda}$ for $\lambda>0$ as the planted distribution, and $\mathbb{P}_{0}$ as the null distribution.

Definition 2.1.

We say that an algorithm (or test) $f:(\mathbb{R}^{n})^{\otimes p}\to\{0,1\}$ achieves strong detection between $\mathbb{P}_{0}$ and $\mathbb{P}_{\lambda}$ if

[TABLE]

Additionally we say that $f$ achieves weak detection between $\mathbb{P}_{0}$ and $\mathbb{P}_{\lambda}$ if the sum of Type-I and Type-II errors remains strictly below 1:

[TABLE]

An additional goal is to recover the planted vector $x_{*}$ . Note that when $p$ is even, $x_{*}$ and $-x_{*}$ have the same posterior probability. Thus, our goal is to recover $x_{*}$ up to a sign.

Definition 2.2.

The normalized correlation between vectors $\hat{x},x\in\mathbb{R}^{n}$ is

[TABLE]

Definition 2.3.

An estimator $\hat{x}=\hat{x}({\bm{Y}})$ achieves weak recovery if $\mathrm{corr}(\hat{x},x_{*})$ is lower-bounded by a strictly positive constant—and we write $\mathrm{corr}(\hat{x},x_{*})=\Omega(1)$ —with high probability, and achieves strong recovery if $\mathrm{corr}(\hat{x},x_{*})=1-o(1)$ with high probability.

We expect that strong detection and weak recovery are generally equally difficult, although formal implications are not known in either direction. We will see in Section 2.5 that in some regimes, weak recovery and strong recovery are equivalent.

The matrix case.

When $p=2$ , the spiked tensor model reduces to the spiked Wigner model. We know from random matrix theory that when $\lambda=\hat{\lambda}/\sqrt{n}$ with $\hat{\lambda}>1$ , strong detection is possible by thresholding the maximum eigenvalue of ${\bm{Y}}$ , and weak recovery is achieved by PCA, i.e., taking the leading eigenvector [FP07, BGN11]. For many spike priors including Rademacher, strong detection and weak recovery are statistically impossible when $\hat{\lambda}<1$ [MRZ15, DAM15, PWBM18], so there is a sharp phase transition at $\hat{\lambda}=1$ . (Note that weak detection is still possible below $\hat{\lambda}=1$ [EKJ18].) A more sophisticated algorithm is AMP (approximate message passing) [DMM09, FR18, LKZ15a, DAM15], which can be thought of as a modification of the matrix power method which uses certain nonlinear transformations to exploit the structure of the spike prior. For many spike priors including Rademacher, AMP is known to achieve the information-theoretically optimal correlation with the true spike [DAM15, DMK*+*16]. However, like PCA, AMP achieves zero correlation asymptotically when $\hat{\lambda}<1$ . For certain spike priors (e.g., sparse priors), statistical-computational gaps can appear in which it is information-theoretically possible to succeed for some $\hat{\lambda}<1$ but we do not expect that polynomial-time algorithms can do so [LKZ15a, LKZ15b, BMV*+*18].

The tensor case.

The tensor case $p\geq 3$ was introduced by [RM14], who also proposed various algorithms. Information-theoretically, there is a sharp phase transition similar to the matrix case: for many spike priors $P_{\mathtt{x}}$ including Rademacher, if $\lambda=\hat{\lambda}n^{(1-p)/2}$ then weak recovery is possible when $\hat{\lambda}>\lambda_{c}$ and impossible when $\hat{\lambda}<\lambda_{c}$ for a particular constant $\lambda_{c}=\lambda_{c}(p,P_{\mathtt{x}})$ depending on $p$ and $P_{\mathtt{x}}$ [LML*+*17]. Strong detection undergoes the same transition, being possible if $\lambda>\lambda_{c}$ and impossible otherwise [Che17, CHL18, JLM18] (see also [PWB16]). In fact, it is shown in these works that even weak detection is impossible below $\lambda_{c}$ , in sharp contrast with the matrix case. There are polynomial-time algorithms (e.g., SOS) that succeed at both strong detection and strong recovery when $\lambda\gg n^{-p/4}$ for any spike prior [RM14, HSS15, HSSS16], which is above the information-theoretic threshold by a factor that diverges with $n$ . There are also SOS lower bounds suggesting that (for many priors) no polynomial-time algorithm can succeed at strong detection or weak recovery when $\lambda\ll n^{-p/4}$ [HSS15, HKP*+*17]. Thus, unlike the matrix case, we expect a large statistical-computational gap, i.e., a “possible-but-hard” regime when $n^{(1-p)/2}\ll\lambda\ll n^{-p/4}$ .

2.3 The Tensor Power Method and Local Algorithms

Various algorithms have been proposed and analyzed for tensor PCA [RM14, HSS15, HSSS16, ADGM16, AGJ18]. We will present two such algorithms that are simple, representative, and will be relevant to the discussion. The first is the tensor power method [AGH*+*14, RM14, AGJ17].

Algorithm 2.4.

(Tensor Power Method) For a vector $u\in\mathbb{R}^{n}$ and a tensor ${\bm{Y}}\in(\mathbb{R}^{n})^{\otimes p}$ , let ${\bm{Y}}\{u\}\in\mathbb{R}^{n}$ denote the vector

[TABLE]

The tensor power method begins with an initial guess $u\in\mathbb{R}^{n}$ (e.g., chosen at random) and repeatedly iterates the update rule $u\leftarrow{\bm{Y}}\{u\}$ until $u/\|u\|$ converges.

The tensor power method appears to only succeed when $\lambda\gg n^{-1/2}$ [RM14], which is worse than the SOS threshold $\lambda\sim n^{-p/4}$ . The AMP algorithm of [RM14] is a more sophisticated variant of the tensor power method, but AMP also fails unless $\lambda\gg n^{-1/2}$ [RM14]. Two other related algorithms, gradient descent and Langevin dynamics, also fail unless $\lambda\gg n^{-1/2}$ [AGJ18]. Following [AGJ18], we refer to all of these algorithms (tensor power method, AMP, gradient descent, Langevin dynamics) as local algorithms, and we refer to the corresponding threshold $\lambda\sim n^{-1/2}$ as the local threshold. Here “local” is not a precise notion, but roughly speaking, local algorithms keep track of a current guess for $x_{*}$ and iteratively update it to a nearby vector that is more favorable in terms of e.g. the log-likelihood. This discrepancy between local algorithms and SOS is what motivated the current work.

2.4 The Tensor Unfolding Algorithm

We have seen that local algorithms do not seem able to reach the SOS threshold. Let us now describe one of the simplest algorithms that does reach this threshold: tensor unfolding. Tensor unfolding was first proposed by [RM14], where it was shown to succeed when $\lambda\gg n^{-\lfloor p/2\rfloor/2}$ and conjectured to succeed when $\lambda\gg n^{-p/4}$ (the SOS threshold). For the case $p=3$ , the same algorithm was later reinterpreted as a spectral relaxation of SOS, and proven to succeed222The analysis of [HSS15] applies to a close variant of the spiked tensor model in which the noise tensor is asymmetric. We do not expect this difference to be important. when $\lambda\gg n^{-3/4}=n^{-p/4}$ [HSS15], confirming the conjecture of [RM14]. We now present the tensor unfolding method, restricting to the case $p=3$ for simplicity. There is a natural extension to all $p$ [RM14], and (a close variant of) this algorithm will in fact appear as level $\ell=\lfloor p/2\rfloor$ in our hierarchy of algorithms (see Section 3 and Appendix C).

Algorithm 2.5.

(Tensor Unfolding) Given an order-3 tensor ${\bm{Y}}\in(\mathbb{R}^{n})^{\otimes 3}$ , flatten it to an $n\times n^{2}$ matrix ${\bm{M}}$ , i.e., let ${\bm{M}}_{i,jk}={\bm{Y}}_{ijk}$ . Compute the leading eigenvector of ${\bm{M}}{\bm{M}}^{\top}$ .

If we use the matrix power method to compute the leading eigenvector, we can restate the tensor unfolding method as an iterative algorithm: keep track of state vectors $u\in\mathbb{R}^{n}$ and $v\in\mathbb{R}^{n^{2}}$ , initialize $u$ randomly, and alternate between applying the update steps $v\leftarrow{\bm{M}}^{\top}u$ and $u\leftarrow{\bm{M}}v$ . We will see later (see Section 4.1) that this can be interpreted as a message-passing algorithm between singleton indices, represented by $u$ , and pairs of indices, represented by $v$ . Thus, tensor unfolding is not “local” in the sense of Section 2.3 because it keeps a state of size $O(n^{2})$ (keeping track of pairwise information) instead of size $O(n)$ . We can, however, think of it as local on a “lifted” space, and this allows it to surpass the local threshold.

Other methods have also been shown to achieve the SOS threshold $\lambda\sim n^{-p/4}$ , including SOS itself and various spectral methods inspired by it [HSS15, HSSS16].

2.5 Boosting and Linearization

One fundamental difference between the matrix case ( $p=2$ ) and tensor case ( $p\geq 3$ ) is the following boosting property. The following result, implicit in [RM14], shows that for $p\geq 3$ , if $\lambda$ is substantially above the information-theoretic threshold (i.e., $\lambda\gg n^{(1-p)/2}$ ) then weak recovery can be boosted to strong recovery via a single power iteration. We give a proof in Appendix D.

Proposition 2.6.

Let ${\bm{Y}}\sim\mathbb{P}_{\lambda}$ with any spike prior $P_{\mathtt{x}}$ supported on $\mathcal{S}^{n-1}$ . Suppose we have an initial guess $u\in\mathbb{R}^{n}$ satisfying $\mathrm{corr}(u,x_{*})\geq\tau$ . Obtain $\widehat{x}$ from $u$ via a single iteration of the tensor power method: $\widehat{x}={\bm{Y}}\{u\}$ . There exists a constant $c=c(p)>0$ such that with high probability,

[TABLE]

In particular, if $\tau>0$ is any constant and $\lambda=\omega(n^{(1-p)/2})$ then $\mathrm{corr}(\widehat{x},x)=1-o(1)$ .

For $p\geq 3$ , since we do not expect polynomial-time algorithms to succeed when $\lambda=O(n^{(1-p)/2})$ , this implies an “all-or-nothing” phenomenon: for a given $\lambda=\lambda(n)$ , the optimal polynomial-time algorithm will either achieve correlation that is asymptotically 0 or asymptotically 1. This is in stark contrast to the matrix case where, for $\lambda=\hat{\lambda}/\sqrt{n}$ , the optimal correlation is a constant (in $[0,1]$ ) depending on both $\hat{\lambda}$ and the spike prior $P_{\mathtt{x}}$ .

This boosting result substantially simplifies things when $p\geq 3$ because it implies that the only important question is identifying the threshold for weak recovery, instead of trying to achieve the optimal correlation. Heuristically, since we only want to attain the optimal threshold, statistical physics suggests that we can use a simple “linearized” spectral algorithm instead of a more sophisticated nonlinear algorithm. To illustrate this in the matrix case ( $p=2$ ), one needs to use AMP in order to achieve optimal correlation, but one can achieve the optimal threshold using linearized AMP, which boils down to computing the top eigenvector. In the related setting of community detection in the stochastic block model, one needs to use belief propagation to achieve optimal correlation [DKMZ11b, DKMZ11a, MNS14], but one can achieve the optimal threshold using a linearized version of belief propagation, which is a spectral method on the non-backtracking walk matrix [KMM*+*13, BLM15] or the related Bethe Hessian [SKZ14]. Our spectral methods for tensor PCA are based on the Kikuchi Hessian, which is a generalization of the Bethe Hessian.

2.6 Subexponential-time Algorithms for Tensor PCA

The degree- $\ell$ sum-of-squares algorithm is a large semidefinite program that requires runtime $n^{O(\ell)}$ to solve. Oftentimes the regime of interest is when $\ell$ is constant, so that the algorithm runs in polynomial time. However, one can also explore the power of subexponential-time algorithms by letting $\ell=n^{\delta}$ for $\delta\in(0,1)$ , which gives an algorithm with runtime roughly $2^{n^{\delta}}$ . Results of this type are known for tensor PCA [RRS17, BGG*+*16, BGL16]. The strongest such results are for a different variant of tensor PCA, which we now define.

Definition 2.7.

In the order- $p$ discrete spiked tensor model with spike prior $P_{\mathtt{x}}$ (normalized so that $\|x_{*}\|=\sqrt{n}$ ) and SNR parameter $\lambda\geq 0$ , we draw a spike $x_{*}\sim P_{\mathtt{x}}$ and then for each $1\leq i_{1}\leq\cdots\leq i_{p}\leq n$ , we independently observe a $\{\pm 1\}$ -valued random variable ${\bm{Y}}_{i_{1},\ldots,i_{p}}$ with $\mathbb{E}[{\bm{Y}}_{i_{1},\ldots,i_{p}}]=\lambda(x_{*}^{\otimes p})_{i_{1},\ldots,i_{p}}$ .

This model differs from our usual one in that the observations are conditionally Rademacher instead of Gaussian, but we do not believe this makes an important difference. However, for technical reasons, the known SOS results are strongest in this discrete setting.

Theorem 2.8 ([BGL16, HSS15]).

For any $1\leq\ell\leq n$ , there is an algorithm with runtime $n^{O(\ell)}$ that achieves strong detection and strong recovery in the order- $p$ discrete spiked tensor model (with any spike prior) whenever

[TABLE]

The work of [BGL16] shows how to certify an upper bound on the injective norm of a random $\{\pm 1\}$ -valued tensor, which immediately implies the algorithm for strong detection. When combined with [HSS15], this can also be made into an algorithm for strong recovery (see Lemma 4.4 of [HSS15]). Similar (but weaker) SOS results are also known for the standard spiked tensor model (see [RRS17] and arXiv version 1 of [BGL16]), and we expect that Theorem 2.8 also holds for this case.

When $\ell=n^{\delta}$ for $\delta\in(0,1)$ , Theorem 2.8 implies that we have an algorithm with runtime $n^{O(n^{\delta})}=2^{n^{\delta+o(1)}}$ that succeeds when $\lambda\gg n^{\delta/2-p\delta/4-p/4}$ . Note that this interpolates smoothly between the polynomial-time threshold ( $\lambda\sim n^{-p/4}$ ) when $\delta=0$ , and the information-theoretic threshold ( $\lambda\sim n^{(1-p)/2}$ ) when $\delta=1$ . We will prove (for $p$ even) that our algorithms achieve this same tradeoff, and we expect this tradeoff to be optimal333One form of evidence suggesting that this tradeoff is optimal is based on the low-degree likelihood ratio; see [KWB19]..

3 Main results

In this section we present our main results about detection and recovery in the spiked tensor model. We propose a hierarchy of spectral methods, which are directly derived from the hierarchy of Kikuchi free energies. Specifically, the symmetric difference matrix defined below appears (approximately) as a submatrix of the Hessian of the Kikuchi free energy. The full details of this derivation are given in Section 4 and Appendix E. For now we simply state the algorithms and results.

We will restrict our attention to the Rademacher-spiked tensor model, which is the setting in which we derived our algorithms. However, we show in Appendix B that the same algorithm works for a large class of priors (at least for strong detection). Furthermore, we show in Appendix F.1 that the same algorithm can also be used for refuting random $k$ -XOR formulas (when $k$ is even).

We will also restrict to the case where the tensor order $p$ is even. The case of odd $p$ is discussed in Appendix C, where we give an algorithm derived from the Kikuchi Hessian and conjecture that it achieves optimal performance. We are unable to prove this, but we are able to prove that optimal results are attained by a related algorithm (see Appendix F.2).

Our approach requires the introduction of two matrices:

The symmetric difference matrix of order $\ell$ .

Let $p$ be even and let ${\bm{Y}}\in(\mathbb{R}^{n})^{\otimes p}$ be the observed order- $p$ symmetric tensor. We will only use the entries ${\bm{Y}}_{i_{1},\ldots,i_{p}}$ for which the indices $i_{1},\ldots,i_{p}$ are distinct; we denote such entries by ${\bm{Y}}_{E}$ where $E\subseteq[n]$ with $|E|=p$ . Fix an integer $\ell\in[p/2,n-p/2]$ , and consider the symmetric ${n\choose\ell}\times{n\choose\ell}$ matrix ${\bm{M}}$ indexed by sets $S\subseteq[n]$ of size $\ell$ , having entries

[TABLE]

Here $\bigtriangleup$ denotes the symmetric difference between sets. The leading eigenvector of ${\bm{M}}$ is intended to be an estimate of $(x_{*}^{S})_{|S|=\ell}$ where $x^{S}\mathrel{\mathop{:}}=\prod_{i\in S}x_{i}$ . The following voting matrix is a natural rounding scheme to extract an estimate of $x_{*}$ from such a vector.

The voting matrix.

To a vector $v\in\mathbb{R}^{{n\choose\ell}}$ we associate the following symmetric $n\times n$ ‘voting’ matrix ${\bm{V}}(v)$ having entries

[TABLE]

Let us define the important quantity

[TABLE]

This is the number of sets $T$ of size $\ell$ such that $|S\bigtriangleup T|=p$ for a given set $S$ of size $\ell$ . Now we are in position to formulate our algorithms for detection and recovery.

Algorithm 3.1 (Detection for even $p$ ).

Compute the top eigenvalue $\lambda_{\max}({\bm{M}})$ of the symmetric difference matrix ${\bm{M}}$ . 2. 2.

Reject the null hypothesis $\mathbb{P}_{0}$ (i.e., return ‘1’) if $\lambda_{\max}({\bm{M}})\geq\lambda d_{\ell}/2$ .

Algorithm 3.2 (Recovery for even $p$ ).

Compute a (unit-norm) leading eigenvector444We define a leading eigenvector to be an eigenvector whose eigenvalue is maximal (although our results still hold for an eigenvector whose eigenvalue is maximal in absolute value). $v_{\textup{top}}\in\mathbb{R}^{{n\choose\ell}}$ of ${\bm{M}}$ . 2. 2.

Form the associated voting matrix ${\bm{V}}(v_{\textup{top}})$ . 3. 3.

Compute a leading eigenvector $\widehat{x}$ of ${\bm{V}}(v_{\textup{top}})$ , and output $\widehat{x}$ .

The next two theorems characterize the performance of Algorithms 3.1 and 3.2 for the strong detection and recovery tasks, respectively. The proofs can be found in Appendix A.

Theorem 3.3.

Consider the Rademacher-spiked tensor model with $p$ even. For all $\lambda\geq 0$ and $\ell\in[p/2,n-p/2]$ , we have

[TABLE]

Therefore, Algorithm 3.1 achieves strong detection between $\mathbb{P}_{0}$ and $\mathbb{P}_{\lambda}$ if $\frac{1}{8}\lambda^{2}d_{\ell}-\ell\log n\to+\infty$ as $n\to+\infty$ .

Theorem 3.4.

Consider the Rademacher-spiked tensor model with $p$ even. Let $\widehat{x}\in\mathbb{R}^{n}$ be the output of Algorithm 3.2. There exists an absolute constant $c_{0}>0$ such that for all $\epsilon>0$ and $\delta\in(0,1)$ , if $\ell\leq n\epsilon^{2}$ and $\lambda\geq c_{0}\epsilon^{-4}\sqrt{\log(n^{\ell}/\delta)\big{/}d_{\ell}}$ , then $\mathrm{corr}(\widehat{x},x_{*})\geq 1-c_{0}\epsilon$ with probability at least $1-\delta$ .

Remark 3.5.

If $\ell=o(n)$ , we have $d_{\ell}=\Theta(n^{p/2}\ell^{p/2})$ , and so the above theorems imply that strong detection and strong recovery are possible as soon as $\lambda\gg\ell^{-(p-2)/4}n^{-p/4}\sqrt{\log n}$ . Comparing with Theorem 2.8, this scaling coincides with the guarantees achieved by the level- $\ell$ SOS algorithm of [BGL16], up to a possible discrepancy in logarithmic factors.

Due to the particularly simple structure of the symmetric difference matrix ${\bm{M}}$ (in particular, the fact that its entries are simply entries of ${\bm{Y}}$ ), the proof of detection (Theorem 3.3) follows from a straightforward application of the matrix Chernoff bound. In contrast, the corresponding SOS results [RRS17, BGG*+*16, BGL16] work with more complicated matrices involving high powers of the entries of ${\bm{Y}}$ , and the analysis is much more involved.

Our proof of recovery is unusual in that the signal component of ${\bm{M}}$ , call it ${\bm{X}}$ , is not rank-one; it even has a vanishing spectral gap when $\ell\gg 1$ . Thus, the leading eigenvector of ${\bm{M}}$ does not correlate well with the leading eigenvector of ${\bm{X}}$ . While this may seem to render recovery hopeless at first glance, this is not the case, due to the fact that many eigenvectors (actually, eigenspaces) of ${\bm{X}}$ contain non-trivial information about the spike $x_{*}$ , as opposed to only the top one. We prove this by exploiting the special structure of ${\bm{X}}$ through the Johnson scheme, and using tools from Fourier analysis on a slice of the hypercube, in particular a Poincaré-type inequality by [Fil16].

Removing the logarithmic factor

Both Theorem 3.3 and Theorem 3.4 involve a logarithmic factor in $n$ in the lower bound on SNR $\lambda$ . These log factors are an artifact of the matrix Chernoff bound, and we believe they can be removed. (The analysis of [HSS15] removes the log factors for the tensor unfolding algorithm, which is essentially the case $p=3$ and $\ell=1$ of our algorithm.) This suggests the following precise conjecture on the power of polynomial-time algorithms.

Conjecture 3.6.

Fix $p$ and let $\ell$ be constant (not depending on $n$ ). There exists a constant $c_{p}(\ell)>0$ with $c_{p}(\ell)\to 0$ as $\ell\to\infty$ (with $p$ fixed) such that if $\lambda\geq c_{p}(\ell)n^{-p/4}$ then Algorithm 3.1 and Algorithm 3.2 (which run in time $n^{O(\ell)}$ ) achieve strong detection and strong recovery, respectively.

Specifically, we expect $c_{p}(\ell)\sim\ell^{-(p-2)/4}$ for large $\ell$ .

4 Motivating the Symmetric Difference Matrices

In this section we motivate the symmetric difference matrices used in our algorithms. In Section 4.1 we give some high-level intuition, including an explanation of how our algorithms can be thought of as iterative message-passing procedures among subsets of size $\ell$ . In Section 4.2 we give a more principled derivation based on the Kikuchi Hessian, with many of the calculations deferred to Appendix E.

4.1 Intuition: Higher-Order Message-Passing and Maximum Likelihood

As stated previously, our algorithms will choose to ignore the entries ${\bm{Y}}_{i_{1},\ldots,i_{p}}$ for which $i_{1},\ldots,i_{p}$ are not distinct; these entries turn out to be unimportant asymptotically. We restrict to the Rademacher-spiked tensor model, as this yields a clean and simple derivation. The posterior distribution for the spike $x_{*}$ given the observed tensor ${\bm{Y}}$ is

[TABLE]

over the domain $x\in\{\pm 1\}^{n}$ . We take $p$ to be even; a similar derivation works for odd $p$ . Now fix $\ell\in[p/2,n-p/2]$ . We can write the above as

[TABLE]

where the sum is over ordered pairs $(S,T)$ of sets $S,T\subseteq[n]$ with $|S|=|T|=\ell$ and $|S\bigtriangleup T|=p$ , and where $N=d_{\ell}{n\choose\ell}/{n\choose p}$ is the number of terms $(S,T)$ with a given symmetric difference $E$ .

A natural message-passing algorithm to maximize the log-likelihood is the following. For each $S\subseteq[n]$ of size $|S|=\ell$ , keep track of a variable $u_{S}\in\mathbb{R}$ , which is intended to be an estimate of $x_{*}^{S}\mathrel{\mathop{:}}=\prod_{i\in S}(x_{*})_{i}$ . Note that there are consistency constraints that $(x_{*}^{S})_{|S|=\ell}$ must obey, such as $x_{*}^{S}x_{*}^{T}x_{*}^{V}=1$ when $S\bigtriangleup T\bigtriangleup V=\varnothing$ ; we will relax the problem and will not require our vector $u=(u_{S})_{|S|=\ell}$ to obey such constraints. Instead we simply attempt to maximize

[TABLE]

over all $u\in\mathbb{R}^{n\choose\ell}$ . To do this, we iterate the update equations

[TABLE]

We call $S$ and $T$ neighbors if $|S\bigtriangleup T|=p$ . Intuitively, each neighbor $T$ of $S$ sends a message $m_{T\to S}\mathrel{\mathop{:}}={\bm{Y}}_{S\bigtriangleup T}u_{T}$ to $S$ , indicating $T$ ’s opinion about $u_{S}$ . We update $u_{S}$ to be the sum of all incoming messages.

Now note that the sum in (7) is simply $\|u\|^{-2}\,u^{\top}{\bm{M}}u$ where ${\bm{M}}$ is the symmetric difference matrix, and (8) can be written as

[TABLE]

Thus our natural message-passing scheme is precisely power iteration against ${\bm{M}}$ , and so we should take the leading eigenvector $v_{\textup{top}}$ of ${\bm{M}}$ as our estimate of $(x_{*}^{S})_{|S|=\ell}$ (up to a scaling factor). Finally, defining our voting matrix ${\bm{V}}(v_{\textup{top}})$ and taking its leading eigenvector is a natural method for rounding $v_{\textup{top}}$ to a vector of the form $u^{x}$ where $u^{x}_{S}=x^{S}$ , thus restoring the consistency constraints we ignored before.

Indeed, if we carry out this procedure on all subsets $S\subseteq[n]$ then this works as intended, and no rounding is necessary: consider the $2^{n}\times 2^{n}$ matrix $\overline{{\bm{M}}}_{S,T}={\bm{Y}}_{S\bigtriangleup T}\mathbf{1}_{|S\bigtriangleup T|=p}$ . It is easy to verify that the eigenvectors of $\overline{{\bm{M}}}$ are precisely the Fourier basis vectors on the hypercube, namely vectors of the form $u^{x}$ where $u^{x}_{S}=x^{S}$ and $x\in\{\pm 1\}^{n}$ . Moreover, the eigenvalue associated to $u^{x}$ is

[TABLE]

This is the expression in the log-likelihood in (5). Thus the leading eigenvector of $\overline{{\bm{M}}}$ is exactly $u^{x}$ where $x$ is the maximum-likelihood estimate of $x_{*}$ .

This procedure succeeds all the way down to the information-theoretic threshold $\lambda\sim n^{(1-p)/2}$ , but takes exponential time. Our contribution can be viewed as showing that even when we restrict to the submatrix ${\bm{M}}$ of $\overline{{\bm{M}}}$ supported on subsets of size $\ell$ , the leading eigenvector still allows us to recover $x_{*}$ whenever the SNR is sufficiently large. Proving this requires us to perform Fourier analysis over a slice of the hypercube rather than the simpler setting of the entire hypercube, which we do by appealing to Johnson schemes and some results of [Fil16].

4.2 Variational Inference and Kikuchi Free Energy

We now introduce the Kikuchi approximations to the free energy (or simply the Kikuchi free energies) of the above posterior (5) [Kik51, Kik94], the principle from which our algorithms are derived. For concreteness we restrict to the case of the Rademacher-spiked tensor model, but the Kikuchi free energies can be defined for general graphical models [YFW03].

The posterior distribution in (5) is a Gibbs distribution $\mathbb{P}(x\,|\,{\bm{Y}})\propto e^{-\beta H(x)}$ with random Hamiltonian $H(x):=-\lambda\sum_{i_{1}<\cdots<i_{p}}{\bm{Y}}_{i_{1}\cdots i_{p}}x_{i_{1}}\cdots x_{i_{p}}$ , and inverse temperature $\beta=1$ . We let $Z_{n}(\beta;{\bm{Y}}):=\sum_{x\in\{\pm 1\}^{n}}e^{-\beta H(x)}$ be the partition function of the model, and denote by $F_{n}(\beta;{\bm{Y}}):=-\frac{1}{\beta}\log Z_{n}(\beta;{\bm{Y}})$ its free energy. It is a classical fact that the Gibbs distribution has the following variational characterization. Fix a finite domain $\Omega$ (e.g., $\{\pm 1\}^{n}$ ), $\beta>0$ and $H:\Omega\to\mathbb{R}$ . Consider the optimization problem

[TABLE]

where the supremum is over probability distributions $\mu$ supported on $\Omega$ , and define the free energy functional $F$ of $\mu$ by

[TABLE]

where $\mathcal{S}(\mu)$ is the Shannon entropy of $\mu$ , i.e., $\mathcal{S}(\mu)=-\sum_{x\in\Omega}\mu(x)\log\mu(x)$ . Then the unique optimizer of (9) is the Gibbs distribution $\mu^{*}(x)\propto\exp(-\beta H(x))$ . If we specialize this statement to our setting, $\mu^{*}=\mathbb{P}(\cdot|{\bm{Y}})$ and $F_{n}(1;{\bm{Y}})=F(\mu^{*})$ . We refer to [WJ08] for more background.

In light of the above variational characterization, a natural algorithmic strategy to learn the posterior distribution is to minimize the free energy functional $F(\mu)$ over distributions $\mu$ . However, this is a priori intractable because (for a high-dimensional domain such as $\Omega=\{\pm 1\}^{n}$ ) an exponential number of parameters are required to represent $\mu$ . The idea underlying the belief propagation algorithm [Pea86, YFW03] is to work only with locally-consistent marginals, or beliefs, instead of a complete distribution $\mu$ . Standard belief propagation works with beliefs on singleton variables and on pairs of variables. The Bethe free energy is a proxy for the free energy that only depends on these beliefs, and belief propagation is a certain procedure that iteratively updates the beliefs in order to locally minimize the Bethe free energy. The level- $r$ Kikuchi free energy is a generalization of the Bethe free energy that depends on $r$ -wise beliefs and gives (in principle) increasingly better approximations of $F(\mu^{*})$ as $r$ increases. Our algorithms are based on the principle of locally minimizing Kikuchi free energy, which we define next.

We now define the level- $r$ Kikuchi approximation to the free energy. We require $r\geq p$ , i.e., the Kikuchi level needs to be at least as large as the interactions present in the data (although the $r<p$ case could be handled by defining a modified graphical model with auxiliary variables). The Bethe free energy is the case $r=2$ .

For $S\subseteq[n]$ with $0<|S|\leq r$ , let $b_{S}:\{\pm 1\}^{S}\to\mathbb{R}$ denote the belief on $S$ , which is a probability mass function over $\{\pm 1\}^{|S|}$ representing our belief about the joint distribution of $x_{S}\mathrel{\mathop{:}}=\{x_{i}\}_{i\in S}$ . Let $b=\{b_{S}:S\in{[n]\choose\leq r}\}$ denote the set of beliefs on $s$ -wise interactions for all $s\leq r$ . Following [YFW03], the Kikuchi free energy is a real-valued functional $\mathcal{K}$ of $b$ having the form $\mathcal{E}-\frac{1}{\beta}\mathcal{S}$ (in our case, $\beta=1$ ). Here the first term is the ‘energy’ term

[TABLE]

where, recall, $x^{S}\mathrel{\mathop{:}}=\prod_{i\in S}x_{i}$ . (This is a proxy for the term $\mathbb{E}_{x\sim\mu}[H(x)]$ in (10).) The second term in $\mathcal{K}$ is the ‘entropy’ term

[TABLE]

where the overcounting numbers are $c_{S}:=\sum_{T\supseteq S,\,|T|\leq r}(-1)^{|T\setminus S|}$ . These are defined so that for any $S\subseteq[n]$ with $0<|S|\leq r$ ,

[TABLE]

which corrects for overcounting. Notice that $\mathcal{E}$ and $\mathcal{S}$ each take the form of an “expectation” with respect to the beliefs $b_{S}$ ; these would be actual expectations were the beliefs the marginals of an actual probability distribution. This situation is to be contrasted with the notion of a pseudo-expectation, which plays a central role in the theory underlying the sum-of-squares algorithm.

Our algorithms are based on he Kikuchi Hessian, a generalization of the Bethe Hessian matrix that was introduced in the setting of community detection [SKZ14]. The Bethe Hessian is the Hessian of the Bethe free energy with respect to the moments of the beliefs, evaluated at belief propagation’s so-called “uninformative fixed point.” The bottom eigenvector of the Bethe Hessian is a natural estimator for the planted signal because it represents the best direction for local improvement of the Bethe free energy, starting from belief propagation’s uninformative starting point. We generalize this method and compute the analogous Kikuchi Hessian matrix. The full derivation is given in Appendix E. The order- $\ell$ symmetric difference matrix (2) (approximately) appears as a submatrix of the level- $r$ Kikuchi Hessian whenever $r\geq\ell+p/2$ .

5 Conclusion

We have presented a hierarchy of spectral algorithms for tensor PCA, inspired by variational inference and statistical physics. In particular, the core idea of our approach is to locally minimize the Kikuchi free energy. We specifically implemented this via the Kikuchi Hessian, but there may be many other viable approaches to minimizing the Kikuchi free energy such as generalized belief propagation [YFW03]. Broadly speaking, we conjecture that for many average-case problems, algorithms based on Kikuchi free energy and algorithms based on sum-of-squares should both achieve the optimal tradeoff between runtime and statistical power. One direction for further work is to verify that this analogy holds for problems other than tensor PCA; in particular, we show here that it also applies to refuting random $k$ -XOR formulas when $k$ is even.

Perhaps one benefit of the Kikuchi hierarchy over the sum-of-squares hierarchy is that it has allowed us to systematically obtain spectral methods, simply by computing a certain Hessian matrix. Furthermore, the algorithms we obtained are simpler than their SOS counterparts. We are hopeful that the Kikuchi hierarchy will provide a roadmap for systematically deriving simple and optimal algorithms for a large class of problems.

Acknowledgments

We thank Alex Russell for suggesting the matrix Chernoff bound (Theorem A.4). For helpful discussions, we thank Afonso Bandeira, Sam Hopkins, Pravesh Kothari, Florent Krzakala, Tselil Schramm, Jonathan Shi, and Lenka Zdeborová. This project started during the workshop Spin Glasses and Related Topics held at the Banff International Research Station (BIRS) in the Fall of 2018. We thank our hosts at BIRS as well as the workshop organizers: Antonio Auffinger, Wei-Kuo Chen, Dmitry Panchenko, and Lenka Zdeborová.

Appendix A Analysis of Symmetric Difference and Voting Matrices

We adopt the notation $x^{S}\mathrel{\mathop{:}}=\prod_{i\in S}x_{i}$ for $x\in\{\pm 1\}^{n}$ and $S\subseteq[n]$ . Recall the matrix ${\bm{M}}$ indexed by sets $S\subseteq[n]$ of size $\ell$ , having entries

[TABLE]

First, observe that we can restrict our attention to the case where the spike is the all-ones vector $x_{*}=\mathbbm{1}$ without loss of generality. To see this, conjugate ${\bm{M}}$ by a diagonal matrix ${\bm{D}}$ with diagonal entries ${\bm{D}}_{S,S}=x_{*}^{S}$ and obtain $({\bm{M}}^{\prime})_{S,T}=({\bm{D}}^{-1}{\bm{M}}{\bm{D}})_{S,T}={\bm{Y}}^{\prime}_{S\bigtriangleup T}\mathbf{1}_{|S\bigtriangleup T|=p}$ where ${\bm{Y}}^{\prime}_{S\bigtriangleup T}=x_{*}^{S}x_{*}^{T}{\bm{Y}}_{S\bigtriangleup T}=x_{*}^{S\bigtriangleup T}{\bm{Y}}_{S\bigtriangleup T}=\lambda+g^{\prime}_{S\bigtriangleup T}$ where $g^{\prime}_{S\bigtriangleup T}=x_{*}^{S}x_{*}^{T}g_{S\bigtriangleup T}$ . By symmetry of the Gaussian distribution, $(g^{\prime}_{E})_{|E|=p}$ are i.i.d. $\mathcal{N}(0,1)$ random variables. Therefore, the two matrices have the same spectrum and the eigenvectors of ${\bm{M}}$ can be obtained from those of ${\bm{M}}^{\prime}$ by pre-multiplying by ${\bm{D}}$ . So from now on we write

[TABLE]

where ${\bm{X}}_{S,T}=\mathbf{1}_{|S\bigtriangleup T|=p}$ and ${\bm{Z}}_{S,T}=g_{S\bigtriangleup T}\mathbf{1}_{|S\bigtriangleup T|=p}$ , where $(g_{E})_{|E|=p}$ is a collection of i.i.d. $\mathcal{N}(0,1)$ r.v.’s.

A.1 Structure of ${\bm{X}}$

The matrix ${\bm{X}}$ is the adjacency matrix of a regular graph $J_{n,\ell,p}$ on ${n\choose\ell}$ vertices, where vertices are represented by sets, and two sets $S$ and $T$ are connected by an edge if $|S\bigtriangleup T|=p$ , or equivalently $|S\cap T|=\ell-p/2$ . This matrix belongs to the Bose-Mesner algebra of the $(n,\ell)$ -Johnson association scheme (see for instance [Sch79, GS10]). This is the algebra of ${n\choose\ell}\times{n\choose\ell}$ real- or complex-valued symmetric matrices where the entry ${\bm{X}}_{S,T}$ depends only on the size of the intersection $|S\cap T|$ . In addition to this set of matrices being an algebra, it is a commutative algebra, which means that all such matrices are simultaneously diagonalizable and share the same eigenvectors.

Filmus [Fil16] provides a common eigenbasis for this algebra: for $0\leq m\leq\ell$ , let $\varphi=(a_{1},b_{1},\ldots,a_{m},b_{m})$ be a sequence of $2m$ distinct elements of $[n]$ . Let $|\varphi|=2m$ denote its total length. Now define a vector $u^{\varphi}\in\mathbb{R}^{{n\choose\ell}}$ having coordinates

[TABLE]

In the case $m=0$ , $\varphi$ is the empty sequence $\varnothing$ and we have $u^{\varnothing}=\mathbbm{1}$ (the all-ones vector).

Proposition A.1.

Each $u^{\varphi}$ is an eigenvector of ${\bm{X}}$ . Furthermore, the linear space $\mathcal{Y}_{m}:=\text{span}\{u^{\varphi}:|\varphi|=2m\}$ for $0\leq m\leq\ell$ is an eigenspace of ${\bm{X}}$ (i.e., all vectors $u^{\varphi}$ with sequences $\varphi$ of length of $2m$ have the same eigenvalue $\mu_{m}$ ). Lastly $\mathbb{R}^{{n\choose\ell}}=\bigoplus_{m=0}^{\ell}\mathcal{Y}_{m}$ , and $\dim\mathcal{Y}_{m}={n\choose m}-{n\choose m-1}$ . (By convention, ${n\choose-1}=0$ .)

Proof.

The first two statements are the content of Lemma 4.3 in [Fil16]. The dimension of $\mathcal{Y}_{m}$ is given in Lemma 2.1 in [Fil16]. ∎

We note that $(u^{\varphi})_{|\varphi|=2m}$ are not linearly independent; an orthogonal basis, called the Young basis, consisting of linear combinations of the $u^{\varphi}$ ’s is given explicitly in [Fil16].

We see from the above proposition that ${\bm{X}}$ has $\ell+1$ distinct eigenvalues $\mu_{0}\geq\mu_{1}\geq\cdots\geq\mu_{\ell}$ , each one corresponding to the eigenspace $\mathcal{Y}_{m}$ . The first eigenvalue is the degree of the graph $J_{n,\ell,p}$ :

[TABLE]

We provide an explicit formula for all the remaining eigenvalues:

Lemma A.2.

The eigenvalues of ${\bm{X}}$ are as follows:

[TABLE]

Proof.

These are the so-called Eberlein polynomials, which are polynomials in $m$ of degree $p$ (see, e.g., [Sch79]). We refer to [Bur17] for formulae in more general contexts, but we give a proof here for completeness. Let $A=\{a_{1},\ldots,a_{m}\}$ and $B=\{b_{1},\ldots,b_{m}\}$ . Note that $u^{\varphi}_{S}$ is nonzero if and only if $|S\cap\{a_{i},b_{i}\}|=1$ for each $1\leq i\leq m$ . By symmetry, we can assume that $A\subseteq S$ and $S\cap B=\varnothing$ . Then $\mu_{m}$ is the sum over all sets $T$ , such that $|S\bigtriangleup T|=p$ and $|T\cap\{a_{i},b_{i}\}|=1$ for each $i$ , of $(-1)^{s}$ where $s=|T\cap B|$ . For each $s$ there are ${m\choose s}$ choices of this set of indices, giving the first binomial. Adding these $b_{i}$ to $T$ and removing these $a_{i}$ from $S$ contributes $2s$ to $|S\bigtriangleup T|$ . To achieve $|S\bigtriangleup T|=p$ , we also need to remove $p/2-s$ elements of $S\setminus A$ from $S$ , giving the second binomial. We also need to add $p/2-s$ elements of $\overline{S\cup B}$ to $T$ , giving the third binomial. Finally, we have $s\leq m$ and $s\leq p/2$ . ∎

As the following lemma shows, the succeeding eigenvalues decay rapidly with $m$ .

Lemma A.3.

Let $3\leq p\leq\sqrt{n}$ and let $\ell<n/p^{2}$ . For all $0\leq m\leq\ell$ it holds that

[TABLE]

Proof.

The terms in (15) have alternating signs. We will show that they decrease in absolute value beyond the first nonzero term, so that it gives a bound on $\mu_{m}$ . We consider two cases. First, suppose $m\leq\ell-p/2$ so that the $s=0$ term is positive. Then the $(s+1)$ st term divided by the $s$ th term is, in absolute value,

[TABLE]

It follows that the $s=0$ term is an upper bound,

[TABLE]

and so

[TABLE]

Next we consider the case $m>\ell-p/2$ , so that the first nonzero term has $s=p/2-\ell+m\geq 1$ . Intuitively, this reduces $\mu_{m}$ by at least a factor of $n$ , and we will show this is the case. For the terms with $s\geq p/2-\ell+m$ , the ratio of absolute values is again bounded by

[TABLE]

It follows that the $s=p/2-\ell+m$ term gives a bound on the absolute value,

[TABLE]

in which case

[TABLE]

Combining these two cases gives the stated result. ∎

A.2 Proof of Strong Detection

Here we prove our strong detection result, Theorem 3.3. The proof doesn’t exploit the full details of the structure exhibited above. Instead, the proof is a straightforward application of the matrix Chernoff bound for Gaussian series [Oli10] (see also Theorem 4.1.1 of [Tro12]):

Theorem A.4.

Let $\{{\bm{A}}_{i}\}$ be a finite sequence of fixed symmetric $d\times d$ matrices, and let $\{\xi_{i}\}$ be independent $\mathcal{N}(0,1)$ random variables. Let ${\bm{\Sigma}}=\sum_{i}\xi_{i}{\bm{A}}_{i}$ . Then, for all $t\geq 0$ ,

[TABLE]

Let us first write ${\bm{M}}$ in the form of a Gaussian series. For a set $E\in{[n]\choose p}$ , define the ${n\choose\ell}\times{n\choose\ell}$ matrix ${\bm{A}}_{E}$ as

[TABLE]

It is immediate that for $\lambda=0$ , ${\bm{M}}={\bm{Z}}=\sum_{|E|=p}g_{E}{\bm{A}}_{E}$ where $(g_{E})_{|E|=p}$ is a collection of i.i.d. $\mathcal{N}(0,1)$ random variables. The second moment of this random matrix is

[TABLE]

since ${\bm{A}}_{E}^{2}$ is the diagonal matrix with $({\bm{A}}_{E})_{S,S}=\mathbf{1}_{|S\bigtriangleup E|=p}$ , and summing over all $E$ gives $d_{\ell}$ on the diagonal. The operator norm of the second moment is then $d_{\ell}$ . It follows that for all $t\geq 0$ ,

[TABLE]

Now letting $t=\frac{\lambda d_{\ell}}{2}$ yields the first statement of the theorem.

As for the second statement, we have $\lambda_{\max}({\bm{M}})\geq\|{\bm{X}}\|_{\textup{op}}-\|{\bm{Z}}\|_{\textup{op}}=\lambda d_{\ell}-\|{\bm{Z}}\|_{\textup{op}}$ where ${\bm{Z}}$ is defined in (13). Applying the same bound we have

[TABLE]

A.3 Proof of Strong Recovery

Here we prove our strong recovery result, Theorem 3.4. Let $v_{0}=v_{\textup{top}}({\bm{M}})$ be a unit-norm leading eigenvector of ${\bm{M}}$ . For a fixed $m\in[\ell]$ (to be determined later on), we write the orthogonal decomposition $v_{0}=v^{(m)}+v^{\perp}$ , where $v^{(m)}\in\bigoplus_{s\leq m}\mathcal{Y}_{s}$ , and $v^{\perp}$ in the orthogonal complement. The goal is to first show that if $m$ is proportional to $\ell$ then $v^{\perp}$ has small Euclidean norm, so that $v_{0}$ and $v^{(m)}$ have high inner product. The second step of the argument is to approximate the voting matrix ${\bm{V}}(v_{0})$ by ${\bm{V}}(v^{(m)})$ , and then use Fourier-analytic tools to reason about the latter.

Let us start with the first step.

Lemma A.5.

With ${\bm{Z}}$ defined as in (13), we have

[TABLE]

Proof.

Let us absorb the factor $\lambda$ in the definition of the matrix ${\bm{X}}$ . Let $\{u_{0},\cdots,u_{d}\}$ be a set of eigenvectors of ${\bm{X}}$ which also form an orthogonal basis for $\bigoplus_{s\leq m}\mathcal{Y}_{s}$ , with $u_{0}$ being the top eigenvector of ${\bm{X}}$ ( $u_{0}$ is the normalized all-ones vector). We start with the inequality $u_{0}^{\top}{\bm{M}}u_{0}\leq v_{0}^{\top}{\bm{M}}v_{0}$ . The left-hand side of the inequality is $\mu_{0}+u_{0}^{\top}{\bm{Z}}u_{0}$ . The right-hand side is $v_{0}^{\top}{\bm{X}}v_{0}+v_{0}^{\top}{\bm{Z}}v_{0}$ . Moreover $v_{0}^{\top}{\bm{X}}v_{0}=v_{0}^{\top}{\bm{X}}v^{(m)}+v_{0}^{\top}{\bm{X}}v^{\perp}$ . Since $v^{(m)}\in\bigoplus_{s\leq m}\mathcal{Y}_{s}$ , by Proposition A.1, ${\bm{X}}v^{(m)}$ belongs to the space as well, and therefore $v_{0}^{\top}{\bm{X}}v^{(m)}=v^{(m)\top}{\bm{X}}v^{(m)}$ . Similarly $v_{0}^{\top}{\bm{X}}v^{\perp}=(v^{\perp})^{\top}{\bm{X}}v^{\perp}$ , so $v_{0}^{\top}{\bm{X}}v_{0}=v^{(m)\top}{\bm{X}}v^{(m)}+(v^{\perp})^{\top}{\bm{X}}v^{\perp}$ . Therefore the inequality becomes

[TABLE]

Since $v^{\perp}$ is orthogonal to the top $m$ eigenspaces of ${\bm{X}}$ we have $(v^{\perp})^{\top}{\bm{X}}v^{\perp}\leq\mu_{m+1}\big{\|}v^{\perp}\big{\|}_{2}^{2}$ . Moreover, $v^{(m)\top}{\bm{X}}v^{(m)}\leq\mu_{0}\big{\|}v^{(m)}\big{\|}^{2}$ , hence

[TABLE]

By rearranging and applying the triangle inequality we get,

[TABLE]

∎

Combining this fact with Lemma A.3, recalling that $\mu_{0}=d_{\ell}$ we obtain

Lemma A.6.

For any $\epsilon>0$ and $\delta\in(0,1)$ , if $\lambda\geq\epsilon^{-1}\sqrt{2\log(n^{\ell}/\delta)/d_{\ell}}$ , then

[TABLE]

with probability at least $1-\delta$ .

Proof.

Lemma A.3 implies $\frac{1}{\mu_{0}-\mu_{m+1}}\leq\frac{1}{\mu_{0}}\frac{1}{1-(1-\frac{m+1}{\ell})^{p/2}}\leq\frac{1}{\mu_{0}}\cdot\frac{\ell}{m}$ . Therefore, Lemma A.5 implies

[TABLE]

The operator norm of the noise can be bounded by a matrix Chernoff bound [Oli10, Tro12], similarly to our argument in the proof of detection: for all $t\geq 0$

[TABLE]

Therefore, letting $\lambda\geq\epsilon^{-1}\sqrt{2\log(n^{\ell}/\delta)/d_{\ell}}$ we obtain the desired result. ∎

A.3.1 Analysis of the Voting Matrix

Recall that the voting matrix ${\bm{V}}(v)$ of a vector $v\in\mathbb{R}^{{n\choose\ell}}$ has zeros on the diagonal, and off-diagonal entries

[TABLE]

It will be more convenient in our analysis to work with ${\bm{V}}(v^{(m)})$ instead of ${\bm{V}}(v_{0})$ . To this end we produce the following approximation result:

Lemma A.7.

Let $u,e\in\mathbb{R}^{n\choose\ell}$ and $v=u+e$ . Then

[TABLE]

In particular,

[TABLE]

Proof.

Let us introduce the shorthand notation $\langle u,v\rangle_{ij}\mathrel{\mathop{:}}=\sum_{S\in{[n]\choose\ell}}v_{S}v_{S\bigtriangleup\{i,j\}}\mathbf{1}_{i\in S,j\notin S}$ . We have

[TABLE]

where the last step uses the bound $(a+b+c)^{2}\leq 3(a^{2}+b^{2}+c^{2})$ . Now we expand

[TABLE]

Plugging this back into (17) yields the desired result. To obtain $\|{\bm{V}}(v_{0})-{\bm{V}}(v^{(m)})\|_{F}^{2}\leq 9\ell^{2}\|v^{\perp}\|^{2}$ we just bound $2\|v^{(m)}\|^{2}+\|v^{\perp}\|^{2}$ by $3$ . ∎

Let us also state the following lemma, which will be need later on:

Lemma A.8.

For $u\in\mathbb{R}^{n\choose\ell}$ , $\|{\bm{V}}(u)\|_{F}^{2}\leq\ell^{2}\|u\|^{4}$ .

Proof.

Note that $\|{\bm{V}}(u)\|_{F}^{2}=\sum_{i,j}{\bm{V}}_{ij}(u)^{2}=\sum_{i,j}\langle u,u\rangle_{ij}^{2}$ and so the desired result follows immediately from (18). ∎

Next, in the main technical part of the proof, we show that ${\bm{V}}(v^{(m)})$ is close to a multiple of the all-ones matrix in Frobenius norm:

Proposition A.9.

Let $\hat{\mathbbm{1}}=\mathbbm{1}/\sqrt{n}$ , $\alpha=\ell\|v^{(m)}\|^{2}$ and $\eta=\frac{m}{\ell}+\frac{\ell}{n}$ . Then

[TABLE]

Before proving the above proposition, let us put the results together and prove our recovery result.

A.3.2 Proof of Theorem 3.4

For $\epsilon,\delta>0$ , assume $\lambda\geq\epsilon^{-1}\sqrt{2\log(n^{\ell}/\delta)/d_{\ell}}$ . By Lemma A.6 and Lemma A.7 we have

[TABLE]

with probability at least $1-\delta$ . Moreover, by Lemma A.9, we have

[TABLE]

With $\alpha=\ell\|v^{(m)}\|^{2}\leq\ell$ and $\eta=\frac{m}{\ell}+\frac{\ell}{n}$ . Therefore, by a triangle inequality we have

[TABLE]

with probability at least $1-\delta$ . Now let us choose a value of $m$ that achieves a good tradeoff of the above two error terms: $m=\ell\sqrt{\epsilon}$ . Let us also use the inequality $\sqrt{a+b}\leq\sqrt{a}+\sqrt{b}$ for positive $a,b$ , to obtain

[TABLE]

under the same event.

Now let $\widehat{x}$ be a leading eigenvector of ${\bm{V}}(v_{0})$ , and let ${\bm{R}}={\bm{V}}(v_{0})-\alpha\hat{\mathbbm{1}}\hat{\mathbbm{1}}^{\top}$ . Since $\widehat{x}^{\top}{\bm{V}}(v_{0})\widehat{x}\geq\hat{\mathbbm{1}}^{\top}{\bm{V}}(v_{0})\hat{\mathbbm{1}}$ , we have

[TABLE]

Therefore

[TABLE]

Since $\alpha=\ell(1-\|v^{\perp}\|^{2})$ , and $\|v^{\perp}\|^{2}\leq\epsilon\frac{\ell}{m}\leq\sqrt{\epsilon}$ , the bound (19) (together with $\|{\bm{R}}\|_{\textup{op}}\leq\|{\bm{R}}\|_{F}$ ) implies

[TABLE]

To conclude the proof of our theorem, we let $\ell\leq n\sqrt{\epsilon}$ , $\epsilon<1/16$ and then replace $\epsilon$ by $\epsilon^{4}$ : we obtain $\langle\widehat{x},\hat{\mathbbm{1}}\rangle^{2}\geq 1-48\epsilon$ with probability at least $1-\delta$ if $\lambda\geq\epsilon^{-4}\sqrt{2\log(n^{\ell}/\delta)/d_{\ell}}$ .

A.3.3 Proof of Proposition A.9

Let $\alpha=\ell\|v^{(m)}\|^{2}$ . By Lemma A.8 we have $\big{\|}{\bm{V}}(v^{(m)})\big{\|}_{F}^{2}\leq\alpha^{2}$ . Therefore

[TABLE]

Now we need a lower bound on $\sum_{i,j=1}^{n}{\bm{V}}_{ij}(v^{(m)})$ . This will crucially rely on the fact that $v^{(m)}$ lies in the span of the top $m$ eigenspaces of ${\bm{X}}$ :

Lemma A.10.

For a fixed $m\leq\ell$ , let $v\in\bigoplus_{s=0}^{m}\mathcal{Y}_{s}$ . Then

[TABLE]

We plug the result of the above lemma in (21) to obtain

[TABLE]

as desired.

A.3.4 A Poincaré Inequality on a Slice of the Hypercube

To prove Lemma A.10, we need some results on Fourier analysis on the slice of the hypercube ${[n]\choose\ell}$ . Following [Fil16], we define the following. First, given a function $f:{[n]\choose\ell}\mapsto\mathbb{R}$ , we define its expectation as its average value over all sets of size $\ell$ , and write

[TABLE]

We also define its variance as

[TABLE]

Moreover, we identify a vector $u\in\mathbb{R}^{{n\choose\ell}}$ with a function on sets of size $\ell$ in the obvious way: $f(S)=u_{S}$ .

Definition A.11.

For $u\in\mathbb{R}^{n\choose\ell}$ and $i,j\in[n]$ , let $u^{(ij)}$ denote the vector having coordinates

[TABLE]

(The operation $u\mapsto u^{(ij)}$ exchanges the roles of $i$ and $j$ whenever possible.)

Definition A.12.

For $u\in\mathbb{R}^{n\choose\ell}$ and $i,j\in[n]$ , define the influence of the pair $(i,j)$ as

[TABLE]

and the total influence as

[TABLE]

With this notation, our main tool is a version of Poincaré’s inequality on ${[n]\choose\ell}$ :

Lemma A.13 (Lemma 5.6 in [Fil16]).

For $v\in\bigoplus_{s=0}^{m}\mathcal{Y}_{s}$ we have

[TABLE]

Proof of Lemma A.10.

We have $m\mathop{\mathbb{E}}_{|S|=\ell}[v_{S}^{2}]\geq m\mathbb{V}[v]\geq\mathrm{Inf}[v]=\frac{1}{n}\sum_{i<j}\mathrm{Inf}_{ij}[v]$ , and

[TABLE]

Since $\mathop{\mathbb{E}}_{|S|=\ell}[\mathbf{1}_{|S\cap\{i,j\}|=1}(v_{S}^{(ij)})^{2}]=\mathop{\mathbb{E}}_{|S|=\ell}[\mathbf{1}_{|S\cap\{i,j\}|=1}v_{S}^{2}]$ , the above is equal to

[TABLE]

Now,

[TABLE]

Therefore

[TABLE]

and so

[TABLE]

as desired. ∎

Appendix B Detection for General Priors

While we have mainly focused on the Rademacher-spiked tensor model, we now show that our algorithm works just as well (at least for detection) for a much larger class of spike priors.

Theorem B.1.

Let $p\geq 2$ be even. Consider the spiked tensor model with a spike prior $P_{\mathtt{x}}$ that draws the entries of $x_{*}$ i.i.d. from some distribution $\pi$ on $\mathbb{R}$ (which does not depend on $n$ ), normalized so that $\mathbb{E}[\pi^{2}]=1$ . There is a constant $C$ (depending on $p$ and $\pi$ ) such that if $\lambda\geq C\ell^{1/2}d_{\ell}^{-1/2}\sqrt{\log n}$ then Algorithm 3.1 achieves strong detection.

Proof.

From (16), we have $\|{\bm{Z}}\|_{\textup{op}}=O(\sqrt{\ell d_{\ell}\log n})$ with high probability, and so it remains to give a lower bound on $\|{\bm{X}}\|_{\textup{op}}$ . Letting $u_{S}=\prod_{i\in S}\mathrm{sgn}((x_{*})_{i})$ for $|S|=\ell$ ,

[TABLE]

where

[TABLE]

We have

[TABLE]

and

[TABLE]

We have

[TABLE]

Also, $\mathrm{Cov}(|x_{*}^{S\bigtriangleup T}|,|x_{*}^{S^{\prime}\bigtriangleup T^{\prime}}|)=0$ unless $S\bigtriangleup T$ and $S^{\prime}\bigtriangleup T^{\prime}$ have nonempty intersection. Using Lemma B.2 (below), the fraction of terms in (23) that are nonzero is at most $p^{2}/n$ and so

[TABLE]

By Chebyshev’s inequality, it follows from (22) and (24) that $u^{\top}{\bm{X}}u\geq\frac{1}{2}C(\pi,p){n\choose\ell}d_{\ell}$ with probability at least $1-\frac{4p^{2}}{C(\pi,p)^{2}n}$ . This implies $\|{\bm{X}}\|_{\textup{op}}\geq\frac{1}{2}C(\pi,p)d_{\ell}$ with the same probability, and so we have strong detection provided $\lambda\geq c_{0}\ell^{1/2}d_{\ell}^{-1/2}\sqrt{\log n}$ for a particular constant $c_{0}=c_{0}(\pi,p)$ . ∎

Above, we made use of the following lemma.

Lemma B.2.

Fix $A\subseteq[n]$ with $|A|=a$ . Let $B$ be chosen uniformly at random from all subsets of $[n]$ of size $b$ . Then $\mathbb{P}(A\cap B\neq\varnothing)\leq\frac{ab}{n}$ .

Proof.

Each element of $A$ will lie in $B$ with probability $b/n$ , so the result follows by a union bound over the elements of $A$ . ∎

Appendix C The Odd- $p$ Case

When $p$ is odd, the Kikuchi Hessian still gives rise to a spectral algorithm. While we conjecture that this algorithm is optimal, we unfortunately only know how to prove sub-optimal results for it. (However, we can prove optimal results for a related algorithm; see Appendix F.2.) We now state the algorithm and its conjectured performance.

Let $p$ be odd and fix an integer $\ell\in[\lfloor p/2\rfloor,n-\lceil p/2\rceil]$ . Consider the symmetric difference matrix ${\bm{M}}\in\mathbb{R}^{{n\choose\ell}\times{n\choose\ell+1}}$ with entries

[TABLE]

where $S,T\subseteq[n]$ with $|S|=\ell$ and $|T|=\ell+1$ .

Algorithm C.1 (Recovery for odd $p$ ).

Let $u$ be a (unit-norm) top left-singular vector of ${\bm{M}}$ and let $v={\bm{M}}^{\top}u$ be the corresponding top right-singular vector. Output $\widehat{x}=\widehat{x}({\bm{Y}})\in\mathbb{R}^{n}$ , defined by

[TABLE]

Notice that the rounding step consisting in extracting an $n$ -dimensional vector $\widehat{x}$ from the singular vectors of ${\bm{M}}$ is slightly simpler that the even- $p$ case, in that it does not require forming a voting matrix. We conjecture that, like the even case, this algorithm matches the performance of SOS.

Conjecture C.2.

Consider the Rademacher-spiked tensor model with $p\geq 3$ odd. If

[TABLE]

then (i) there is a threshold $\tau=\tau(n,p,\ell,\lambda)$ such that strong detection can be achieved by thresholding the top singular value of ${\bm{M}}$ at $\tau$ , and (ii) Algorithm C.1 achieves strong recovery.

Similarly to the proof of Theorem 3.4, the matrix Chernoff bound (Theorem A.4) can be used to show that strong recovery is achievable when $\lambda\gg\ell^{-(p-1)/4}n^{-(p-1)/4}$ , which is weaker than SOS when $\ell\ll n$ . We now explain the difficulties involved in improving this. We can decompose ${\bm{M}}$ into a signal part and a noise part: ${\bm{M}}=\lambda{\bm{X}}+{\bm{Z}}$ . In the regime of interest, $\ell^{-(p-2)/4}n^{-p/4}\ll\lambda\ll\ell^{-(p-1)/4}n^{-(p-1)/4}$ , the signal term is smaller in operator norm than the noise term, i.e., $\lambda\|{\bm{X}}\|_{\textup{op}}\ll\|{\bm{Z}}\|_{\textup{op}}$ . While at first sight this would seem to suggest that detection and recovery are hopeless, we actually expect that $\lambda{\bm{X}}$ still affects the top singular value and singular vectors of ${\bm{M}}$ . This phenomenon is already present in the analysis of tensor unfolding (the case $p=3$ , $\ell=1$ ) [HSS15], but it seems that new ideas are required to extend the analysis beyond this case.

Appendix D Proof of Boosting

Definition D.1.

For a tensor ${\bm{G}}\in(\mathbb{R}^{n})^{\otimes p}$ , the injective tensor norm is

[TABLE]

where $u^{(j)}\in\mathbb{R}^{n}$ . For a symmetric tensor ${\bm{G}}$ , it is known [Wat90] that equivalently,

[TABLE]

Proof of Proposition 2.6.

Write $\hat{x}=\lambda\langle u,x_{*}\rangle^{p-1}x_{*}+\Delta$ where $\|\Delta\|\leq\|{\bm{G}}\|_{\textup{inj}}\|u\|^{p-1}$ . We have

[TABLE]

and

[TABLE]

and so

[TABLE]

Our prior $P_{\mathtt{x}}$ is supported on the sphere of radius $\sqrt{n}$ , so $\|x_{*}\|=\sqrt{n}$ . We need to control the injective norm of the tensor ${\bm{G}}$ . To this end we use Theorem 2.12 in [ABČ13] (see also Lemma 2.1 of [RM14]): there exists a constant $c(p)>0$ (called $E_{0}(p)$ in [ABČ13]) such that for all $\epsilon>0$ ,

[TABLE]

Letting $\epsilon=c(p)$ we obtain

[TABLE]

with probability tending to 1 as $n\to\infty$ . ∎

Appendix E Computing the Kikuchi Hessian

In Section 4 we defined the Kikuchi free energy and explained the high level idea of how the symmetric difference matrices are derived from the Kikuchi Hessian. We now carry out the Kikuchi Hessian computation in full detail. This is a heuristic (non-rigorous) computation, but we believe these methods are important as we hope they will be useful for systematically obtaining optimal spectral methods for a wide variety of problems.

E.1 Derivatives of Kikuchi Free Energy

Following [SKZ14], we parametrize the beliefs in terms of the moments $m_{S}=\mathbb{E}[x^{S}]$ . Specifically,

[TABLE]

We imagine $m_{T}$ are close enough to zero so that $b_{S}$ is a positive measure. One can check that these beliefs indeed have the prescribed moments: for $T\subseteq S$ ,

[TABLE]

Thus we can think of the Kikuchi free energy $\mathcal{K}$ as a function of the moments $\{m_{S}\}_{0<|S|\leq r}$ . This parametrization forces the beliefs to be consistent, i.e., if $T\subseteq S$ then the marginal distribution $b_{S}|T$ is equal to $b_{T}$ .

We now compute first and second derivatives of $\mathcal{K}=\mathcal{E}-\mathcal{S}$ with respect to the moments $m_{S}$ . First, the energy term:

[TABLE]

Now the entropy term:

[TABLE]

From (25), for $\varnothing\subset T\subseteq S$ ,

[TABLE]

and so

[TABLE]

For $\varnothing\subset T\subseteq S$ , $\varnothing\subset T^{\prime}\subseteq S$ ,

[TABLE]

Finally, if $T\not\subseteq S$ then $\frac{\partial\mathcal{S}_{S}}{\partial m_{T}}=0$ .

E.2 The Case $r=p$

We first consider the simplest case, where $r$ is as small as possible: $r=p$ . (We need to require $r\geq p$ in order to express the energy term in terms of the beliefs.)

E.2.1 Trivial Stationary Point

There is a “trivial stationary point” of the Kikuchi free energy where the beliefs only depend on local information. Specifically, if $|S|<p$ then $b_{S}$ is the uniform distribution over $\{\pm 1\}^{|S|}$ , and if $|S|=p$ then

[TABLE]

i.e.,

[TABLE]

where

[TABLE]

Note that these beliefs are consistent (if $T\subseteq S$ with $|S|\leq p$ then $b_{S}|T=b_{T}$ ) and so there is a corresponding set of moments $\{m_{S}\}_{|S|\leq p}$ .

We now check that this is indeed a stationary point of the Kikuchi free energy. Using (26) and (27) we have for $\varnothing\subset T\subseteq S$ and $|S|\leq p$ ,

[TABLE]

Thus if $|T|<p$ we have $\frac{\partial\mathcal{K}}{\partial m_{T}}=0$ , and if $|T|=p$ we have

[TABLE]

This confirms that we indeed have a stationary point.

E.2.2 Hessian

We now compute the Kikuchi Hessian, the matrix indexed by subsets $\varnothing<|T|\leq p$ with entries ${\bm{H}}_{T,T^{\prime}}=\frac{\partial^{2}\mathcal{K}}{\partial m_{T}\partial m_{T^{\prime}}}$ , evaluated at the trivial stationary point. Similarly to the Bethe Hessian [SKZ14], we expect the bottom eigenvector of the Kikuchi Hessian to be a good estimate for the (moments of) the true signal. This is because this bottom eigenvector indicates the best local direction for improving the Kikuchi free energy, starting from the trivial stationary point. If all eigenvalues of ${\bm{H}}$ are positive then the trivial stationary point is a local minimum and so an algorithm acting locally of the beliefs should not be able to escape from it, and should not learn anything about the signal. On the other hand, a negative eigenvalue (or even an eigenvalue close to zero) indicates a (potential) direction for improvement.

Remark E.1.

When $p$ is odd, we cannot hope for a substantially negative eigenvalue because $x_{*}$ and $-x_{*}$ are not equally-good solutions and so the Kikuchi free energy should be locally cubic instead of quadratic. Still, we believe that the bottom eigenvector of the Kikuchi Hessian (which will have eigenvalue close to zero) yields a good algorithm. For instance, we will see in the next section that this method yields a close variant of tensor unfolding when $r=p=3$ .

Recall that for $\varnothing\subset T\subseteq S$ , and $\varnothing\subset T^{\prime}\subseteq S$ ,

[TABLE]

If $|S|<p$ then $b_{S}$ is uniform on $\{\pm 1\}^{|S|}$ (at the trivial stationary point) and so $\frac{\partial^{2}\mathcal{S}_{S}}{\partial m_{T}\partial m_{T^{\prime}}}=-\mathbbm{1}_{T=T^{\prime}}$ . If $|S|=p$ then $b_{S}(x_{S})=\frac{1}{Z_{S}}\exp(\lambda{\bm{Y}}_{S}x^{S})$ where $Z_{S}=\sum_{x_{S}}\exp(\lambda{\bm{Y}}_{S}x^{S})=2^{|S|}\cosh(\lambda{\bm{Y}}_{S})$ , and so

[TABLE]

where $\sqcup$ denotes disjoint union. (Note that we have replaced $\bigtriangleup$ with $\sqcup$ due to the restriction $T,T^{\prime}\subseteq S$ .) We can now compute the Hessian:

[TABLE]

where we used (11) in the last step. Suppose $\lambda\ll 1$ (since otherwise tensor PCA is very easy). If $T=T^{\prime}$ then, using the $\cosh$ Taylor series, we have the leading-order approximation

[TABLE]

This means ${\bm{H}}\approx\tilde{\bm{H}}$ where

[TABLE]

E.2.3 The Case $r=p=3$

We now restrict to the case $r=p=3$ and show that the Kikuchi Hessian recovers (a close variant of) the tensor unfolding method. Recall that in this case the computational threshold is $\lambda\sim n^{-3/4}$ and so we can assume $\lambda\ll n^{-1/2}$ (or else the problem is easy). We have

[TABLE]

This means we can write

[TABLE]

where $\alpha=\frac{1}{2}n^{2}\lambda^{2}$ and ${\bm{M}}$ is the $n\times{n\choose 2}$ flattening of ${\bm{Y}}$ , i.e., ${\bm{M}}_{i,\{j,k\}}={\bm{Y}}_{ijk}\mathbf{1}_{\{i,j,k\text{ distinct}\}}$ .

Since we are looking for the minimum eigenvalue of $\tilde{\bm{H}}$ , we can restrict ourselves to the submatrix $\tilde{\bm{H}}^{\leq 2}$ indexed by sets of size 1 and 2. We have

[TABLE]

An eigenvector $[u\;v]^{\top}$ of $\tilde{\bm{H}}^{\leq 2}$ with eigenvalue $\beta$ satisfies

[TABLE]

which implies $(1-\beta)v=\lambda{\bm{M}}^{\top}u$ and so $\lambda^{2}{\bm{M}}{\bm{M}}^{\top}u=(\alpha-\beta)(1-\beta)u$ . This means either $u$ is an eigenvector of $\lambda^{2}{\bm{M}}{\bm{M}}^{\top}$ with eigenvalue $(\alpha-\beta)(1-\beta)$ , or $u=0$ and $\beta\in\{1,\alpha\}$ . Conversely, if $u$ is an eigenvector of $\lambda^{2}{\bm{M}}{\bm{M}}^{\top}$ with eigenvalue $(\alpha-\beta)(1-\beta)\neq 0$ , then $[u\;v]^{\top}$ with $v=(1-\beta)^{-1}\lambda{\bm{M}}^{\top}u$ is an eigenvector of $\tilde{\bm{H}}^{\leq 2}$ with eigenvalue $\beta$ . Letting $\mu_{1}>\cdots>\mu_{n}>0$ be the eigenvalues of $\lambda^{2}{\bm{M}}{\bm{M}}^{\top}$ , $\tilde{\bm{H}}^{\leq 2}$ has $2n$ eigenvalues of the form

[TABLE]

and the remaining eigenvalues are $\alpha$ or $1$ . Thus, the $u$ -part of the bottom eigenvector of $\tilde{\bm{H}}^{\leq 2}$ is precisely the leading eigenvector of ${\bm{M}}{\bm{M}}^{\top}$ . This is a close variant of the tensor unfolding spectral method (see Section 2.4), and we expect that its performance is essentially identical.

E.2.4 The General Case: $r\geq p$

One difficulty when $r>p$ is that there is no longer a trivial stationary point that we can write down in closed form. There is, however, a natural guess for “uninformative” beliefs that only depend on the local information: for $0<|S|\leq r$ ,

[TABLE]

for the appropriate normalizing factor $Z_{S}$ . Unfortunately, these beliefs are not quite consistent, so we need separate moments for each set $S$ :

[TABLE]

Provided $\lambda\ll 1$ , we can check that $m^{(S)}_{T}\approx m^{(S^{\prime})}_{T}$ to first order, and so the above beliefs are at least approximately consistent:

[TABLE]

and so

[TABLE]

which does not depend on $S$ . Thus we will ignore the slight inconsistencies and carry on with the derivation. As above, the important calculation is, for $T,T^{\prime}\subseteq S$ ,

[TABLE]

Analogous to (28), we compute the Kikuchi Hessian

[TABLE]

If we fix $\ell_{1},\ell_{2}$ , the submatrix ${\bm{H}}^{(\ell_{1},\ell_{2})}=({\bm{H}}_{T,T^{\prime}})_{|T|=\ell_{1},|T^{\prime}|=\ell_{2}}$ takes the form

[TABLE]

for certain scalars $a(\ell_{1},\ell_{2})$ and $b(\ell_{1},\ell_{2})$ , where ${\bm{I}}$ is the identity matrix and ${\bm{M}}^{(\ell_{1},\ell_{2})}\in\mathbb{R}^{{[n]\choose\ell_{1}}\times{[n]\choose\ell_{2}}}$ is the symmetric difference matrix

[TABLE]

Instead of working with the entire Kikuchi Hessian, we choose to work instead with ${\bm{M}}^{(\ell,\ell)}$ , which (when $p$ is even) appears as a diagonal block of the Kikuchi Hessian whenever $r\geq\ell+p/2$ (since there must exist $|T|=|T^{\prime}|=\ell$ with $|T\cup T^{\prime}|\leq r$ and $|T\bigtriangleup T^{\prime}|=p$ ). Our theoretical results (see Section 3) show that indeed ${\bm{M}}^{(\ell,\ell)}$ yields algorithms matching the (conjectured optimal) performance of sum-of-squares. When $p$ is odd, ${\bm{M}}^{(\ell,\ell)}=0$ and so we propose to instead focus on ${\bm{M}}^{(\ell,\ell+1)}$ ; see Appendix C.

Appendix F Extensions

F.1 Refuting Random $k$ -XOR Formulas for $k$ Even

Our symmetric difference matrices can be used to give a simple algorithm and proof for a related problem: strongly refuting random $k$ -XOR formulas (see [AOW15, RRS17] and references therein). This is essentially a variant of the spiked tensor problem with sparse Rademacher observations instead of Gaussian ones. It is known [RRS17] that this problem exhibits a smooth tradeoff between subexponential runtime and the number of constraints required, but the proof of [RRS17] involves intensive moment calculations. When $k$ is even, we will give a simple algorithm and a simple proof using the matrix Chernoff bound that achieves the same tradeoff. SOS lower bounds suggest that this tradeoff is optimal [Gri01, Sch08].

When $k$ is odd, we expect that the construction in Section F.2 should achieve the optimal tradeoff, but we do not have a proof for this case.

F.1.1 Setup

Let $x_{1},\ldots,x_{n}$ be $\{\pm 1\}$ -valued variables. A $k$ -XOR formula $\Phi$ with $m$ constraints is specified by a sequence of subsets $U_{1},\ldots,U_{m}$ with $U_{i}\subseteq[n]$ and $|U_{i}|=k$ , along with values $b_{1},\ldots,b_{m}$ with $b_{i}\in\{\pm 1\}$ . For $1\leq i\leq m$ , constraint $i$ is satisfied if $x^{U_{i}}=b_{i}$ , where $x^{U_{i}}\mathrel{\mathop{:}}=\prod_{j\in U_{i}}x_{j}$ . We write $P_{\Phi}(x)$ for the number of constraints satisfied by $x$ . We will consider a uniformly random $k$ -XOR formula in which each $U_{i}$ is chosen uniformly and independently from the ${n\choose k}$ possible $k$ -subsets, and each $b_{i}$ is chosen uniformly and independently from $\{\pm 1\}$ .

Given a formula $\Phi$ , the goal of strong refutation is to certify an upper bound on the number of constraints that can be satisfied. In other words, our algorithm should output a bound $B=B(\Phi)$ such that for every formula $\Phi$ , $\max_{x\in\{\pm 1\}^{n}}P_{\Phi}(x)\leq B(\Phi)$ . (Note that this must be satisfied always, not merely with high probability.) At the same time, we want the bound $B$ to be small with high probability over a random $\Phi$ . Since a random assignment $x$ will satisfy roughly half the constraints, the best bound we can hope for is $B=\frac{m}{2}(1+\varepsilon)$ with $\varepsilon>0$ small.

F.1.2 Algorithm

Let $k\geq 2$ be even and let $\ell\geq k/2$ . Given a $k$ -XOR formula $\Phi$ , construct the order- $\ell$ symmetric difference matrix ${\bm{M}}\in\mathbb{R}^{{[n]\choose\ell}\times{[n]\choose\ell}}$ as follows. For $S,T\subseteq[n]$ with $|S|=|T|=\ell$ , let

[TABLE]

and let

[TABLE]

Define the parameter

[TABLE]

which, for any fixed $|S|=\ell$ , is the number of sets $|T|=\ell$ such that $|S\bigtriangleup T|=k$ . For an assignment $x\in\{\pm 1\}^{n}$ , let $u^{x}\in\mathbb{R}^{[n]\choose\ell}$ be defined by $u^{x}_{S}=x^{S}$ for all $|S|=\ell$ . We have

[TABLE]

since for any fixed $U_{i}$ (of size $k$ ), the number of $(S,T)$ pairs such that $S\bigtriangleup T=U_{i}$ is ${n\choose\ell}d_{\ell}{n\choose k}^{-1}$ . Thus we can perform strong refutation by computing $\|{\bm{M}}\|$ :

[TABLE]

Theorem F.1.

Let $k\geq 2$ be even and let $k/2\leq\ell\leq n-k/2$ . Let $\beta\in(0,1)$ . If

[TABLE]

then $\|{\bm{M}}\|$ certifies

[TABLE]

with probability at least $1-3{n\choose\ell}^{-1}$ over a uniformly random $k$ -XOR formula $\Phi$ with $m$ constraints.

If $k$ is constant and $\ell=n^{\delta}$ with $\delta\in(0,1)$ , the condition (38) becomes

[TABLE]

matching the result of [RRS17]. In fact, our result is tighter by polylog factors.

F.1.3 Binomial Tail Bound

The main ingredients in the proof of Theorem F.1 will be the matrix Chernoff bound (Theorem A.4) and the following standard binomial tail bound.

Proposition F.2.

Let $X\sim\mathrm{Binomial}(n,p)$ . For $p<\frac{u}{n}<1$ ,

[TABLE]

Proof.

We begin with the standard Binomial tail bound [AG89]

[TABLE]

for $p<\frac{u}{n}<1$ , where

[TABLE]

Since $\log(x)\geq 1-1/x$ ,

[TABLE]

and the desired result follows. ∎

F.1.4 Proof

Proof of Theorem F.1.

We need to bound $\|{\bm{M}}\|$ with high probability over a uniformly random $k$ -XOR formula $\Phi$ . First, fix the subsets $U_{1},\ldots,U_{m}$ and consider the randomness of the signs $b_{i}$ . We can write ${\bm{M}}$ as a Rademacher series

[TABLE]

where

[TABLE]

By the matrix Chernoff bound (Theorem A.4),

[TABLE]

where

[TABLE]

In particular,

[TABLE]

Now we will bound $\sigma^{2}$ with high probability over the random choice of $U_{1},\ldots,U_{m}$ . We have

[TABLE]

where $D_{S}$ is the number of $i$ for which $|S\bigtriangleup U_{i}|=\ell$ . This means $\sigma^{2}=\max_{|S|=\ell}D_{S}$ . For fixed $S\subseteq[n]$ with $|S|=\ell$ , the number of sets $U\subseteq[n]$ with $|U|=k$ such that $|S\bigtriangleup U|=\ell$ is $d_{\ell}$ and so $D_{S}\sim\mathrm{Binomial}\left(m,p\right)$ with $p\mathrel{\mathop{:}}=d_{\ell}{n\choose k}^{-1}$ . Using the Binomial tail bound (Proposition F.2) and a union bound over $S$ ,

[TABLE]

Provided

[TABLE]

we have

[TABLE]

Let $\beta\in(0,1)$ . From (37), to certify $P_{\Phi}(x)\leq\frac{m}{2}(1+\beta)$ it suffices to have $\|{\bm{M}}\|\leq\beta md_{\ell}{n\choose k}^{-1}=\beta pm$ . Therefore, from (39), it suffices to have

[TABLE]

From (40), this will occur provided

[TABLE]

and

[TABLE]

Note that (42) is subsumed by (41). This completes the proof. ∎

F.2 Odd-Order Tensors

When the tensor order $p$ is odd, we have given an algorithm for tensor PCA based on the Kikuchi Hessian (see Appendix C) but are unfortunately unable to give a tight analysis of it. Here we present a related algorithm for which we are able to give a better analysis, matching SOS. The idea of the algorithm is to use a construction from the SOS literature that transforms an order- $p$ tensor (with $p$ odd) into an order- $2(p-1)$ tensor via the Cauchy–Schwarz inequality [CGL04]. We then apply a variant of our symmetric difference matrix to the resulting even-order tensor. A similar construction was given independently in the recent work [Has19] and shown to give optimal performance for all $\ell\leq n^{\delta}$ for a certain constant $\delta>0$ . The proof we give here applies to the full range of $\ell$ values: $\ell\ll n$ . Our proof uses a certain variant of the matrix Bernstein inequality combined with some fairly simple moment calculations.

F.2.1 Setup

For simplicity, we consider the following version of the problem. Let $p\geq 3$ be odd and let ${\bm{Y}}\in(\mathbb{R}^{n})^{\otimes p}$ be an asymmetric tensor with i.i.d. Rademacher (uniform $\pm 1$ ) entries. Our goal is to certify an upper bound on the Rademacher injective norm, defined as

[TABLE]

The true value is $O(\sqrt{n})$ with high probability. In time $n^{\ell}$ (where $\ell=n^{\delta}$ with $\delta\in(0,1)$ ) we will certify the bound $\|{\bm{Y}}\|_{\pm}\leq n^{p/4}\ell^{1/2-p/4}\mathrm{polylog}(n)$ , matching the results of [BGG*+*16, BGL16]. Such certification results can be turned into recovery results using sum-of-squares; see Lemma 4.4 of [HSS15]. To certify a bound on the injective norm instead of the Rademacher injective norm (where $x$ is constrained to the sphere instead of the hypercube), one should use the basis-invariant version of the symmetric difference matrices given by [Has19] (but we do not do this here).

F.2.2 Algorithm

We will use a trick from [CGL04] which is often used in the sum-of-squares literature. For any $\|x\|=1$ , we have by the Cauchy–Schwarz inequality,

[TABLE]

where $p=2q+1$ and ${\bm{T}}_{abcd}\mathrel{\mathop{:}}=\sum_{e\in[n]}{\bm{Y}}_{ace}{\bm{Y}}_{bde}$ where $a,b,c,d\in[n]^{q}$ . We have $\mathbb{E}[{\bm{T}}]_{abcd}=n\cdot\mathbf{1}\{ac=bd\}$ and so $\langle\mathbb{E}[{\bm{T}}],x^{\otimes 4q}\rangle=n\sum_{ac}(x^{a}x^{c})^{2}=n$ . Let $\tilde{\bm{T}}={\bm{T}}-\mathbb{E}[{\bm{T}}]$ , i.e., $\tilde{\bm{T}}_{abcd}={\bm{T}}_{abcd}\cdot\mathbf{1}\{ac\neq bd\}$ . Define the $n^{\ell}\times n^{\ell}$ matrix ${\bm{M}}$ as follows. For $S,T\in[n]^{\ell}$ ,

[TABLE]

where $S\stackrel{{\scriptstyle ab,cd}}{{\longleftrightarrow}}T$ roughly means that $S$ is obtained from $T$ by replacing $ab$ by $cd$ , or $cd$ by $ab$ ; the formal definition is given below. Also, $N_{ab,cd}$ denotes the number of $(S,T)$ pairs for which $S\stackrel{{\scriptstyle ab,cd}}{{\longleftrightarrow}}T$ .

Definition F.3.

For $S,T\in[n]^{\ell}$ and $a,b,c,d\in[n]^{q}$ , we write $S\stackrel{{\scriptstyle ab,cd}}{{\longleftrightarrow}}T$ if there are distinct indices $i_{1},\ldots,i_{2q}\in[\ell]$ such that either: (i) $S_{i_{j}}=(ab)_{i_{j}}$ and $T_{i_{j}}=(cd)_{i_{j}}$ for all $j\in[2q]$ , the values in $a,b,c,d$ do not appear anywhere else in $S$ or $T$ , and $S,T$ are identical otherwise: $S_{i}=T_{i}$ for all $i\notin\{i_{1},\ldots,i_{2q}\}$ ; or (ii) the same holds but with $ab$ and $cd$ interchanged. (Here $ab$ denotes concatenation.)

Note that

[TABLE]

The above construction ensures that

[TABLE]

This means we can certify an upper bound on $\|{\bm{Y}}\|_{\pm}$ by computing $\|{\bm{M}}\|$ :

[TABLE]

Theorem F.4.

Let $p\geq 3$ be odd and let $p-1\leq\ell\leq\min\{n-(p-1),\frac{n}{4(p-1)},\frac{n}{8\log n}\}$ . Then $\|{\bm{M}}\|$ certifies

[TABLE]

with probability at least $1-n^{-\ell}$ over an i.i.d. Rademacher ${\bm{Y}}$ .

F.2.3 Proof

We will use the following variant of the matrix Bernstein inequality; this is a special case ( ${\bm{A}}_{k}=R\cdot{\bm{I}}$ ) of [Tro12], Theorem 6.2.

Theorem F.5 (Matrix Bernstein).

Consider a finite sequence $\{{\bm{X}}_{i}\}$ of independent random symmetric $d\times d$ matrices. Suppose $\mathbb{E}[{\bm{X}}_{i}]=0$ and $\|\mathbb{E}[{\bm{X}}_{i}^{r}]\|\leq\frac{r!}{2}R^{r}$ for $r=2,3,4,\ldots$ . Then

[TABLE]

For $e\in[n]$ , let

[TABLE]

We will apply Theorem F.5 to the sum ${\bm{M}}=\sum_{e}{\bm{M}}^{(e)}$ . Note that $\mathbb{E}[{\bm{M}}^{(e)}]=0$ . To bound the moments $\|\mathbb{E}[({\bm{M}}^{(e)})^{r}]\|$ , we will use the following basic fact.

Lemma F.6.

If ${\bm{A}}$ is a symmetric matrix,

[TABLE]

Proof.

Let $v$ be the leading eigenvector of ${\bm{A}}$ so that ${\bm{A}}v=\lambda v$ where $\|{\bm{A}}\|=|\lambda|$ . Normalize $v$ so that $\|v\|_{1}=1$ . Then $\|{\bm{A}}v\|_{1}=|\lambda|\cdot\|v\|_{1}$ and so

[TABLE]

∎

Proof of Theorem F.4.

For any fixed $e$ , we have by Lemma F.6,

[TABLE]

Let $\pi$ denote a “path” of the form

[TABLE]

such that $S_{0}=S$ , $(a_{i},c_{i})\neq(b_{i},d_{i})$ , and $S_{i-1}\stackrel{{\scriptstyle a_{i}b_{i},c_{i}d_{i}}}{{\longleftrightarrow}}S_{i}$ . Then we have

[TABLE]

Among tuples of the form $(a_{i},c_{i})$ and $(b_{i},d_{i})$ , each must occur an even number of times (or else the term associated with $\pi$ is [math]). There are $2r$ such tuples, so there are ${2r\choose r}\,r!\,2^{-r}$ ways to pair them up. Once $S_{i-1}$ is chosen, there are at most $2(\ell n)^{q}$ choices for $(a_{i},c_{i})$ , and the same is true for $(b_{i},d_{i})$ . Once $S_{i-1},a_{i},b_{i},c_{i},d_{i}$ are chosen, there are at most $(2q)!$ possible choices for $S_{i}$ . This means

[TABLE]

where $\bar{N}$ is defined in (43). Since ${2r\choose r}\leq 4^{r}$ , we can apply Theorem F.5 with $R=8(2q)!(\ell n)^{q}\bar{N}^{-1}$ . This yields

[TABLE]

Let $t=R\sqrt{8\ell n\log n}$ . Provided $\ell\leq n/(8\log n)$ we have $Rt\leq nR^{2}$ and so

[TABLE]

Thus with high probability we certify

[TABLE]

We have the following bound on $\bar{N}$ :

[TABLE]

provided $\ell\leq n/(8q)$ . Therefore we certify

[TABLE]

as desired.

∎

Bibliography63

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[ABČ13] Antonio Auffinger, Gérard Ben Arous, and Jiří Černỳ. Random matrices and complexity of spin glasses. Communications on Pure and Applied Mathematics , 66(2):165–201, 2013.
2[ADGM 16] Anima Anandkumar, Yuan Deng, Rong Ge, and Hossein Mobahi. Homotopy analysis for tensor PCA. ar Xiv preprint ar Xiv:1610.09322 , 2016.
3[AG 89] Richard Arratia and Louis Gordon. Tutorial on large deviations for the binomial distribution. Bulletin of mathematical biology , 51(1):125–131, 1989.
4[AGH + 14] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models. The Journal of Machine Learning Research , 15(1):2773–2832, 2014.
5[AGJ 17] Animashree Anandkumar, Rong Ge, and Majid Janzamin. Analyzing tensor power method dynamics in overcomplete regime. The Journal of Machine Learning Research , 18(1):752–791, 2017.
6[AGJ 18] Gerard Ben Arous, Reza Gheissari, and Aukosh Jagannath. Algorithmic thresholds for tensor PCA. ar Xiv preprint ar Xiv:1808.00921 , 2018.
7[AKS 98] Noga Alon, Michael Krivelevich, and Benny Sudakov. Finding a large hidden clique in a random graph. Random Structures & Algorithms , 13(3-4):457–466, 1998.
8[AOW 15] Sarah R Allen, Ryan O’Donnell, and David Witmer. How to refute a random CSP. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science , pages 689–708. IEEE, 2015.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

The Kikuchi Hierarchy and Tensor PCA

Abstract

1 Introduction

2 Preliminaries and Prior Work

2.1 Notation

2.2 The Spiked Tensor Model

Definition 2.1**.**

Definition 2.2**.**

Definition 2.3**.**

The matrix case.

The tensor case.

2.3 The Tensor Power Method and Local Algorithms

Algorithm 2.4**.**

2.4 The Tensor Unfolding Algorithm

Algorithm 2.5**.**

2.5 Boosting and Linearization

Proposition 2.6**.**

2.6 Subexponential-time Algorithms for Tensor PCA

Definition 2.7**.**

Theorem 2.8** ([BGL16, HSS15]).**

3 Main results

The symmetric difference matrix of order ℓ\ellℓ.

The voting matrix.

Algorithm 3.1** (Detection for even ppp).**

Algorithm 3.2** (Recovery for even ppp).**

Theorem 3.3**.**

Theorem 3.4**.**

Remark 3.5**.**

Removing the logarithmic factor

Conjecture 3.6**.**

4 Motivating the Symmetric Difference Matrices

4.1 Intuition: Higher-Order Message-Passing and Maximum Likelihood

4.2 Variational Inference and Kikuchi Free Energy

5 Conclusion

Acknowledgments

Appendix A Analysis of Symmetric Difference and Voting Matrices

A.1 Structure of X{\bm{X}}X

Proposition A.1**.**

Proof.

Lemma A.2**.**

Proof.

Lemma A.3**.**

Proof.

A.2 Proof of Strong Detection

Theorem A.4**.**

A.3 Proof of Strong Recovery

Lemma A.5**.**

Proof.

Lemma A.6**.**

Proof.

A.3.1 Analysis of the Voting Matrix

Lemma A.7**.**

Proof.

Lemma A.8**.**

Proof.

Proposition A.9**.**

A.3.2 Proof of Theorem 3.4

A.3.3 Proof of Proposition A.9

Lemma A.10**.**

A.3.4 A Poincaré Inequality on a Slice of the Hypercube

Definition A.11**.**

Definition A.12**.**

Lemma A.13** (Lemma 5.6 in [Fil16]).**

Proof of Lemma A.10.

Appendix B Detection for General Priors

Theorem B.1**.**

Proof.

Lemma B.2**.**

Proof.

Appendix C The Odd-ppp Case

Algorithm C.1** (Recovery for odd ppp).**

Conjecture C.2**.**

Appendix D Proof of Boosting

Definition 2.1.

Definition 2.2.

Definition 2.3.

Algorithm 2.4.

Algorithm 2.5.

Proposition 2.6.

Definition 2.7.

Theorem 2.8 ([BGL16, HSS15]).

The symmetric difference matrix of order $\ell$ .

Algorithm 3.1 (Detection for even $p$ ).

Algorithm 3.2 (Recovery for even $p$ ).

Theorem 3.3.

Theorem 3.4.

Remark 3.5.

Conjecture 3.6.

A.1 Structure of ${\bm{X}}$

Proposition A.1.

Lemma A.2.

Lemma A.3.

Theorem A.4.

Lemma A.5.

Lemma A.6.

Lemma A.7.

Lemma A.8.

Proposition A.9.

Lemma A.10.

Definition A.11.

Definition A.12.

Lemma A.13 (Lemma 5.6 in [Fil16]).

Theorem B.1.

Lemma B.2.

Appendix C The Odd- $p$ Case

Algorithm C.1 (Recovery for odd $p$ ).

Conjecture C.2.

Definition D.1.

E.2 The Case $r=p$

Remark E.1.

E.2.3 The Case $r=p=3$

E.2.4 The General Case: $r\geq p$

F.1 Refuting Random $k$ -XOR Formulas for $k$ Even

Theorem F.1.

Proposition F.2.

Definition F.3.

Theorem F.4.

Theorem F.5 (Matrix Bernstein).

Lemma F.6.