Spectral Approximate Inference

Sejun Park; Eunho Yang; Se-Young Yun; Jinwoo Shin

arXiv:1905.05348·cs.LG·May 15, 2019

Spectral Approximate Inference

Sejun Park, Eunho Yang, Se-Young Yun, Jinwoo Shin

PDF

Open Access

TL;DR

This paper introduces a spectral-based approximation method for computing the partition function in graphical models, overcoming limitations of local iterative algorithms by leveraging global spectral features for improved robustness and accuracy.

Contribution

It presents a polynomial-time approximation scheme for low-rank GMs and a spectral mean-field scheme for high-rank GMs, enhancing robustness over prior methods.

Findings

01

The spectral approach outperforms prior algorithms in accuracy.

02

The method is robust and does not suffer from convergence issues.

03

Experiments demonstrate improved efficiency and reliability.

Abstract

Given a graphical model (GM), computing its partition function is the most essential inference task, but it is computationally intractable in general. To address the issue, iterative approximation algorithms exploring certain local structure/consistency of GM have been investigated as popular choices in practice. However, due to their local/iterative nature, they often output poor approximations or even do not converge, e.g., in low-temperature regimes (hard instances of large parameters). To overcome the limitation, we propose a novel approach utilizing the global spectral feature of GM. Our contribution is two-fold: (a) we first propose a fully polynomial-time approximation scheme (FPTAS) for approximating the partition function of GM associating with a low-rank coupling matrix; (b) for general high-rank GMs, we design a spectral mean-field scheme utilizing (a) as a subroutine, where…

Equations113

P (x) = \frac{1}{Z} exp (⟨ θ, x ⟩ + x^{T} A x)

P (x) = \frac{1}{Z} exp (⟨ θ, x ⟩ + x^{T} A x)

P (x) \propto exp i \in V \sum θ_{i} x_{i} + 2 (i, j) \in E \sum A_{ij} x_{i} x_{j} .

P (x) \propto exp i \in V \sum θ_{i} x_{i} + 2 (i, j) \in E \sum A_{ij} x_{i} x_{j} .

Z = Z (θ, A) := x \in Ω \sum exp (⟨ θ, x ⟩ + x^{T} A x) .

Z = Z (θ, A) := x \in Ω \sum exp (⟨ θ, x ⟩ + x^{T} A x) .

Z

Z

= x \in Ω \sum exp (⟨ θ, x ⟩ + j = 1 \sum r λ_{j} ⟨ v_{j}, x ⟩^{2})

Z = x \in Ω \sum exp (⟨ θ, x ⟩ + j = 1 \sum r sign (λ_{j}) ⟨ u_{j}, x ⟩^{2})

Z = x \in Ω \sum exp (⟨ θ, x ⟩ + j = 1 \sum r sign (λ_{j}) ⟨ u_{j}, x ⟩^{2})

\displaystyle\sum_{\mathbf{x}\in\Omega}\exp\big{(}\langle\boldsymbol{\theta},\mathbf{x}\rangle\big{)}\exp\left(\sum_{j=1}^{r}\text{sign}(\lambda_{j})\big{(}c\cdot f_{j}(\mathbf{x})\big{)}^{2}\right).

\displaystyle\sum_{\mathbf{x}\in\Omega}\exp\big{(}\langle\boldsymbol{\theta},\mathbf{x}\rangle\big{)}\exp\left(\sum_{j=1}^{r}\text{sign}(\lambda_{j})\big{(}c\cdot f_{j}(\mathbf{x})\big{)}^{2}\right).

\displaystyle\sum_{\mathbf{x}\in\Omega}\exp\big{(}\langle\boldsymbol{\theta},\mathbf{x}\rangle\big{)}\exp\left(\sum_{j=1}^{r}\text{sign}(\lambda_{j})\big{(}c\cdot f_{j}(\mathbf{x})\big{)}^{2}\right)

\displaystyle\sum_{\mathbf{x}\in\Omega}\exp\big{(}\langle\boldsymbol{\theta},\mathbf{x}\rangle\big{)}\exp\left(\sum_{j=1}^{r}\text{sign}(\lambda_{j})\big{(}c\cdot f_{j}(\mathbf{x})\big{)}^{2}\right)

\displaystyle=\sum_{\mathbf{k}\in\mathbf{f}(\Omega)}\left(\sum_{\mathbf{x}\in\mathbf{f}^{-1}(\mathbf{k})}\exp\big{(}{\langle\boldsymbol{\theta},\mathbf{x}\rangle}\big{)}\right)

\times exp (j = 1 \sum r sign (λ_{j}) (c \cdot k_{j})^{2})

= k \in f (Ω) \sum t (k) exp (j = 1 \sum r sign (λ_{j}) (c \cdot k_{j})^{2}) .

f_{j} (x) = ar g k_{j} \in Z min ∣ c \cdot k_{j} - ⟨ u_{j}, x ⟩ ∣

f_{j} (x) = ar g k_{j} \in Z min ∣ c \cdot k_{j} - ⟨ u_{j}, x ⟩ ∣

{(- 1, \dots, - 1)} = S_{0} \subset S_{1} \subset \dots \subset S_{n - 1} \subset S_{n} = Ω.

{(- 1, \dots, - 1)} = S_{0} \subset S_{1} \subset \dots \subset S_{n - 1} \subset S_{n} = Ω.

f_{j}\big{(}(-1,\dots,-1)\big{)}:=\arg\min_{k_{j}\in\mathbb{Z}}\left|c\cdot k_{j}+\sum_{i=1}^{n}u_{ji}\right|,

f_{j}\big{(}(-1,\dots,-1)\big{)}:=\arg\min_{k_{j}\in\mathbb{Z}}\left|c\cdot k_{j}+\sum_{i=1}^{n}u_{ji}\right|,

f_{j} (x) := f_{j} (x^{'}) + u_{j i}

f_{j} (x) := f_{j} (x^{'}) + u_{j i}

c \cdot (f_{j} (x^{'}) + u_{j i}) \approx ⟨ u_{j}, x^{'} ⟩ + 2 u_{j i} = ⟨ u_{j}, x ⟩

c \cdot (f_{j} (x^{'}) + u_{j i}) \approx ⟨ u_{j}, x^{'} ⟩ + 2 u_{j i} = ⟨ u_{j}, x ⟩

B

B

b_{j}

∣ B ∣ \leq 2^{r} j = 1 \prod r (\frac{1}{c} ∣ λ_{j} ∣ n + \frac{n}{2} + 1) .

∣ B ∣ \leq 2^{r} j = 1 \prod r (\frac{1}{c} ∣ λ_{j} ∣ n + \frac{n}{2} + 1) .

Z \approx k \in B \sum t (k) exp (j = 1 \sum r sign (λ_{j}) (c \cdot k_{j})^{2}),

Z \approx k \in B \sum t (k) exp (j = 1 \sum r sign (λ_{j}) (c \cdot k_{j})^{2}),

t_{i}(\mathbf{k}):=\sum_{\mathbf{x}\in\mathbf{f}^{-1}(\mathbf{k})\cap\mathcal{S}_{i}}\exp\big{(}\langle\boldsymbol{\theta},\mathbf{x}\rangle\big{)},

t_{i}(\mathbf{k}):=\sum_{\mathbf{x}\in\mathbf{f}^{-1}(\mathbf{k})\cap\mathcal{S}_{i}}\exp\big{(}\langle\boldsymbol{\theta},\mathbf{x}\rangle\big{)},

t_{0}(\mathbf{k})=\begin{cases}&\exp\left(-\sum_{i=1}^{n}\theta_{i}\right)~{}~{}\text{if}~{}~{}\mathbf{k}=\mathbf{f}\big{(}(-1,\dots,-1)\big{)}\\ &\qquad\qquad 0\qquad\quad\text{otherwise}\end{cases}

t_{0}(\mathbf{k})=\begin{cases}&\exp\left(-\sum_{i=1}^{n}\theta_{i}\right)~{}~{}\text{if}~{}~{}\mathbf{k}=\mathbf{f}\big{(}(-1,\dots,-1)\big{)}\\ &\qquad\qquad 0\qquad\quad\text{otherwise}\end{cases}

lo g \frac{Z}{Z} \leq \frac{1}{4} r c^{2} (n + 1)^{2} + c n (n + 1) j = 1 \sum r ∣ λ_{j} ∣,

lo g \frac{Z}{Z} \leq \frac{1}{4} r c^{2} (n + 1)^{2} + c n (n + 1) j = 1 \sum r ∣ λ_{j} ∣,

(1 - ε) Z \leq Z \leq (1 + ε) Z,

(1 - ε) Z \leq Z \leq (1 + ε) Z,

c = min (\frac{ε}{r} \frac{1}{n + 1}, \frac{ε}{4 ( \sum _{j} ∣ λ _{j} ∣ ) n ( n + 1 )}) .

c = min (\frac{ε}{r} \frac{1}{n + 1}, \frac{ε}{4 ( \sum _{j} ∣ λ _{j} ∣ ) n ( n + 1 )}) .

Z=Z(\boldsymbol{\theta},A)=\exp\big{(}-\text{Tr}(D)\big{)}\cdot Z(\boldsymbol{\theta},A+D).

Z=Z(\boldsymbol{\theta},A)=\exp\big{(}-\text{Tr}(D)\big{)}\cdot Z(\boldsymbol{\theta},A+D).

Z

Z

= \frac{1}{2} x^{'} \in {- 1, 1}^{n + 1} \sum exp ((x^{'})^{T} A^{'} x^{'}) = \frac{1}{2} Z^{'}

Z

Z

E_{x \sim U_{Ω}} [exp (x^{T} A x)] = E_{y \sim P_{Y}} [exp (j = 1 \sum n λ_{j} y_{j}^{2})]

E_{x \sim U_{Ω}} [exp (x^{T} A x)] = E_{y \sim P_{Y}} [exp (j = 1 \sum n λ_{j} y_{j}^{2})]

\approx E_{y \sim q} [exp (j = 1 \sum n λ_{j} y_{j}^{2})] = j = 1 \prod n E_{y_{j} \sim q_{j}} [exp (λ_{j} y_{j}^{2})],

Z

Z

\approx 2^{n} j = 1 \prod n E_{y_{j} \sim q_{j}} [exp (λ_{j} y_{j}^{2})]

= 2^{n} j = 1 \prod n E_{x \sim U_{Ω}} [exp (λ_{j} ⟨ v_{j}, x ⟩^{2})],

E_{x \sim U_{Ω}} [exp (x^{T} A x)]

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBlind Source Separation Techniques · Sparse and Compressive Sensing Techniques · Neural Networks and Applications

Full text

Spectral Approximate Inference

Sejun Park

Eunho Yang

Se-Young Yun

Jinwoo Shin

Abstract

Given a graphical model (GM), computing its partition function is the most essential inference task, but it is computationally intractable in general. To address the issue, iterative approximation algorithms exploring certain local structure/consistency of GM have been investigated as popular choices in practice. However, due to their local/iterative nature, they often output poor approximations or even do not converge, e.g., in low-temperature regimes (hard instances of large parameters). To overcome the limitation, we propose a novel approach utilizing the global spectral feature of GM. Our contribution is two-fold: (a) we first propose a fully polynomial-time approximation scheme (FPTAS) for approximating the partition function of GM associating with a low-rank coupling matrix; (b) for general high-rank GMs, we design a spectral mean-field scheme utilizing (a) as a subroutine, where it approximates a high-rank GM into a product of rank-1 GMs for an efficient approximation of the partition function. The proposed algorithm is more robust in its running time and accuracy than prior methods, i.e., neither suffers from the convergence issue nor depends on hard local structures, as demonstrated in our experiments.

Machine Learning, ICML

1 Introduction

Graphical models (GMs) provide a succinct representation of a joint probability distribution over a set of random variables by encoding their conditional dependencies in graphical structures. GMs have been studied in various fields of machine learning, including computer vision (Freeman et al., 2000), speech recognition (Bilmes, 2004) and deep learning (Salakhutdinov & Larochelle, 2010). Most inference problems arising in GMs, e.g., obtaining desired samples and computing marginal distributions, can be easily reduced to computing their partition function (normalizing constant). However, computing the partition function is #P-hard in general even to approximate (Jerrum & Sinclair, 1993), which is thus a fundamental barrier for inference tasks of GM.

Variational inference is one of the most popular heuristics in practice for estimating the partition function. It is typically achieved via running iterative local message-passing algorithms, e.g., mean-field approximation (Parisi, 1988; Jain et al., 2018) and belief propagation (Pearl, 1982; Wainwright et al., 2005). Markov chain Monte Carlo (MCMC) method (Neal, 2001; Efthymiou et al., 2016) is another popular approach, where it usually samples from GMs via Markov chains with a local transition, e.g., Gibbs sampler (Geman & Geman, 1984), and estimates a target expectation by averaging over samples. Unfortunately, both variational and MCMC methods are hard to guarantee the convergence/mixing under some fixed computation budget and known to output poor approximation in the low-temperature regime, i.e., large parameters of GM, due to the non-existence of the so-called correlation decay (Weitz, 2006; Bandyopadhyay & Gamarnik, 2008). On the other hand, variable elimination (Dechter, 1999; Dechter & Rish, 2003; Liu & Ihler, 2011; Xue et al., 2016; Wrigley et al., 2017; Ahn et al., 2018a, b) is one of popular ‘convergence free’ methods for approximating the partition function. At each step, it sequentially marginalizes a chosen variable and generates complex high-order factors approximating the marginalized variable and its associated factors. Hence, it guarantees to terminate after marginalizing all variables. However, the performance of variable elimination schemes is also significantly degraded in the low-temperature regime, due to its local/iterative nature of processing variables one by one.

Contribution. In this paper, we propose a completely new approach by investigating the global information of GM, to overcome the limitation of prior methods. To this end, we study the spectral feature of the coupling matrix of GM and propose a partition function approximation algorithm utilizing the eigenvectors and eigenvalues. In particular, if the matrix-rank and parameters of GM are bounded, i.e., $O(1)$ , then we prove that the proposed algorithm is a fully polynomial-time approximation scheme (FPTAS), even for GMs with high treewidth. Such polynomial-time approximation schemes have been typically investigated in the literature under certain structured GMs (Temperley & Fisher, 1961; Pearl, 1982; Dechter, 1999; Jerrum et al., 2004), and high-temperature regimes (Zhang et al., 2011; Li et al., 2013; Patel & Regts, 2017) or homogeneity of GM parameters (Jerrum & Sinclair, 1993; Sinclair et al., 2014; Molkaraie, 2016; Liu et al., 2017; Patel & Regts, 2017; Molkaraie & Gómez, 2018). Our theoretical result provides a new class of GMs for the direction.

Despite the theoretical value of the proposed algorithm for low-rank GMs, it is very expensive to run for general high-rank GMs as its complexity grows exponentially with respect to the rank. To address this issue, we decompose the partition function of high-rank GM into a product of those of rank-1 GMs. Then, we run the proposed FPTAS algorithm to compute all rank-1 partition functions and combine them to approximate the original partition function. For improving our approximation, we additionally suggest running a semi-definite programming to discover a better spectral decomposition of the partition function. In a sense, our approach is of mean-field type, but different from the traditional ones decomposing GM itself without spectral pre-processing. We present an illustration of the proposed scheme in Figure 1.

The proposed mean-field scheme can be universally applied to any GMs without the rank restriction. Its computational complexity scales well for large GMs without suffering from the convergence issue. Furthermore, its approximation quality is quite robust against hard GM instances of heterogeneous parameters since the utilized spectral feature grows linearly with respect to the inverse temperature, i.e., scale of parameters. Our experiments demonstrate that the proposed scheme indeed outperforms mean-field approximation, belief propagation and variable elimination, in particular, significantly in the low-temperature regimes where the prior methods fail.

2 Spectral Inference for Low-Rank GMs

We begin with introducing the definition of the pairwise binary graphical model (GM). Given a vector $\boldsymbol{\theta}\in\mathbb{R}^{n}$ and a symmetric matrix $A\in\mathbb{R}^{n\times n}$ , we define GM as the following joint distribution on $\mathbf{x}\in\Omega:=\{-1,1\}^{n}$ :

[TABLE]

where $\langle\cdot,\cdot\rangle$ denotes the inner product and $Z$ is the normalizing constant. The above definition of GM coincides with the following conventional definition associating with an undirected graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$ defined as:

[TABLE]

where $\mathcal{V}=\{1,\dots,n\}$ and $\mathcal{E}=\{(i,j):A_{ij}\neq 0,~{}i<j\}$ .

The normalizing constant $Z$ of (1) is called the partition function defined as follows:

[TABLE]

Computing $Z$ is one of the most essential inference tasks arising in GMs. However, it is known to be computationally intractable in general, i.e., #P-hard even to approximate (Jerrum & Sinclair, 1993). In particular, the case when the magnitudes of entries of $A$ are large is called, the low-temperature regime (Sykes et al., 1965), where $Z$ is known to be harder to approximate provably (Sly & Sun, 2012; Galanis et al., 2014). This is indeed the regime where known heuristics also fail badly.

In this section, we show that $Z$ is possible to be approximated in polynomial-time if there exists a diagonal matrix $D$ such that the rank of $A+D$ is bounded, i.e., $O(1)$ . Just for clarity, we primarily focus on the case when $A$ is of low-rank itself (i.e., $D=0$ ) and then describe at the end of this section how our results are extended to the case when $A+D$ is of low-rank for any diagonal matrix $D$ .

2.1 Overall Approach: Approximate Inference via Spectral Decomposition

To design such a polynomial-time algorithm, we first reformulate $Z$ using the eigenvalues/eigenvectors of $A$ as follows:

[TABLE]

where $\lambda_{j}$ and $\mathbf{v}_{j}$ denote the $j$ -th largest non-zero eigenvalue and its corresponding unit eigenvector of $A$ and $r$ denotes the rank of $A$ . We note that such a decomposition is always possible because $A$ is a real symmetric matrix, i.e., all eigenvalues are real. However, even with a small rank $r$ , a naive computation of $Z$ is still intractable as it is a summation over exponentially many terms. Our main idea is approximating $\lambda_{j}\langle\mathbf{v}_{j},\mathbf{x}\rangle^{2}$ in (4) to its quantized value in order to drastically reduce the number of summations. Toward this, we rewrite (4) as

[TABLE]

where $\text{sign}(\lambda_{j})\in\{-1,1\}$ denotes the sign of $\lambda_{j}$ and $\mathbf{u}_{j}=\sqrt{|\lambda_{j}|}\mathbf{v}_{j}$ . Here we deliberately choose some mapping $f_{j}:\Omega\rightarrow\mathbb{Z}$ (it will be explicitly described in Section 2.2) so that $c\cdot f_{j}(\mathbf{x})\approx\langle\mathbf{u}_{j},\mathbf{x}\rangle$ for some fixed constant $c>0$ and hence $Z$ can be nicely approximated as

[TABLE]

Note that $c$ decides a quantization interval and $c\cdot f_{j}(\mathbf{x})$ represents a quantized value of $\langle\mathbf{u}_{j},\mathbf{x}\rangle$ . Namely, for each $\mathbf{x}\in\Omega$ , we will design $\mathbf{f}(\mathbf{x})=[f_{j}(\mathbf{x})]_{j=1}^{r}\in\mathbb{Z}^{r}$ for approximating $\langle\mathbf{u}_{j},\mathbf{x}\rangle$ for all $j$ .

Given such $\mathbf{f}$ , we further process (5) as

[TABLE]

In the above, the first equality is from replacing the summation over $\Omega$ by that over $\mathbf{f}(\Omega)$ , i.e., for $\mathbf{k}=[k_{j}]_{j=1}^{r}\in\mathbf{f}(\Omega)$ , each $k_{j}$ represents a possible value of $f_{j}(\mathbf{x})$ . For the second equality, we define $t(\mathbf{k}):=\sum_{\mathbf{x}\in\mathbf{f}^{-1}(\mathbf{k})}\exp\big{(}\langle\boldsymbol{\theta},\mathbf{x}\rangle\big{)}.$ Finally, from (5) and (6), one can observe that if $t(\mathbf{k})$ is easy to compute and the cardinality of $\mathbf{f}(\Omega)$ is small, then the partition function $Z$ can be efficiently approximated. In the following section, we provide more details on how to choose $\mathbf{f}$ for the desired property.

2.2 How to Choose $\mathbf{f}$ and Compute $t(\mathbf{k})$

Choice of $\mathbf{f}$ . A naive choice of $\mathbf{f}$ can be

[TABLE]

for all $j$ . However, with the above choice of $\mathbf{f}$ , it is unclear how to compute $t(\mathbf{k})$ efficiently (in polynomial-time). To address the issue, we propose a recursive construction of $\mathbf{f}$ by relaxing (7): we iteratively define $\mathbf{f}(\mathbf{x})$ for $\mathbf{x}\in\mathcal{S}_{i}\setminus\mathcal{S}_{i-1}$ where $\mathcal{S}_{i}:=\{\mathbf{x}\in\Omega:x_{\ell}=-1,~{}\forall\ell>i\}$ so that

[TABLE]

First, we define $\mathbf{f}$ for $\mathcal{S}_{0}$ following (7):

[TABLE]

for all $j$ . The construction of $\mathbf{f}$ for the rest $\mathcal{S}_{n}\setminus\mathcal{S}_{0}$ will be done in a recursive manner. Suppose that $\mathbf{f}(\mathbf{x})$ is defined for $\mathbf{x}\in\mathcal{S}_{i-1}$ . Then, we define $\mathbf{f}$ for $\mathbf{x}\in\mathcal{S}_{i}\setminus\mathcal{S}_{i-1}$ as follow:

[TABLE]

where we define $\widehat{u}_{ji}:=\arg\min_{\widehat{u}_{ji}\in\mathbb{Z}}|c\cdot\widehat{u}_{ji}-2u_{ji}|$ and $\mathbf{x}^{\prime}\in\mathcal{S}_{i-1}$ such that $x^{\prime}_{\ell}=x_{\ell}$ except for $\ell=i$ , i.e., $x^{\prime}_{i}=-1$ . Here, (9) is motivated by the following approximation: $c\cdot f_{j}(\mathbf{x}^{\prime})\approx\langle\mathbf{u}_{j},\mathbf{x}^{\prime}\rangle$ and the definition of $\widehat{u}_{ji}$ implies that

[TABLE]

where the equality is due to $x_{i}=1$ and $x^{\prime}_{i}=-1$ .

In essence, we have so far constructed $\mathbf{f}$ via a dynamic programming to approximate (7), which allows us to compute $t(\mathbf{k})$ efficiently. Furthermore, our choice of $\mathbf{f}$ ensures that $|\mathbf{f}(\Omega)|$ is bounded. Before describing how to compute $t(\mathbf{k})$ , let us discuss the bound of $|\mathbf{f}(\Omega)|$ . For bounding $|\mathbf{f}(\Omega)|$ , we discover a bounded set $\mathcal{B}\subset\mathbb{Z}^{r}$ so that $\mathbf{f}(\Omega)\subset\mathcal{B}$ instead of characterizing $|\mathbf{f}(\Omega)|$ directly. We explicitly describe such $\mathcal{B}$ as follows.

Claim 1.

$\mathbf{f}(\Omega)\subset\mathcal{B}$ * where*

[TABLE]

Furthermore, $|\mathcal{B}|$ is bounded by

[TABLE]

We present the proof of Claim 1 in the supplementary material. Finally, given $t(\mathbf{k})$ and $\mathcal{B}$ as defined in Claim 1, we approximate the partition function $Z$ as follows (see (5) and (6)):

[TABLE]

where $t(\mathbf{k})=0$ if $\mathbf{k}\notin\mathbf{f}(\Omega)$ .

Computation of $t(\mathbf{k})$ . We are now ready to describe how to compute $t(\mathbf{k})$ . Since $t(\mathbf{k})=0$ for $\mathbf{k}\notin\mathcal{B}$ , it suffices to compute $t(\mathbf{k})$ for all $\mathbf{k}\in\mathcal{B}$ . Similar to the construction of $\mathbf{f}$ , we recursively compute

[TABLE]

i.e., $t_{n}(\mathbf{k})=t(\mathbf{k})$ . The recursive computation of $t_{i}(\mathbf{k})$ is based on the following claim.

Claim 2.

$t_{i}(\mathbf{k})=t_{i-1}(\mathbf{k})+\exp(2\theta_{i})\cdot t_{i-1}(\mathbf{k}-[\widehat{u}_{ji}]_{j=1}^{r})$ .

The proof of Claim 2 is presented in the supplementary material. The above claim implies that once $t_{i-1}(\mathbf{k})$ for $\mathbf{k}\in\mathcal{B}$ is obtained, $t_{i}(\mathbf{k})$ can be efficiently computed using $t_{i-1}(\mathbf{k})$ . Here, we consider $t_{i-1}(\mathbf{k})=0$ for $\mathbf{k}\notin\mathcal{B}$ . Initially, one can find $t_{0}(\mathbf{k})$ as follows:

[TABLE]

where $\mathbf{f}\big{(}(-1,\dots,-1)\big{)}$ is defined in (8).

2.3 Provable Guarantee

The succinct description of the proposed approximate inference algorithm described in Section 2.1 and 2.2 is given in Algorithm 1. We further prove the following theoretical guarantee of the algorithm.

Theorem 3.

Algorithm 1 outputs $\widehat{Z}$ such that

[TABLE]

in $O\big{(}n2^{r}\prod_{j=1}^{r}(\sqrt{|\lambda_{j}|n}/c+n/2+1)\big{)}$ time.

The proof of Theorem 3 is presented in the supplementary material. As expected, a smaller quantization interval $c$ provides a smaller error bound, but a higher complexity (and vice versa). From Theorem 3, given $\varepsilon\in(0,1/2)$ , one can check that Algorithm 1 guarantees

[TABLE]

if we choose

[TABLE]

Under the choice of $c$ , the algorithm complexity becomes $O\big{(}(\frac{9}{\varepsilon}r\max(\lambda_{\max},1))^{r}n^{2r+1}\big{)}$ where $\lambda_{\max}=\max_{j}|\lambda_{j}|$ . Therefore, if the rank and parameters of GM are bounded, i.e., $r,A_{ij}=O(1)$ for all $i,j$ , Algorithm 1 is a fully polynomial-time approximation scheme (FPTAS) for approximating $Z$ .

Finally, we remark that the following simple trick allows a FPTAS for approximating the partition function of a richer class of GMs: for any diagonal matrix $D$ , one can check

[TABLE]

Namely, if there exists a diagonal matrix $D$ such that the rank of $A+D$ is $O(1)$ (possibly, $A$ is not of low-rank though), then one can run Algorithm 1 to approximate $Z(\boldsymbol{\theta},A+D)$ and use it to derive $Z(\boldsymbol{\theta},A)$ from (10).

3 Spectral Inference for High-Rank GMs

In the previous section, we introduced a FPTAS algorithm for approximating the partition function for the special class of low-rank GMs. However, for general (high-rank) GMs, Algorithm 1 is intractable to run as its complexity grows exponentially with respect to the rank. In this section, we address the issue by proposing a new efficient partition function approximation algorithm for general GMs of arbitrary rank. The proposed algorithm utilizes Algorithm 1 as a subroutine. Our main idea is to decompose the partition function of GM into a product of that of rank-1 GMs using the mean-field approximation, and then handle each rank-1 GM via Algorithm 1.

Throughout this section, we assume GMs with $\boldsymbol{\theta}=0$ . Such a restriction does not harm the generality of our method due to the following:

[TABLE]

where $A^{\prime}=\begin{bmatrix}A&\frac{1}{2}\boldsymbol{\theta}\\ \frac{1}{2}\boldsymbol{\theta}^{T}&0\end{bmatrix}$ and $Z^{\prime}=\sum_{\mathbf{x}^{\prime}}\exp\left((\mathbf{x}^{\prime})^{T}A^{\prime}\mathbf{x}^{\prime}\right)$ is the partition function of a GM with $A^{\prime}$ . Namely, computing the partition function of any GM is easily reducible to computing that of an alternative GM with $\boldsymbol{\theta}=0$ .

3.1 Overall Approach: From High-Rank to Low-Rank

To handle high-rank GMs, we first reformulate the partition function $Z$ by substituting the summation over $\mathbf{x}$ with the expectation over $\mathbf{x}$ drawn from the uniform distribution $U_{\Omega}$ over $\Omega$ :

[TABLE]

Then, for approximating the above expectation, we consider the following mean-field approximation via some fully factorized distribution $q(\mathbf{y})=\prod_{j=1}^{n}q_{j}(y_{j})$ , where $y_{j}=\langle\mathbf{v}_{j},\mathbf{x}\rangle$ , $\mathbf{y}=[y_{j}]_{j=1}^{n}$ :

[TABLE]

where $P_{\mathcal{Y}}(\mathbf{y}):=\sum_{\mathbf{x}\in\Omega\,:\,y_{j}=\langle\mathbf{v}_{j},\mathbf{x}\rangle,~{}\forall j}U_{\Omega}(\mathbf{x})$ for $\mathbf{y}\in\mathcal{Y}$ and $\mathcal{Y}:=\big{\{}\mathbf{y}=[y_{j}=\langle\mathbf{v}_{j},\mathbf{x}\rangle]_{j=1}^{n}:\mathbf{x}\in\Omega\big{\}}$ . Now, we prove the following claim that the choice of $q_{j}(y_{j})=P_{\mathcal{Y}}(y_{j})$ (the marginal probability of the joint distribution $P_{\mathcal{Y}}$ ) is optimal for the mean-field approximation in (12), with respect to the Kullback-Leibler (KL) divergence. The proof of Claim 4 is presented in the supplementary material.

Claim 4.

$\text{KL}\big{(}P_{\mathcal{Y}}(\mathbf{y})||\prod_{j=1}^{n}q_{j}(y_{j})\big{)}$ * is minimized when $q_{j}(y_{j})=P_{\mathcal{Y}}(y_{j})$ for all $j$ .*

In summary, under the choice of $q_{j}(y_{j})=P_{\mathcal{Y}}(y_{j})$ , we use the following approximation for $Z$ from (11) and (12):

[TABLE]

where it is easy to check that $2^{n}\mathbb{E}_{\mathbf{x}\sim U_{\Omega}}\big{[}\exp\big{(}\lambda_{j}\langle\mathbf{v}_{j},\mathbf{x}\rangle^{2}\big{)}\big{]}$ is equivalent to the partition function of a rank-1 GM induced by $\lambda_{j},\mathbf{v}_{j}$ and can be efficiently approximated using Algorithm 1. We further remark that the mean-field approximation quality in (13) is expected to be better if variables $y_{j}=\langle\mathbf{v}_{j},\mathbf{x}\rangle$ for all $j$ are closer to independence. Hence, it is quite a reasonable approximation since for $i\neq j$ , $\langle\mathbf{v}_{i},\mathbf{x}\rangle$ , $\langle\mathbf{v}_{j},\mathbf{x}\rangle$ are pairwise uncorrelated, i.e., $\mathbb{E}_{\mathbf{x}\sim U_{\Omega}}[\langle\mathbf{v}_{i},\mathbf{x}\rangle\langle\mathbf{v}_{j},\mathbf{x}\rangle]=0$ , due to the orthogonality of eigenvectors $\mathbf{v}_{i},\mathbf{v}_{j}$ .

We remark that our mean-field approximation (13) is different from the traditional one (Parisi, 1988). The latter addresses to find a mean-field distribution of $x_{i}$ ’s minimizing the KL divergence with the original distribution $\mathbb{P}(\mathbf{x})$ , while our approach minimizes the KL divergence between $q(y_{j})$ and $P_{\mathcal{Y}}(\mathbf{y})$ , i.e., after spectral processing.

3.2 Improving (13) via Controlling the Diagonal of $A$

It is instructive to remind that varying the diagonal of $A$ only changes the partition function by a constant multiplicative factor, as in (10). In order to fully utilize this, we address to optimize the diagonal of $A$ to improve our mean-field approximation. To this end, we build the following mean-field approximation by introducing the additional freedom of choosing a diagonal matrix $D$ :

[TABLE]

where $\lambda^{D}_{j},\mathbf{y}^{D},q^{D},\mathcal{Y}^{D},P_{\mathcal{Y}^{D}}$ are those for $A+D$ (analogous to $\lambda_{j},\mathbf{y},q,\mathcal{Y},P_{\mathcal{Y}}$ of $A$ ). Since it is intractable to find the optimal selection for $D$ by directly minimizing the approximation gap of (14) (as computing the true expectations is intractable), we propose to set the free parameter $D$ by solving the following semi-definite programming (SDP):

[TABLE]

The intuition behind solving (15) is provided in Section 3.3. We also provide its empirical justification through experimental studies in Section 4.2. We remark that the SDP (15) is equivalent to (the dual of) the popular semi-definite relaxation of the max-cut problem (Goemans & Williamson, 1995) and the maximum eigenvalue minimization problem (Delorme & Poljak, 1993). For the complexity of solving (15), the interior point method (Alizadeh, 1995; Helmberg et al., 1996) has $O(n^{3.5}\log(1/\varepsilon))$ running time and the first order method (Nesterov, 2007) has $O(n^{3}\sqrt{\log n}/\varepsilon)$ running time where $\varepsilon>0$ denotes the target precision to the optimum.222We also refer Section 3 of (Waldspurger et al., 2015) and Section 4 of (Goemans & Williamson, 1995) for more details.

From (11), (14) and (15), our final approximation becomes

[TABLE]

where $D$ is a solution of (15) and $\mathbf{v}^{D}_{j}$ is an eigenvector of $A+D$ corresponding to $\lambda_{j}^{D}$ . It is trivial that the above approximation with $D=0$ reduces to (13). Finally, we formally state the proposed algorithm in Algorithm 2.

3.3 Intuition for (15)

Now, we describe the intuition why we consider the semi-definite programming (15). To this end, let us re-write the approximation error in (14) as the following alternative view:

[TABLE]

where $\widehat{Z}$ denotes the approximated partition function. One can easily check that the approximation error is [math] when $\lambda^{D}_{1}=\dots=\lambda^{D}_{n}=0$ . Thus, we can expect a very accurate estimation when all eigenvalues of $A+D$ are close to 0. One can also observe that if there exists $\lambda_{j}^{D}>0$ , then the error might be too huge as $\sup_{y\in\mathbb{R}^{n}}\sum_{j=1}^{n}\lambda_{j}^{D}y_{j}^{2}=\infty$ and the supports of $P_{\mathcal{Y}^{D}}$ and $q^{D}$ are different. Under the above intuitions, we suggest to solve the following problem:

[TABLE]

The optimization (16) is equivalent to (15) since $\text{Tr}(D)=\sum_{j=1}^{n}\lambda_{j}^{D}-\text{Tr}(A)$ and the condition $\lambda^{D}_{j}\leq 0$ for all $j$ is equivalent to $A+D\preceq 0$ .

4 Experimental Results

In this section, we evaluate our algorithms on diverse environments including both synthetic and UAI datasets to corroborate our theorem and claims.

4.1 Setups

To begin with, we describe our overall experimental settings. We compare our algorithms against the standard inference schemes dominantly used in most applications: belief propagation (BP) (Pearl, 1982), mean-field approximation (MF) (Parisi, 1988), mini-bucket elimination (MBE) (Dechter & Rish, 2003) and weighted mini-bucket elimination (WMBE) (Liu & Ihler, 2011). Since all baselines are iterative methods and have the trade-off between the computation cost and the performance, we choose 200 iterations for BP, 1000 iterations for MF and 10 ibound for MBE and WMBE, for fair comparisons. Below these are referred to as BP-200, MF-1000, MBE-10 and WMBE-10, respectively. In the case of BP and MF, their performances are saturated with the above choice in most cases and there is no gain by running more iterations. On the other hand, one can improve the approximation quality of MBE and WMBE with a larger ibound. However, its complexity grows exponentially with respect to it. We also report the running times of algorithms in our implementation using round brackets following their names, e.g., BP-200 (2s) means that 200 iterations of BP run in 2 seconds (on average) for tested GMs.

Throughout our all experiments, we fix $c=\sqrt{|\lambda_{j}|}/1000$ for Algorithm 1 and Algorithm 2 to bound its running time regardless of eigenvalues. For solving the semi-definite programming (SDP) (15), we use CVX (Grant et al., 2008) with SDPT3 solver (Toh et al., 1999) using MATLAB.

For generating synthetic GMs to evaluate on, we first choose the graph structure (it will be specified in each setting) and randomly sample $\theta_{i}\sim\text{Unif}[-1,1]$ on its vertices and $A_{ij}\sim\text{Unif}[-s,s]$ on its edges where Unif denotes the uniform distribution and $s$ indicates the strength of pairwise couplings. For measuring the running time for all experiments, we run algorithms using a single thread of CPU. To reduce experimental noise, we average 100 random GMs for each plot unless otherwise stated.

4.2 Investigating the Semi-Definite Programming (15)

In this section, we investigate empirical effects and running time of the proposed SDP (15).

Effect of solving (15). We first investigate how (15) helps the mean-field approximation (14) used in Algorithm 2 compared to other choices of diagonal matrix $D$ . In particular, we consider three other choices to compare. The first choice is $D=0$ which does not change the diagonal of $A$ . The second choice is $D=-\max_{j}\lambda_{j}\times I$ which chooses entries of $D$ by the maximum eigenvalue of $A$ so that $A+D\preceq 0$ . The last choice is $D_{ii}=-\sum_{j=1}^{n}|A_{ij}|$ which forces $A+D$ to be a diagonal dominant matrix, i.e., $A+D\preceq 0$ . The second and third choices can be thought as feasible, yet non-optimal solutions of (15). Figure 2(a) reports the experimental result for measuring the log partition error for GMs on complete graph having 20 vertices. One can observe that solving (15) is important for the approximation performance of Algorithm 2.

Running time for solving (15). Now, we discuss about the empirical complexity of solving (15). Our solver SDPT3 uses the primal-dual interior point method (Toh et al., 1999) for solving (15). To measure the running time of the solver, we generate random GMs on complete graphs by varying the number of vertices from $100$ to $5000$ . Figure 2(b) illustrates the average running time of our solver where each point is averaged over 10 random GMs. We compare the running time of our solver with quadratic and cubic polynomials with respect to $n$ . One can observe that the empirical running time to solve (15) is between $O(n^{2})$ and $O(n^{3})$ , which is better than the theoretical bound of the interior point method $O(n^{3.5})$ (Helmberg et al., 1996).

4.3 Evaluation of Algorithm 1 under Low-Rank GMs

We evaluate Algorithm 1 under rank-1 GMs, which is used as a subroutine of Algorithm 2. We choose a random eigenvalue $\lambda\in\{-1,1\}$ and a random eigenvector $\mathbf{v}\in\text{Unif}\big{(}\{\mathbf{v}\in\mathbb{R}^{n}:\|\mathbf{v}\|_{2}=1\}\big{)}$ to generate rank-1 GMs by choosing $A=\lambda\mathbf{v}\mathbf{v}^{T}$ and $\theta_{i}\sim\text{Unif}[-1,1]$ . Given $\mathbf{v}$ , we scale $\lambda$ to match the average value of $|A_{ij}|$ to be equal to some constant $s$ (coupling strength in Figure 2(c)), i.e.,

[TABLE]

We remark that rank-1 GMs has the special property that if its eigenvalue $\lambda$ is positive (or negative), they are equivalent to ferromagnetic (or antiferromagnetic) models, i.e., $A_{ij}\geq 0$ (or $A_{ij}\leq 0$ ) for $i\neq j$ , respectively. Figure 2(c) reports the algorithm performances under rank-1 GMs. As expected from our theoretical results (Theorem 3), our algorithm is nearly exact, while other algorithms fail. In particular, BP, MBE and WMBE output very poor approximation since they usually fail in antiferromagnetic cases, i.e., negative eigenvalue. The superior performance of Algorithm 1 under rank-1 GMs implies that the approximation error of Algorithm 2 would mainly come from the mean-field approximation (14).

4.4 Evaluation of Algorithm 2 under High-Rank GMs

We now evaluate the empirical performance of Algorithm 2 under synthetic high-rank GMs and UAI datasets (Gogate, 2014). In all cases, we have checked through simulations that BP and MF do not have better accuracy than BP-200 and MF-1000, respectively, even if we run the algorithms with much longer iterations.

To generate synthetic GMs, we consider Erdős-Rényi (ER) random graphs, complete bipartite graphs, complete graphs, and grid graphs. The experimental results are reported in Figure 3(a)-3(e). In all cases, one can observe that our algorithm significantly outperforms others in the high coupling region, i.e., the low-temperature regime. It is known that MF outputs better approximations than others as the underlying graph structure becomes dense, e.g., complete graph (Ellis & Newman, 1978), however, our algorithm remarkably performs better than MF even in such cases. In particular, MF and BP exhibit high variance on their approximation errors in high coupling regions, while ours does not.

We also evaluate our algorithms with GMs on grid graphs in a dataset for UAI 2014 inference competition. It provides 8 GMs on grid graphs, where 4 of them are of 100 vertices ( $10\times 10$ ) and the other 4 are of 400 vertices ( $20\times 20$ ). Figure 3(f) reports the approximation error and the running time of each algorithm. In the experimental results, our algorithm consistently has small errors, while other algorithms often fail badly.

Finally, we compare the running times of algorithms under GMs on complete graphs of 100-500 vertices, which are reported in Figure 4. Here, we do not report WMBE since it is slower than MBE. One can observe that Algorithm 2 scales as well as BP, while MBE does not. MF is the fastest, but it is worst in approximation quality under grid and UAI GMs, as reported in earlier experimental results.

5 Conclusion

In this paper, we provide a completely new angle to design approximate inference algorithms for graphical models. The proposed algorithms scale well for large scale models as like prior iterative message-passing schemes, and outperforms them in approximation quality, in particular, significantly for hard instances. For the future work, we plan to extend our spectral approach to estimating the marginal distributions or/and related inference in higher-order or continuous models.

Acknowledgement

This work was supported by IITP grant funded by the Korea government (MSIT) (No.2017-0-01779, XAI). We would like to acknowledge Sungsoo Ahn for helpful discussions and sharing codes.

Appendix A Proof of Claim 1

We first prove $\mathbf{f}(\Omega)\subset\mathcal{B}$ . To this end we introduce the following inequalities for all $\mathbf{x}\in\{-1,1\}^{n}$ :

[TABLE]

which directly leads us to $|c\cdot f_{j}(\mathbf{x})|\leq\|\mathbf{u}_{j}\|_{1}+c\cdot(n+1)/2\leq c\cdot b_{j}$ , and therefore $\mathbf{f}(\Omega)\subset\mathcal{B}$ . Here, the first inequality of (17) is trivial. The second inequality of (17) is from the fact that the error between $c\cdot f_{j}(\mathbf{x})$ and $\langle\mathbf{u}_{j},\mathbf{x}\rangle$ arises from a series of quantizations which is presented once in (8) and at most $n$ times in (9). Since the quantization error is at most $c/2$ for each quantization, the second inequality of (17) holds.

Now we prove the bound of $|\mathcal{B}|$ . From the definition of $\mathcal{B}$ and $b_{j}$ , one can easily observe that the following bound on $|\mathcal{B}|$ holds:

[TABLE]

where the inequality is from $\|\mathbf{v}_{j}\|_{1}\leq\sqrt{n}\|\mathbf{v}_{j}\|_{2}=\sqrt{n}$ .

Appendix B Proof of Claim 2

Claim 2 holds since

[TABLE]

In the above, $\mathbf{g}_{i}:\mathcal{S}_{i}\setminus\mathcal{S}_{i-1}\rightarrow\mathcal{S}_{i-1}$ is a bijection defined by $\mathbf{g}_{i}(\mathbf{x})=\mathbf{x}^{\prime}$ such that $x^{\prime}_{\ell}=x_{\ell}$ except for $\ell=i$ . The second equality of (18) is from replacing the summation over $\mathbf{f}^{-1}(\mathbf{k})\cap(\mathcal{S}_{i}\setminus\mathcal{S}_{i-1})$ by that over $\mathbf{g}_{i}\big{(}\mathbf{f}^{-1}(\mathbf{k})\cap(\mathcal{S}_{i}\setminus\mathcal{S}_{i-1})\big{)}$ . The third equality of (18) is based on (9) which implies that for all $\mathbf{x}\in\mathcal{S}_{i}\setminus\mathcal{S}_{i-1}$ , $\mathbf{x}^{\prime}=\mathbf{g}_{i}(x)$ satisfies

[TABLE]

Hence, (19) leads us to

[TABLE]

and the third equality of (18) follows. The fourth equality of (18) directly follows from the definition of $\mathbf{g}_{i}$ that $x^{\prime}_{i}=-1$ and $\big{(}\mathbf{g}_{i}^{-1}(\mathbf{x}^{\prime})\big{)}_{i}=x_{i}=1$ .

Appendix C Proof of Theorem 3

We first prove the computational complexity of Algorithm 1. Since each $t(\mathbf{k}),t^{\prime}(\mathbf{k})$ possesses a memory of $O(|\mathcal{B}|)$ and $|\mathcal{B}|\leq 2^{r}\prod_{j=1}^{r}(\sqrt{|\lambda_{j}|n}/c+n/2+1)$ from Claim 1, the space complexity of Algorithm 1 is $O\big{(}2^{r}\prod_{j=1}^{r}(\sqrt{|\lambda_{j}|n}/c+n/2+1)\big{)}$ . In addition, as the algorithm iterates $n$ times while each iteration accesses to $t(\mathbf{k})$ and $t^{\prime}(\mathbf{k})$ , Algorithm 1 has $O\big{(}n2^{r}\prod_{j=1}^{r}(\sqrt{|\lambda_{j}|n}/c+n/2+1)\big{)}$ computational complexity.

Now we provide the bound on the partition function approximation. First, we refer the following error bound introduced in the proof of Claim 1.

[TABLE]

Using (20), we provide a bound for $|\left\langle\mathbf{u}_{j},\mathbf{x}\right\rangle^{2}-(c\cdot f_{j}(\mathbf{x}))^{2}|$ as follows

[TABLE]

where the first inequality is from (20) and the second inequality is from $|\langle\mathbf{u}_{j},\mathbf{x}\rangle|\leq\|\mathbf{u}_{j}\|_{1}\leq\sqrt{|\lambda_{j}|n}$ . From (21), the error bound can be derived as

[TABLE]

where the last inequality follows from (21). One can obtain a same bound for $\widehat{Z}/Z$ and this completes the proof of Theorem 3.

Appendix D Proof of Claim 4

The result of Claim 4 directly follows from the following inequality:

[TABLE]

where the last inequality follows from the source coding theorem (Shannon, 1948).

Bibliography43

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Ahn et al. (2018 a) Ahn, S., Chertkov, M., Shin, J., and Weller, A. Gauged mini-bucket elimination for approximate inference. In International Conference on Artificial Intelligence and Statistics (AISTATS) , pp. 10–19, 2018 a.
2Ahn et al. (2018 b) Ahn, S., Chertkov, M., Weller, A., and Shin, J. Bucket renormalization for approximate inference. In International Conference on Machine Learning (ICML) , pp. 109–118, 2018 b.
3Alizadeh (1995) Alizadeh, F. Interior point methods in semidefinite programming with applications to combinatorial optimization. SIAM Journal on Optimization , 5(1):13–51, 1995.
4Bandyopadhyay & Gamarnik (2008) Bandyopadhyay, A. and Gamarnik, D. Counting without sampling: Asymptotics of the log-partition function for certain statistical physics models. Random Structures & Algorithms , 33(4):452–479, 2008.
5Bilmes (2004) Bilmes, J. A. Graphical models and automatic speech recognition. In Mathematical Foundations of Speech and Language Processing , pp. 191–245. Springer, 2004.
6Dechter (1999) Dechter, R. Bucket elimination: A unifying framework for reasoning. Artificial Intelligence , 113(1-2):41–85, 1999.
7Dechter & Rish (2003) Dechter, R. and Rish, I. Mini-buckets: A general scheme for bounded inference. Journal of the ACM (JACM) , 50(2):107–153, 2003.
8Delorme & Poljak (1993) Delorme, C. and Poljak, S. Laplacian eigenvalues and the maximum cut problem. Mathematical Programming , 62(1-3):557–574, 1993.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Spectral Approximate Inference

Abstract

1 Introduction

2 Spectral Inference for Low-Rank GMs

2.1 Overall Approach: Approximate Inference via Spectral Decomposition

2.2 How to Choose f\mathbf{f}f and Compute t(k)t(\mathbf{k})t(k)

Claim 1**.**

Claim 2**.**

2.3 Provable Guarantee

Theorem 3**.**

3 Spectral Inference for High-Rank GMs

3.1 Overall Approach: From High-Rank to Low-Rank

Claim 4**.**

3.2 Improving (13) via Controlling the Diagonal of AAA

3.3 Intuition for (15)

4 Experimental Results

4.1 Setups

4.2 Investigating the Semi-Definite Programming (15)

4.3 Evaluation of Algorithm 1 under Low-Rank GMs

4.4 Evaluation of Algorithm 2 under High-Rank GMs

5 Conclusion

Acknowledgement

Appendix A Proof of Claim 1

Appendix B Proof of Claim 2

Appendix C Proof of Theorem 3

Appendix D Proof of Claim 4

2.2 How to Choose $\mathbf{f}$ and Compute $t(\mathbf{k})$

Claim 1.

Claim 2.

Theorem 3.

Claim 4.

3.2 Improving (13) via Controlling the Diagonal of $A$