Dimension-independent Sparse Fourier Transform

Michael Kapralov; Ameya Velingker; Amir Zandieh

arXiv:1902.10633·cs.DS·February 28, 2019

Dimension-independent Sparse Fourier Transform

Michael Kapralov, Ameya Velingker, Amir Zandieh

PDF

Open Access

TL;DR

This paper introduces a novel algorithm for computing the sparse Fourier transform in any dimension with runtime polynomial in sparsity and logarithmic in signal size, overcoming previous dimensionality limitations.

Contribution

It presents the first dimension-independent Sparse Fourier Transform algorithm with polynomial runtime in sparsity and logarithmic in size, using adaptive aliasing filters.

Findings

01

Achieves dimension-independent Sparse FFT in polynomial time.

02

Introduces adaptive aliasing filters for frequency isolation.

03

Provides efficient algorithms for average case models.

Abstract

The Discrete Fourier Transform (DFT) is a fundamental computational primitive, and the fastest known algorithm for computing the DFT is the FFT (Fast Fourier Transform) algorithm. One remarkable feature of FFT is the fact that its runtime depends only on the size $N$ of the input vector, but not on the dimensionality of the input domain: FFT runs in time $O (N lo g N)$ irrespective of whether the DFT in question is on $Z_{N}$ or $Z_{n}^{d}$ for some $d > 1$ , where $N = n^{d}$ . The state of the art for Sparse FFT, i.e. the problem of computing the DFT of a signal that has at most $k$ nonzeros in Fourier domain, is very different: all current techniques for sublinear time computation of Sparse FFT incur an exponential dependence on the dimension $d$ in the runtime. In this paper we give the first algorithm that computes the DFT of a $k$ -sparse signal in time $poly (k, lo g N)$ …

Equations280

Input: Output: \leavevmode \nobreak \leavevmode \nobreak \leavevmode \nobreak access to x : [n]^{d} \to C, \leavevmode \nobreak \leavevmode \nobreak \leavevmode \nobreak integer k \geq 1 such that ∣ supp \leavevmode x ∣ \leq k \leavevmode \nobreak \leavevmode \nobreak \leavevmode \nobreak nonzero elements of x and their coefficients

Input: Output: \leavevmode \nobreak \leavevmode \nobreak \leavevmode \nobreak access to x : [n]^{d} \to C, \leavevmode \nobreak \leavevmode \nobreak \leavevmode \nobreak integer k \geq 1 such that ∣ supp \leavevmode x ∣ \leq k \leavevmode \nobreak \leavevmode \nobreak \leavevmode \nobreak nonzero elements of x and their coefficients

(x_{\cdot- a} \cdot G)_{j \cdot n / b} = f \in [n]^{d} \sum x_{f} e^{2 π f^{T} a / n} \cdot G_{j \cdot n / b - f} .

(x_{\cdot- a} \cdot G)_{j \cdot n / b} = f \in [n]^{d} \sum x_{f} e^{2 π f^{T} a / n} \cdot G_{j \cdot n / b - f} .

j \in [n]^{d} \sum x_{j} G_{- j}

j \in [n]^{d} \sum x_{j} G_{- j}

x - f \in C \sum x_{f} \cdot e^{2 π i f^{T}} .

x - f \in C \sum x_{f} \cdot e^{2 π i f^{T}} .

Input: Output: \leavevmode \nobreak \leavevmode \nobreak \leavevmode \nobreak access to x : [n]^{d} \to C, \leavevmode \nobreak \leavevmode \nobreak \leavevmode \nobreak subset S \subseteq [n]^{d} such that supp \leavevmode x \subseteq S \leavevmode \nobreak \leavevmode \nobreak \leavevmode \nobreak x_{S}

Input: Output: \leavevmode \nobreak \leavevmode \nobreak \leavevmode \nobreak access to x : [n]^{d} \to C, \leavevmode \nobreak \leavevmode \nobreak \leavevmode \nobreak subset S \subseteq [n]^{d} such that supp \leavevmode x \subseteq S \leavevmode \nobreak \leavevmode \nobreak \leavevmode \nobreak x_{S}

FrequencyCone_{T} (v) := {f \in [n] : f \equiv f_{v} (mod 2^{l_{T} (v)})} .

FrequencyCone_{T} (v) := {f \in [n] : f \equiv f_{v} (mod 2^{l_{T} (v)})} .

j \in [n] \sum x_{j} G_{t - j} = \frac{1}{n} f \in FrequencyCone_{T} (v) \sum x_{f} e^{2 π i \frac{f t}{n}} .

j \in [n] \sum x_{j} G_{t - j} = \frac{1}{n} f \in FrequencyCone_{T} (v) \sum x_{f} e^{2 π i \frac{f t}{n}} .

x_{f} = β_{f} e^{2 π i θ} for uniformly random θ \in [0, 2 π),

x_{f} = β_{f} e^{2 π i θ} for uniformly random θ \in [0, 2 π),

H_{c}^{n} = {f \in [n] : w (f) \leq c},

H_{c}^{n} = {f \in [n] : w (f) \leq c},

X_{c}^{n} = {x \in C^{n} : supp \leavevmode x \subseteq H_{c}^{n}}

X_{c}^{n} = {x \in C^{n} : supp \leavevmode x \subseteq H_{c}^{n}}

x_{f} = {β_{f} 0 with probability k / n^{d} with probability 1 - k / n^{d},

x_{f} = {β_{f} 0 with probability k / n^{d} with probability 1 - k / n^{d},

j \in [n] \sum x_{j} G_{t - j}

j \in [n] \sum x_{j} G_{t - j}

= \frac{1}{n} f \in [n] \sum x_{f} \cdot G_{f} \cdot e^{2 π i \frac{f t}{n}}

= \frac{1}{n} x_{f} e^{2 π i \frac{f t}{n}}

j \in [n] \sum x_{j} G_{t - j} = \frac{1}{n} f \in FrequencyCone_{T} (v) \sum x_{f} e^{2 π i \frac{f t}{n}} .

j \in [n] \sum x_{j} G_{t - j} = \frac{1}{n} f \in FrequencyCone_{T} (v) \sum x_{f} e^{2 π i \frac{f t}{n}} .

\emptyset = supp \leavevmode G \cap u \neq = v u : \leavevmode \nobreak leaf of \leavevmode \nobreak T ⋃ FrequencyCone_{T} (u) = supp \leavevmode G \cap u \neq = v u : \leavevmode \nobreak leaf of \leavevmode \nobreak T ⋃ {f_{u}} = supp \leavevmode G \cap (S ∖ f_{v}),

\emptyset = supp \leavevmode G \cap u \neq = v u : \leavevmode \nobreak leaf of \leavevmode \nobreak T ⋃ FrequencyCone_{T} (u) = supp \leavevmode G \cap u \neq = v u : \leavevmode \nobreak leaf of \leavevmode \nobreak T ⋃ {f_{u}} = supp \leavevmode G \cap (S ∖ f_{v}),

G_{q} (ξ) = {G_{q - 1} (ξ) \cdot \frac{1 + e ^{2 π i \frac{ξ - f _{v}}{2 ^{q}}}}{2} G_{q - 1} (ξ) if \leavevmode \nobreak v_{q - 1} has two children in T otherwise .

G_{q} (ξ) = {G_{q - 1} (ξ) \cdot \frac{1 + e ^{2 π i \frac{ξ - f _{v}}{2 ^{q}}}}{2} G_{q - 1} (ξ) if \leavevmode \nobreak v_{q - 1} has two children in T otherwise .

(supp \leavevmode G) \cap u \neq = v u : \leavevmode \nobreak leaf of \leavevmode \nobreak T ⋃ FrequencyCone_{T} (u) = \emptyset

(supp \leavevmode G) \cap u \neq = v u : \leavevmode \nobreak leaf of \leavevmode \nobreak T ⋃ FrequencyCone_{T} (u) = \emptyset

G (f) = 1 for all f \in FrequencyCone_{T} (v) .

G (f) = 1 for all f \in FrequencyCone_{T} (v) .

G_{j + 1} (ξ) = G_{j} (ξ) \cdot \frac{1 + e ^{2 π i \frac{ξ - f _{v}}{2 ^{j + 1}}}}{2} .

G_{j + 1} (ξ) = G_{j} (ξ) \cdot \frac{1 + e ^{2 π i \frac{ξ - f _{v}}{2 ^{j + 1}}}}{2} .

G_{j + 1} (f) = G_{j} (f) \cdot \frac{1 + e ^{2 π i \frac{f - f _{v}}{2 ^{j + 1}}}}{2} = G_{j} (f) \cdot \frac{1 + e ^{2 π i \frac{f _{u} - f _{v}}{2 ^{j + 1}}}}{2} = 0,

G_{j + 1} (f) = G_{j} (f) \cdot \frac{1 + e ^{2 π i \frac{f - f _{v}}{2 ^{j + 1}}}}{2} = G_{j} (f) \cdot \frac{1 + e ^{2 π i \frac{f _{u} - f _{v}}{2 ^{j + 1}}}}{2} = 0,

G (f^{'}) = q \in {1, 2, \dots, l} v_{q - 1} \leavevmode \nobreak has two children in \leavevmode \nobreak T \prod \frac{1 + e ^{2 π i \frac{f ^{'} - f _{v}}{2 ^{q}}}}{2} = 1,

G (f^{'}) = q \in {1, 2, \dots, l} v_{q - 1} \leavevmode \nobreak has two children in \leavevmode \nobreak T \prod \frac{1 + e ^{2 π i \frac{f ^{'} - f _{v}}{2 ^{q}}}}{2} = 1,

G_{q} (ξ) = {G_{q - 1} (t) * \frac{δ ( t ) + e ^{- 2 π i f_{v} / 2^{q}} δ ( t + \frac{n}{2 ^{q}} )}{2} G_{q - 1} (t) if \leavevmode \nobreak v_{q - 1} has two children in T otherwise

G_{q} (ξ) = {G_{q - 1} (t) * \frac{δ ( t ) + e ^{- 2 π i f_{v} / 2^{q}} δ ( t + \frac{n}{2 ^{q}} )}{2} G_{q - 1} (t) if \leavevmode \nobreak v_{q - 1} has two children in T otherwise

f = r = 1 \sum d f_{r} \cdot n^{r - 1} .

f = r = 1 \sum d f_{r} \cdot n^{r - 1} .

ξ_{q} = \frac{ξ - ξ ( mod n ^{q - 1} )}{n ^{q - 1}} (mod n) .

ξ_{q} = \frac{ξ - ξ ( mod n ^{q - 1} )}{n ^{q - 1}} (mod n) .

FrequencyCone_{T} (v) := {f \in [n]^{d} : f \equiv f_{v} (mod 2^{l_{T} (v)})},

FrequencyCone_{T} (v) := {f \in [n]^{d} : f \equiv f_{v} (mod 2^{l_{T} (v)})},

FrequencyCone_{T} (v) = FrequencyCone_{T^{'}} (u) \cup FrequencyCone_{T^{'}} (w)

FrequencyCone_{T} (v) = FrequencyCone_{T^{'}} (u) \cup FrequencyCone_{T^{'}} (w)

j \in [n]^{d} \sum x_{j} G_{t - j} = \frac{1}{N} f \in FrequencyCone_{T} (v) \sum x_{f} e^{2 π i \frac{f ^{T} t}{n}} .

j \in [n]^{d} \sum x_{j} G_{t - j} = \frac{1}{N} f \in FrequencyCone_{T} (v) \sum x_{f} e^{2 π i \frac{f ^{T} t}{n}} .

G^{(q)} (f) = G^{(q - 1)} (f) \cdot G_{q} (f_{q})

G^{(q)} (f) = G^{(q - 1)} (f) \cdot G_{q} (f_{q})

v \in L \sum 2^{- w_{T} (v)} = 1.

v \in L \sum 2^{- w_{T} (v)} = 1.

supp \leavevmode (x - χ^{(t)}) \subseteq S^{(t)} \leavevmode \nobreak and \leavevmode \nobreak ∣ S^{(t)} ∣ = ∣ S ∣ - t

supp \leavevmode (x - χ^{(t)}) \subseteq S^{(t)} \leavevmode \nobreak and \leavevmode \nobreak ∣ S^{(t)} ∣ = ∣ S ∣ - t

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematical Approximation and Integration · Sparse and Compressive Sensing Techniques · Digital Filter Design and Implementation

Full text

Dimension-independent Sparse Fourier Transform

Michael Kapralov

EPFL

[email protected]

Ameya Velingker

Google Research

[email protected] This work was completed while the author was a research scientist in the School of Computer and Communication Sciences, EPFL.

Amir Zandieh

EPFL

[email protected]

Abstract

The Discrete Fourier Transform (DFT) is a fundamental computational primitive, and the fastest known algorithm for computing the DFT is the FFT (Fast Fourier Transform) algorithm. One remarkable feature of FFT is the fact that its runtime depends only on the size $N$ of the input vector, but not on the dimensionality of the input domain: FFT runs in time $O(N\log N)$ irrespective of whether the DFT in question is on $\mathbb{Z}_{N}$ or $\mathbb{Z}_{n}^{d}$ for some $d>1$ , where $N=n^{d}$ .

The state of the art for Sparse FFT, i.e. the problem of computing the DFT of a signal that has at most $k$ nonzeros in Fourier domain, is very different: all current techniques for sublinear time computation of Sparse FFT incur an exponential dependence on the dimension $d$ in the runtime. In this paper we give the first algorithm that computes the DFT of a $k$ -sparse signal in time $\operatorname{poly}(k,\log N)$ in any dimension $d$ , avoiding the curse of dimensionality inherent in all previously known techniques. Our main tool is a new class of filters that we refer to as adaptive aliasing filters: these filters allow isolating frequencies of a $k$ -Fourier sparse signal using $O(k)$ samples in time domain and $O(k\log N)$ runtime per frequency, in any dimension $d$ .

We also investigate natural average case models of the input signal: (1) worst case support in Fourier domain with randomized coefficients and (2) random locations in Fourier domain with worst case coefficients. Our techniques lead to an $\widetilde{O}(k^{2})$ time algorithm for the former and an $\widetilde{O}(k)$ time algorithm for the latter.

1 Introduction

The Discrete Fourier Transform (DFT) is one of the most widely used computational primitives in modern computing, with numerous applications in data analysis, signal processing, and machine learning. The fastest algorithm for computing the DFT is the Fast Fourier Transform (FFT) algorithm of Cooley and Tukey, which has been recognized as one of the 10 most important algorithms of the 20th century [Cip00]. The FFT algorithm is very efficient: it computes the Discrete Fourier Transform of a length $N$ complex-valued signal in time $O(N\log N)$ . This applies to vectors in any dimension: FFT works in $O(N\log N)$ time irrespective of whether the DFT is on the line, on a $\sqrt{N}\times\sqrt{N}$ grid, or is in fact the Hadamard transform on $\{0,1\}^{d}$ , with $d=\log_{2}N$ .

In any applications of the Discrete Fourier Transform, the input signal $x\in{\mathbb{C}}^{N}$ often satisfies sparsity or approximate sparsity constraints: the Fourier transform $\widehat{x}$ of $x$ has a small number of coefficients $k$ or is close to a signal with a small number of coefficients (e.g., this phenomenon is the motivation for compression schemes such as JPEG and MPEG). This has motivated a rich line of work on the Sparse FFT problem: given access to a signal $x\in{\mathbb{C}}^{N}$ in time domain that is sparse in Fourier domain, compute the $k$ nonzero coefficients in sublinear (i.e., $o(N)$ ) time.

Very efficient algorithms for the Sparse FFT problem have been developed in the literature [GL89, KM91, Man92, GGI*+*02, AGS03, GMS05, Iwe10a, Aka10, HIKP12b, HIKP12a, LWC12, BCG*+*12, HAKI12, PR13, HKPV13, IKP14, IK14, Kap16, PS15, CKPS16, Kap17]. The state-of-the-art approach, due to [HIKP12a], yields an $O(k\log N)$ runtime algorithm for the following exact $k$ -sparse Fourier transform problem: given access to an input signal of length $N$ whose Fourier transform has at most $k$ nonzeros, output the nonzero coefficients and their values. This highly efficient algorithm comes with a caveat, however: the runtime of $O(k\log N)$ only holds for the Fourier transform on the line, namely, $\mathbb{Z}_{N}$ . The algorithm naturally extends to higher dimensions, namely, $\mathbb{Z}_{n}^{d}$ , where $N=n^{d}$ , but with an exponential loss in runtime; the runtime becomes $O(k\log^{d}N)$ as opposed to $O(k\log N)$ . Interestingly, the other extreme of $d=\log_{2}N$ , i.e., the Hadamard transform, has been known to admit an $O(k\log N)$ algorithm since the seminal work of Goldreich and Levin [GL89]. However, all intermediate values of $d$ exhibit a curse of dimensionality. This is in sharp contrast with FFT itself, which runs in time $O(N\log N)$ , where $N=n^{d}$ is the length of the input signal, in any dimension $d$ . The focus of our work is to design sublinear time algorithms for Sparse FFT that avoid this curse of dimensionality. Our main point of attention is the Sparse FFT problem:

[TABLE]

Our main result is the first sublinear algorithm for exact Sparse FFT (1), as stated in the following theorem.

Theorem 1 (Main result, informal version of Theorem 2.1 in Section 2.1).

For any integer $n$ that is a power of two and any positive integer $d$ , there exists a deterministic algorithm that, given access to a signal $x\in{\mathbb{C}}^{n^{d}}$ with $\|{\widehat{x}}\|_{0}\leq k$ , recovers $\widehat{x}$ in time $\operatorname{poly}(k,\log N)$ .

We note that this is the first sublinear time Sparse FFT algorithm that avoids an exponential dependence on the dimension $d$ . One should note that the runtime still depends on $d$ , since $\log_{2}N=d\log_{2}n$ is lower bounded by $d$ , but this dependence is polynomial as opposed to exponential.

1.1 Significance of our results and related work

Significance of our results.

The state of the art in high dimensional Sparse Fourier Transforms presents an interesting conundrum: algorithms with runtime $O(k\log N)$ are known for $d=1$ (Discrete Fourier Transform on the line, see [HIKP12a]) and $d=\log_{2}N$ (the Hadamard transform, see [GL89]), but for all intermediate values of $d$ the runtime scales exponentially in $d$ . Given that FFT itself is dimension-insensitive, this strongly suggests that exciting new algorithmic techniques can be developed for the high-dimensional version of the problem. Our paper designs the first approach to high dimensional Sparse FFT that does not suffer from the curse of dimensionality, and naturally leads to several exciting open problems that we hope will spur further progress in this area.

In addition, we note that rather high-dimensional versions of the Fourier transform arise in applications (e.g., 2D, 3D and 4D-NMR in medical imaging), and designing practical Sparse FFT algorithms for this regime is an important problem. We hope that new techniques for dimension-independent Sparse FFT will lead to progress in this direction as well.

Sample complexity of high-dimensional Sparse FFT.

We note that, besides runtime, another very important parameter of a Sparse FFT algorithm is sample complexity, i.e., the number of samples that an algorithm needs to access in time domain in order to compute the top few coefficients of the Fourier transform. The sample complexity of Sparse FFT, unlike runtime, does not suffer from a curse of dimensionality. Indeed, there exist several algorithms with $\widetilde{O}(N)$ runtime that can recover the top $k$ coefficients of $\widehat{x}$ using only $k\operatorname{poly}(\log N)$ accesses in time domain, irrespective of the dimensionality of the problem. This can be achieved, for example, using either results on the restricted isometry property (RIP) [CT06, RV08, Bou14, CGV12, HR17], or using the filtering approach developed in the Sparse FFT literature, with $\widetilde{O}(N)$ decoding time. Thus, the challenge is to achieve sublinear runtime without an exponential dependence on the dimension.

We now outline existing approaches to Sparse FFT and explain why they fail to scale well in high dimensions:

State-of-the-art approaches to Sparse FFT and their lack of scalability in high dimensions.

The main idea behind many recently developed algorithms for the Sparse FFT problem is the “hashing” approach inherited from sparse recovery with arbitrary linear measurements. Given access to a signal $x:[n]^{d}\to{\mathbb{C}}$ , one designs linear measurements of $x$ that allow one to “hash” the nonzero positions of $\widehat{x}$ into a number of “buckets.” The number of buckets $B=b^{d}$ is chosen to be a constant factor larger than the sparsity $k$ to ensure that a large constant fraction of the nonzero positions of $\widehat{x}$ are isolated in their buckets. Every isolated element can be recovered and subtracted from $x$ for future iterations of the same hashing scheme, thereby ensuring convergence. The idea of hashing is implemented via filtering: one designs a filter $G:[n]^{d}\to{\mathbb{C}}$ such that $\widehat{G}$ approximates a “bucket,” i.e., $\widehat{G}$ is close to $1$ on an $\ell_{\infty}$ ball of side length $\approx(N/B)^{1/d}=n/b$ in dimension $d$ . The content of the $\bm{j}$ -th ‘bucket’, for $\bm{j}\in[\bm{B}]$ , is then

[TABLE]

Since $\widehat{G}$ is essentially $1$ on the $\ell_{\infty}$ ball around the center $\bm{j}\cdot n/b$ of the ‘bucket’ and essentially zero outside, (2) gives the algorithm time domain access to the restriction of $\widehat{x}$ to the “bucket,” i.e., the essential support of $\widehat{G}$ , where $a\in[n]^{d}$ is the location in time domain at which the signal is being accessed. A pseudorandom permutation of the frequency space ensures that such a bucket is likely to contain just a single element of the support, which enables the algorithm to recover at least a constant fraction of elements in a single round and perform iterative recovery. Furthermore, if the (essential) support of $G$ in time domain is small, one obtains an efficient algorithm.

The difficulty that arises in using (2) in high dimensions is the fact that it is not known how to ensure that $\widehat{G}$ is close to $1$ in an appropriately defined “bucket” while simultaneously ensuring that $|\mathrm{supp}\leavevmode\nobreak\ G|$ is small. For example, the filters constructed in [HIKP12a] ensure that $\widehat{G}$ is polynomially close to $1$ in Fourier domain, but this comes at the expense of $|\mathrm{supp}\leavevmode\nobreak\ G|$ being larger than $k$ (the ideal support size) by a factor of $\Theta(\log n)$ , and this effect is even more pronounced in higher dimensions, resulting in a $\log^{d}n$ loss in runtime. The other extreme would be to choose $G$ to be equal to $1$ on an $\ell_{\infty}$ ball with $k$ points around the origin, but in that case, its Fourier transform $\widehat{G}$ is the sinc function, which is only a constant factor approximation to the indicator of the corresponding $\ell_{\infty}$ box in Fourier domain (i.e., the ideal “bucket”). In dimension $d$ , the approximation degrades to $c^{d}$ for some constant $c\in(0,1)$ , leading to exponential loss in runtime. Indeed, suppose that all elements of $\widehat{x}$ have roughly the same value. Then for a given element $\bm{f}\in\mathrm{supp}\leavevmode\nobreak\ \widehat{x}$ , the expected contribution of other elements to the noise in the “bucket” that $\bm{f}$ is hashed to is $||\widehat{x}||_{2}^{2}/B$ , but the contribution of $\widehat{x}_{\bm{f}}$ to its own bucket is (most of the time) only $c^{d}$ of its value, and, hence, only an exponentially small fraction of coefficients can be recovered in a given round of hashing. 111In addition, the discussion above assumes the presence of an approximate pairwise hashing lemma for high dimensions that does not lose an exponential factor in the dimension (it is known that such a lemma holds with at most about a factor of $2^{d}$ loss [IK14], but no dimension-independent version is available in the literature).

Related work.

In [CI17], the authors presented a deterministic Sparse Fourier transform algorithm for the Hadamard transform, i.e., $d=\log_{2}N$ , that runs in nearly linear time in the sparsity parameter $k$ , but it is not known how this extends to lower dimensions. In [Iwe10b, Iwe12] the author gives a $\widetilde{O}(k^{2})$ time deterministic algorithm for the Sparse Fourier Transform, but the algorithm only applies to a related but distinctly easier problem. Specifically, the problem considers a continuous function on $[0,2\pi)$ whose Fourier transform is bandlimited and sparse. The presented algorithm requires sampling the signal at arbitrary locations in $[0,2\pi)$ . A natural approach is to emulate sampling off-grid (i.e., at arbitrary points in $[0,2\pi)$ ) given discrete samples that we have access to, which is achieved in [MZIC17] giving an $\widetilde{O}(k^{2})$ time deterministic algorithm for one dimensional sparse FFT. But this is a challenging task in multi-dimensional setting for several reasons. First, we are operating under the sparsity assumption alone, and no powerful general interpolation techniques that work under the sparsity assumption alone are available, to the best of our knowledge. Furthermore, even if the function were bandlimited, a natural approach to interpolation would involve some form of Taylor expansion or semi-equispaced Fourier Transform, however, both approaches incur a $\log^{d}N$ loss in dimension $d$ . Indeed, similar exponential dependence on the dimensionality of the problem manifests itself in Fast Multipole Methods [GR87b, BG97] and the Sparse FFT algorithms mentioned above. Finally, one should also note that whereas the problem of computing the Fourier transform on a $p\times q$ grid with $p$ mutually prime with $q$ is equivalent to a one-dimensional Fourier transform on $\mathbb{Z}_{pq}$ , the standard case of side lengths that are powers of two (for which we have the most efficient FFT algorithms) does not admit such a reduction. Furthermore, such a reduction appears to be quite challenging in high dimensions for reasons outlined above, and even more so for highly oscillatory functions that Sparse FFT algorithms need to handle.

2 Overview of our results and techniques

Prior works on Sparse FFT have primarily focused on efficiently implementing hashing-based ideas developed in the extensive literature on sparse recovery using general linear measurements (e.g., [GHI*+*13]), which meets with several difficulties. In particular, the presence of multiplicative subgroups in $\mathbb{Z}_{n}^{d}$ has been a hurdle in analyzing Sparse FFT algorithms: while aliasing filters have optimal performance from the point of view of the uncertainty principle, their applications have been limited due to the fact that frequencies that belong to the same subgroup get hashed together if such filters are used, making it impossible to reason about isolation of individual frequencies. At the same time, FFT itself owes much of its efficiency to the very same multiplicative subgroups of $\mathbb{Z}_{n}^{d}$ , and a natural question is whether one can design a Sparse FFT algorithm that operates on similar principles. This is precisely the approach that we take.

Adaptive aliasing filters.

The main technical innovation that allows us to avoid exponential dependence on the dimension and obtain Theorem 1 is a new family of filters for isolating a subset of frequencies in Fourier domain in a sparse signal $\widehat{x}$ using few samples in time domain. We refer to the family of filters as adaptive aliasing filters.

Definition 1 ( $(\bm{f},S)$ -isolating filter, informal version of Definition 11, see Section 4).

Suppose $n$ is a power of two integer and $S\subseteq[n]^{d}$ for a positive integer $d$ . Then, for any frequency $\bm{f}\in S$ , a filter $G:[n]^{d}\to{\mathbb{C}}$ is called $(\bm{f},S)$ -isolating if $\widehat{G}_{\bm{f}}=1$ and $\widehat{G}_{\bm{f}^{\prime}}=0$ for every $\bm{f}^{\prime}\in S\setminus\{\bm{f}\}$ .

We explain the intuition behind the construction of the filter in Section 2.1 below and provide the details later in Section 4.

The reason why an $(\bm{f},S)$ -isolating filter $G$ is useful lies in the fact that for every signal $x\in{\mathbb{C}}^{n^{d}}$ with $\mathrm{supp}\leavevmode\nobreak\ \widehat{x}\subseteq S$ we have, for all $\tt\in[n]^{d}$

[TABLE]

Thus, the filter $G$ enables access to the time domain representation of the restriction of $\widehat{x}$ to $\bm{f}$ in time proportional to $|\mathrm{supp}\leavevmode\nobreak\ G|$ , at any point . Of course, this is only useful if the support of $G$ is small. The main technical lemma of our paper shows that for every support set $S\subseteq\widehat{x}$ , there exists an $\bm{f}\in S$ that can be isolated efficiently:

Lemma 1 (Informal version of Corollary 2 in Section 4).

For every power of two $n\geq 1$ , positive integer $d$ , and set $S\subseteq[n]^{d}$ , there exists an $\bm{f}\in S$ and an $(\bm{f},S)$ -isolating filter $G$ such that $|\mathrm{supp}\leavevmode\nobreak\ G|\leq|S|$ .

The proof of the lemma is given in Section 4.

Accessing the residual signal.

Lemma 1 suggests a natural approach to the estimation problem with Fourier measurements in high dimensions: iteratively construct an $(\bm{f},S)$ -isolating filter $G$ , estimate $\bm{f}$ , remove $\bm{f}$ from $S$ , and proceed. The hope is that we can essentially assume that we are given access to ${\mathcal{F}}^{-1}(\widehat{x}_{S\setminus\{\bm{f}\}})$ once we have estimated $\bm{f}$ . In general, if we have been able to estimate the values of $\widehat{x}_{\bm{f}}$ for all $\bm{f}\in C$ with some $C\subseteq S$ , then we would like to obtain access to

[TABLE]

Note that we would need $x$ for in the support of $G$ at the next iteration, and this support is generally a rather complicated set of size $\Omega(k)$ , from which we need to subtract the inverse Fourier transform of the signal estimated so far. This problem is the non-uniform Fourier transform problem, and no subquadratic methods for subtraction are known even in dimension $d=1$ when the set in time domain that we want to compute the inverse Fourier transform on is arbitrary. Even if the target set is an $\ell_{\infty}$ -box, the best known algorithms for this problem run in time $\Omega(k\log^{d}(1/\epsilon))$ , where $\epsilon>0$ is the precision parameter of the computation—this reduces to quadratic time even when $d=\Omega(\log k/\log\log k)$ and inverse polynomial in $k$ precision is desired. Thus, subtracting from time domain would result in at least cubic runtime in $k$ . Instead, we subtract the influence of the residual in frequency domain, which requires $O(k)$ evaluations of $\widehat{G}$ (as we show, $\widehat{G}$ can be evaluated at a cost of just $O(\log N)$ ). Note that it is crucial here that we peel off one coefficient at at time. Any improvements to this process, if they were to achieve $k^{2-\Omega(1)}$ runtime overall, would likely also imply improvements in the computation of approximate non-uniform Fourier transform: given a $k$ -sparse signal $\widehat{x}$ and a set $T\subseteq[n]^{d}$ with $|T|\leq k$ , output $y:[n]^{d}\to{\mathbb{C}}$ such that $||(x-y)_{T}||_{2}^{2}\leq\epsilon||x||_{2}^{2}$ . However, it seems plausible that quadratic runtime in $k$ is essentially optimal for the non-uniform Fourier transform problem: specifically, that under natural complexity theoretic assumptions there exists no algorithm for the $\epsilon$ -approximate non-uniform Fourier transform problem with runtime $k^{2-\Omega(1)}$ when $d=\Omega(\log k)$ and $\epsilon<1/k^{C}$ for sufficiently large constant $C$ . We note that current techniques do not provide a subquadratic algorithm even for simple sets $T$ such as the $\ell_{\infty}$ box with $k$ points in dimension $d=\Omega(\log k/\log\log k)$ (due to the $k\log^{d}(1/\epsilon)$ dependence mentioned above; a similar exponential dependence on the dimension is present in Fast Multipole Methods [GR87a, BG]). For an arbitrary set $T$ no subquadratic algorithm is known even when $d=1$ .

Putting it together: estimation with Fourier measurements

Combining the aforementioned ideas, we are able to develop a deterministic algorithm for the estimation problem with Fourier measurements in high dimensions:

[TABLE]

For the estimation problem (3) we obtain the following result.

Theorem 2 (Estimation guarantee, informal version of Theorem 5 in Section 5).

Suppose $n$ is a power of two integer, $d$ is a positive integer, and $S\subseteq[n]^{d}$ . Then, for any signal $x\in{\mathbb{C}}^{n^{d}}$ with $\mathrm{supp}\leavevmode\nobreak\ {\widehat{x}}\subseteq S$ , the procedure Estimate $(x,S,n,d)$ (see Algorithm 2) recovers $\widehat{x}$ . Moreover, the sample complexity of this procedure is $O(|S|^{2})$ and its runtime is $O(|S|^{2}\cdot\log N)$ . Furthermore, the procedure Estimate is deterministic.

In the rest of this section, we give an overview of our techniques. Throughout the section, we present our results for the one-dimensional setting, as this makes notation simpler. All our results translate to the high-dimensional setting without any loss—see Section 4.2 for details.

2.1 Recovery via adaptive aliasing filters

Our main theorem is the following, which presents an algorithm for problem (1) for worst-case signals.

{restatable}

[Sparse FFT for worst-case signals]theoremsfftworstcase For any power of two integer $n$ and any positive integer $d$ and any signal $x\in{\mathbb{C}}^{n^{d}}$ with $\|{\widehat{x}}\|_{0}=k$ , the procedure SparseFFT $(x,n,d,k)$ in Algorithm 4 recovers ${\widehat{x}}$ . Moreover, the sample complexity of this procedure is $O(k^{3}\log^{2}k\log^{2}N)$ and its runtime is $O(k^{3}\log^{2}k\log^{2}N)$ .

The major difference between estimation and recovery (i.e., problem (3) vs. (1)) is the fact in the latter problem, the set $S$ of frequencies is unknown to us: the algorithm is only given access to $x$ and an upper bound on the sparsity of $\widehat{x}$ . Since our $(f,S)$ -separating filter is adaptive, i.e., depends on $S$ , this appears to present a challenge. However, we circumvent this challenge by constructing a sequence of successive approximations to the set $S$ . In dimension $1$ , these approximations amount to reducing $S$ modulo $2^{j}$ for all $j=1,\ldots,\log_{2}n$ , and adaptively probing to learn which of the residue classes are nonzero. As before, our approach extends seamlessly to high dimensions by simply concatenating the $d$ coordinates into a single vector. Note that this is in sharp contrast to all previously known approaches, which are more efficient in low dimensions, but incur an exponential loss overall. We would like to note that at a high level one can view our filtering approach as a way to prune the FFT computation graph in a way that suffices for recovery of a $k$ -Fourier sparse vector.

We outline the main ideas in one-dimensional setting here to simplify the presentation (see Section 4.2 for the high-dimensional version of the argument). Let $N=n$ be the length of the signal and $d=1$ be the dimension for $n$ a power of two. We define $T^{\mathrm{full}}_{n}$ to be a full binary tree of height $\log_{2}n$ and define a labelling scheme on the vertices as follows. {restatable}definitiondeftfull

Suppose $n$ is a power of two. Let $T^{\mathrm{full}}_{n}$ be a full binary tree of height $\log_{2}n$ , where for every $j\in\{0,1,\dots,\log_{2}n\}$ , the nodes at level $j$ (i.e., at distance $j$ from the root) are labeled with integers in $[2^{j}]$ . For a node $v\in T^{\mathrm{full}}_{n}$ , we let $f_{v}$ be its label. The label of the root is $f_{root}=0$ . The labelling of $T_{n}^{full}$ satisfies the condition that for every $j\in[\log_{2}n]$ and every $v$ at level $j$ , the right and left children of $v$ have labels $f_{v}$ and $f_{v}+2^{j}$ , respectively. Note that the root of $T^{\mathrm{full}}_{n}$ is at level 0, while the leaves are at level $\log_{2}n$ .

The tree captures the computation graph of FFT algorithm, where leaves correspond to frequencies in $[n]$ (given by the label), and for any $j\in\{0,1,\dots,\log_{2}n\}$ , the nodes at level $j$ (i.e., at distance $j$ from the root) correspond to congruence classes of frequencies modulo $2^{j}$ , as specified by the labelling (see Figure 1(a)).

Note that the full FFT algorithm starts from the root of $T^{\mathrm{full}}_{n}$ and computes the congruence classes of the Fourier transform of signal $x$ at each level of this tree iteratively. Because it can reuse the computations from each level for computing the next levels, the total runtime of FFT is $O(n\log_{2}n)$ .

In order to speed up the computation for sparse signals, we introduce the notion of a splitting tree, which is nothing but the subtree of $T^{\mathrm{full}}_{n}$ that contains the nonzero locations of $\widehat{x}$ together with paths connecting them to the root. Given a set $S\subseteq[n]$ (the support of $\widehat{x}$ in Fourier domain), we define the splitting tree of the set $S$ as follows:

{restatable}

[Splitting tree]definitiondefsplit Let $n$ be a power of two. For every $S\subseteq[n]$ , the splitting tree $T=\mathrm{Tree}(S,n)$ of a set $S$ is a binary tree that is the subtree of $T^{\mathrm{full}}_{n}$ that contains, for every $j\in[\log_{2}n]$ , all nodes $v\in T^{\mathrm{full}}_{n}$ at level $j$ such that $\{f\in S:f\equiv f_{v}\pmod{2^{j}}\}\neq\emptyset$ .

An illustration of such a tree is given in Figure 1(b). In order to recover the identities of the elements in $S$ , our algorithm performs an exploration of this tree. At every point, the algorithm constructs a filter $G$ that isolates frequencies in a given subtree and tests whether that subtree contains a nonzero signal. In order to make this work, we need a construction of filters that isolates the entire subtree as opposed to only one element, as Definition 1 does. Fortunately, the actual $(f,S)$ -isolating filters $G$ constructed in Lemma 1 satisfy precisely this property. The stronger isolation properties are captured by the following definition:

{restatable}

[Frequency cone of a leaf of $T$ ]definitiondeffreqcone For every power of two $n$ , subtree $T$ of $T^{\mathrm{full}}_{n}$ , and vertex $v\in T$ which is at level $l_{T}(v)$ from the root, define the frequency cone of $v$ with respect to $T$ as

[TABLE]

Intuitively, the frequency cone of a node $v$ in $T$ captures all potential nonzeros of $\widehat{x}$ that belong to the subtree of $v$ in $T$ (see Figure 2). Our adaptive filter construction lets us obtain time domain access to the corresponding part of the frequency space:

{restatable}

[ $(v,T)$ -isolating filter]definitiondefvtisol For every integer $n$ , subtree $T$ of $T^{\mathrm{full}}_{n}$ , and leaf $v$ of $T$ , a filter $G\in{\mathbb{C}}^{n}$ is called $(v,T)$ -isolating if the following conditions hold:

•

For all $f\in\operatorname{\mathrm{FrequencyCone}}_{T}(v)$ , we have $\widehat{G}_{f}=1$ .

•

For every $f^{\prime}\in\bigcup_{\begin{subarray}{c}u\neq v\\ u:\text{\leavevmode\nobreak\ leaf of\leavevmode\nobreak\ }T\end{subarray}}\operatorname{\mathrm{FrequencyCone}}_{T}(u)$ , we have $\widehat{G}_{f^{\prime}}=0$ .

Note that for all signals $x\in{\mathbb{C}}^{n}$ with $\mathrm{supp}\leavevmode\nobreak\ \widehat{x}\subseteq\bigcup_{u:\text{\leavevmode\nobreak\ leaf of\leavevmode\nobreak\ }T}\operatorname{\mathrm{FrequencyCone}}_{T}(u)$ and $t\in[n]$ ,

[TABLE]

Iterative tree exploration process leading to an algorithm with $\widetilde{O}(k^{3})$ runtime.

Now that we have defined the framework for our algorithm, we need to specify the order in which the algorithm will be accessing the leaves of the tree in order to minimize runtime. This is governed by the cost of constructing and using a $(v,T)$ -isolating filter for various nodes $v$ in $T$ . To quantify cost, we introduce the notion of a weight of a leaf in the tree. {restatable}[Weight of a leaf]definitiondefwt Suppose $n$ is a power of two. Let $T$ be a subtree of $T^{\mathrm{full}}_{n}$ . Then for any leaf $v\in T$ , we define its weight $w_{T}(v)$ with respect to $T$ to be the number of ancestors of $v$ in tree $T$ with two children.

It turns out that the techniques from Lemma 1 also yield the following.

Lemma 2 (Informal version of Lemma 3 in Section 4).

Suppose $n$ is a power of two. Let $T$ be a subtree of $T^{\mathrm{full}}_{n}$ . Then for any leaf $v\in T$ , there exists a $(v,T)$ -isolating filter $G$ with $|\text{supp}\leavevmode\nobreak\ G|\leq 2^{w_{T}(v)}$ such that $G$ and $\widehat{G}$ can be evaluated at $O(\log N)$ cost per point.

Before describing the algorithm we give an example illustrating filter support in time domain. Consider a complete binary tree $T$ of height $h\ll\log_{2}n$ . Suppose that $v_{0}$ is some vertex at level $h_{0}<h$ of this tree. Now we take the subtree rooted at $v_{0}$ and add an appendage of length $\log_{2}n-h$ to $v_{0}$ . The appendage is a path of $\log_{2}n-h$ nodes each of which has a single child. This doesn’t change the weight of any of the leafs of the original tree because every node on the appendage has exactly one child. One can see an example of such tree in Fig. 3. Suppose that the leaf $v$ is a leaf of the subtree rooted at $v_{0}$ , which is moved far from the root by the appendage. In order to isolate $v$ from the elements that are not in the subtree of $v_{0}$ we need a filter which is $(n/2^{h_{0}})$ -periodic in time domain and in order to isolate from the rest of the elements in subtree of $v_{0}$ the filter needs to sample the signal at a fine grid of length $2^{h-h_{0}}$ . Note that the support of a $(v,T)$ -isolating filter $G$ is $\mathrm{supp}\leavevmode\nobreak\ {G}=\left\{i+(n/2^{h_{0}})\cdot j;\,j\in[2^{h_{0}}],i\in[2^{h-h_{0}}]\right\}$ . In Fig. 3 we exhibit a $(v,T)$ -isolating filter $G$ which is constructed based on Lemma 3, where $v$ and $T$ correspond to this instance of splitting tree.

Given Lemma 2, our algorithm is natural. We find the vertex $v^{*}=\text{argmin}_{v\in T}w_{T}(v)$ , which, by Kraft’s inequality, satisfies $w_{T}(v^{*})\leq\log_{2}k$ . We then define an auxiliary tree $T^{\prime}$ by appending a left $a$ and a right child $b$ to $v$ . Then for each of the children $a,b$ , we, in turn, construct a filter $G$ that isolates them from the rest of $T$ (i.e., from the frequency cones of other nodes in $T$ ) and check whether the corresponding restricted signals are nonzero. The latter is unfortunately a nontrivial task, since the sparsity of these signals can be as high as $k$ , and detecting whether a $k$ -sparse signal is nonzero requires $\Omega(k)$ samples. However, a fixed set of $k\log^{3}N$ locations that satisfies the restricted isometry property (RIP) can be selected, and accessing the signal on those values suffices to test whether it is nonzero. The overall runtime becomes $\widetilde{O}(k^{3})$ : the isolating filter has support at most $2k$ , while the number of samples needed to test whether the two subtrees of $v$ are nonempty is $\widetilde{O}(k)$ , so peeling off $\leq k$ elements takes $\widetilde{O}(k^{3})$ time overall. This results in Theorem 2.1 (the algorithm is presented as Algorithm 4).

$\widetilde{O}(k^{2})$ runtime under random phase assumption.

We note that the runtime can be easily reduced to $\widetilde{O}(k^{2})$ if assumptions are made on the signal that ensure that its energy is evenly spread across time domain, making $\widetilde{O}(1)$ samples sufficient to detect whether the signal is zero or not. This occurs, for instance, if a signal’s frequencies satisfy distributional assumptions (e.g., the values have random phases). We present such a result in Section 7. It seems that even under this assumption on the values of the signal, since the support of the signal in Fourier domain is worst case, reducing the runtime below $k^{2}$ likely requires a major advance in techniques for non-uniform Fourier transform computation.

More formally, we introduce the notion of a worst-case signal with random phase as follows: {restatable}[Worst-case signal with random phase]definitionrandsign For any positive integer $d$ and power of two $n$ , we define $x$ to be a worst-case signal with random phase having values $\{\beta_{\bm{f}}\}_{\bm{f}\in[n]^{d}}$ if

[TABLE]

independently for every $\bm{f}\in[n]^{d}$ . Furthermore, if $k$ of the values $\{\beta_{\bm{f}}\}_{\bm{f}\in[n]^{d}}$ are nonzero, then $x$ is said to be a worst-case $k$ -sparse signal with random phase and is guaranteed to have sparsity $k$ .

Note that “worst-case” in the above definition signifies the fact that the support of the signal is arbitrary (having no distributional assumptions), subject to a potential sparsity constraint. We then present the following theorem: {restatable}[Sparse FFT for worst-case signals with random support]theoremsfftrandphase For any power of two integer $n$ , positive integer $d$ , and worst-case $k$ -sparse signal with random phase $x\in{\mathbb{C}}^{n^{d}}$ , the procedure SparseFFT-RandomPhase $(x,n,d,k)$ in Algorithm 5 recovers ${\widehat{x}}$ with probability $1-\frac{1}{N^{2}}$ . Moreover, the sample complexity of this procedure is $O(k^{2}\log^{4}N)$ and its runtime is $O(k^{2}\log^{4}N)$ .

Impossibility of reducing the number of iterations (rounds of adaptivity): signals with low Hamming weight support.

We note that our algorithm differs from all prior works in that it uses many rounds of adaptivity. Indeed, the samples that our algorithm takes are guided by values of the signal that have been read in previously queried locations, which is in contrast to most prior Sparse Fourier Transform algorithms. Two notable exceptions in recent literature include the adaptive block Sparse FFT algorithms of [CKSZ17] and [CKPS16].

Our algorithm uses $k$ rounds of adaptivity, peeling off one element at a time. It would be desirable to reduce the number of rounds of adaptivity by perhaps peeling off many elements in one batch as opposed to one at a time. For example, if the locations of the nonzeros of $\widehat{x}$ are uniformly random in $[n]^{d}$ , then the splitting tree of $x$ is likely to be rather balanced (see, e.g. Fig. 5 for an illustration), so perhaps one can find a filter $G$ that has small support and can be efficiently used to isolate many coefficients at once? Indeed, this intuition turns out to be correct for signals with uniformly random supports—we show in Section 2.2 below (with details presented in Section 8) that this idea yields a $\widetilde{O}(k)$ time algorithm. However, rather surprisingly, adversarial instances exist that force the peeling process to use $k^{1-o(1)}$ rounds of adaptivity in the worst case, making our analysis essentially tight. We now present this adversarial instance.

{restatable}

[Hamming ball]definitionlowhamming For any power of two integer $n$ any integer $0\leq c\leq\log_{2}n$ , we define $H_{c}^{n}$ to be the closed Hamming ball of radius $c$ centered at 0:

[TABLE]

where $w(f)$ is the Hamming weight of the binary representation of $f$ , i.e., $w(f)$ is the number of ones in the binary representation of $f$ .

Note that $|H_{c}^{n}|=\sum_{j=0}^{c}\binom{\log_{2}n}{j}$ .

{restatable}

[Class of signals with low Hamming support]definitionlowhammingsupport For any power of two integer $n$ and any integer $c$ , Let $\mathcal{X}^{n}_{c}$ denote the class of signals in ${\mathbb{C}}^{n}$ with support $H^{n}_{c}$ as in Definition 2.1,

[TABLE]

Note that for any $x\in\mathcal{X}^{n}_{c}$ we have that $\|x\|_{0}=\sum_{i=0}^{c}\binom{\log_{2}n}{i}$ , so for any $c\leq(\frac{1}{2}-\epsilon)\log_{2}n$ , the signals that are contained in class $\mathcal{X}^{n}_{c}$ are $\Theta\Big{(}\binom{\log_{2}n}{c}\Big{)}$ -sparse.

{restatable}

[Low Hamming weight binary trees]definitionlowhammingweighttrees Suppose $n$ is a power of two integer. Then, we define a low Hamming weight binary tree $T_{c}^{n}$ inductively for $c=0,1,\dots,\log_{2}n$ :

$T_{0}^{n}$ is defined to be the unique tree of depth $\log_{2}n$ that has a single leaf node and satisfies the property that each non-leaf node has a single right child only. Thus, $T_{0}^{n}$ has $\log_{2}n+1$ nodes. 2. 2.

For any $c>0$ , $T_{c}^{n}$ is constructed as follows: Take $T_{0}^{n}$ and label the nodes in order from the root to the leaf as $0,1,\dots,\log_{2}n$ . Then, for each node $0\leq j<\log_{2}n$ , take a copy of $T_{c-1}^{n/2^{j+1}}$ and let its root be the left child of node $j$ . The resulting tree defines $T_{c}^{n}$ .

Note that all the leaves of $T^{n}_{c}$ are at level $\log_{2}n$ .

It is not hard to see that $T^{n}_{c}$ is in fact the splitting tree for the set $H^{n}_{c}$ and, hence, the number of its leaves is $\sum_{i=0}^{c}\binom{\log_{2}n}{i}$ . An illustration of the tree $T^{n}_{c}$ for $c=2$ and $n=32$ is shown in Figure 4.

We prove the following theorem in Section 6 (see Theorem 6):

Theorem 3 (Informal version of Theorem 6).

A peeling process with threshold $\tau\leq\log_{2}k+O(1)$ (i.e. any threshold that allows isolation of an element at cost bounded by $O(k)$ ) must take $k^{1-o(1)}$ iterations to terminate.

To add to the result above, we note that the lower bound on the number of rounds of adaptivity is not the only cause for quadratic runtime in our algorithm. The other cause is the necessity to update the residual signal as more and more elements are recovered, i.e. perform non-uniform Fourier transform computations. Since no subquadratic approach to this problem are known in high dimensions, it seems plausible that a $k^{2-\Omega(1)}$ runtime algorithm for high-dimensional FFT would also shed light on the complexity of this intriguing problem.

2.2 Runtime $\widetilde{O}(k)$ for random supports through a batched peeling process

To complement our lower bound of $k^{1-o(1)}$ rounds of adaptive pruning for worst-case signals using our adaptive aliasing filters, we show that if the support of the signal is uniformly random, adaptive aliasing filters can be used to achieve an algorithm with $\widetilde{O}(k)$ runtime. A beautiful $\widetilde{O}(k)$ runtime and optimal $O(k)$ sample complexity algorithm for this model was given in [GHI*+*13]. The algorithm was stated for $d=2$ but readily extends to high dimensions. Unfortunately, it comes with a major restriction, namely, the sparsity $k$ must be $o(N^{1/d})$ . Our approach is different and extends to all $k\leq N$ .

We now introduce the notion of a Fourier sparse random support signal: {restatable}[Random support signal]definitiondefrandsupp For any positive integer $d$ , power of two $n$ , and arbitrary $\beta:[n]^{d}\to{\mathbb{C}}$ , we define $x:[n]^{d}\to{\mathbb{C}}$ to be a random support signal of Fourier sparsity $k$ (with values given by $\beta$ ) if $\widehat{x}$ is the signal defined by

[TABLE]

where the $x_{\bm{f}}$ are independently chosen for the various $\bm{f}\in[n]^{d}$ .

In other words, we assume a Bernoulli model for $\mathrm{supp}\leavevmode\nobreak\ \widehat{x}$ , while the values at the frequencies that are chosen to be in the support are arbitrary.

Our algorithmic result for such signals is stated below. {restatable}[Sparse FFT algorithm for random support signals]theoremsfftalg Suppose $d$ is a positive integer and $n$ and $k$ are powers of two. For any signal $x\in{\mathbb{C}}^{n^{d}}$ such that $x$ is a random support signal of Fourier sparsity $k$ , the procedure SparseFFT $(x,n,d,k)$ (see Algorithm 8) returns $\widehat{x}$ with probability $9/10$ . Moreover, the runtime and sample complexity of this procedure are $\widetilde{O}(k)$ .

The algorithm is motivated by the idea of speeding up our algorithm for worst-case signals (Algorithm 4, also see Theorem 2.1) by reducing the number of iterations of the process from $\Theta(k)$ down to $O(\log k)$ . Such a reduction (which we show to be impossible for worst-case signals in Section 6) requires the ability to peel off many elements of the residual in a single phase of the algorithm, which turns out to be possible if the support of $\widehat{x}$ is chosen uniformly at random as in Definition 2.2. Indeed, if one considers the splitting tree $T$ of a signal with uniformly random support (see Fig. 5 for an illustration), one sees that

(a)

a large constant fraction of nodes $v\in T$ satisfy $w_{T}(v)\leq\log_{2}k+O(1)$ ;

(b)

the adaptive aliasing filters $G$ constructed for such nodes will have significantly overlapping support in time domain.

We provide the intuition for this for the one-dimensional setting ( $d=1$ ) to simplify notation (changes required in higher dimensions are minor). In this setting, property (b) above is simply a manifestation of the fact that since the support is uniformly random, any given congruence classes modulo $B^{\prime}=Ck$ for a large enough constant $C>1$ is likely to contain only a single element of the support of $\widehat{x}$ . Our adaptive aliasing filters provide a way to only partition frequency space along a carefully selected subset of bits in $[\log_{2}N]$ , but due to the randomness assumption, one can isolate most of the elements by simply partitioning using the bottom $\log_{2}k+O(1)$ bits. This essentially corresponds to hashing $\widehat{x}$ into $B=Ck$ buckets at computational cost $O(B^{\prime}\log B^{\prime})=O(k\log k)$ . While this scheme is efficient, it unfortunately only recovers a constant fraction of coefficients. One solution would be to hash into $B=Ck^{2}$ buckets (i.e., consider congruence classes modulo $Ck^{2}$ ), which would result in perfect hashing with good constant probability, allowing us to recover the entire signal in a single round. However, this hashing scheme would result in a runtime of $\Omega(k^{2}\log k)$ and is, hence, not satisfactory. On the other hand, hashing into $Ck^{2}$ buckets is clearly wasteful, as most buckets would be empty. Our main algorithmic contribution is a way of “implicitly” hashing into $Ck^{2}$ buckets, i.e., getting access to the nonempty buckets, at an improved cost of $\widetilde{O}(k)$ .

Our algorithm uses an iterative approach, and the main underlying observation is very simple. Suppose that we are given the ability to “implicitly” hash into $B$ buckets for some $B$ , namely, get access to the nonempty buckets. If $B$ is at least $\text{min}(Ck^{2},N)$ , we know that there are no collisions with high probability and we are done. If not, then we show that, given access to nonempty buckets in the $B$ -hashing (i.e. a hashing into $B$ buckets), we can get access to the nonempty buckets of a $(\Gamma B)$ -hashing for some appropriately chosen constant $\Gamma>1$ at a polylogarithmic cost in the size of each nonempty bucket of the $B$ -bucketing by essentially computing the Fourier transform of the signal restricted to nonempty buckets in the $B$ -bucketing. We then proceed iteratively in this manner, starting with $B=Ck$ , for which we can perform the hashing explicitly. Since the number of nonzero frequencies remaining in the residual after $t$ iterations of this process decays geometrically in $t$ , we can also afford to use a smaller number of buckets $B^{\prime}$ in the hashing that we construct explicitly, ensuring that the runtime is dominated by the first iteration.

Ultimately, the algorithm takes the following form. At every iteration, we explicitly compute a hashing into $B^{\mathrm{base}}\leq Ck$ buckets explicitly. Then, using a list of nonempty buckets in a $B^{\mathrm{prev}}$ -hashing from the previous iteration, we extend this list to a list of nonempty buckets in a $B^{\mathrm{next}}$ -bucketing at polylogarithmic cost per bucket (by solving a well-conditioned linear system, see Algorithm 6), where $B^{\mathrm{next}}=\Gamma\cdot B^{\mathrm{prev}}$ for some large enough constant $\Gamma>1$ . Meanwhile, we reduce $B^{\mathrm{base}}$ by a factor of $\Gamma$ , thus maintaining the invariant $B^{\mathrm{base}}\cdot B^{\mathrm{next}}\approx k^{2}$ at all times (note that this is satisfied at the start, when $B^{\mathrm{base}}=B^{\mathrm{prev}}\approx k$ , and $B^{\mathrm{base}}\cdot B^{\mathrm{next}}$ remains invariant at each iteration). Therefore, after a logarithmic number of iterations, we have effectively emulated hashing into $\approx k^{2}$ but at a total cost of roughly one hashing computation into $\approx k$ buckets (see Figure 6 for an illustration).

Organization.

In Section 3, we introduce basic definitions and notation that will be used throughout the paper. Section 4 introduces our main technical tool of adaptive aliasing filter, which are used in the various algorithms found in this paper. Section 5 shows how to use the adaptive aliasing filters to solve the problem of estimation for Fourier measurements for worst-case signals, i.e., problem (3), thereby proving Theorem 2. Section 6 then shows that the inherent tree pruning process used to subtract off recovered frequencies and access residual signals in the estimation algorithm is essentially optimal.

Section 7 proves our main theorem, Theorem 2.1, for problem (1) on worst-case signals. Additionally, it shows how to improve on the runtime under the assumption that the signal is a worst-case signal with random phase, thereby proving Theorem 2.1.

Finally, Section 8 discusses how to obtain an algorithm for problem (1) on random support signals and proves Theorem 2.2.

3 Preliminaries and notation

In this section, we introduce some notation and basic definitions that we will use in the paper.

For any positive integer $n$ , we use the notation $[n]$ to denote the set of integer numbers $\{0,1,\dots,n-1\}$ . We are interested in computing the Fourier transform of discrete signals of size $N$ in dimension $d$ , where $N=n^{d}$ for some $n\geq 2$ . Such a signal will be a function $[n]^{d}\to{\mathbb{C}}$ . However, we will often identify $[n]^{d}\to{\mathbb{C}}$ with ${\mathbb{C}}^{n^{d}}$ for convenience (and often use the two interchangably depending on the context). This correspondence is formally defined later in Definition 9. We first need the notion of an inner product.

Definition 2 (Inner product).

Let $\bm{t}$ and $\bm{f}$ be two vectors in dimension $d$ . We denote the inner product of $\bm{t}$ and $\bm{f}$ by $\bm{f}^{T}\bm{t}=\sum_{q=1}^{d}f_{q}t_{q}$ .

Let us define the Fourier transform of a signal.

Definition 3 (Fourier transform).

For any positive integers $d$ and $n$ , the Fourier transform of a signal $x\in{\mathbb{C}}^{n^{d}}$ is denoted by $\widehat{x}$ , where for any $\bm{f}\in[n]^{d}$ , we define $\widehat{x}_{\bm{f}}=\sum_{\tt\in[n]^{d}}xe^{-2\pi i\frac{\bm{f}^{T}\tt}{n}}$ .

Note that in the case of $n=2$ , the Fourier transform reduces to the Hadamard transform of size $N=2^{d}$ .

Claim 1 (Parseval’s theorem).

For any positive integers $n$ and $d$ , any signal $x\in{\mathbb{C}}^{n^{d}}$ satisfies $\|\widehat{x}\|_{2}^{2}=n^{d}\cdot\|x\|_{2}^{2}$ .

Definition 4 (Unit impulse).

For any positive integers $n$ and $d$ , the unit impulse function $\delta\in{\mathbb{C}}^{n^{d}}$ is defined as the function given by $\delta(\bm{t})=1$ for $\bm{t}=0$ and $\delta(\bm{t})=0$ for $\bm{t}\neq 0$ .

Claim 2.

For any positive integers $d$ , $n$ , and any $\bm{a}\in[n]^{d}$ , the inverse Fourier transform of $\widehat{x}:[n]^{d}\to{\mathbb{C}}$ given by $\widehat{x}_{\bm{f}}=e^{2\pi i\frac{\bm{a}^{T}\bm{f}}{n}}$ is $x_{\bm{t}}=\delta(\bm{t}+\bm{a})$ .

Claim 3 (Convolution theorem).

Suppose $d$ and $n$ are positive integers. Then, for any signals $x,y\in{\mathbb{C}}^{n^{d}}$ , $\widehat{(x*y)}=\widehat{x}\cdot\widehat{y}$ , where $x*y$ is the convolution of $x$ and $y$ which itself is a signal in ${\mathbb{C}}^{n^{d}}$ defined as, $(x*y)_{\bm{t}}=\sum_{\bm{\tau}\in[n]^{d}}x_{\bm{\tau}}y_{\bm{t}-\bm{\tau}}$ for all $\bm{t}\in[n]^{d}$ .

We will require the notion of a tensor product of signals. Given $d$ signals $G_{1},G_{2},\dots,G_{d}:[n]\to{\mathbb{C}}$ , the tensor product constructs a signal in ${\mathbb{C}}^{n^{d}}$ that is defined as follows.

Definition 5 (Tensor multiplication).

Suppose $d$ and $n$ are positive integers. Given functions $G_{1},G_{2},\dots,G_{d}:[n]\to{\mathbb{C}}$ , we define the tensor product $(G_{1}\times G_{2}\times\cdots\times G_{d}):[n]^{d}\to{\mathbb{C}}$ as $\left(G_{1}\times G_{2}\times\cdots\times G_{d}\right){(\bm{j})}=G_{1}({j_{1}})\cdot G_{2}({j_{2}})\cdots G_{d}({j_{d}})$ for all $\bm{j}=(j_{1},j_{2},\dots,j_{d})\in[n]^{d}$ .

Note that the tensor product is essentially a generalization of the usual outer product on two vectors to $d$ vectors.

Claim 4 (Fourier transform of a tensor product).

For any integers $n$ and $d$ and $G_{1},G_{2},\dots,G_{d}\in{\mathbb{C}}^{n}$ , let $G:[n]^{d}\to{\mathbb{C}}$ denote the tensor product $G=G_{1}\times G_{2}\times\cdots\times G_{d}$ . Then, the $d$ -dimensional Fourier transform $\widehat{G}$ of $G$ is the tensor product of $\widehat{G}_{1},\widehat{G}_{2},\cdots,\widehat{G}_{d}$ , i.e., $\widehat{G}=\widehat{G}_{1}\times\widehat{G}_{2}\times\cdots\times\widehat{G}_{d}$ .

Definition 6.

For any positive $d$ , $n$ , and $k$ , a signal $x:[n]^{d}\to{\mathbb{C}}$ is called Fourier $k$ -sparse if $\|\widehat{x}\|_{0}=k$ .

Definition 7 (The Restricted isometry property).

We say that a matrix $A\in{\mathbb{C}}^{q\times n}$ satisfies the restricted isometry property (RIP) of order $k$ if for every $k$ -sparse vector $x\in{\mathbb{C}}^{n}$ , i.e., $\|x\|_{0}\leq k$ , it holds that $\frac{1}{2}\|x\|^{2}_{2}\leq\|Ax\|_{2}^{2}\leq\frac{3}{2}\|x\|^{2}_{2}$ .

We will use the following theorem from [HR17].

Theorem 4.

(The Restricted Isometry Property [HR17, Theorem 3.7])* For sufficiently large $N$ and $k$ , and a unitary matrix $M\in{\mathbb{C}}^{N\times N}$ satisfying $\|M\|_{\infty}=O\left(\frac{1}{\sqrt{N}}\right)$ , the following holds. For some $q=O\left(k\log^{2}k\log N\right)$ let $A\in{\mathbb{C}}^{q\times N}$ be a matrix whose $q$ rows are chosen uniformly and independently from the rows of $M$ , multiplied by $\sqrt{\frac{N}{q}}$ . Then, with probability $1-\frac{1}{N^{10}}$ , the matrix A satisfies the restricted isometry property of order $k$ , as per Definition 7.*

4 Adaptive aliasing filters

In this section, we introduce a new class of filters that forms the basis of our algorithm for estimation of worst case Fourier sparse signals. For simplicity, we begin by introducing the filters in the one-dimensional setting and then show how they naturally extend to the multidimensional setting (using tensoring). Throughout the section, we assume that the input is a signal $x\in{\mathbb{C}}^{n}$ with $\mathrm{supp}\leavevmode\nobreak\ \widehat{x}=S$ for some $S\subseteq[n]$ .

4.1 One-dimensional Fourier transform

We restate the following definition for $T^{\mathrm{full}}_{n}$ and corresponding labels of vertices: \deftfull*

Next, we recall the definition of the splitting tree of a set. \defsplit*

The splitting tree $T=\mathrm{Tree}(S,n)$ can be constructed easily in $O(|S|\log n)$ time, given $S$ . We provide simple pseudocode in Algorithm 9.

For every node $v\in T$ , the level of $v$ , denoted by $l_{T}(v)$ , is the distance from $v$ to the root. The following basic claim will be useful and follows immediately from the definition of $T=\mathrm{Tree}(S,n)$ :

Claim 5.

For every integer power of two $n$ , if $T$ is a subtree of $T^{\mathrm{full}}$ , then for every node $v\in T$ , the labels of nodes that belong to the subtree $T_{v}$ of $T$ rooted at $v$ are congruent to $f_{v}$ modulo $2^{l_{T}(v)}$ . Furthermore, every node $u\in T$ at level $l_{T}(v)$ or higher which satisfies $f_{u}\equiv f_{v}\pmod{2^{l_{T}(v)}}$ belongs to $T_{v}$ .

\defwt

Definition 8 ( $(f,S)$ -isolating filter).

For every power of two $n$ , set $S\subseteq[n]$ , and $f\in S$ , a filter $G\in{\mathbb{C}}^{n}$ is called $(f,S)$ -isolating if $\widehat{G}_{f}=1$ , and $\widehat{G}_{f^{\prime}}=0$ for all $f^{\prime}\in S\setminus\{f\}$ .

In particular, if $G$ is $(f,S)$ -isolating, then for every signal $x\in{\mathbb{C}}^{n}$ with $\mathrm{supp}\leavevmode\nobreak\ \widehat{x}\subseteq S$ , we have

[TABLE]

for all $t\in[n]$ , by convolution theorem, see Claim 3.

While the definitions above suffice to state our estimation primitive, our Sparse FFT algorithm requires a filter $G$ that satisfies a more refined property due to the fact that throughout the execution of the algorithm, the identity of $\mathrm{supp}\leavevmode\nobreak\ \widehat{x}$ is only partially known. We encode this knowledge as a subtree $T$ of $T^{\mathrm{full}}_{n}$ whose leaves are not necessarily at level $\log_{2}n$ . Hence, every leaf $v\in T$ corresponds to a set of frequencies in the support of $\widehat{x}$ whose full identities have not been discovered yet. This is captured by the following definition:

\deffreqcone

Note that under this definition, the frequency cone of a vertex $v$ of $T$ corresponds to the subtree rooted at $v$ when $T$ is embedded inside $T^{\mathrm{full}}_{n}$ (see Figure 2). \defvtisol* Note that in particular, for all signals $x\in{\mathbb{C}}^{n}$ with $\mathrm{supp}\leavevmode\nobreak\ \widehat{x}\subseteq\bigcup_{u:\text{\leavevmode\nobreak\ leaf of\leavevmode\nobreak\ }T}\operatorname{\mathrm{FrequencyCone}}_{T}(u)$ and $t\in[n]$ ,

[TABLE]

Lemma 3 (Filter properties).

For every power of two $n$ , subtree $T$ of $T^{\mathrm{full}}_{n}$ , and leaf $v\in T$ , the procedure FilterPreProcess $(T,v,n)$ outputs a static data structure $\mathbf{g}\in{\mathbb{C}}^{\log_{2}n}$ in time $O(\log_{2}n)$ such that, given $\mathbf{g}$ , the following conditions hold:

The primitive FilterTime $(\mathbf{g},n)$ outputs a filter $G$ such that $|\mathrm{supp}\leavevmode\nobreak\ {G}|=2^{w_{T}(v)}$ and $G$ is a $(v,T)$ -isolating filter. Moreover, the procedure runs in time $O(2^{w_{T}(v)}+\log_{2}n)$ . 2. 2.

For every $\xi\in[n]$ , the primitive FilterFrequency $(\mathbf{g},n,\xi)$ computes the Fourier transform of $G$ at frequency $\xi$ , namely, $\widehat{G}(\xi)$ , in time $O(\log_{2}n)$ .

Before we prove Lemma 3, we establish the following corollary, assuming the statement of Lemma 3 holds.

Corollary 1.

Suppose $n$ is a power of two, $S\subseteq[n]$ , and $f\in S$ . Then, let $T=\mathrm{Tree}(S,n)$ be the splitting tree of $S$ . If $v$ is the leaf of $T$ with label $f_{v}=f$ , while $\mathbf{g}$ is the output of FilterPreProcess $(T,v,n)$ , and $G$ is the filter computed by FilterTime $(\mathbf{g},n)$ , then the following conditions hold:

(1)

$G$ * is an $(f,S)$ -isolating filter.*

(2)

$|\mathrm{supp}\leavevmode\nobreak\ G|=2^{w_{T}(v)}$ .

Proof.

Indeed, given a subset $S$ and $f\in S$ , if $T=\mathrm{Tree}(S,n)$ , then all the leaves of $T$ are at level $\log_{2}n$ and the set of labels of the leaves is exactly $S$ . Hence, for every leaf $v$ of $T$ , one has $\operatorname{\mathrm{FrequencyCone}}_{T}(v)=\{f_{v}\}$ . By Lemma 3, $G$ is a $(v,T)$ -isolating filter. Therefore, by Definition 1,

[TABLE]

and $\widehat{G}(f)=1$ for all $f\in\operatorname{\mathrm{FrequencyCone}}_{T}(v)=\{f_{v}\}$ . This implies (1), see definition of $(f,S)$ -isolating filters in 8. Property (2) follows directly from Lemma 3. ∎

Now, we prove Lemma 3.

Proof of Lemma 3: Let $v$ be a leaf of $T$ , $l=l_{T}(v)$ denote the level of $v$ (i.e., distance from the root), $r$ denote the root of $T$ , and $v_{0},v_{1},\ldots,v_{l}$ denote the path from root to $v$ in $T$ , where $v_{0}=r$ and $v_{l}=v$ .

We first show how to efficiently construct a $(v,T)$ -isolating filter in the Fourier domain, i.e., how to efficiently construct $\widehat{G}$ . Then we derive the time domain representation of $G$ . We iteratively define a sequence of functions $G_{0},G_{1},\dots,G_{l}$ (with Fourier transforms $\widehat{G}_{0},\widehat{G}_{1},\dots,\widehat{G}_{l}$ , respectively) by traversing the path from the root to $v$ in $T$ , after which we let $G$ be the final filter constructed on this path, i.e., $G:=G_{l}$ (and $\widehat{G}:=\widehat{G}_{l}$ ). We start with $\widehat{G}_{0}(\xi)=1$ for all $\xi\in[n]$ . Then, we iteratively define $\widehat{G}_{q}$ in terms of $\widehat{G}_{q-1}$ according to the following update rule for all $q=1,2,\ldots,l$ :

[TABLE]

for every $\xi\in[n]$ .

We now show that $G={G}_{l}$ is a $(v,T)$ -isolating filter. It is enough to show that $G$ satisfies

[TABLE]

and

[TABLE]

We now prove (5). Consider a leaf $u$ of $T$ distinct from $v$ . Recall that $v_{0},v_{1},\dots,v_{l}$ denotes the root to $v$ path in $T$ . Let $j$ be the largest integer such that $v_{j}$ is a common ancestor of $v$ and $u$ .

By definition of tree $T$ (Definition 2.1) and because $v_{j}$ is at level $j$ , one has that the label of the right child $a$ of $v_{j}$ is $f_{v_{j}}$ , and the label of the left child $b$ is $f_{v_{j}}+2^{j}$ . Furthermore, using this together with Claim 5, we get that the labels of nodes in subtree $T_{a}$ of $T$ subtended at the right child $a$ of $v$ are congruent to $f_{a}=f_{v_{j}}$ modulo $2^{j+1}$ , and labels in the subtree $T_{b}$ rooted at the left child $b$ of $v_{j}$ are all congruent to $f_{b}=f_{v_{j}}+2^{j}$ modulo $2^{j+1}$ .

Suppose that $v$ belongs to the right subtree of $v_{j}$ , and $u$ belongs to the left subtree (the other case is symmetric). We thus get that $f_{v}\equiv f_{v_{j}}\pmod{2^{j+1}}$ , and $f_{u}\equiv f_{v_{j}}+2^{j}\pmod{2^{j+1}}$ . It now suffices to note that by construction of $\widehat{G}$ (see (4)), we have that for all $\xi\in[n]$ ,

[TABLE]

By Claim 5, for all $f\in\operatorname{\mathrm{FrequencyCone}}_{T}(u)$ one has that $f\equiv f_{u}\pmod{2^{l_{T}(u)}}$ and hence, $f\equiv f_{u}\pmod{2^{j+1}}$ because $j+1\leq l_{T}(u)$ . Therefore, by substituting $\xi=f$ in the above, we get

[TABLE]

implying that $\widehat{G}_{j+1}(f)=0$ and, hence, $\widehat{G}_{l}(f)=0$ , as required.

It remains to prove (6). Consider any $f^{\prime}\in\operatorname{\mathrm{FrequencyCone}}_{T}(v)$ , and note that by Claim 5, $f^{\prime}\equiv f_{v}\pmod{2^{l}}$ . Using this in (4), we get

[TABLE]

since $f^{\prime}-f_{v}\equiv 0\pmod{2^{q}}$ for every $q=0,\ldots,l$ .

Next, note that the primitive FilterPreProcess( $T,v,n$ ) preprocesses the tree $T$ by traversing the path from root to leaf $v$ in time $O(\log_{2}n)$ . Given $\mathbf{g}$ , the primitive FilterFrequency $(\mathbf{g},n,\xi)$ implements (4) for successive values of $q$ , and the runtime of this algorithm is $O(\log_{2}n)$ because of the for loop passing through vector $\mathbf{g}$ .

Finally, it remains to show that the filter $G$ in time domain can be computed efficiently and has a small support. First note that by Claim 2, the inverse Fourier transform of $\frac{1+e^{2\pi i\frac{\xi-f_{v}}{2^{q}}}}{2}$ is $\frac{\delta(t)+e^{-2\pi if_{v}/2^{q}}\delta\left(t+\frac{n}{2^{q}}\right)}{2}$ .

By the duality of convolution in the time domain and multiplication in Fourier domain (see Claim 3), we can equivalently define $G$ (see (4)) by letting $G_{0}(t)=\delta(t)$ and setting

[TABLE]

for every $q=1,\dots,l$ . Thus, $G=G_{l}$ is the time domain representation of the filter $\widehat{G}$ defined in (4). We now note that convolving any function with a function supported on two points, e.g., $\frac{1}{2}\left(\delta(t)+e^{-2\pi i{f_{v}}/{2^{q}}}\delta(t+\frac{n}{2^{q}})\right)$ , at most doubles the support. Since the number of times the convolution is performed in obtaining $G_{l}$ from $G_{0}$ (as per (7)) is $w_{T}(v)$ , the support size of $G$ is at most $2^{w_{T}(v)}$ . Given $\mathbf{g}$ , the primitive FilterTime $(\mathbf{g},n)$ implements the above algorithm for construction of $G$ and, therefore, runs in time $O(2^{w_{T}(v)}+\log_{2}n)$ . ∎

4.2 $d$ -dimensional Fourier transform

In this section, we show that our construction of adaptive aliasing filters from the previous section naturally extends to higher dimensions without any loss by tensoring.

Definition 9 (Flattening of ${[n]}^{d}$ to ${[{n}^{d}]}$ . Unflattening of ${[n^{d}]}$ to ${[{n}]^{d}}$ ).

For every power of two $n$ , positive integer $d$ , and $\bm{f}=(f_{1},\ldots,f_{d})\in[n]^{d}$ we define the flattening of $\bm{f}$ as

[TABLE]

Similarly, for a subset $S\subseteq[n]^{d}$ we let $\widetilde{S}:=\{\widetilde{\bm{f}}:\bm{f}\in S\}$ denote the flattening of $S$ .

For $\widetilde{\bm{\xi}}\in[n^{d}]$ , we define the unflattening of $\widetilde{\bm{\xi}}$ as $\bm{\xi}=(\xi_{1},\dots,\xi_{d})\in[n]^{d}$ , where

[TABLE]

for every $q=1,\dots,d$ . Similarly, for a subset $\widetilde{R}\subseteq[n^{d}]$ , we let ${R}:=\{{\bm{\xi}}\in[n]^{d}:\widetilde{\bm{\xi}}\in\widetilde{R}\}$ denote the unflattening of $\widetilde{R}$ .

Definition 10 (Multidimensional splitting tree).

Suppose $d$ is a positive integer and $n$ is a power of two. For every $S\subseteq[n]^{d}$ , the flattened splitting tree of $S$ is defined as $\widetilde{T}=\mathrm{Tree}(\widetilde{S},n^{d})$ where $\widetilde{S}$ is flattening of $S$ .

The unflattened splitting tree of $S$ is denoted by $T$ and is obtained from the flattened splitting tree $\widetilde{T}$ by unflattening the labels $\widetilde{\bm{f}}_{v}$ of all nodes $v\in\widetilde{T}$ .

Definition 11 (Multidimensional $(\bm{f},S)$ -isolating filter).

Suppose $n$ is a power of two integer and $S\subseteq[n]^{d}$ for a positive integer $d$ . Then, for any frequency $\bm{f}\in S$ , a filter $G:[n]^{d}\to{\mathbb{C}}$ is called $(\bm{f},S)$ -isolating if $\widehat{G}_{\bm{f}}=1$ and $\widehat{G}_{\bm{f}^{\prime}}=0$ for every $\bm{f}^{\prime}\in S\setminus\{\bm{f}\}$ .

Definition 12 (Frequency cone of a leaf of $T$ in high dimensions).

Suppose $d$ is a positive integer, $n$ is a power of two, and $N=n^{d}$ . For every unflattened subtree $T$ of $T^{\mathrm{full}}_{N}$ and $v\in T$ , we define the frequency cone of $v$ as

[TABLE]

where $l_{T}(v)$ denotes the level of $v$ in $T$ (i.e., the distance from the root).

Claim 6.

For every positive integer $d$ , power of two $n$ , and every subtree $T$ of $T^{\mathrm{full}}_{n^{d}}$ and every leaf $v\in T$ of height $l_{T}(v)<d\log_{2}n$ , let $T^{\prime}=T\cup\{\text{left child$ u $of$ v $}\}\cup\{\text{right child$ w $of$ v $}\}$ . Then the following holds,

[TABLE]

Definition 13 (Multidimensional $(v,T)$ -isolating filter).

Suppose $d$ is a positive integer, $n$ is a power of two, and $N=n^{d}$ . For every subtree $T$ of $T^{\mathrm{full}}_{N}$ and vertex $v\in T$ , a filter $G\in{\mathbb{C}}^{n^{d}}$ is called $(v,T)$ -isolating if $\widehat{G}_{\bm{f}}=1$ for all $\bm{f}\in\operatorname{\mathrm{FrequencyCone}}_{T}(v)$ and for every $\bm{f}^{\prime}\in\bigcup_{\begin{subarray}{c}u\neq v\\ u:\text{\leavevmode\nobreak\ leaf of\leavevmode\nobreak\ }T\end{subarray}}\operatorname{\mathrm{FrequencyCone}}_{T}(u)$ one has $\widehat{G}_{\bm{f}^{\prime}}=0$ .

In particular, for every signal $x\in{\mathbb{C}}^{n^{d}}$ with $\mathrm{supp}\leavevmode\nobreak\ \widehat{x}\subseteq\bigcup_{u:\text{\leavevmode\nobreak\ leaf of\leavevmode\nobreak\ }T}\operatorname{\mathrm{FrequencyCone}}_{T}(u)$ and for all $\bm{t}\in[n]^{d}$ ,

[TABLE]

Lemma 4 (Construction of a multidimensional isolating filter).

Suppose $n$ is a power of two integer and $d$ is a positive integer. Let $N=n^{d}$ . For every subtree $T$ of $T_{N}^{full}$ and every leaf $v\in T$ , there exists a $(v,T)$ -isolating filter $G$ such that $|\mathrm{supp}\leavevmode\nobreak\ G|=2^{w_{T}(v)}$ . Such a filter $G$ can be constructed in time $O(2^{w_{T}(v)}+\log_{2}N)$ . Moreover, for any frequency $\bm{\xi}\in[n]^{d}$ , the Fourier transform of $G$ at frequency $\bm{\xi}$ , i.e., $\widehat{G}(\bm{\xi})$ , can be computed in time $O(\log_{2}N)$ .

The proof of Lemma 4 appears in Appendix A. The key idea is to choose $q^{*}$ to be the smallest positive integer such that $l_{T}(v)\leq q^{*}\cdot\log_{2}n$ . One then defines successive filters $G^{(0)},G^{(1)},\dots,G^{(q^{*})}$ by letting $\widehat{G}^{(0)}=1$ and

[TABLE]

for $q=1,2,\dots,q^{*}$ , where $\widehat{G}_{q}$ is an isolating filter corresponding to the projection of the leaves of tree $T$ into coordinate $q$ . The final filter $G=G^{(q^{*})}$ turns out to be $(v,T)$ -isolating.

4.3 Putting it together

Claim 7.

For any binary tree $T$ let $L$ be the set of leaves of $T$ . There exists a leaf $v\in L$ such that $w_{T}(v)\leq\log_{2}|L|$ .

Proof.

Let $T^{\prime}$ be the tree obtained by “collapsing” $T$ , i.e., removing all nodes (and incident edges) of $T$ that have exactly one child. Then, observe that the leaves of $T$ are still preserved in $T^{\prime}$ , except that they are at possibly varying levels. In particular, a leaf $v$ in $T^{\prime}$ will be at level $w_{T}(v)$ . Thus, by applying Kraft’s inequality to $T^{\prime}$ (which is an equality because every node in $T^{\prime}$ is either a leaf or has two children), we see that

[TABLE]

Therefore, there exists a $v\in L$ such that $2^{-w_{T}(v)}\geq\frac{1}{|L|}$ and, therefore, $w_{T}(v)\leq\log_{2}|L|$ , as desired. ∎

This gives us the main result of this section, and the main technical lemma of the paper:

Corollary 2.

For every integer $n\geq 1$ a power of two and every positive integer $d$ , every $S\subseteq[n]^{d}$ , there exists an $\bm{f}\in S$ and an $(\bm{f},S)$ -isolating filter $G$ (as defined in Definition 11) such that $|\mathrm{supp}\leavevmode\nobreak\ G|\leq|S|$ .

Proof.

Follows by combining Lemma 4 with Claim 7. ∎

5 Estimation of sparse high-dimensional signals in quadratic time

In this section, we use the filters that we have constructed in Section 4 in order to show the first result of the paper, a deterministic algorithm for estimation of Fourier-sparse signals in time which is quadratic in the sparsity.

Theorem 5 (Estimation guarantee).

Suppose $n$ is a power of two integer and $d$ is a positive integer and $S\subseteq[n]^{d}$ . Then, for any signal $x\in{\mathbb{C}}^{n^{d}}$ with $\mathrm{supp}\leavevmode\nobreak\ {\widehat{x}}\subseteq S$ , the procedure Estimate $(x,S,n,d)$ (see Algorithm 2) returns $\widehat{x}$ . Moreover, the sample complexity of this procedure is $O(|S|^{2})$ and its runtime is $O(|S|^{2}\cdot d\log_{2}n)$ .

Proof.

The proof is by induction on the iteration number $t=0,1,2,...$ of the while loop in Algorithm 2. One can see that since at each iteration the tree $T$ looses one of its leaves, the algorithm terminates after $|S|$ iterations, since initially the number of leaves of $T$ is $|S|$ . Let $\widehat{\chi}^{(t)}$ denote the signal $\widehat{\chi}$ after iteration $t$ , and let ${T}^{(t)}$ denote the tree ${T}$ after iteration $t$ and let $S^{(t)}$ denote the set of frequencies corresponding to leaves of $T^{(t)}$ , i.e., $S^{(t)}=\{\bm{f}_{u}:u\text{ is a leaf of }T^{(t)}\}$ . In particular, $\widehat{\chi}^{(0)}=0$ and ${T}^{(0)}$ is the unflattened spltting tree of ${S}$ and $S^{(0)}=S$ .

We claim that for each $t=0,1,\dots,|S|$ , we have

[TABLE]

Base case of induction:

We have $S^{(0)}=S$ and $\widehat{\chi}^{(0)}\equiv 0$ , which immediately implies (8) for $t=0$ .

Inductive step:

For the inductive hypothesis, let $r\geq 1$ and assume that (8) holds for $t=r-1$ . The main loop of the algorithm finds $v=\text{argmin}_{u:\text{ leaf of }{T}^{(r-1)}}w_{{T}^{(r-1)}}(u)$ . By Claim 7 along with inductive hypothesis, $w_{{T}^{(r-1)}}(v)\leq\log_{2}|S^{(r-1)}|\leq\log_{2}|S|$ . Note that the main loop of the algorithm constructs a $(\bm{f}_{v},S^{(r-1)})$ -isolating filter $G$ , along with $\widehat{G}$ . In order to do so, the algorithm constructs trees $T^{v}_{q}$ for all $q\in\{1,...,d\}$ which in total takes time $O(|S|d\log_{2}n)$ . Given $T^{v}_{q}$ ’s, the algorithm constructs filter $G$ and $\widehat{G}$ in time $O\left(2^{w_{{T}^{(r-1)}}(v)}+d\log_{2}n\right)=O\left(|S|+d\log_{2}n\right)$ , by Lemma 4. Moreover, the filter $G$ has support size $2^{w_{{T}^{(r-1)}}(v)}\leq|S|$ by Lemma 4.

By Lemma 4 computing the quantity $h_{\bm{f}}=\sum_{\bm{\xi}\in[n]^{d}}\widehat{\chi}^{(r-1)}_{\bm{\xi}}\cdot\widehat{G}(\bm{\xi})$ in line 15 of Algorithm 2 can be done in time $O(\|\widehat{\chi}^{(r-1)}\|_{0}\cdot d\log_{2}n)=O(|S|\cdot d\log_{2}n)$ . By convolution theorem 3, the quantity $h_{\bm{f}}$ satisfies $h_{\bm{f}}=n^{d}\cdot(\chi^{(r-1)}\ast G)_{0}$ , and thus

[TABLE]

where the last transition is due to the fact that $G$ is $\left(\bm{f}_{v},S^{(r-1)}\right)$ -isolating along with the inductive hypothesis of $\mathrm{supp}\leavevmode\nobreak\ {\left(\widehat{x}-\widehat{\chi}^{(r-1)}\right)}\subseteq S^{(r-1)}$ .

We thus get that $\widehat{\chi}^{(r)}(\cdot)\leftarrow\widehat{\chi}^{(r-1)}(\cdot)+\left(\widehat{x}-\widehat{\chi}^{(r-1)}\right)_{\bm{f}_{v}}\cdot\delta_{\bm{f}_{v}}(\cdot)$ . Moreover, it updates the tree ${T}^{(r)}\leftarrow\mathrm{Tree}.\textsc{remove}({T}^{(r-1)},v)$ . Also note that the set $S^{(r)}$ gets updated to $S^{(r-1)}\setminus\{\bm{f}_{v}\}$ accordingly. This establishes (8) for $t=r$ , thereby completing the inductive step.

The number of steps is exactly $|S|$ , as follows from the inductive claim. Thus, the total runtime is $O(|S|^{2}\cdot d\log_{2}n)$ . ∎

6 A lower bound of $k^{1-o(1)}$ rounds of tree pruning

One apparent disadvantage of our algorithm presented in the previous section is the fact that it only estimates elements of the Fourier spectrum one at a time, thereby taking $k$ rounds to estimate all elements in the spectrum. Since the isolation of one element takes up to $k$ time due to the support size of $G$ , the resulting bound on the runtime is quadratic in $k$ . A natural conjecture is that our analysis is not tight, and one can achieve better runtime by removing several nodes of weight at most $\log_{2}k+O(1)$ at once. If one could argue that the filters $G$ that isolate the nodes removed in one round have nontrivial overlap, runtime improvements could be achieved. In this section we present a class of signals on which $k^{1-o(1)}$ rounds of pruning the tree are required, showing that our analysis is essentially optimal.

Tree pruning process

Suppose $n$ is a power of two integer and $\tau$ is a positive integer. Let $T$ be a subtree of $T^{\mathrm{full}}_{n}$ . The tree pruning process, $\mathcal{P}(T,\tau,n)$ , is an iterative algorithm that performs the following operations on $T$ successively until $T$ is empty:

Find $\tilde{S}_{\tau}=\{\text{leaves$ v $of }T:w_{T}(v)\leq\tau\}$ , i.e., set of vertices of weight no more than $\tau$ . 2. 2.

For each $v\in\tilde{S}_{\tau}$ (in an arbitrary order) remove $v$ from $T$ together with the path from $v$ to its closest ancestor that has two children (i.e., run $T.remove(v)$ ; see Algorithm 9).

We show that for every $k$ and sufficiently large integer $n$ there exists a tree $T$ with $k$ leaves such that $\mathcal{P}(T,\tau,n)$ with $\tau=\log_{2}k+O(1)$ requires $k^{1-o(1)}$ rounds to terminate. This in particular shows that our $k^{2}$ runtime analysis from section 5 cannot be improved by reusing work done in a single iteration, and hence our analysis is essentially optimal. Our construction is one-dimensional, although higher dimensional extensions can be readily obtained.

Theorem 6.

For any integer constant $c\geq 1$ , sufficiently large power of two integer $n$ there exists $k=\Theta(\log^{c}n)$ such that if $\tau=\log_{2}k+O(1)$ , the following condition holds. There exists a subtree $T$ of $T^{\mathrm{full}}_{n}$ with $k$ leaves such that the tree pruning process $\mathcal{P}(T,\tau,n)$ requires $k^{1-o(1)}$ iterations to terminate.

The following simple lemma is crucial to our analysis

Lemma 5 (Monotonicity of tree pruning process).

Suppose $n$ is a power of two integer $T^{\prime}$ a subtree of $T^{\mathrm{full}}_{n}$ and $T$ a subtree of $T^{\prime}$ . Then for every integer $\tau$ the number of rounds that it takes $\mathcal{P}(T,\tau,n)$ to collapse $T$ is at most the number of rounds that it takes $\mathcal{P}(T^{\prime},\tau,n)$ to collapse $T^{\prime}$ .

Proof.

For $j=0,1,2,\dots$ , let $T^{(j)}$ (respectively $T^{\prime(j)}$ ) denote the tree obtained by performing $j$ rounds of the tree pruning process (with threshold $\tau$ ) to $T$ (respectively $T^{\prime}$ ). Note that $T^{(0)}=T$ and $T^{\prime(0)}=T^{\prime}$ .

We claim that $T^{(j)}$ is a subtree of $T^{\prime(j)}$ for all $j=0,1,\dots$ , which will obviously imply the desired conclusion. We use induction on $j$ . Note that the base of induction is trivial for $j=0$ . Now, we prove the inductive step. Suppose $j>0$ . By the inductive hypothesis, we have that $T^{(j-1)}$ is a subtree of $T^{\prime(j-1)}$ . Thus, for any leaf $v$ that appears in both $T^{(j-1)}$ and $T^{\prime(j-1)}$ , we have $w_{T}(v)\leq w_{T^{\prime}}(v)$ (this is because any node in $T^{\prime(j-1)}$ along the path from the root to $v$ that has exactly one child will also have exactly one child in $T^{(j-1)}$ ). Hence, if $v$ is removed from $T^{\prime(j-1)}$ in the $j$ -th iteration of the process, then it is also removed from $T^{(j-1)}$ during the $j$ -th iteration. Hence, $T^{(j)}$ is a subtree of $T^{\prime(j)}$ , which completes the inductive step and, therefore, proves the claim. ∎

We recall a few definitions. \lowhamming* Note that $|H_{c}^{n}|=\sum_{j=0}^{c}\binom{\log_{2}n}{j}$ .

\lowhammingsupport

Note that for any $x\in\mathcal{X}^{n}_{c}$ we have that $\|x\|_{0}=\sum_{i=0}^{c}\binom{\log_{2}n}{i}$ , so for any $c\leq(\frac{1}{2}-\epsilon)\log_{2}n$ , the signals that are contained in class $\mathcal{X}^{n}_{c}$ are $\Theta\Big{(}\binom{\log_{2}n}{c}\Big{)}$ -sparse.

\lowhammingweighttrees

It is not hard to see that $T^{n}_{c}$ is in fact the splitting tree for the set $H^{n}_{c}$ and, hence, the number of its leaves is $\sum_{i=0}^{c}\binom{\log_{2}n}{i}$ .

Now, we are ready to prove Theorem 6.

Proof of Theorem 6: Let us choose the tree $T$ to be $T^{n}_{c}$ for some positive integer $c$ . We will set parameter $c$ at the end of the proof. Let $D(n,c,\tau)$ denote the number of iterations required to collapse $T_{c}^{n}$ with threshold $\tau$ . We prove that

[TABLE]

for any power of two integer $n$ , any integer $0\leq c\leq\log_{2}n$ , and any positive integer $\tau$ . We use induction on $c$ .

Base: Note that for $c=0$ , the tree $T_{c}^{n}$ has one leaf, which gets removed in the first iteration of the tree pruning process. Thus, $D(n,0,\tau)=1$ for any power of two $n$ and $\tau\geq 1$ , and so, (9) holds for $c=0$ .

Inductive step: Suppose $c>0$ . For any $T_{c}^{n}$ , we label the nodes along the path from the root to the rightmost leaf (i.e., the path formed by starting at the root and repeatedly following the right child) in order as $0,1,\dots,\log_{2}n$ .

Note that if $n\leq 2^{\tau}$ , then

[TABLE]

Thus, (9) does indeed hold for $n\leq 2^{\tau}$ .

Now, suppose $n>2^{\tau}$ . Recall that a copy of $T_{c-1}^{n/2^{j+1}}$ is rooted at the left child of node $j$ of $T_{c}^{n}$ for all $j=0,1,\dots,\tau-1$ . We divide the pruning process on $T_{c}^{n}$ into two phases. The first phase consists of the process up until the point at which the left subtree of node $j$ in $T_{c}^{n}$ completely collapses for some $j\in\{0,1,\dots,\tau-1\}$ , while the second phases consists of the process thereafter. Thus, the number of rounds in the first phase is just the number of rounds till the top $\tau$ left subtrees collapses.

Note that during the first phase, the behavior of the collapsing process on the left subtree of node $j$ corresponds to running a collapsing process with threshold $\tau-j-1$ on $T_{c-1}^{n/2^{j+1}}$ . Thus, the number of rounds in the first phase is,

[TABLE]

By the inductive hypothesis (on $c$ ), we have that for $j=0,1,\dots,\tau-1$

[TABLE]

which implies that $R\geq\frac{1}{(c-1)!}\cdot\left(\frac{\log_{2}n-1}{\tau-1}\right)^{c-1}$ since we assumed $\tau\leq\log_{2}n$ .

Now, let $T^{\prime}$ be the tree obtained after performing $R$ rounds of the collapsing process on $T_{c}^{n}$ . Moreover, let $T^{\prime\prime}$ be the tree obtained by further removing any left subtrees of nodes $0,1,\dots,\tau-1$ . By Lemma 5, we have that the number of rounds needed to collapse $T^{\prime}$ is at least the number of rounds needed to collapse $T^{\prime\prime}$ . Moreover, observe that the number of rounds needed to collapse $T^{\prime\prime}$ is precisely $D(n/2^{\tau},c,\tau)$ , thus, the number of rounds in the second phase is at least $D(n/2^{\tau},c,\tau)$ , and so,

[TABLE]

Note that a similar argument gives us

[TABLE]

for all $a=0,1,\dots,\lfloor(\log_{2}n-1)/\tau\rfloor-1$ (this condition ensures that $\tau\leq\log_{2}(n/2^{a\tau})$ , as required by our argument above). Hence, it follows that

[TABLE]

which establishes (9) for $n>2^{\tau}$ . This completes the inductive step.

Recall that $k=\Theta\Big{(}\binom{\log_{2}n}{c}\Big{)}$ , so for any constant $c$ one has $k=\Theta(\binom{\log_{2}n}{c})\leq(e\log_{2}n/c)^{c}$ . Setting $\tau=\log_{2}k+O(1)$ , we get

[TABLE]

as required.

∎

7 Sparse FFT for worst-case sparse signals and worst case signals with random phase

In this section we prove the main result of the paper, namely \sfftworstcase*

We also study Fourier sparse signals $x$ whose nonzero frequencies are distributed arbitrarily (worst-case) and whose values at the nonzero frequencies are independently chosen to have a uniformly random phase. Recall Definition 2.1:

\randsign

For this model we prove the stronger result: \sfftrandphase*

The main property that allows us to obtain the stronger result is the fact that a small number of time domain samples from such a signal suffice to approximate its energy with high confidence (whereas $\Omega(k)$ samples are required in general for a worst-case $k$ -sparse signal). This is reflected by the following

Lemma 6.

For any positive integer $d$ , power of two $n$ , and worst-case signal with random phase $x$ , we have

[TABLE]

where $s=Cd^{3}\log_{2}^{3}n$ for some absolute constant $C>0$ and $\bm{t}_{1},\bm{t}_{2},\dots,\bm{t}_{s}\sim\mathrm{Unif}([n]^{d})$ are i.i.d. random variables. The probability is over the randomness in choosing the various $\bm{t}_{j}$ as well the randomness in the choice of phase for each frequency of $\widehat{x}$ .

For completeness we present a proof for this lemma in Appendix A.

7.1 Proofs of Theorems 2.1 and 2.1

Given the construction of our adaptive aliasing filter from the previous section, our sparse recovery algorithms follow by a reduction to the estimation problem. We find the vertex $v^{*}=\text{argmin}_{v\in T}w_{T}(v)$ , which, by Kraft’s inequality, satisfies $w_{T}(v^{*})\leq\log_{2}k$ . We then define an auxiliary tree $T^{\prime}$ by appending a left $a$ and a right child $b$ to $v$ . Then for each of the children $a,b$ , we, in turn, construct a filter $G$ that isolates them from the rest of $T$ (i.e., from the frequency cones of other nodes in $T$ ) and check whether the corresponding restricted signals are nonzero. The latter is unfortunately a nontrivial task, since the sparsity of these signals can be as high as $k$ , and detecting whether a $k$ -sparse signal is nonzero requires $\Omega(k)$ samples. However, a fixed set of $k\log^{3}N$ locations that satisfies the restricted isometry property (RIP) can be selected, and accessing the signal on those values suffices to test whether it is nonzero. If the signal is further assumed to be a worst case random phase signal, then a polylogarithmic number of samples suffices. The following lemma (Lemma 7) makes the latter claim formal. The algorithm is presented as Algorithm 4.

Lemma 7 (ZeroTest guarantee).

Suppose $d$ is a positive integer and $n$ is a power of two. Assume $T$ is a subtree of $T^{\mathrm{full}}_{n^{d}}$ . Suppose that signals $x,\widehat{\chi}\in{\mathbb{C}}^{n^{d}}$ satisfy $\mathrm{supp}\leavevmode\nobreak\ {(\widehat{x}-\widehat{\chi})}\subseteq\bigcup_{u:\text{\leavevmode\nobreak\ leaf of\leavevmode\nobreak\ }T}\operatorname{\mathrm{FrequencyCone}}_{T}(u)$ . Suppose that $\mathbf{\Delta}$ is a multiset of sample from $[n]^{d}$ which satisfies the following for every leaf $v$ of $T$ :

[TABLE]

where $y=(\widehat{x}-\widehat{\chi})_{\operatorname{\mathrm{FrequencyCone}}_{T}(v)}$ is the signal obtained by restricting $\widehat{x}-\widehat{\chi}$ to frequencies $\bm{\xi}\in\operatorname{\mathrm{FrequencyCone}}_{T}(v)$ and zeroing it out on all other frequencies.

Then the following conditions hold:

•

ZeroTest $(x,\widehat{\chi},T,v,n,d,\mathbf{\Delta})$ * outputs true if $\mathrm{supp}\leavevmode\nobreak\ {(\widehat{x}-\widehat{\chi})}\cap{\operatorname{\mathrm{FrequencyCone}}_{T}(v)}\neq\emptyset$ ; otherwise, it outputs false.*

•

The sample complexity of this procedure is $O(2^{w_{T}(v)}\cdot|\mathbf{\Delta}|)$ , where $w_{T}(v)$ is the weight of leaf $v$ in $T$ (see Definition 2.1).

•

The runtime of the ZeroTest procedure is

[TABLE]

where $|T|$ denotes the number of leaves of $T$ .

Proof.

Consider lines 14-15 in Algorithm 3. By Claim 3, we have that

[TABLE]

Thus,

[TABLE]

Note that, by Lemma 4, the filter $G$ used in Algorithm 3 is a $(v,T)$ -isolating filter. Therefore, by the assumption $\mathrm{supp}\leavevmode\nobreak\ (\widehat{x}-\widehat{\chi})\subseteq\bigcup_{u:\text{\leavevmode\nobreak\ leaf of\leavevmode\nobreak\ }T}\operatorname{\mathrm{FrequencyCone}}_{T}(u)$ and the definition of a $(v,T)$ -isolating filter (see Definition 13), we have

[TABLE]

Note that $H_{\bm{f}}^{\Delta}$ is essentially the inverse Fourier transform of $(\widehat{x}-\widehat{\chi})_{\operatorname{\mathrm{FrequencyCone}}_{T}(v)}$ , where $(\widehat{x}-\widehat{\chi})_{\operatorname{\mathrm{FrequencyCone}}_{T}(v)}$ denotes the signal obtained by restricting $\widehat{x}-\widehat{\chi}$ to frequencies $\bm{\xi}\in\operatorname{\mathrm{FrequencyCone}}_{T}(v)$ and zeroing out the signal on all other frequencies. By the assumption of the lemma we have the following:

[TABLE]

Therefore the first claim of the lemma holds.

Note that in order to construct a $(v,T)$ -isolating filter $G$ , along with $\widehat{G}$ , the algorithm constructs trees $T^{v}_{q}$ for all $q\in\{1,...,d\}$ , which has total time complexity $O(|T|d\log_{2}n)$ . Given $T^{v}_{q}$ ’s, the algorithm constructs filter $G$ and $\widehat{G}$ in time $O\left(2^{w_{{T}}(v)}+d\log_{2}n\right)$ , by Lemma 4. Moreover, the filter $G$ has support size $2^{w_{{T}}(v)}$ , by Lemma 4.

By Lemma 4, computing the quantities $h_{\bm{f}}^{\Delta}=\frac{1}{n^{d}}\sum_{\bm{\xi}\in[n]^{d}}e^{2\pi i\frac{\bm{\xi}^{T}\Delta}{n}}\cdot\widehat{\chi}_{\bm{\xi}}\widehat{G}_{\bm{\xi}}$ for all $\Delta$ in line 14 of Algorithm 3 can be done in time $O\left(\|\widehat{\chi}\|_{0}\cdot(|\mathbf{\Delta}|+d\log_{2}n)\right)=O\left(\|\widehat{\chi}\|_{0}\cdot|\mathbf{\Delta}|\right)$ . Given the values of $h_{\bm{f}}^{\Delta}$ for various $\Delta$ , computing all $\big{\{}|H^{\Delta}_{f^{*}}|^{2}\big{\}}_{\Delta\in\mathbf{\Delta}}$ in line 15 takes time $O\left(2^{w_{T}(v)}\cdot|\mathbf{\Delta}|\right)$ . Therefore the total runtime of this procedure is

[TABLE]

as desired.

Because support size of $G$ is $2^{w_{T}(v)}$ , computing all $\big{\{}|H^{\Delta}_{f^{*}}|^{2}\big{\}}_{\Delta\in\mathbf{\Delta}}$ in line 15 of the algorithm requires $O(2^{w_{T}(v)}\cdot|\mathbf{\Delta}|)$ samples from $x$ which proves the second claim of the lemma. ∎

We now prove our main result:

Proof of Theorems 2.1 and 2.1: Note that Algorithms 4 and 5 are identical except in line 2. We first analyze the common code of the algorithms (after line 2) under the assumption that the set $\mathbf{\Delta}$ in all calls to ZeroTest are replaced with a more powerful set which satisfies the precondition of Lemma 7 hence ZeroTest correctly tests the zero hypothesis on its input signal with probability $1$ . We then establish a coupling between this idealized execution and the actual execution for both Algorithms 4 and 5, leading to our result.

Let $m$ denote the size of the set $m=|\mathbf{\Delta}|$ . We prove that the following properties are maintained throughout the execution of SparseFFT (Algorithm 4) and SparseFFT-RandomPhase (Algorithm 5):

(1)

$\mathrm{supp}\leavevmode\nobreak\ {(\widehat{x}-\widehat{\chi})}\subseteq\bigcup_{u:\text{ leaf of }T}\operatorname{\mathrm{FrequencyCone}}_{T}(u)$ ;

(2)

For every leaf $u$ of tree $T$ one has $\mathrm{supp}\leavevmode\nobreak\ {(\widehat{x}-\widehat{\chi})}\cap\operatorname{\mathrm{FrequencyCone}}_{T}(u)\neq\emptyset$ ;

(3)

If $\widehat{x}$ is a worst-case signal with random phase, then $\widehat{x}-\widehat{\chi}$ is a worst-case signal with random phase;

(4)

The quantity $\phi=(d\log_{2}n+1)\|\widehat{x}-\widehat{\chi}\|_{0}-\sum_{u:\text{ leaf of }T}l_{T}(u)$ always decreases by at least 1 on every iteration of Algorithm 4 or 5;

(5)

Always $\|\widehat{x}-\widehat{\chi}\|_{0}\leq k$ ;

The base of the induction is provided by the first iteration, at which point $T$ is a single vertex $T=\{r\}$ where $r$ is the root with $\bm{f}_{r}=0$ and $\widehat{\chi}=0$ . The conditions (1) and (2) and (3) and (5) are satisfied since $\operatorname{\mathrm{FrequencyCone}}_{T}(r)=[n]^{d}$ and $\mathrm{supp}\leavevmode\nobreak\ (\widehat{x}-\widehat{\chi})=\mathrm{supp}\leavevmode\nobreak\ {\widehat{x}}\neq\emptyset$ and $\widehat{x}-\widehat{\chi}=\widehat{x}$ is a worst-case signal with random phase if $\widehat{x}$ is a worst-case signal with random phase.

We now prove the inductive step. We assume that conditions (1) and (2) and (3) and (5) of the inductive hypothesis are satisfied at the beginning of a certain iteration and argue that conditions (1) and (2) and (3) and (5) are maintained at the end of the iteration. We also show that the value of the quantity $\phi$ defined in (4), at the end of the loop is smaller than its value at the start of the loop by at least one. Let $v\in T$ be the smallest weight leaf chosen by the algorithm in line 4. We now consider two cases.

Case 1: $l_{T}(v)=d\log_{2}n$ . Since $G$ is a $(v,T)$ -isolating filter, we have by Definition 1 that for every signal $z\in{\mathbb{C}}^{n^{d}}$ with Fourier support $\mathrm{supp}\leavevmode\nobreak\ \widehat{z}\subseteq\bigcup_{u:\text{\leavevmode\nobreak\ leaf in\leavevmode\nobreak\ }T}\operatorname{\mathrm{FrequencyCone}}_{T}(u)$ and for all $t\in[n]^{d}$ ,

[TABLE]

By condition (1) of the inductive hypothesis one has $\mathrm{supp}\leavevmode\nobreak\ (\widehat{x}-\widehat{\chi})\subseteq\bigcup_{u:\text{ leaf of }T}\operatorname{\mathrm{FrequencyCone}}_{T}(u),$ and thus we can apply (10) with $z=x-\chi$ and $\bm{t}=0$ , obtaining

[TABLE]

Note that by Claim 3,

[TABLE]

where $h_{\bm{f}}$ is the quantity computed in line 15. We thus get that

[TABLE]

because $\operatorname{\mathrm{FrequencyCone}}_{T}(v)=\{\bm{f}_{v}\}$ due to the assumption that $l_{T}(v)=d\log_{2}n$ . Thus we get that $\widehat{\chi}(\cdot)\leftarrow\widehat{\chi}(\cdot)+\widehat{(x-\chi)}_{\bm{f}_{v}}\delta_{\bm{f}_{v}}(\cdot)$ therefore at the end of the loop we have $\widehat{(x-\chi)}_{\bm{f}_{v}}=0$ which means that $\bm{f}_{v}$ will no longer be in $\mathrm{supp}\leavevmode\nobreak\ {\widehat{(x-\chi)}}$ . And also $v$ gets removed from tree $T$ implying that $\{\bm{f}_{v}\}=\operatorname{\mathrm{FrequencyCone}}_{T}(v)$ will be excluded from $\bigcup_{u:\text{ leaf of }T}\operatorname{\mathrm{FrequencyCone}}_{T}(u)$ . Note that this also implies that $\widehat{(x-\chi)}$ will remain a worst-case signal with random phase. Therefore, condition (1) and (2) and (3) hold.

Now, note that $\|\widehat{(x-\chi)}\|_{0}$ will decrease by 1 exactly because $\bm{f}_{v}$ is no longer in $\mathrm{supp}\leavevmode\nobreak\ {\widehat{(x-\chi)}}$ and the rest of the support is unchanged. This shows that condition (5) holds. Also, $\sum_{u:\text{ leaf of }T}l_{T}(u)$ decreases by exactly $d\log_{2}n$ because the level of $v$ was $l_{T}(v)=d\log_{2}n$ and $v$ gets removed from $T$ . So $\phi$ will decrease by exactly one as required in condition (4).

Case 2 Suppose that $l_{T}(v)<d\log_{2}n$ . We first check that the invocation of ZeroTest satisfies preconditions of Lemma 7. We need to ensure that for the residual signal $\widehat{x}-\widehat{\chi}$ one has

[TABLE]

where $T^{\prime}$ is the tree obtained from $T$ by adding two children of $v$ (line 19). This follows, since by the inductive hypothesis we have

[TABLE]

and because by claim 6 we have,

[TABLE]

We thus get that the preconditions of Lemma 7 are satisfied, and the output of ZeroTest $(x,\widehat{\chi},T^{\prime},w,n,d,\mathbf{\Delta})$ is true if $(\widehat{x}-\widehat{\chi})_{\operatorname{\mathrm{FrequencyCone}}_{T^{\prime}}(w)}\neq 0$ and false otherwise. A similar analysis shows that the algorithm correctly tests the zero hypothesis on $(\widehat{x}-\widehat{\chi})_{\operatorname{\mathrm{FrequencyCone}}_{T^{\prime}}(u)}$ . We thus get, letting $T_{new}$ denote the tree $T$ at the end of the while loop, that

[TABLE]

and for every $v\in T_{new}$ one has $\mathrm{supp}\leavevmode\nobreak\ {(\widehat{x}-\widehat{\chi})}\cap{\operatorname{\mathrm{FrequencyCone}}_{T_{new}}(v)}\neq\emptyset$ . Hence, because $\widehat{(x-\chi)}$ remains unchanged, conditions (1) and (2) and (3) hold at the end of the loop.

Now, we show $\phi$ is decreased at least by one. By inductive hypothesis $\mathrm{supp}\leavevmode\nobreak\ {(\widehat{x}-\widehat{\chi})}\cap{\operatorname{\mathrm{FrequencyCone}}_{T}(v)}\neq\emptyset$ and at least one of $w$ or $u$ will be added to $T$ because $\operatorname{\mathrm{FrequencyCone}}_{T}(v)=\operatorname{\mathrm{FrequencyCone}}_{T^{\prime}}(u)\cup\operatorname{\mathrm{FrequencyCone}}_{T^{\prime}}(v)$ . Note that $l_{T_{new}}(w)=l_{T_{new}}(u)=l_{T}(v)+1$ hence $\sum_{u^{\prime}:\text{ leaf of }T}l_{T_{new}}(u^{\prime})\geq\sum_{u^{\prime}:\text{ leaf of }T}l_{T}(u^{\prime})+1$ . Because $\|\widehat{x}-\widehat{\chi}\|_{0}$ remains unchanged, the value of $\phi$ will decrease by at least one hence conditions (4) and (d) hold.

Because $l_{T}(u)\leq d\log_{2}n$ for every leaf $u\in T$ , it follows from condition (2) that the quantity $\phi=(d\log_{2}n+1)\|\widehat{x}-\widehat{\chi}\|_{0}-\sum_{u:\text{ leaf of }T}l_{T}(u)$ is non-negative. At the first iteration, $\widehat{\chi}=0$ and $T=\{r\}$ where $r$ is the root with $l_{T}(f)=0$ . Hence, $\phi=\|\widehat{x}\|_{0}(1+d\log_{2}n)$ at first iteration. Because $\phi$ is decreasing by at least 1 at each iteration, the algorithm terminates after $O(\|\widehat{x}\|_{0}\cdot d\log_{2}n)$ iterations. By Lemma 7 along with Claim 7, the runtime of each iteration of algorithm is $O(km)$ and also sample complexity of each iteration is $O(km)$ therefore the total runtime and sample complexity both will be $O(k^{2}m\cdot d\log_{2}n)$ .

Finally, observe that throughout this analysis we have assumed that the set $\mathbf{\Delta}$ satisfies the precondition of Lemma 7 for all the invocations of ZeroTest by our algorithm.

In reality, there are two cases. The first case is for worst-case signals (Algorithm 4, Theorem 2.1). In this case, the algorithm chooses $\mathbf{\Delta}$ to be a multiset which corresponds to the Fourier measurements with RIP of order $k$ . Let $F^{-1}_{N}$ be the $d$ dimensional inverse Fourier transform’s matrix with $N=n^{d}$ points. The matrix $M={\sqrt{N}}F^{-1}_{N}$ is a unitary matrix. If you let $M_{\bm{\Delta}}$ denote the submatrix of $M$ whose rows are sampled from $M$ according to set $\bm{\Delta}$ defined in line 5 of Algorithm 4 then by Theorem 4 there exists a multiset $\bm{\Delta}$ of size $m=O(k\log_{2}^{2}k\cdot d\log_{2}n)$ such that $\frac{\sqrt{N}}{|\bm{\Delta}|}M_{\bm{\Delta}}$ satisfies the restricted isometry property of order $k$ . Therefore, for every signal $y\in{\mathbb{C}}^{n^{d}}$ :

[TABLE]

As we have shown in condition (5) of the induction, $\|\widehat{x}-\widehat{\chi}\|_{0}\leq k$ . Hence for every leaf $v$ of the tree $\|(\widehat{x}-\widehat{\chi})_{\operatorname{\mathrm{FrequencyCone}}_{T}(v)}\|_{0}\leq k$ therefore the precondition of lemma 7 is satisfied.

The second case is for worst-case signals with random phase (Algorithm 5, Theorem 2.1). We have shown in condition (3) of the induction that $x-\chi$ is a worst-case signal with random phase in every iteration of the algorithm. Therefore for every leaf $v$ of the tree it is true that $(\widehat{x}-\widehat{\chi})_{\operatorname{\mathrm{FrequencyCone}}_{T}(v)}$ is a worst-case signal with random phase. In this case, the multiset $\bm{\Delta}$ is defined in line 5 of Algorithm 5 therefore by Lemma 6 for a fixed leaf $v$ of tree $T$ with probability at least $1-1/n^{4d}$ the following holds:

[TABLE]

where $y=(\widehat{x}-\widehat{\chi})_{\operatorname{\mathrm{FrequencyCone}}_{T}(v)}$ .

This shows that in the second case which corresponds to theorem 2.1, the failure probability of procedure ZeroTest is at most $\frac{1}{n^{4d}}$ . Moreover, the above analysis shows that SFFT makes at most $O(kd\log_{2}n)$ calls to ZeroTest. Therefore, by a union bound, the overall failure probability of the calls to ZeroTest is $O\left((kd\log_{2}n)\frac{1}{n^{4d}}\right)\leq O(n^{-2d})$ . Hence, we obtain the desired result.

∎

8 Signals with random support in high dimension

In this section, we consider Fourier sparse signals whose support in the frequency domain is chosen randomly, while the values at the nonzero frequencies are chosen arbitrarily (worst-case). In other words, we assume a Bernoulli model for $\mathrm{supp}\leavevmode\nobreak\ \widehat{x}$ , while the values at the frequencies that are chosen to be in the support are arbitrary. We will present an algorithm that runs in time $O(k\log^{O(1)}N)$ . The model for random support signals can be found in Definition 2.2 (Section 2), which we restate here for convenience of the reader:

\defrandsupp

8.1 Outline of our approach

The algorithm is motivated by the idea of speeding up our algorithm for worst-case signals (Algorithm 4, also see Theorem 2.1) by reducing the number of iterations of the process from $\Theta(k)$ down to $O(\log N)$ . Such a reduction (which we show to be impossible for worst-case signals in Section 6) requires the ability to peel off many elements of the residual in a single phase of the algorithm, which turns out to be possible if the support of $\widehat{x}$ is chosen uniformly at random as in Definition 2.2. Indeed, if one considers the splitting tree $T$ of a signal with uniformly random support, one sees that

(a)

a large constant fraction of nodes $v\in T$ satisfy $w_{T}(v)\leq\log_{2}k+O(1)$ ;

(b)

the adaptive aliasing filters $G$ constructed for such nodes will have significantly overlapping support in time domain.

We provide the intuition for this for the one-dimensional setting ( $d=1$ ) to simplify notation (changes required in higher dimensions are minor). In this setting, property (b) above is simply a manifestation of the fact that since the support is uniformly random, any given congruence classes modulo $B^{\prime}=Ck$ for a large enough constant $C>1$ is likely to contain only a single element of the support of $\widehat{x}$ . Our adaptive aliasing filters provide a way to only partition frequency space along a carefully selected subset of bits in $[\log_{2}N]$ , but due to the randomness assumption, one can isolate most of the elements by simply partitioning using the bottom $\log_{2}k+O(1)$ bits. This essentially corresponds to hashing $\widehat{x}$ into $B=Ck$ buckets at computational cost $O(B^{\prime}\log B^{\prime})=O(k\log k)$ . While this scheme is efficient, it unfortunately only recovers a constant fraction of coefficients. One solution would be to hash into $B=Ck^{2}$ buckets (i.e., consider congruence classes modulo $Ck^{2}$ ), which would result in perfect hashing with good constant probability, allowing us to recover the entire signal in a single round. However, this hashing scheme would result in a runtime of $\Omega(k^{2}\log k)$ and is, hence, not satisfactory. On the other hand, hashing into $Ck^{2}$ buckets is clearly wasteful, as most buckets would be empty. Our main algorithmic contribution is a way of “implicitly” hashing into $Ck^{2}$ buckets, i.e., getting access to the nonempty buckets, at an improved cost of $\widetilde{O}(k)$ .

Our algorithm uses an iterative approach, and the main underlying observation is very simple. Suppose that we are given the ability to “implicitly” hash into $B$ buckets for some $B$ , namely, get access to the nonempty buckets. If $B$ is at least $\text{min}(Ck^{2},N)$ , we know that there are no collisions with high probability and we are done. If not, then we show that, given access to nonempty buckets in the $B$ -hashing (i.e. a hashing into $B$ buckets), we can get access to the nonempty buckets of a $(\Gamma B)$ -hashing for some appropriately chosen constant $\Gamma>1$ at a polylogarithmic cost in the size of each nonempty bucket of the $B$ -bucketing by essentially computing the Fourier transform of the signal restricted to nonempty buckets in the $B$ -bucketing. We then proceed iteratively in this manner, starting with $B=Ck$ , for which we can perform the hashing explicitly. Since the number of nonzero frequencies remaining in the residual after $t$ iterations of this process decays geometrically in $t$ , we can also afford to use a smaller number of buckets $B^{\prime}$ in the hashing that we construct explicitly, ensuring that the runtime is dominated by the first iteration.

Ultimately, the algorithm takes the following form. At every iteration, we explicitly compute a hashing into $B^{\mathrm{base}}\leq Ck$ buckets explicitly. Then, using a list of nonempty buckets in a $B^{\mathrm{prev}}$ -hashing from the previous iteration, we extend this list to a list of nonempty buckets in a $B^{\mathrm{next}}$ -bucketing at polylogarithmic cost per bucket (by solving a well-conditioned linear system, see Algorithm 6), where $B^{\mathrm{next}}=\Gamma\cdot B^{\mathrm{prev}}$ for some large enough constant $\Gamma>1$ . Meanwhile, we reduce $B^{\mathrm{base}}$ by a factor of $\Gamma$ , thus maintaining the invariant $B^{\mathrm{base}}\cdot B^{\mathrm{next}}\approx k^{2}$ at all times (note that this is satisfied at the start, when $B^{\mathrm{base}}=B^{\mathrm{prev}}\approx k$ , and $B^{\mathrm{base}}\cdot B^{\mathrm{next}}$ remains invariant at each iteration). Therefore, after a logarithmic number of iterations, we have effectively emulated hashing into $\approx k^{2}$ but at a total cost of roughly one hashing computation into $\approx k$ buckets (see Figure 6 for an illustration).

Bucketing in high dimensions (MakeBucket function).

We note that our vectorial notation for buckets in high dimensions (see section 8.2) allows us to continue talking about bucketings with $\bm{B}^{base},\bm{B}^{prev},\bm{B}^{next}$ buckets, even though now the number of buckets is in fact a vector of length $d$ . In fact in dimension $d$ the only property of the bucketing that matters for our analysis is the number of buckets $|\bm{B}^{base}|,|\bm{B}^{prev}|,|\bm{B}^{next}|$ and the shape of each bucket is not important (this is due to the fact that the support is sampled from a permutation invariant distribution). In order to avoid unnecessary notation overload, in Algorithm 8 we introduce procedure MakeBucket that constructs a bucketing $\mathbf{B}$ of size $|\mathbf{B}|=b$ of the following simple form. The vector $\bm{B}$ is defined by

[TABLE]

The pseudocode for MakeBucket, which implements the formula above, is given in Algorithm 8.

8.2 Notation

We will need notations for vectorial operations, e.g., entrywise multiplication and/or division of vectors, which is defined in the following definition.

Definition 14 (Entrywise vectorial arithmetic).

Suppose that $\bm{B}=(B_{1},B_{2},\cdots,B_{d})$ , $\bm{j}=(j_{1},j_{2},\cdots,j_{d})$ and $\bm{t}=(t_{1},t_{2},\cdots,t_{d})$ are $d$ -dimensional vectors and $a$ is a scalar value. Then we define the following operations,

[TABLE]

8.3 Outline of our approach

The algorithm is motivated by the idea of speeding up our algorithm for worst-case signals (Algorithm 4, also see Theorem 2.1) by reducing the number of iterations of the process from $\Theta(k)$ down to $O(\log k)$ . Such a reduction (which is shown to be impossible for worst-case signals in Section 6) requires the ability to peel off many elements of the residual in a single phase of the algorithm, which turns out to be possible if the support of $\widehat{x}$ is chosen uniformly at random as in Definition 2.2. Indeed, if one considers the splitting tree $T$ of a signal with uniformly random support (see Figure 5 for an illustration), one sees that

(a)

a large constant fraction of nodes $v\in T$ satisfy $w_{T}(v)\leq\log_{2}k+O(1)$ ;

(b)

the adaptive aliasing filters $G$ constructed for such nodes will have significantly overlapping support in time domain.

We provide the intuition for this for the one-dimensional setting ( $d=1$ ) to simplify notation (changes required in higher dimensions are minor). In this setting, property (b) above is simply a manifestation of the fact that since the support is uniformly random, any given non-empty congruence class modulo $B^{\prime}=Ck$ for a large enough constant $C>1$ is likely to contain only a single element of the support of $\widehat{x}$ . Our adaptive aliasing filters provide a way to only partition frequency space along a carefully selected subset of bits in $[\log_{2}N]$ , but due to the randomness assumption, one can isolate most of the elements by simply partitioning using the bottom $\log_{2}k+O(1)$ bits. This essentially corresponds to hashing $\widehat{x}$ into $B=Ck$ buckets at computational cost $O(B^{\prime}\log B^{\prime})=O(k\log k)$ . While this scheme is efficient, it unfortunately only recovers a constant fraction of coefficients. One solution would be to hash into $B=Ck^{2}$ buckets (i.e., consider congruence classes modulo $Ck^{2}$ ), which would result in perfect hashing with good constant probability, allowing us to recover the entire signal in a single round. However, this hashing scheme would result in a runtime of $\Omega(k^{2}\log k)$ and is, hence, not satisfactory. On the other hand, hashing into $Ck^{2}$ buckets is clearly wasteful, as most buckets would be empty. Our main algorithmic contribution is a way of “implicitly” hashing into $Ck^{2}$ buckets, i.e., getting access to the nonempty buckets, at an improved cost of $\widetilde{O}(k)$ .

Our algorithm uses an iterative approach, and the main underlying observation is very simple. Suppose that we are given the ability to “implicitly” hash into $B$ buckets for some $B$ , namely, get access to the nonempty buckets. If $B$ is at least $\text{min}(Ck^{2},N)$ , we know that there are no collisions with high probability and we are done. If not, then we show that, given access to nonempty buckets in the $B$ -hashing (i.e. a hashing into $B$ buckets), we can get access to the nonempty buckets of a $(\Gamma B)$ -hashing for some appropriately chosen constant $\Gamma>1$ at a polylogarithmic cost in the size of each nonempty bucket of the $B$ -bucketing by essentially computing the Fourier transform of the signal restricted to nonempty buckets in the $B$ -bucketing. We then proceed iteratively in this manner, starting with $B=Ck$ , for which we can perform the hashing explicitly. Since the number of nonzero frequencies remaining in the residual after $t$ iterations of this process decays geometrically in $t$ , we can also afford to use a smaller number of buckets $B^{\prime}$ in the hashing that we construct explicitly, ensuring that the runtime is dominated by the first iteration.

Ultimately, the algorithm takes the following form. At every iteration, we explicitly compute a hashing into $B^{\mathrm{base}}\leq Ck$ buckets explicitly. Then, using a list of nonempty buckets in a $B^{\mathrm{prev}}$ -hashing from the previous iteration, we extend this list to a list of nonempty buckets in a $B^{\mathrm{next}}$ -bucketing at polylogarithmic cost per bucket (by solving a well-conditioned linear system, see Algorithm 6), where $B^{\mathrm{next}}=\Gamma\cdot B^{\mathrm{prev}}$ for some large enough constant $\Gamma>1$ . Meanwhile, we reduce $B^{\mathrm{base}}$ by a factor of $\Gamma$ , thus maintaining the invariant $B^{\mathrm{base}}\cdot B^{\mathrm{next}}\approx k^{2}$ at all times (note that this is satisfied at the start, when $B^{\mathrm{base}}=B^{\mathrm{prev}}\approx k$ , and $B^{\mathrm{base}}\cdot B^{\mathrm{next}}$ remains invariant at each iteration). Therefore, after a logarithmic number of iterations, we have effectively emulated hashing into $\approx k^{2}$ but at a total cost of roughly one hashing computation into $\approx k$ buckets (see Figure 6 for an illustration).

Bucketing in high dimensions (MakeBucket function).

We note that our vectorial notation for buckets in high dimensions (see section 8.2) allows us to continue talking about bucketings with $\bm{B}^{base},\bm{B}^{prev},\bm{B}^{next}$ buckets, even though now the number of buckets is in fact a vector of length $d$ . In fact in dimension $d$ the only property of the bucketing that matters for our analysis is the number of buckets $|\bm{B}^{base}|,|\bm{B}^{prev}|,|\bm{B}^{next}|$ and the shape of each bucket is not important (this is due to the fact that the support is sampled from a permutation invariant distribution). In order to avoid unnecessary notation overload, in Algorithm 8 we introduce procedure MakeBucket that constructs a bucketing $\mathbf{B}$ of size $|\mathbf{B}|=b$ of the following simple form. The vector $\bm{B}$ is defined by

[TABLE]

The pseudocode for MakeBucket, which implements the formula above, is given in Algorithm 8.

8.4 Filtering, hashing, and bucketing in high dimensions

We introduce the main definitions here. Our techniques in this section use a version of our adaptive aliasing filters that is taylored to the assumption that the support of $\widehat{x}$ is chosen uniformly at random. Since the signal is assumed to be sampled from a distribution, we are able to design a fast algorithm by adapting to a distribution as opposed to a given realization of the support of $\widehat{x}$ . The next definition is essentially a simplified version of the definition of a frequency cone from Section 2 (see Definition 1):

Definition 15 (Congruence classes of support).

Suppose $d$ and $n$ are positive integers such that $n$ is a power of two. Let $\bm{B}=(B_{1},B_{2},\dots,B_{d})$ be a vector of powers of two such that $B_{j}\mid n$ for $j=1,2,\dots,d$ . For every $\bm{b}\in[\bm{B}]$ , and signal $x\in{\mathbb{C}}^{n^{d}}$ , we define the $(\bm{B},\bm{b})$ -congruence class of $\mathrm{supp}\leavevmode\nobreak\ {\widehat{x}}$ to be the set $S_{x}(\bm{B},\bm{b})$ , given by

[TABLE]

We access the signal using a bucketing operation, defined below.

Definition 16 (Bucketing in high dimensions).

Suppose $d$ and $n$ are positive integers such that $n$ is a power of two. Let $\bm{B}=(B_{1},B_{2},\dots,B_{d})$ be a vector of powers of two such that $B_{j}\mid n$ for all $j=1,2,\dots,d$ . For every $\bm{a}\in[n]^{d}$ , $\bm{b}\in[\bm{B}]$ , and signal $x\in{\mathbb{C}}^{n^{d}}$ , we define the $(\bm{B},\bm{b})$ -bucketing of $x$ with shift $\bm{a}$ to be $U_{x}^{\bm{a}}(\bm{B},\bm{b})$ , given by

[TABLE]

The following definition of Bernoulli set provides a compact way of referring to the distribution of $\text{supp\leavevmode\nobreak\ }\widehat{x}$ :

Definition 17 (Bernoulli set).

For every power of two $n$ and positive integer $d$ , let $N=n^{d}$ and set $S\subseteq[n]^{d}$ be a random set such that each $\bm{j}\in[n]^{d}$ is independently chosen to be in $S$ with probability $k/N$ ,

[TABLE]

Moreover, for any $\bm{B}=(B_{1},B_{2},\dots,B_{d})$ such that $B_{1},B_{2},\dots,B_{d}\mid n$ , we define

[TABLE]

The next lemma is crucial for our analysis. The lemma considers two bucketings $\bm{B}$ and $\bm{B}^{\prime}$ , where $\bm{B}$ is a refinement of $\bm{B}$ . The object of interest is the number of buckets in the bucketing $\bm{B}$ that contain at least two elements of a Bernoulli set $S$ , i.e. non-singleton buckets. The lemma shows that as long as the product of the number of buckets in $\bm{B}$ and $\bm{B}^{\prime}$ is at least $k^{2}$ , the elements (i.e. frequencies) of a Bernoulli set $S$ that belong to non-singleton buckets in $\bm{B}$ must be rather uniformly spread over the coarser bucketing $\bm{B}^{\prime}$ . Specifically, no bucket in $\bm{B}^{\prime}$ contains more than $O(\log N)$ such frequencies with high probability. We will use this lemma with $\bm{B}^{\prime}=\bm{B}^{base}$ and $\bm{B}=\bm{B}^{next}$ and $\bm{B}=\bm{B}^{prev}$ (see proof of Theorem 2.2 below).

Lemma 8 (Refinement lemma).

For any power of two integers $n$ and $k$ , suppose $\bm{B}=(B_{1},B_{2},\dots B_{d})$ and $\bm{B}^{\prime}=(B_{1}^{\prime},B_{2}^{\prime},\dots,B_{d}^{\prime})$ satisfy $B_{j}^{\prime}\mid B_{j}\mid n$ for all $j=1,2,\dots,d$ as well as $|\bm{B}|\cdot|\bm{B}^{\prime}|\geq k^{2}$ . Then, with probability at least $1-\frac{1}{N^{3}}$ , we have that for all $\bm{b}\in[\bm{B}^{\prime}]$ ,

[TABLE]

The proof is a simple probabilistic argument and is given in Appendix B.

8.5 Hashing in high dimensions by using downsampling

The main primitive we need for developing an algorithm with $\tilde{O}(k)$ sample complexity is a hashing function based on downsampling, presented below as Algorithm 6. The algorithm takes as input the list $R$ of buckets in the hashing into $\bm{B}^{\mathrm{prev}}$ (that will later be guaranteed to be superset of the nonempty buckets in a $\bm{B}^{\mathrm{prev}}$ -hashing of the residual signal $x-\chi$ ) and outputs a list of potentially nonempty buckets in the hashing into $\bm{B}^{\mathrm{next}}$ , together with evaluations of the corresponding hashed signals at a point $\bm{\alpha}$ that is given as a parameter.

Lemma 9.

(Hashing in high dimensions)* Suppose $d$ and $n$ are positive integers such that $n$ is a power of two. Suppose $\bm{B}^{\mathrm{base}}=(B^{\mathrm{base}}_{1},B^{\mathrm{base}}_{2},\dots,B^{\mathrm{base}}_{d})$ , $\bm{B}^{\mathrm{prev}}=(B^{\mathrm{prev}}_{1},B^{\mathrm{prev}}_{2},\dots,B^{\mathrm{prev}}_{d})$ and $\bm{B}^{\mathrm{next}}=(B^{\mathrm{next}}_{1},B^{\mathrm{next}}_{2},\dots,B^{\mathrm{next}}_{d})$ are vectors of powers of two such that $B^{\mathrm{prev}}_{j}\mid B^{\mathrm{next}}_{j}$ and $B^{\mathrm{base}}_{j}\mid B^{\mathrm{prev}}_{j}$ for all $j$ . Moreover, let $\bm{\alpha}\in[n]^{d}$ be a shift vector. For any signals $x,\widehat{\chi}\in{\mathbb{C}}^{n^{d}}$ suppose that*

[TABLE]

where $S_{x-\chi}(\bm{B}^{\mathrm{prev}},\bm{b})$ is $(\bm{B}^{\mathrm{prev}},\bm{b})$ -congruence class of $\mathrm{supp}\leavevmode\nobreak\ {\widehat{(x-\chi)}}$ as per Definition 15. Then the procedure Hashing $(x,\widehat{\chi},n,d,\bm{B}^{\mathrm{base}},\bm{B}^{\mathrm{prev}},\bm{B}^{\mathrm{next}},\bm{\alpha},R)$ outputs a set $W$ that with probability $1-\frac{1}{n^{10d}}$ satisfies

[TABLE]

Moreover, the sample complexity of this procedure is

[TABLE]

while the time complexity of this procedure is

[TABLE]

where $m=\frac{|\bm{B}^{\mathrm{next}}|}{|\bm{B}^{\mathrm{prev}}|}\max_{\bm{b}\in[\bm{B}^{\mathrm{base}}]}\left|\left\{\bm{r}\in R:\bm{r}\equiv\bm{b}\pmod{B^{\mathrm{base}}}\right\}\right|$ .

Proof.

Recall that the algorithm uses a coarse $\bm{B}^{\mathrm{base}}$ -bucketing to refine a $\bm{B}^{\mathrm{prev}}$ -bucketing, for which only the non-empty buckets are computed, to a $\bm{B}^{\mathrm{next}}$ -bucketing. Let $x^{\prime}=x-\chi$ and $\bm{a}=\bm{\alpha}+\bm{\beta}\cdot\frac{n}{\bm{B}^{\mathrm{next}}}$ , where $\bm{\alpha}=\bm{a}\bmod\frac{n}{\bm{B}^{\mathrm{next}}}$ . Suppose (12) holds.

Note that for any $\bm{b}\in[\bm{B}^{\mathrm{base}}]$ ,

[TABLE]

where $\bm{\phi}(\bm{b},\bm{r},\bm{s})$ denotes $\bm{b}+\bm{r}\bm{B}^{\mathrm{base}}+\bm{s}\bm{B}^{\mathrm{prev}}\in[\bm{B}^{\mathrm{next}}]$ for ease of notation. Hence, $\bm{B}^{\mathrm{base}}$ -bucketing and $\bm{B}^{\mathrm{next}}$ -bucketing are related via a linear system which is defined in (14) for any collection of values of $\bm{\beta}$ . We also show how to choose a relatively small number of values of $\bm{\beta}$ such that the above linear system will be well-conditioned. Note that $\frac{\bm{B}^{\mathrm{next}}}{\bm{B}^{\mathrm{prev}}}$ is a vector of length $d$ , as per our vectorial notations in section 8.2, which denotes by how much we want to further refine each bucket of $\bm{B}^{\mathrm{prev}}$ -bucketing.

Choosing $\bm{\beta}$ ’s that make the linear system in (14) well-conditioned:

For every $\bm{b}\in[\bm{B}^{\mathrm{base}}]$ , let $\mathbf{A}_{\bm{b}}$ denote the $\left|\bm{B}^{\mathrm{next}}\right|\times\left|\frac{\bm{B}^{\mathrm{next}}}{\bm{B}^{\mathrm{base}}}\right|$ matrix whose rows are indexed by $\bm{\beta}\in[\bm{B}^{\mathrm{next}}]$ and columns are indexed by $(\bm{r},\bm{s})\in[\frac{\bm{B}^{\mathrm{prev}}}{\bm{B}^{\mathrm{base}}}]\times[\frac{\bm{B}^{\mathrm{next}}}{\bm{B}^{\mathrm{prev}}}]$ , with entries defined by

[TABLE]

Moreover, let $\bm{v}_{b}$ denote the column vector of length $\left|\frac{\bm{B}^{\mathrm{next}}}{\bm{B}^{\mathrm{base}}}\right|$ with entries indexed by $(\bm{r},\bm{s})\in[\frac{\bm{B}^{\mathrm{prev}}}{\bm{B}^{\mathrm{base}}}]\times[\frac{\bm{B}^{\mathrm{next}}}{\bm{B}^{\mathrm{prev}}}]$ and given by

[TABLE]

while we let $\bm{w}_{\bm{b}}$ denote the column vector of length $\left|\bm{B}^{\mathrm{next}}\right|$ with entries indexed by $\bm{\beta}\in[\bm{B}^{\mathrm{next}}]$ and given by

[TABLE]

Then, (14) implies the following linear system of equations,

[TABLE]

Next, for any $\bm{b}\in[\bm{B}^{\mathrm{base}}]$ , let

[TABLE]

By the assumption of the lemma on the set $R$ , one has that the vectors ${\bm{v}_{\bm{b}}}$ in (15) satisfy $\mathrm{supp}\leavevmode\nobreak\ {\bm{v}_{\bm{b}}}\subseteq R_{\bm{b}}$ for every $\bm{b}\in[\bm{B}^{\mathrm{base}}]$ . This shows that the linear system in (17) can be solved very efficiently by randomly sampling its rows. More formally, suppose that $S$ is a multiset such that $S=\left\{\bm{\beta}_{i}:\bm{\beta}_{i}\sim\mathrm{Unif}([\bm{B}^{\mathrm{next}}]),\forall i\in[Cm\log^{2}_{2}m\cdot d\log_{2}n]\right\}$ where $m=\max_{\bm{b}^{\prime}\in[\bm{B}^{\mathrm{base}}]}|R_{\bm{b}^{\prime}}|$ and let $\mathbf{A}^{S}_{\bm{b}}$ denote the submatrix of $\mathbf{A}_{\bm{b}}$ whose rows are selected with respect to set $S$ . Then by Theorem 4, the matrix $\frac{1}{|S|}\mathbf{A}_{\bm{b}}^{S}$ satisfies RIP of order $m$ . We will use this property to solve the system (17) efficiently.

Let $\widetilde{\mathbf{A}}_{\bm{b}}$ be a submatrix of $\mathbf{A}_{\bm{b}}$ with size $|S|\times|R_{\bm{b}}|$ . Suppose that its rows are selected with respect to set $S$ and its columns are selected with respect to $R_{\bm{b}}$ . More specifically, $\widetilde{\mathbf{A}}_{\bm{b}}$ is a $|S|\times|R_{\bm{b}}|$ matrix whose rows are indexed by $\bm{\beta}\in S$ and columns are indexed by $(\bm{r},\bm{s})\in R_{\bm{b}}$ , with entries defined by

[TABLE]

Moreover, let $\bm{v}^{(\bm{b})}$ denote the column vector of length $|R_{\bm{b}}|$ with entries indexed by elements of $R_{\bm{b}}$ and given by

[TABLE]

while we let $\bm{w}^{(\bm{b})}$ denote the column vector of length $|S|$ with entries indexed by elements of $S$ and given by

[TABLE]

Then, (17) implies that

[TABLE]

Now, because the matrix

[TABLE]

satisfies RIP of order $|R_{\bm{b}}|$ for all $\bm{b}\in[\bm{B}^{\mathrm{base}}]$ , one has that the condition number of $\widetilde{\mathbf{A}_{\bm{b}}}^{T}\widetilde{\mathbf{A}_{\bm{b}}}$ is at most $\sqrt{3}$ . Therefore, a linear least squares solver can compute $\bm{v}^{(\bm{b})}$ efficiently and in a numerically stable way using the reduced linear system in (19). Note that lines 11-14 of Algorithm 6 carry out this procedure and compute $\bm{v}^{(\bm{b})}$ for each $\bm{b}\in[\bm{B}^{\mathrm{base}}]$ .

Computing $U^{\bm{a}}_{x-\chi}(\bm{B}^{\mathrm{base}},\bm{b})$ :

Now we show how to compute $U_{x-\chi}^{\bm{a}}(\bm{B}^{\mathrm{base}},\bm{b})$ for any $\bm{b}\in[\bm{B}^{\mathrm{base}}]$ and $\bm{a}=\bm{\alpha}+\bm{\beta}\cdot\frac{n}{\bm{B}^{\mathrm{next}}}$ . By standard downsampling properties, we have that if $Z^{(\bm{\alpha},\bm{\beta})}:[\bm{B}^{\mathrm{base}}]\to{\mathbb{C}}$ is the signal defined by

[TABLE]

then its Fourier transform is given by

[TABLE]

Hence,

[TABLE]

which demonstrates how to compute $U_{x-\chi}^{\bm{a}}(\bm{B}^{\mathrm{base}},\bm{b})$ . Note that lines 6-8 of Algorithm 6 simply compute $U_{x-\chi}^{\bm{a}}(B^{\mathrm{base}},\bm{b})$ , for some $\bm{b}\in[B^{\mathrm{base}}]$ with $\bm{a}=\bm{\alpha}+\bm{\beta}\cdot\frac{n}{\bm{B}^{\mathrm{next}}}$ for some $\bm{\beta}$ .

Sample complexity and Runtime:

Lines 6-8 of Algorithm 6 compute $U_{x-\chi}^{\bm{a}}(B^{\mathrm{base}},\bm{b})$ , for some $\bm{b}\in[B^{\mathrm{base}}]$ with $\bm{a}=\bm{\alpha}+\bm{\beta}\cdot\frac{n}{\bm{B}^{\mathrm{next}}}$ for some $\bm{\beta}$ , in time $O(\left|\bm{B}^{\mathrm{base}}\right|\log_{2}\left(|\bm{B}^{\mathrm{base}}|\right)+\|\widehat{\chi}\|_{0})$ and with sample complexity $O\left(\left|\bm{B}^{\mathrm{base}}\right|\right)$ , according to the rule (20). This shows that the vector $\bm{w}_{\bm{b}}$ in (16) can be constructed efficiently.

Note that lines 11-14 of Algorithm 6 carry out a least squares linear system procedure and compute $\bm{v}^{(\bm{b})}$ for each $\bm{b}\in[\bm{B}^{\mathrm{base}}]$ in time $O(|S|^{3})$ , as the time complexity of LeastSquaresSolver procedure is $O(|S|^{3})$ . Moreover, by (18), it follows that for a fixed $\bm{b}\in[\bm{B}^{\mathrm{base}}]$ , line 16 simply adds all pairs $(\bm{b}^{\prime},U_{x-\chi}^{\bm{\alpha}}(\bm{B}^{\mathrm{next}},\bm{b}^{\prime}))$ with $\bm{b}^{\prime}\in[\bm{B}^{\mathrm{next}}]$ satisfying $(\bm{b}^{\prime}\bmod B^{\mathrm{prev}})\in R$ . Also, any $\bm{b}^{\prime}\in[\bm{B}^{\mathrm{next}}]$ for which there exists $\bm{f}\in\mathrm{supp}\leavevmode\nobreak\ \widehat{x-\chi}$ with $\bm{f}\equiv\bm{b}^{\prime}\pmod{\bm{B}^{\mathrm{next}}}$ must satisfy $\bm{f}\equiv\bm{b}^{\prime}\pmod{\bm{B}^{\mathrm{prev}}}$ and, hence, $S_{x-\chi}(\bm{B}^{\mathrm{prev}},\bm{b}^{\prime}\pmod{\bm{B}^{\mathrm{prev}}})\neq\emptyset$ . This shows that $(\bm{b}^{\prime}\bmod\bm{B}^{\mathrm{prev}})\in R$ (by (12)). Thus, it follows that after looping over all $\bm{b}\in[\bm{B}^{\mathrm{base}}]$ , the final $W$ satisfies (13), as desired.

Note that the sample complexity of Algorithm 6 is determined by the total number of samples required to construct the various $Z^{(\bm{\alpha},\bm{\beta})}$ . For any fixed $\bm{\beta}\in S$ , constructing $Z^{(\bm{\alpha},\bm{\beta})}_{\bm{j}}$ requires $|\bm{B}^{\mathrm{base}}|$ samples from $X$ . Since there are $|S|$ values of $\bm{\beta}$ that are relevant, it follows that the total sample complexity is

[TABLE]

The time complexity of this procedure is due to two computations. First, constructing each $\widehat{Z}^{(\bm{\alpha},\bm{\beta})}$ for a fixed $\bm{\beta}$ takes time $O(\left|\bm{B}^{\mathrm{base}}\right|\log_{2}\left(|\bm{B}^{\mathrm{base}}|\right)+\|\widehat{\chi}\|_{0})$ . Second, computing the $\bm{v}^{(\bm{b})}$ vector for each fixed $\bm{b}\in[\bm{B}^{\mathrm{base}}]$ requires time $O(|S|^{3})$ . Therefore, the total time complexity is

[TABLE]

∎

8.6 Resolving buckets in the hashed signal

The other major building block we need for developing a sparse FFT algorithm is a function for testing bucketings of signals with various shifts for emptyness and one-sparsity. Such a primitive takes in a list of buckets of a hashed signal and determines whether each bucket is empty or not. If a bucket is not empty, then we determine whether the bucket consists of exactly one frequency using a one-sparse test. If so, we can determine this frequency and the value of the signal at this frequency from the bucketed signals. If not, then we retain the bucket for the next iteration, in which we will hash to more buckets.

Lemma 10.

(TestBuckets in high dimensions)* Suppose $d$ and $n$ are positive integers such that $n$ is a power of two. Suppose $\bm{B}=(B_{1},B_{2},\dots,B_{d})$ is a vector of powers of two such that $B_{j}\mid n$ for all $j$ . Suppose $x\in{\mathbb{C}}^{n^{d}}$ is a signal such that $W_{\bm{\alpha}}(\bm{b})=U_{x}^{\bm{\alpha}}(\bm{B},\bm{b})$ is a $(\bm{B},\bm{b})$ -bucketing of $x$ with shift $\bm{\alpha}$ for all $\bm{\alpha}\in\mathcal{A}$ and $\bm{b}\in\mathrm{Dom}(W_{\bm{\alpha}})$ where $\mathcal{A}$ is a multiset of $q$ i.i.d samples from $\mathrm{Unif}([n]^{d})$ for some*

[TABLE]

and $S_{x}(\bm{B},\bm{b})$ for all $\bm{b}\in[\bm{B}]$ are Congruence classes of $\mathrm{supp}\leavevmode\nobreak\ {\widehat{x}}$ . Also suppose that Algorithm 7 takes in the quantities $W_{\bm{\alpha}}(\bm{b})$ for all $\bm{\alpha}\in\{e_{1},\dots,e_{d}\}$ , standard basis vectors in $[n]^{d}$ . Then, TestBuckets $(\{W_{\bm{\alpha}}\}_{\bm{\alpha}\in\mathcal{A}\cup\{e_{1},\dots,e_{d}\}},n,d,\bm{B})$ returns $\widehat{\chi}$ and $R$ that with probability $1-\frac{1}{n^{10d}}$ satisfy the following:

•

For any $\bm{b}\in\mathrm{Dom}(W_{\bm{\alpha}})$ such that $S_{x}(\bm{B},\bm{b})$ is a singleton set, $S_{x}(\bm{B},\bm{b})=\{\bm{f}\}$ , we have $\widehat{\chi}_{\bm{f}}=\widehat{x}_{\bm{f}}$ .

•

We have $R=\left\{\bm{b}\in\mathrm{Dom}(W_{\bm{\alpha}}):|S_{x}(\bm{B},\bm{b})|\geq 2\right\}$ .

Moreover, the runtime of this procedure is $O\left(|\mathcal{A}|\cdot|\mathrm{Dom}(W_{\bm{\alpha}})|\right)$ .

Proof.

Let $F^{-1}_{N}$ be the $d$ dimensional inverse Fourier transform’s matrix with $N=n^{d}$ points. The matrix $M={\sqrt{N}}F^{-1}_{N}$ is a unitary matrix and all of its elements have absolute value $\frac{1}{\sqrt{N}}$ . If you let $M_{\mathcal{A}}$ denote the submatrix of $M$ whose rows are sampled from $M$ according to set $\mathcal{A}$ then by Theorem 4, $\frac{\sqrt{N}}{|\mathcal{A}|}M_{\mathcal{A}}$ satisfies the restricted isometry property of order $\max_{\bm{b}\in[\bm{B}]}|S_{x}(\bm{B},\bm{b})|+1$ with probability $1-\frac{1}{N^{10}}$ . In the rest we condition on the event corresponding to matrix $\frac{\sqrt{N}}{|\mathcal{A}|}M_{\mathcal{A}}$ satisfying RIP of order $\max_{\bm{b}\in[\bm{B}]}|S_{x}(\bm{B},\bm{b})|+1$ .

Now, note that by definition of $U_{x}(\bm{B},\bm{b})$ and $S_{x}(\bm{B},\bm{b})$ (definitions 16 and 15 respectively) we have,

[TABLE]

therefore, for every $\bm{b}\in\mathrm{Dom}(W_{\bm{\alpha}})$ the following holds true,

[TABLE]

thus the zero test in line 5 of Algorithm 7 works correctly for all buckets.

Now note that if $S_{x}(\bm{B},\bm{b})$ is a singleton set $\{\bm{f}\}$ then,

[TABLE]

Therefore, for every $q=1,2,\dots,d$ , the following holds,

[TABLE]

thus, $\frac{n}{2\pi}\cdot\phi\left(\frac{W_{e_{q}}(\bm{b})}{W_{0}(\bm{b})}\right)=f_{q}$ . Also note that, $W_{0}(\bm{b})=\widehat{x}_{\bm{f}}$ . This is precisely implemented in line 6-7 of Algorithm 7. On the other hand if the hypothesis that $S_{x}(\bm{B},\bm{b})$ is a singleton set is incorrect our algorithm will find it. Using the notation $v=W_{0}(\bm{b})$ as in line 7 of Algorithm 7,

[TABLE]

where, $\widehat{x}^{\prime}\textbf{}=\widehat{x}(\cdot)-v\delta_{\bm{f}}(\cdot)$ . Because, $\widehat{x}^{\prime}_{S_{x}(\bm{B},\bm{b})\cup\{\bm{f}\}}$ is at most $\max_{\bm{b}\in[\bm{B}]}|S_{x}(\bm{B},\bm{b})|+1$ sparse and matrix $\frac{\sqrt{N}}{|\mathcal{A}|}M_{\mathcal{A}}$ satisfies RIP of order $\max_{\bm{b}\in[\bm{B}]}|S_{x}(\bm{B},\bm{b})|+1$ we have that,

[TABLE]

thus, the one sparse test in line 8 of Algorithm 7 works correctly for all buckets.

It is straightforward to see that the runtime of this procedure is $O\left(|\mathcal{A}|\cdot|\mathrm{Dom}(W_{\bm{\alpha}})|\right)$ . ∎

8.7 Sparse FFT for signals with random support in nearly linear time

Now, we are ready to present the main theorem of this section.

\sfftalg

The theorem is a consequence of the following lemma.

Lemma 11.

Let $\bm{B}^{\mathrm{base},(t)}$ , $\bm{B}^{\mathrm{prev},(t)}$ , $\bm{B}^{\mathrm{next},(t)}$ , $R^{(t)}$ , and $\chi^{(t)}$ denote the values of $\bm{B}^{\mathrm{base}}$ , $\bm{B}^{\mathrm{prev}}$ , $\bm{B}^{\mathrm{next}}$ , $R$ , and $\chi$ , respectively, at the start of iteration $t$ of the main for loop in Algorithm 8. Then, for all $t=0,1,\dots,L$ , we define the event $\mathcal{E}_{t}$ to be the occurrence of the following statements:

$R^{(t)}=\{\bm{b}\in[\bm{B}^{\mathrm{prev},(t)}]:S_{x-\chi^{(t)}}(\bm{B}^{\mathrm{prev},(t)},\bm{b})\neq\emptyset\}$ . 2. 2.

$\mathrm{supp}\leavevmode\nobreak\ (\widehat{x-\chi^{(t)}})\subseteq\mathrm{supp}\leavevmode\nobreak\ \widehat{x}$ * and $\mathrm{supp}\leavevmode\nobreak\ {\widehat{\chi}^{(t)}}\cap\mathrm{supp}\leavevmode\nobreak\ (\widehat{x-\chi^{(t)}})=\emptyset$ .* 3. 3.

If $t>0$ then $\left|S_{x-\chi^{(t)}}\left(\bm{B}^{\mathrm{prev},(t)},\bm{\xi}\pmod{\bm{B}^{\mathrm{prev},(t)}}\right)\right|\geq 2$ for every $\bm{\xi}\in\mathrm{supp}\leavevmode\nobreak\ {(\widehat{X}-\widehat{\chi}^{(t)})}$ .

Then, $\mathcal{E}_{0}$ holds with probability 1, while $\mathrm{Pr}[\mathcal{E}_{t}\mid\mathcal{E}_{0},\mathcal{E}_{1},\dots,\mathcal{E}_{t-1}]\geq 1-\frac{1}{n^{2d}}$ for $t=1,\dots,L$ .

Proof.

Note that for $t=0$ , we have $R^{(t)}=\left[\bm{B}^{\mathrm{prev},(0)}\right]$ . Thus, condition (1) trivially holds. Condition (2) also trivially holds, since $\widehat{\chi}^{(0)}=0$ . Furthermore, (3) does not apply for event $\mathcal{E}_{0}$ . Thus, $\mathcal{E}_{0}$ holds with probability 1.

Now, assume that $\mathcal{E}_{0},\mathcal{E}_{1},\dots,\mathcal{E}_{m}$ hold for some $m\geq 0$ . We consider the probability of $\mathcal{E}_{m+1}$ occurring, conditioned on the aforementioned events. Note that it follows from the values that Algorithm 8 assigns to $\bm{B}^{\mathrm{base},(m)}$ , $\bm{B}^{\mathrm{prev},(m)}$ and $\bm{B}^{\mathrm{next},(m)}$ along with condition (1) of the inductive hypothesis that Lemma 9 can be applied to invocations of Hashing $(x,\widehat{\chi}^{(m)},n,d,\bm{B}^{\mathrm{base},(m)},\bm{B}^{\mathrm{prev},(m)},\bm{B}^{\mathrm{next},(m)},\bm{\alpha},R)$ in line 13 of Algorithm 8, and hence the output of Hashing procedure satisfies the following,

[TABLE]

Moreover, it is clear that for every $\bm{\alpha}\in[n]^{d}$ and every $\bm{b}^{\prime}\in[\bm{B}^{\mathrm{next},(m)}]$ , one has that $W_{\bm{\alpha}}(\bm{b}^{\prime})$ is $(\bm{B}^{\mathrm{next},(m)},\bm{b}^{\prime})$ -bucketing of $x-\chi^{(m)}$ .

Now, note that by condition (2) of the inductive hypothesis along with definition of Congruence class presented in Definition 15, it follows that $S_{x-\chi^{(m)}}(\bm{B}^{\mathrm{next},(m)},\bm{b})\subseteq S_{x}(\bm{B}^{\mathrm{next},(m)},\bm{b})$ for all $\bm{b}\in[\bm{B}^{\mathrm{next},(m)}]$ . Hence by Lemma 12, with probability $1-\frac{1}{n^{3d}}$ ,

[TABLE]

This show that set $\mathcal{A}$ defined in line 12 of Algorithm 8 satisfies the precondition of Lemma 10. Therefore by Lemma 10, a call to TestBuckets procedure in line 14 of Algorithm 8, with probability $1-\frac{1}{n^{10d}}$ , outputs $(\widehat{\chi}^{\prime},R^{(m+1)})$ such that the following hold:

(a)

For any $\bm{b}^{\prime}\in\mathrm{Dom}(W_{\bm{\alpha}})$ such that $S_{x-\chi^{(m)}}(\bm{B}^{\mathrm{next},(m)},\bm{b}^{\prime})$ is a singleton set, $S_{x-\chi^{(m)}}(\bm{B}^{\mathrm{next},(m)},\bm{b}^{\prime})=\{\bm{f}\}$ , one has $\widehat{\chi}^{\prime}_{\bm{f}}=(\widehat{x-\chi^{(m)}})_{\bm{f}}$ . 2. (b)

$R^{(m+1)}=\left\{\bm{b}^{\prime}\in\mathrm{Dom}(W_{\bm{\alpha}}):|S_{x-\chi^{(m)}}(\bm{B}^{\mathrm{next},(m)},\bm{b}^{\prime})|\geq 2\right\}$ .

In order to complete the inductive step, it suffices to show that (a) and (b) imply conditions (1), (2), and (3) in the definition of $\mathcal{E}_{t}$ for $t=m+1$ .

Observe that condition (a) imply that $\mathrm{supp}\leavevmode\nobreak\ {\widehat{\chi}^{\prime}}\subseteq\mathrm{supp}\leavevmode\nobreak\ (\widehat{x-\chi^{(m)}})$ therefore since $\chi^{(m+1)}=\chi^{(m)}+\chi^{\prime}$ (by line 15), it follows that $\mathrm{supp}\leavevmode\nobreak\ (\widehat{x-\chi^{(m+1)}})\subseteq\mathrm{supp}\leavevmode\nobreak\ (\widehat{x-\chi^{(m)}})$ . Hence, by inductive hypothesis $\mathcal{E}_{m}$ we have $\mathrm{supp}\leavevmode\nobreak\ (\widehat{x-\chi^{(m+1)}})\subseteq\mathrm{supp}\leavevmode\nobreak\ {\widehat{x}}$ . Also note that for every $\bm{f}\in\mathrm{supp}\leavevmode\nobreak\ (\widehat{x-\chi^{(m+1)}})$ we have that $\bm{f}\in\mathrm{supp}\leavevmode\nobreak\ (\widehat{x-\chi^{(m)}})$ and hence by condition (2) of the inductive hypothesis $\mathcal{E}_{m}$ , one has $\widehat{\chi}^{(m)}_{\bm{f}}=0$ . Condition (a) implies that $\widehat{\chi}^{\prime}_{\bm{f}}=0$ for every $\bm{f}\in\mathrm{supp}\leavevmode\nobreak\ (\widehat{x-\chi^{(m+1)}})$ and hence $\chi^{(m+1)}_{\bm{f}}=\chi^{(m)}_{\bm{f}}+\chi^{\prime}_{\bm{f}}=0$ for every such every $\bm{f}$ . This establishes condition (2) of $\mathcal{E}_{t}$ for $t=m+1$ .

Next note that $\bm{B}^{\mathrm{prev},(m+1)}=\bm{B}^{\mathrm{next},(m)}$ (by line 16). Conditions (a) along with condition (1) of the inductive hypothesis for $\mathcal{E}_{m}$ and (21) imply that there exists no $\bm{b}^{\prime}\in[\bm{B}^{\mathrm{prev},(m+1)}]$ such that $|S_{x-\chi^{(m+1)}}(\bm{B}^{\mathrm{prev},(m+1)},\bm{b}^{\prime})|=1$ . Also note that condition (b) implies that $|S_{x-\chi^{(m+1)}}(\bm{B}^{\mathrm{prev},(m+1)},\bm{b}^{\prime})|\geq 2$ for every $\bm{b}^{\prime}\in R^{(m+1)}$ , therefore $R^{(m+1)}$ satisfies condition (1) of the induction $\mathcal{E}_{t}$ for $t=m+1$ . This also establishes condition (3) of $\mathcal{E}_{t}$ for $t=m+1$ .

By a union bound we have that with probability $1-\frac{1}{n^{3d}}-\frac{1}{n^{10d}}\geq 1-\frac{1}{n^{2d}}$ , event $\mathcal{E}_{m+1}$ holds true as desired. ∎

Now we are ready to prove Theorem 2.2.

Proof of Theorem 2.2.

Note that by Lemma 11, there exist events $\mathcal{E}_{0},\mathcal{E}_{1},\dots,\mathcal{E}_{L}$ such that $\mathrm{Pr}[\mathcal{E}_{0}]=1$ and $\mathrm{Pr}[\mathcal{E}_{t}\mid\mathcal{E}_{0},\mathcal{E}_{1},\dots,\mathcal{E}_{t-1}]\geq 1-\frac{1}{n^{2d}}$ for $t=1,2,\dots,L$ . Observe that

[TABLE]

Note that condition (3) of $\mathcal{E}_{L}$ implies that the existence of a $\bm{\xi}\in[n]^{d}$ such that $\widehat{\chi}^{(L)}_{\bm{\xi}}\neq\widehat{x}_{\bm{\xi}}$ requires $S^{(\bm{B}^{\mathrm{prev},(L)})}\neq\emptyset$ . Now, recall that after the main for loop in Algorithm 8 finishes execution,we have

[TABLE]

Thus, by Lemma 14, we have that $\mathbb{E}\left[S^{\left(\bm{B}^{\mathrm{prev}}\right)}\right]\leq\frac{k^{2}}{k\cdot\Gamma^{L+2}}=\frac{k}{\Gamma^{L+2}}\leq\frac{1}{100}$ , by our choice of $\Gamma$ and $L$ . Thus, by Markov’s inequality, with probability $\geq\frac{99}{100}$ over the randomness in the choice of $S=\mathrm{supp}\leavevmode\nobreak\ {\widehat{x}}$ , we have that $S^{\left(\bm{B}^{\mathrm{prev}}\right)}=\emptyset$ . Hence, by a union bound, we have that $\mathrm{Pr}\left[\mathcal{E}_{L}\land\left(S^{(\bm{B}^{\mathrm{prev}})}=\emptyset\right)\right]\geq\frac{9}{10}$ . Thus, by condition (3) in Lemma 11, we see that with probability $\geq\frac{9}{10}$ , the output $\widehat{\chi}$ of Algorithm 8 satisfies $\widehat{\chi}_{\bm{f}}=\widehat{x}_{\bm{f}}$ for all $\bm{f}\in[n]^{d}$ , which proves the correctness of Algorithm 8.

Now, let us compute the sample complexity of Algorithm 8. Note that for each iteration $t$ of the main for loop in SparseFFT, we have $\frac{|\bm{B}^{\mathrm{next}}|}{|\bm{B}^{\mathrm{prev}}|}=\Gamma$ . Also, By condition (1) of $\mathcal{E}_{t}$ in Lemma 11,

[TABLE]

Therefore, since $|\bm{B}^{\mathrm{prev}}|\cdot|\bm{B}^{\mathrm{base}}|\geq k^{2}$ , by Lemma 8, we have that

[TABLE]

with probability $\geq 1-\frac{1}{N^{3}}$ .

Moreover, $|\bm{B}^{\mathrm{base},(t)}|=O\left(\frac{k}{\Gamma^{t}}\right)$ . Therefore, by Lemma 9, each call to Hashing in the $t$ -th iteration has sample complexity

[TABLE]

Hence, because in each iteration Hashing is invoked $O((\log^{2}N)(\log\log N)^{2})$ times, the total sample complexity for the algorithm is

[TABLE]

since $\Gamma=O(1)$ .

Finally, we compute the time complexity of Algorithm 8. By Lemma 9, we have that the time complexity for each call to Hashing in the $t$ -th iteration of the main for loop is

[TABLE]

Thus, the total time complexity due to calls to Hashing is

[TABLE]

which can be simplified as

[TABLE]

since $\Gamma=O(1)$ . Moreover, the call to TestBucket in the $t$ -th iteration of the main for loop has time complexity

[TABLE]

Hence, the total time complexity due to calls to TestBucket is

[TABLE]

Therefore, by (22) and (23), the total time complexity of Algorithm 8 is

[TABLE]

as desired. ∎

Acknowledgements

Michael Kapralov is supported in part by ERC Starting Grant 759471.

Appendix A Proofs and pseudocode omitted from section 4

Proof of Lemma 4: Let $v$ be a leaf of $T$ , let $l=l_{T}(v)$ denote the level of $v$ , let $r$ denote the root of $T$ , and let $v_{0},v_{1},\ldots,v_{l}$ denote the path from root to $v$ in $T$ , where $v_{0}=r$ and $v_{l}=v$ . Let $q^{*}$ denote the smallest positive integer such that $l\leq q^{*}\cdot\log_{2}n$ . Note that $q^{*}\leq d$ .

For $q\in\{0,1,\ldots,d\}$ let ${T}^{(q)}$ be a subtree of $T^{\mathrm{full}}_{N}$ which denotes the result of truncating $T$ to contain only the nodes that are at distance at most $q\log_{2}n$ from the root.

We construct the $(v,{T})$ -isolating filter $\widehat{G}$ iteratively by starting with $\widehat{G}^{(0)}=1$ and refining $\widehat{G}^{(q-1)}$ to $\widehat{G}^{(q)}$ over $q^{*}$ steps. The filters $\widehat{G}^{(q)}$ will be $(v_{q\cdot\log_{2}n},{T}^{(q)})$ -isolating for $q=0,1,\ldots,q^{*}-1$ and $\widehat{G}^{(q^{*})}$ will be $(v_{l},{T}^{(q^{*})})$ -isolating. Since ${T}^{(q^{*})}={T}$ and $v_{l}=v$ , the filter $\widehat{G}^{(q^{*})}$ will be $(v,{T})$ -isolating, as required.

For every $q\in\{1,...,q^{*}\}$ let $T_{q}^{v}$ be the subtree of $T$ which is rooted at $v_{(q-1)\cdot\log_{2}n}$ and is restricted to contain only the nodes that are at distance at most $\log_{2}n$ from $v_{(q-1)\cdot\log_{2}n}$ . For every node $u\in T_{q}^{v}$ the label of $u$ is defined to be $f_{u}=(\bm{f}_{u})_{q}$ , i.e., the $q$ th coordinate of $\bm{f}_{u}$ , where $\bm{f}_{u}$ is the label of node $u$ in tree $T$ .

We now define $\widehat{G}^{(q)}$ for $q=1,\ldots,q^{*}$ . We start by letting $\widehat{G}^{(0)}=1$ and letting for every $\bm{f}=(f_{1},\ldots,f_{q},\ldots,f_{d})\in[n]^{d}$

[TABLE]

where $\widehat{G}_{q}$ is a $(v_{q\cdot\log_{2}n},T_{q}^{v})$ -isolating filter for all $q=1,...,q^{*}-1$ and $\widehat{G}_{q^{*}}$ is a $(v_{l},T_{q^{*}}^{v})$ -isolating filter. By lemma 3, for every $q=1,\ldots,q^{*}$ there exists such $G_{q}$ with $|\mathrm{supp}\leavevmode\nobreak\ {G_{q}}|=2^{w_{T_{q}^{v}}(v_{q\cdot\log_{2}n})}$ and can be constructed in time $O(2^{w_{T_{q}^{v}}(v_{q\cdot\log_{2}n})}+\log_{2}n)$ . Such a filter can be computed in Fourier domain at any desired frequency in time $O(\log_{2}n)$ . Note that $\widehat{G}^{(q)}$ is a tensor product of $q$ filters in dimension one. We now show by induction on $q$ that $\widehat{G}^{(q)}$ is a $(v_{q\cdot\log_{2}n},{T}^{(q)})$ -isolating filter.

The base of the induction is provided by $q=0$ : since $v_{0}$ is the root of ${T}^{(0)}$ , we have that $\operatorname{\mathrm{FrequencyCone}}_{{T}^{(0)}}(v_{0})=[n]^{d}$ and $\widehat{G}^{(0)}\equiv 1$ as required.

We now prove the inductive step: $q-1\to q$ . We first show that $\widehat{G}^{(q)}_{\bm{f}^{\prime}}=0$ for every $\bm{f}^{\prime}\in\bigcup_{\begin{subarray}{c}u\neq v_{q\cdot\log_{2}n}\\ u:\text{\leavevmode\nobreak\ leaf of\leavevmode\nobreak\ }T^{(q)}\end{subarray}}\operatorname{\mathrm{FrequencyCone}}_{T^{(q)}}(u)$ . Let $u$ be a leaf of $T^{(q)}$ distinct from $v_{q\cdot\log_{2}n}$ . Let $u^{\prime}$ denote the leaf of ${T}^{(q-1)}$ which is the ancestor of $u$ . We consider two cases.

Case 1: $\bm{f}^{\prime}\not\in\operatorname{\mathrm{FrequencyCone}}_{{T}^{(q-1)}}(v_{(q-1)\log_{2}n})$

Suppose that $u^{\prime}\neq v_{(q-1)\cdot\log_{2}n}$ . Note that $l_{T}(u^{\prime})\leq(q-1)\log_{2}n$ , and also note that

[TABLE]

Thus for every $\bm{f}^{\prime}\in\operatorname{\mathrm{FrequencyCone}}_{T^{(q)}}(u)$ it is true that $\bm{f}^{\prime}\in\operatorname{\mathrm{FrequencyCone}}_{{T}^{q-1}}(u^{\prime})$ . By the inductive hypothesis we have that $\widehat{G}^{(q-1)}$ is $(v_{(q-1)\log_{2}n},{T}^{(q-1)})$ -isolating, and hence by the assumption of $u^{\prime}\neq v_{(q-1)\cdot\log_{2}n}$ , one has $\widehat{G}^{(q-1)}(\bm{f}^{\prime})=0$ for every such $\bm{f}^{\prime}$ , and thus $\widehat{G}^{(q)}(\bm{f}^{\prime})=\widehat{G}^{(q-1)}(\bm{f}^{\prime})\cdot\widehat{G}_{q}(f^{\prime}_{q})=0$ as required.

Case 2: $\bm{f}^{\prime}\in\operatorname{\mathrm{FrequencyCone}}_{{T}^{(q-1)}}(v_{(q-1)\log_{2}n})$

Suppose that $v_{(q-1)\cdot\log_{2}n}$ is ancestor of $u$ . Therefore, by definition of $T_{q}^{v}$ , one can see that $u$ is a leaf in $T_{q}^{v}$ . Hence, by definition of $T^{v}_{q}$ , for every $\bm{f}^{\prime}\in\operatorname{\mathrm{FrequencyCone}}_{T^{(q)}}(u)$ , it is true that $f^{\prime}_{q}\in\operatorname{\mathrm{FrequencyCone}}_{T_{q}^{v}}(u)$ . Recall that $\widehat{G}_{q}$ is a $(v_{q\cdot\log_{2}n},T_{q}^{v})$ -isolating filter and therefore, $\widehat{G}_{q}(f^{\prime}_{q})=0$ , and thus $\widehat{G}^{(q)}(\bm{f}^{\prime})=\widehat{G}^{(q-1)}(\bm{f}^{\prime})\cdot\widehat{G}_{q}(f^{\prime}_{q})=0$ as required.

Now we show that $\widehat{G}^{(q)}_{\bm{f}}=1$ for all $\bm{f}\in\operatorname{\mathrm{FrequencyCone}}_{T^{(q)}}(v_{q\cdot\log_{2}n})$ . Note that $v_{q\cdot\log_{2}n}$ is a leaf in $T_{q}^{v}$ . Hence, for every $\bm{f}\in\operatorname{\mathrm{FrequencyCone}}_{T^{(q)}}(v_{q\cdot\log_{2}n})$ , it is true that $f_{q}\in\operatorname{\mathrm{FrequencyCone}}_{T_{q}^{v}}(v_{q\cdot\log_{2}n})$ . Since $\widehat{G}_{q}$ is a $(v_{q\cdot\log_{2}n},T_{q}^{v})$ -isolating filter, $\widehat{G}_{q}(f_{q})=1$ . Now, note that

[TABLE]

Thus for every $\bm{f}\in\operatorname{\mathrm{FrequencyCone}}_{T^{(q)}}(v_{q\cdot\log_{2}n})$ it is true that $\bm{f}\in\operatorname{\mathrm{FrequencyCone}}_{{T}^{(q-1)}}(v_{(q-1)\cdot\log_{2}n})$ . By the inductive hypothesis we have that $\widehat{G}^{(q-1)}$ is $(v_{(q-1)\log_{2}n},{T}^{(q-1)})$ -isolating, and hence $\widehat{G}^{(q-1)}(\bm{f})=1$ , and thus $\widehat{G}^{(q)}(\bm{f})=\widehat{G}^{(q-1)}(\bm{f})\cdot\widehat{G}_{q}(f_{q})=1$ as required.

It remains to note that $w_{{T}}({v})=\sum_{q=1}^{q^{*}}w_{T_{q}^{v}}({v_{q\cdot\log_{2}n}})$ . By Lemma 3, for every $q\in\{1,...,q^{*}\}$ one has $|\mathrm{supp}\leavevmode\nobreak\ G_{q}|=2^{w_{T_{q}^{v}}(v_{q\cdot\log_{2}n})}$ , so $|\mathrm{supp}\leavevmode\nobreak\ G|=2^{w_{T}(v)}$ , as required (note that the support size of the convolution of two filters is at most the product of support sizes of each filter).

The total runtime for constructing this filter has two parts; First part is the computation time of $G_{q}$ ’s for all $q\in\{1,...,q^{*}\}$ which takes $\sum_{q=1}^{q^{*}}O\left(2^{w_{T_{q}^{v}}({v_{q\cdot\log_{2}n}})}+\log_{2}n\right)=O\left(2^{w_{T}(v)}+d\log_{2}n\right)$ by Lemma 3. Second part is the time needed for computing the tensor product of all $G_{q}$ ’s which is $O\left(\|G_{1}\|_{0}\cdot...\cdot\|G_{q^{*}}\|_{0}\right)=O(2^{w_{T}(v)})$ . Therefore the total runtime is $O\left(2^{w_{T}(v)}+d\log_{2}n\right)$ . Moreover, the total time for computing $\widehat{G}(\bm{\xi})$ is the sum of the times needed for computing all $\widehat{G}_{q}(\xi_{q})$ ’s for $q=1,\cdots,q^{*}$ , which is $O(d\log_{2}n)$ by Lemma 3.

∎

Proof of Lemma 6: Let $N=n^{d}$ . Recall that for every $\bm{t}\in[n]^{d}$ ,

[TABLE]

Because all $\widehat{x}_{\bm{f}}$ ’s are zero mean independent random variables, for every fixed $\bm{t}\in[n]^{d}$ one has that for every $\bm{f}\in[n]^{d}$ the random variables $\widehat{x}_{\bm{f}}\cdot e^{2\pi i\frac{\bm{f}^{T}\bm{t}}{n}}$ are zero mean and independent. Observe that for all $\bm{f}\in[n]^{d}$ , we have $|\widehat{x}_{\bm{f}}\cdot e^{2\pi i\frac{\bm{f}^{T}\bm{t}}{n}}|=|\beta_{\bm{f}}|\leq\|\beta\|_{\infty}$ and also $\mathbb{E}\left[|\widehat{x}_{\bm{f}}\cdot e^{2\pi i\frac{\bm{f}^{T}\bm{t}}{n}}|^{2}\right]=|\beta_{\bm{f}}|^{2}$ . Therefore by Bernstein’s inequality we have that for every fixed $\bm{t}\in[n]^{d}$ ,

[TABLE]

If we choose $\theta=C_{1}\log_{2}N\cdot\|\beta\|_{2}$ for some absolute constant $C_{1}>0$ ,

[TABLE]

for large enough constant $C_{1}$ . By a union bound over all $\bm{t}\in[n]^{d}$ we get that, $|x_{\bm{t}}|^{2}\leq\frac{C_{1}^{2}\log^{2}_{2}N}{N^{2}}\|\beta\|^{2}_{2}$ for all $\bm{t}\in[n]^{d}$ with probability $1-\frac{1}{2N^{4}}$ .

Now note that by Parseval’s theorem, Claim 1,

[TABLE]

Conditioning on $\|x\|_{\infty}\leq\frac{C_{1}^{2}\log^{2}_{2}N}{N^{2}}\|\beta\|^{2}_{2}$ , by Chernoff-Hoeffding Bound we have,

[TABLE]

where the probability is over the i.i.d. random variables $\bm{t}_{1},\bm{t}_{2},\dots,\bm{t}_{s}\sim\mathrm{Unif}([n]^{d})$ and $C_{2}$ is some positive constant. Therefore, by the choice of $s=C\log_{2}^{3}N$ for some large enough constant $C$ we have that,

[TABLE]

By a union bound over these two events we have that $\frac{1}{2}\cdot\frac{\|\beta\|_{2}^{2}}{N^{2}}\leq\frac{1}{s}\sum_{j=1}^{s}|x_{\bm{t}_{j}}|^{2}\leq\frac{3}{2}\cdot\frac{\|\beta\|_{2}^{2}}{N^{2}}$ with probability at least $1-\frac{1}{N^{4}}$ .

∎

Appendix B Proof of Lemma 8

Lemma 12.

For every power of two integer $n$ and positive integer $d$ , if $x\in{\mathbb{C}}^{n^{d}}$ is a random support signal as per Definition 2.2, the following conditions hold. If $N=n^{d}$ , $\bm{B}=(B_{1},B_{2},\dots,B_{d})$ is a vector of powers of two such that $B_{j}\mid n$ for all $j=1,2,\dots,d$ and $|\bm{B}|\geq 4k$ , then, with probability at least $1-\frac{1}{N^{3}}$ over $x$ ,

[TABLE]

for all $\bm{b}\in[\bm{B}]$ (where $S_{x}(\bm{B},\bm{b})$ is the set from Definition 15).

Proof.

Note that for every $\bm{b}\in[\bm{B}]$ , we have that for each $\bm{f}\in[n]^{d}$ with $\bm{f}\equiv\bm{b}\pmod{[\bm{B}]}$ , $\mathrm{Pr}[\bm{f}\in S]=k/N$ . Then, since there are $N/|\bm{B}|$ such $\bm{f}$ for every fixed $\bm{b}\in[\bm{B}]$ , it follows that,

[TABLE]

Hence, by the Chernoff bound, it follows that $|S_{x}(\bm{B},\bm{b})|=O(\log N)$ with probability $1-N^{-4}$ for any fixed $\bm{b}\in[\bm{B}]$ . Finally, by a union bound over all $\bm{b}\in[\bm{B}]$ , we have the desired result. ∎

We now prove a lemma about the size of the sets $S^{(\bm{B})}$ :

Lemma 13.

For any power of two integers $n$ and $k$ , positive integer $d$ and $\bm{B}=(B_{1},B_{2},\dots,B_{d})$ such that $B_{1},B_{2},\dots,B_{d}\mid n$ and $\bm{f}=(f_{1},\dots,f_{d})\in[\bm{B}]$ , we have that $\mathrm{Pr}[{\bf f}\in S^{(\bm{B})}]\leq\left(\frac{k}{|\bm{B}|}\right)^{2}$ , where $S^{(\bm{B})}$ is defined as in Definition 17.

Proof.

Suppose $\bm{f}\in[\bm{B}]$ . Then, observe that there are $\left(\frac{n}{B_{1}}\right)\left(\frac{n}{B_{2}}\right)\cdot\left(\frac{n}{B_{d}}\right)=\frac{N}{B_{1}B_{2}\cdots B_{d}}$ elements $\gg=(g_{1},\dots,g_{d})\in[n]^{d}$ such that $\bm{f}\equiv\gg\pmod{\bm{B}}$ . Note that $\bm{f}\in S^{(\bm{B})}$ if at least two of these elements lies in $S$ . Thus, for every $\bm{f}\in[\bm{B}]$ we have,

[TABLE]

since $|\bm{B}|\leq n^{d}=N$ . ∎

As a consequence, we have a bound on the expected size of $S^{(\bm{B})}$ .

Lemma 14.

For any power of two integers $n$ and $k$ , any $\bm{B}=(B_{1},B_{2},\dots,B_{d})$ such that $B_{1},B_{2},\dots,B_{d}\mid n$ , we have $\mathbb{E}\left[|S^{(\bm{B})}|\right]\leq\frac{k^{2}}{|\bm{B}|}$ .

Proof.

Simply note that

[TABLE]

by Lemma 13. ∎

We are now ready to proof Lemma 8.

Proof of Lemma 8: Consider a fixed $\bm{b}\in[\bm{B}^{\prime}]$ . Note that there are $m=\frac{B_{1}B_{2}\cdots B_{d}}{B_{1}^{\prime}B_{2}^{\prime}\cdots B_{d}^{\prime}}$ values of $\bm{f}\in[\bm{B}]$ such that $\bm{f}\equiv\bm{b}\pmod{\bm{B}^{\prime}}$ . Moreover, by Lemma 13, each such $\bm{f}$ lies in $S^{(\bm{B})}$ with identical probability

[TABLE]

and these events are all independent. Thus,

[TABLE]

Thus, by the Chernoff bound, we have that

[TABLE]

with probability at least $1-\frac{1}{N^{4}}$ , as desired. Finally, taking a union bound over all $|\bm{B}^{\prime}|\leq N$ values of $\bm{b}\in[\bm{B}]$ gives the desired result. ∎

Bibliography37

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[AGS 03] A. Akavia, S. Goldwasser, and S. Safra. Proving hard-core predicates using list decoding. FOCS , 44:146–159, 2003.
2[Aka 10] A. Akavia. Deterministic sparse Fourier approximation via fooling arithmetic progressions. COLT , pages 381–393, 2010.
3[BCG + 12] P. Boufounos, V. Cevher, A. C. Gilbert, Y. Li, and M. J. Strauss. What’s the frequency, kenneth?: Sublinear fourier sampling off the grid. RANDOM/APPROX , 2012.
4[BG] R. Beatson and Greengard. A short course on fast multipole methods. https://web.njit.edu/~jiang/math 614/beatson-greengard.pdf .
5[BG 97] Rick Beatson and Leslie Greengard. A short course on fast multipole methods. Wavelets, multilevel methods and elliptic PD Es , 1:1–37, 1997.
6[Bou 14] J. Bourgain. An improved estimate in the restricted isometry problem. GAFA , 2014.
7[CGV 12] Mahdi Cheraghchi, Venkatesan Guruswami, and Ameya Velingker. Restricted isometry of fourier matrices and list decodability of random linear codes. SODA , 2012.
8[CI 17] Mahdi Cheraghchi and Piotr Indyk. Nearly optimal deterministic algorithm for sparse walsh-hadamard transform. ACM Trans. Algorithms , 13(3):34:1–34:36, 2017.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Dimension-independent Sparse Fourier Transform

Abstract

1 Introduction

Theorem 1** (Main result, informal version of Theorem 2.1 in Section 2.1).**

1.1 Significance of our results and related work

Significance of our results.

Sample complexity of high-dimensional Sparse FFT.

State-of-the-art approaches to Sparse FFT and their lack of scalability in high dimensions.

Related work.

2 Overview of our results and techniques

Adaptive aliasing filters.

Definition 1** ((f,S)(\bm{f},S)(f,S)-isolating filter, informal version of Definition 11, see Section 4).**

Lemma 1** (Informal version of Corollary 2 in Section 4).**

Accessing the residual signal.

Putting it together: estimation with Fourier measurements

Theorem 2** (Estimation guarantee, informal version of Theorem 5 in Section 5).**

2.1 Recovery via adaptive aliasing filters

Iterative tree exploration process leading to an algorithm with O~(k3)\widetilde{O}(k^{3})O(k3) runtime.

Lemma 2** (Informal version of Lemma 3 in Section 4).**

O~(k2)\widetilde{O}(k^{2})O(k2) runtime under random phase assumption.

Impossibility of reducing the number of iterations (rounds of adaptivity): signals with low Hamming weight support.

Theorem 3** (Informal version of Theorem 6).**

2.2 Runtime O~(k)\widetilde{O}(k)O(k) for random supports through a batched peeling process

Organization.

3 Preliminaries and notation

Definition 2** (Inner product).**

Definition 3** (Fourier transform).**

Claim 1** (Parseval’s theorem).**

Definition 4** (Unit impulse).**

Claim 2**.**

Claim 3** (Convolution theorem).**

Definition 5** (Tensor multiplication).**

Claim 4** (Fourier transform of a tensor product).**

Definition 6**.**

Definition 7** (The Restricted isometry property).**

Theorem 4**.**

4 Adaptive aliasing filters

4.1 One-dimensional Fourier transform

Claim 5**.**

Definition 8** ((f,S)(f,S)(f,S)-isolating filter).**

Lemma 3** (Filter properties).**

Corollary 1**.**

Proof.

4.2 ddd-dimensional Fourier transform

Definition 9** (Flattening of [n]d{[n]}^{d}[n]d to [nd]{[{n}^{d}]}[nd]. Unflattening of [nd]{[n^{d}]}[nd] to [n]d{[{n}]^{d}}[n]d).**

Definition 10** (Multidimensional splitting tree).**

Definition 11** (Multidimensional (f,S)(\bm{f},S)(f,S)-isolating filter).**

Definition 12** (Frequency cone of a leaf of TTT in high dimensions).**

Claim 6**.**

Definition 13** (Multidimensional (v,T)(v,T)(v,T)-isolating filter).**

Lemma 4** (Construction of a multidimensional isolating filter).**

4.3 Putting it together

Claim 7**.**

Proof.

Corollary 2**.**

Proof.

5 Estimation of sparse high-dimensional signals in quadratic time

Theorem 5** (Estimation guarantee).**

Proof.

Base case of induction:

Inductive step:

6 A lower bound of k1−o(1)k^{1-o(1)}k1−o(1) rounds of tree pruning

Tree pruning process

Theorem 6**.**

Lemma 5** (Monotonicity of tree pruning process).**

Proof.

7 Sparse FFT for worst-case sparse signals and worst case signals with random phase

Lemma 6**.**

7.1 Proofs of Theorems 2.1 and 2.1

Lemma 7** (ZeroTest guarantee).**

Proof.

8 Signals with random support in high dimension

8.1 Outline of our approach

Theorem 1 (Main result, informal version of Theorem 2.1 in Section 2.1).

Definition 1 ( $(\bm{f},S)$ -isolating filter, informal version of Definition 11, see Section 4).

Lemma 1 (Informal version of Corollary 2 in Section 4).

Theorem 2 (Estimation guarantee, informal version of Theorem 5 in Section 5).

Iterative tree exploration process leading to an algorithm with $\widetilde{O}(k^{3})$ runtime.

Lemma 2 (Informal version of Lemma 3 in Section 4).

$\widetilde{O}(k^{2})$ runtime under random phase assumption.

Theorem 3 (Informal version of Theorem 6).

2.2 Runtime $\widetilde{O}(k)$ for random supports through a batched peeling process

Definition 2 (Inner product).

Definition 3 (Fourier transform).

Claim 1 (Parseval’s theorem).

Definition 4 (Unit impulse).

Claim 2.

Claim 3 (Convolution theorem).

Definition 5 (Tensor multiplication).

Claim 4 (Fourier transform of a tensor product).

Definition 6.

Definition 7 (The Restricted isometry property).

Theorem 4.

Claim 5.

Definition 8 ( $(f,S)$ -isolating filter).

Lemma 3 (Filter properties).

Corollary 1.

4.2 $d$ -dimensional Fourier transform

Definition 9 (Flattening of ${[n]}^{d}$ to ${[{n}^{d}]}$ . Unflattening of ${[n^{d}]}$ to ${[{n}]^{d}}$ ).

Definition 10 (Multidimensional splitting tree).

Definition 11 (Multidimensional $(\bm{f},S)$ -isolating filter).

Definition 12 (Frequency cone of a leaf of $T$ in high dimensions).

Claim 6.

Definition 13 (Multidimensional $(v,T)$ -isolating filter).

Lemma 4 (Construction of a multidimensional isolating filter).

Claim 7.

Corollary 2.

Theorem 5 (Estimation guarantee).

6 A lower bound of $k^{1-o(1)}$ rounds of tree pruning

Theorem 6.

Lemma 5 (Monotonicity of tree pruning process).

Lemma 6.

Lemma 7 (ZeroTest guarantee).

Definition 14 (Entrywise vectorial arithmetic).

Definition 15 (Congruence classes of support).

Definition 16 (Bucketing in high dimensions).

Definition 17 (Bernoulli set).

Lemma 8 (Refinement lemma).

Lemma 9.

Choosing $\bm{\beta}$ ’s that make the linear system in (14) well-conditioned:

Computing $U^{\bm{a}}_{x-\chi}(\bm{B}^{\mathrm{base}},\bm{b})$ :

Lemma 10.

Lemma 11.

Lemma 12.

Lemma 13.

Lemma 14.