A Faster Algorithm Enumerating Relevant Features over Finite Fields

Mikito Nanashima

arXiv:1903.06412·cs.LG·July 16, 2019

A Faster Algorithm Enumerating Relevant Features over Finite Fields

Mikito Nanashima

PDF

Open Access

TL;DR

This paper introduces a novel, efficient algorithm for learning k-juntas over finite fields, expanding Fourier detection techniques beyond binary cases and connecting to problems like LDME and the light bulb problem.

Contribution

It extends Fourier detection methods to finite fields and provides the first non-trivial algorithm for k-juntas over such fields, answering an open question.

Findings

01

Achieves an $O(n^{0.8k})$-time learning algorithm for k-juntas over finite fields.

02

First non-trivial algorithm for multi-labeled data in this context.

03

Reduces the problem to well-studied problems like LDME and LBP, enabling the use of existing techniques.

Abstract

We consider the problem of enumerating relevant features hidden in other irrelevant information for multi-labeled data, which is formalized as learning juntas. A $k$ -junta function is a function which depends on only $k$ coordinates of the input. For relatively small $k$ w.r.t. the input size $n$ , learning $k$ -junta functions is one of fundamental problems both theoretically and practically in machine learning. For the last two decades, much effort has been made to design efficient learning algorithms for Boolean junta functions, and some novel techniques have been developed. However, in real world, multi-labeled data seem to be obtained in much more often than binary-labeled one. Thus, it is a natural question whether these techniques can be applied to more general cases about the alphabet size. In this paper, we expand the Fourier detection techniques for the binary alphabet to…

Equations135

Cor (f, χ_{α}) := ∣ E_{x, f} [e (f (x)) \overline{e (χ_{α} (x))}] ∣ \geq ρ,

Cor (f, χ_{α}) := ∣ E_{x, f} [e (f (x)) \overline{e (χ_{α} (x))}] ∣ \geq ρ,

Pr [\frac{1}{m} i = 1 \sum m X_{i} - μ > ϵ] < 2 e^{- \frac{2 m ϵ ^{2}}{( b - a ) ^{2}}} .

Pr [\frac{1}{m} i = 1 \sum m X_{i} - μ > ϵ] < 2 e^{- \frac{2 m ϵ ^{2}}{( b - a ) ^{2}}} .

Tr (c_{j} (f (x) - f (y))) = Tr (c_{j} f (x)) - Tr (c_{j} f (y)) \neq = 0.

Tr (c_{j} (f (x) - f (y))) = Tr (c_{j} f (x)) - Tr (c_{j} f (y)) \neq = 0.

f (α) = E [e (f (x) - χ_{α} (x))] = E [e (f (x) - χ_{α^{'}} (x))] \cdot E [e (- α_{i} x_{i})] = 0,

f (α) = E [e (f (x) - χ_{α} (x))] = E [e (f (x) - χ_{α^{'}} (x))] \cdot E [e (- α_{i} x_{i})] = 0,

f_{A}^{a} (x) = α : A α = a^{m} \sum a f (α) e (χ_{α} (x)) = {\sum_{α : A α = 1^{m}} a f (a α) e (a χ_{α} (x)) 1 (if a \neq = 0) (if a = 0)

f_{A}^{a} (x) = α : A α = a^{m} \sum a f (α) e (χ_{α} (x)) = {\sum_{α : A α = 1^{m}} a f (a α) e (a χ_{α} (x)) 1 (if a \neq = 0) (if a = 0)

f_{A}^{a} (x) = E_{z \sim F_{q}^{m}} [e (a f (x + A^{T} z)) \overline{e (χ_{a^{m}} (z))}] .

f_{A}^{a} (x) = E_{z \sim F_{q}^{m}} [e (a f (x + A^{T} z)) \overline{e (χ_{a^{m}} (z))}] .

S D (X, X^{'}) = \frac{1}{2} x \in F_{q} \sum ∣ Pr [X = x] - Pr [X^{'} = x] ∣,

S D (X, X^{'}) = \frac{1}{2} x \in F_{q} \sum ∣ Pr [X = x] - Pr [X^{'} = x] ∣,

C D (X, X^{'}) = a \in F_{q} max ∣ E [e (a X)] - E [e (a X^{'})] ∣.

C D (X, X^{'}) = a \in F_{q} max ∣ E [e (a X)] - E [e (a X^{'})] ∣.

C D (X, X^{'}) \leq 2 \cdot S D (X, X^{'}) \leq q - 1 \cdot C D (X, X^{'}) .

C D (X, X^{'}) \leq 2 \cdot S D (X, X^{'}) \leq q - 1 \cdot C D (X, X^{'}) .

(k + 1) q^{k} \cdot O (n \cdot q^{k} ln \frac{( k + 1 ) q ^{k} + k}{δ}) + k \cdot T (n, k, 1/ q^{k + 1}) \cdot poly (n, q^{k}, ln \frac{( k + 1 ) q ^{k} + k}{δ}) = T (n, k, 1/ q^{k + 1}) \cdot poly (n, q^{k}, ln δ^{- 1}) .

(k + 1) q^{k} \cdot O (n \cdot q^{k} ln \frac{( k + 1 ) q ^{k} + k}{δ}) + k \cdot T (n, k, 1/ q^{k + 1}) \cdot poly (n, q^{k}, ln \frac{( k + 1 ) q ^{k} + k}{δ}) = T (n, k, 1/ q^{k + 1}) \cdot poly (n, q^{k}, ln δ^{- 1}) .

g (α) = f_{A}^{a} (α) .

g (α) = f_{A}^{a} (α) .

g (α) = E_{x} [g (x) \overline{e (χ_{α} (x))}]

g (α) = E_{x} [g (x) \overline{e (χ_{α} (x))}]

= E_{z} [E_{x} [e (a f (x + A^{T} z)) \overline{e (χ_{α} (x + A^{T} z))}] e (χ_{α} (A^{T} z)) \overline{e (χ_{a^{m}} (z))}]

= a f (α) E_{z} [e (χ_{A α} (z)) \overline{e (χ_{a^{m}} (z))}]

= a f (α) \mbox 1 \mbox l {A α = a^{m}} = f_{A}^{a} (α) .

E_{b_{x}} [e (a b_{x})]

E_{b_{x}} [e (a b_{x})]

= E_{z} [e (a f (x + A^{T} z) - χ_{a^{m}} (z))]

= E_{z} [e (a f (x + A^{T} z)) \overline{e (χ_{a^{m}} (z))}] = f_{A}^{a} (x) (∵ (\ref e q : A p r o)) .

Pr [f (x) - χ_{α} (x) = a_{q}] = v \in F_{q} \sum Pr [f (x) - χ_{α^{'}} (x) = v] Pr [α_{i} x_{i} = a_{q} - v] = \frac{1}{q},

Pr [f (x) - χ_{α} (x) = a_{q}] = v \in F_{q} \sum Pr [f (x) - χ_{α^{'}} (x) = v] Pr [α_{i} x_{i} = a_{q} - v] = \frac{1}{q},

Pr [Tr (f (x) - χ_{α} (x)) = a_{p}] = a_{q} \in Tr^{- 1} (a_{p}) \sum Pr [f (x) - χ_{α^{'}} (x) = a_{q}] = \frac{∣ Tr ^{- 1} ( a _{p} ) ∣}{q} = \frac{p ^{ℓ - 1}}{p ^{ℓ}} = \frac{1}{p} .

Pr [Tr (f (x) - χ_{α} (x)) = a_{p}] = a_{q} \in Tr^{- 1} (a_{p}) \sum Pr [f (x) - χ_{α^{'}} (x) = a_{q}] = \frac{∣ Tr ^{- 1} ( a _{p} ) ∣}{q} = \frac{p ^{ℓ - 1}}{p ^{ℓ}} = \frac{1}{p} .

x Pr [x^{T} α = a and x^{T} β = b] = \frac{1}{q ^{2}} .

x Pr [x^{T} α = a and x^{T} β = b] = \frac{1}{q ^{2}} .

x Pr [x^{T} α = a and x^{T} β = b] = {1/ q 0 (if b = c a) (otherwise) .

x Pr [x^{T} α = a and x^{T} β = b] = {1/ q 0 (if b = c a) (otherwise) .

α_{i} x_{i} + α_{j} x_{j} = v_{1} and c α_{i} x_{i} + c^{'} α_{j} x_{j} = v_{2} .

α_{i} x_{i} + α_{j} x_{j} = v_{1} and c α_{i} x_{i} + c^{'} α_{j} x_{j} = v_{2} .

α_{1} x_{1} + \dots + α_{n} x_{n}

α_{1} x_{1} + \dots + α_{n} x_{n}

α_{1} x_{1} + \dots + α_{n} x_{n}

A \sim F_{q}^{m \times n} Pr [A α = 1^{m} and A β \neq = 1^{m} for each β \in F_{q}^{D} ∖ {α}] \geq \frac{q ^{m - k} - 1}{q ^{2 m - k}}

A \sim F_{q}^{m \times n} Pr [A α = 1^{m} and A β \neq = 1^{m} for each β \in F_{q}^{D} ∖ {α}] \geq \frac{q ^{m - k} - 1}{q ^{2 m - k}}

A \sim F_{q}^{(k + 1) \times k} Pr [A α = 1^{k + 1} and A β \neq = 1^{k + 1} for each β \in F_{q}^{D} ∖ {α}] \geq \frac{1}{q ^{k + 2}}

A \sim F_{q}^{(k + 1) \times k} Pr [A α = 1^{k + 1} and A β \neq = 1^{k + 1} for each β \in F_{q}^{D} ∖ {α}] \geq \frac{1}{q ^{k + 2}}

A Pr [A α = 1^{m}] = \frac{1}{q ^{m}} and A Pr [A β \neq = 1^{m} for each β \in F_{q}^{D} ∖ {α} ∣ A α = 1^{m}] \geq 1 - \frac{1}{q ^{m - k}} .

A Pr [A α = 1^{m}] = \frac{1}{q ^{m}} and A Pr [A β \neq = 1^{m} for each β \in F_{q}^{D} ∖ {α} ∣ A α = 1^{m}] \geq 1 - \frac{1}{q ^{m - k}} .

x Pr [x^{T} β = 1 and x^{T} α = 1] \leq \frac{1}{q ^{2}} .

x Pr [x^{T} β = 1 and x^{T} α = 1] \leq \frac{1}{q ^{2}} .

x Pr [x^{T} β = 1∣ x^{T} α = 1] = \frac{Pr _{x} [ x ^{T} β = 1 and x ^{T} α = 1 ]}{Pr _{x} [ x ^{T} α = 1 ]} \leq \frac{q}{q ^{2}} = \frac{1}{q}

x Pr [x^{T} β = 1∣ x^{T} α = 1] = \frac{Pr _{x} [ x ^{T} β = 1 and x ^{T} α = 1 ]}{Pr _{x} [ x ^{T} α = 1 ]} \leq \frac{q}{q ^{2}} = \frac{1}{q}

A Pr [A β = 1^{m} ∣ A α = 1^{m}] \leq \frac{1}{q ^{m}} .

A Pr [A β = 1^{m} ∣ A α = 1^{m}] \leq \frac{1}{q ^{m}} .

A Pr [\exists β \in F_{q}^{D} ∖ {α} s.t. A β = 1^{m} ∣ A α = 1^{m}] \leq \frac{q ^{k}}{q ^{m}},

A Pr [\exists β \in F_{q}^{D} ∖ {α} s.t. A β = 1^{m} ∣ A α = 1^{m}] \leq \frac{q ^{k}}{q ^{m}},

(x, b) := (y - A^{T} z, f (y) - j \sum z_{j}) .

(x, b) := (y - A^{T} z, f (y) - j \sum z_{j}) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Algorithms and Data Compression · Complexity and Algorithms in Graphs

Full text

A Faster Algorithm Enumerating Relevant Features

over Finite Fields

Mikito Nanashima

Tokyo Institute of Technology

[email protected]

We consider the problem of enumerating relevant features hidden in other irrelevant information for multi-labeled data, which is formalized as learning juntas.

A $k$ -junta function is a function which depends on only $k$ coordinates of the input. For relatively small $k$ w.r.t. the input size $n$ , learning $k$ -junta functions is one of fundamental problems both theoretically and practically in machine learning. For the last two decades, much effort has been made to design efficient learning algorithms for Boolean junta functions, and some novel techniques have been developed. However, in real world, multi-labeled data seem to be obtained in much more often than binary-labeled one. Thus, it is a natural question whether these techniques can be applied to more general cases about the alphabet size.

In this paper, we expand the Fourier detection techniques for the binary alphabet to any finite field $\mathbb{F}_{q}$ , and give, roughly speaking, an $O(n^{0.8k})$ -time learning algorithm for $k$ -juntas over $\mathbb{F}_{q}$ . Note that our algorithm is the first non-trivial (i.e., non-brute force) algorithm for such a class even in the case where $q=3$ and we give an affirmative answer to the question posed by Mossel et al. [15].

Our algorithm consists of two reductions: (1) from learning juntas to the learning with discrete memoryless errors (LDME) problem which is the extension of the learning with errors (LWE) problems introduced by Regev [17], and (2) from LDME to the light bulb problem (LBP) introduced by L.Valiant [21]. Since the reduced problem (i.e., LBP) is a kind of binary problem regardless of the alphabet size of the original problem (i.e., learning juntas), we can directly apply the techniques for the binary problem in the previous work.

1 Introduction

1.1 Background and Motivation

In both practical and theoretical senses, it is a fundamental challenge to separate relevant information from irrelevant information in data analysis. In many machine learning settings, collected data may contain many irrelevant features together with relevant features (e.g., DNA sequences and big data), and the efficient techniques for selecting relevant features are widely required. This problem is captured by learning juntas, which is one of the most challenging and important issues in computational learning theory. Informally, we say an $n$ -input function $f:\mathcal{X}^{n}\to\mathcal{Y}$ is $k$ -junta ( $k\leq n$ ) iff $f$ depends on only at most $k$ coordinates of the input. Our task is to find the relevant coordinates (i.e., features) of a $k$ -junta function $f$ , called a target function, from passively collected examples of the form $(x,f(x))\in\mathcal{X}^{n}\times\mathcal{Y}$ .

In the special case where the domain of a target function is binary, that is, $\mathcal{X}=\mathbb{F}_{2}$ , the learning junta problem has theoretically important meanings. For $k=O(\log{n})$ , learning $k$ -junta functions is a special case of learning polynomial-size DNF (disjunctive normal form) formulas and log-depth decision trees, which are also known as notorious open problems in computational learning theory, even in the uniform-distribution model (i.e., examples are distributed uniformly over $\mathbb{F}_{2}^{n}$ ). Therefore, for an affirmative answer to such problems, finding an efficient learning algorithm for log-juntas is inevitable. Despite much effort by many researchers, efficient (i.e., polynomial-time) learning algorithms for log-juntas have not been found. From the other point of view (i.e., parameterized complexity introduced by [7]), learning juntas problem can be regarded as a parametrized learning problem for general Boolean functions, and in fact, fixed parameter intractable results have been found in (proper) learning juntas under arbitrary example distribution in [2]. However, in the uniform-distribution model, any convincing argument on intractability has not been found until now. For further details about learning juntas, see the survey by Blum [4].

On the positive side, some elegant techniques for learning Boolean juntas have been developed in the uniform-distribution model since the problem was posed in [3, 5]. Obviously, any $k$ -junta function can be learned in time $O(n^{k})$ with high probability by brute-force search for all $\binom{n}{k}\leq n^{k}$ patterns about relevant coordinates. The first polynomial factor improvement was found by Mossel et al. [15], and the running time was reduced to $O(n^{\frac{\omega}{\omega+1}k})\leq O(n^{0.706k})$ , where $\omega$ denotes the exponential factor of the running time $O(n^{\omega})$ of fast $n\times n$ matrix multiplication with best known bound of $\omega<2.3728639$ in [14]. Further improvement has been made by G.Valiant [19], and the faster learning algorithm in time $O(n^{\frac{\omega}{4}k})\leq O(n^{0.6k})$ has been developed, which is the best learning algorithm at present. Their contributions are mainly to give a subquadratic algorithm for the light bulb problem which was posed in [21] and a reduction from learning Boolean juntas to the light bulb problem.

In real world, multi-labeled data such as questionnaires or DNA sequences (i.e., (A,T,G,C)) seem to be obtained in much more often than binary-labeled one. Then, it is a natural question whether the techniques for learning Boolean juntas can be modified to more general domains. Although the learning problem for $k$ -juntas over the finite alphabet size $q\in\mathbb{N}$ was mentioned as a direction for future work in [15], there are much less learnability results in the general case than in the binary case. Obviously, it can be solved in time $O(n^{k})$ as in the case $\mathbb{F}_{2}$ . The subsequent work [9] implicitly gave the non-trivial $O(n^{\frac{\omega}{3}k})\leq O(n^{0.8k})$ -time algorithm in the case where $q=2^{\ell}$ for some $\ell\in\mathbb{N}$ , by reducing the learning problem to $q-1$ learning problems for junta functions of the range $\mathcal{Y}=\mathbb{F}_{2}$ . However, to the best of our knowledge, any non-trivial learning algorithm for juntas over more general domains has not been known, even in the case where $q=3$ . In this paper, we investigate the learnability of juntas over arbitrary finite fields, and explicitly give the first non-trivial learning algorithm for such classes.

1.2 Our Contributions

Let $\mathbb{F}_{q}$ be arbitrary finite field of order $q=p^{\ell}$ where $p=char(q)$ . In this paper, we focus on $k$ -junta functions over $\mathbb{F}_{q}$ as target functions. Formally, $k$ -junta functions are defined as follows.

Definition 1.

For a function $f:\mathbb{F}_{q}^{n}\to\mathbb{F}_{q}$ , we say that a coordinate $i\in\{1,\ldots,n\}$ is relevant if $f(x)\neq f(y)$ for some points of $x,y\in\mathbb{F}_{q}^{n}$ such that $x$ and $y$ differ only at the coordinate $i$ . For $k\leq n$ , we say that a function $f$ is $k$ -junta if $f$ has at most $k$ relevant coordinates.

We state the learning junta problem more formally. The learning setting mainly follows the framework of PAC (Probably Approximately Correct) learning which was first introduced by L.Valiant [20]. The number of relevant coordinates is given in advance by some fixed function $k:\mathbb{N}\to\mathbb{N}$ , and a learning algorithm knows the function $k$ . The learning algorithm is given an example oracle $\mathbb{O}(f)$ as the only access to the target function $f:\mathbb{F}_{q}^{n}\to\mathbb{F}$ . For each access to $\mathbb{O}(f)$ , it returns an example $(x,f(x))\in\mathbb{F}_{q}^{n}\times\mathbb{F}_{q}$ , where $x$ is selected uniformly at random over $\mathbb{F}_{q}^{n}$ .

The learning junta problem is formally stated as follows. In this paper, we will use the term “with high probability (w.h.p. for short)” to imply with some constant probability.

Learning $k$ -juntas (over finite field)

Input: $n,k\in\mathbb{N}$ , and an example oracle $\mathbb{O}(f)$ where $f:\mathbb{F}_{q}^{n}\to\mathbb{F}_{q}$ is $k$ -junta

Goal: Find all (at most $k$ ) relevant coordinates w.h.p.

As described in [10], the failure probability can be reduced to any given $\delta\in(0,1)$ by $O(\ln{\delta^{-1}})$ independent repetitions. The reader may think the above formulation differs from the usual PAC learning model in the sense that the learning algorithm will not output a hypothesis function. However, as described in [4, 15], the difficulty of learning juntas comes from the task of finding not what the function is but where the relevant coordinates are. In fact, the above formulation is equivalent to the usual PAC learnability under uniform distribution in learning juntas (within the multiplicative factor of $\textup{\text{poly}}(n,q^{k})$ ).

In this paper, we will prove the following main result.

Theorem 1 (main).

For any $\epsilon>0$ and $k=O(\log_{q}{n})$ , $k$ -juntas over any finite field $\mathbb{F}_{q}$ is learnable in time $n^{\frac{\omega}{3}k+\epsilon}\cdot\textup{\text{poly}}(n,q^{k})$ .

Our learning algorithm mainly follows the line of work by [8, 19] and consists of two reductions that generalize their reductions for the binary domain to any finite field $\mathbb{F}_{q}$ .

In the first step, we reduce the learning juntas problem to another learning problem, learning with discrete memoryless errors (LDME). Simply speaking, the task of LDME is to learn a linear function $\chi_{\alpha}:\mathbb{F}_{q}^{n}\to\mathbb{F}_{q}$ with $\alpha\in\mathbb{F}_{q}^{n}$ under the condition that the label may be corrupted with random noise, where $\chi_{\alpha}(x)=\alpha_{1}x_{1}+\ldots+\alpha_{n}x_{n}$ with arithmetic in $\mathbb{F}_{q}$ . For simplicity, we regard a randomized function as a target function to capture the noise.

Learning with Discrete Memoryless Errors: LDME

Input: $n,k\in\mathbb{N}$ , $\rho\in(0,1]$ , and an example oracle $\mathbb{O}(f)$ ,

where $f:\mathbb{F}_{q}^{n}\to\mathbb{F}_{q}$ is randomized. The distribution of the value $f(x)$ is determined by only a value of $\chi_{\alpha}(x)$ (not $x$ itself), where $1\leq|\alpha|\leq k$ . The target function $f$ is close to $\chi_{\alpha}$ in the sense of correlation as follows:

[TABLE]

where the mapping $e:\mathbb{F}_{q}\to\mathbb{C}$ is defined by $e(a)=e^{\frac{2\pi i}{p}\mathrm{Tr}(a)}$ for $a\in\mathbb{F}_{q}$ and $\mathrm{Tr}(a):=\sum_{j=0}^{\ell-1}a^{p^{j}}\in\mathbb{F}_{p}$ .

Goal: Find the coefficients $a\alpha\in\mathbb{F}_{q}^{n}$ for some $a\in\mathbb{F}_{q}\setminus\{0\}$ w.h.p.

We call the above function $\chi_{\alpha}$ as a target linear function. The reason why we allow the algorithm to output $a\alpha$ instead of $\alpha$ is that the linear function $\chi_{a\alpha}$ may also have large correlation, that is, $\mathrm{Cor}(f,\chi_{a\alpha})\geq\rho$ .

Let us briefly overview the background of the above problem. LDME, introduced first by [9], is the extension of the well-known learning with errors problem (LWE) which has been known as one of the most challenging problems in learning theory and even used as a hardness assumption in cryptography (see [17, 18]). The difference between them is the noise setting. In LWE, the (unknown) distribution of noise is fixed in advance, while in LDME, the distribution is determined for each value of the target linear function, in other words, there exist totally $q$ unknown distributions of the noise. Note that, in addition, we adopt slightly different condition about the closeness between $f$ and $\chi_{\alpha}$ compared to [9]. In the previous formulation, the given $\rho$ was the lower bound for the agreement probability that $f=\chi_{\alpha}$ . However, in our formulation by correlation, the agreement probability is not always large. For example, even in the case where the subtraction $f-\chi_{\alpha}$ is close to some constant, our condition about the closeness may hold.

We first present the reduction from the learning juntas problem to LDME, which is a generalization of the binary case in [8]. The detail will be given in Section 3.

Theorem 2.

If there exists a learning algorithm for solving LDME in time $T(n,k,\rho)$ , then there exists a learning algorithm for $k$ -juntas over $\mathbb{F}_{q}$ in time $T(n,k,1/q^{k+1})\cdot\textup{\text{poly}}(n,q^{k})$ .

In the second step, we reduce LDME to the light bulb problem (LBP), which is first introduced by [21] and also a fundamental problem in machine learning and data analysis. Roughly speaking, the task of LBP is to find a correlated pair from the other uncorrelated pairs. The formal definition is as follows:

Light Bulb Problem: LBP

Input: a set $S=\{x^{1},\ldots,x^{n}\}$ of $n$ vectors, and $\rho\in(0,1]$ ,

where $x^{i}\in\{\pm 1\}^{d}$ for each $i\in[n]$ . The instance $S$ contains a single correlated pair $(x^{i^{*}},x^{j^{*}})$ satisfying $\langle x^{i^{*}},x^{j^{*}}\rangle\geq\rho d$ , and the other pairs of vectors are selected independently and uniformly at random.

Goal: Find indices of the correlated pair $(i^{*},j^{*})$ w.h.p.

It is obvious that LBP is solved in time $O(n^{2}d)$ by calculating inner products of all pairs. As a breakthrough result, the first subquadratic algorithm for LBP has been found by [19]. Moreover, in the case where $\rho\geq n^{-\Theta(1)}$ , a faster algorithm was presented by [12]. Other subquadratic algorithms also have been proposed in [13, 1].

*Fact 1** ([12, Corollary 2.2]).*

For any $0<\epsilon<\omega/3$ and $n^{-\Theta(1)}<\rho<1$ , if $d\geq 5\rho^{-\frac{4\omega}{9\epsilon}-\frac{2}{3}}\ln{n}$ , then there is a randomized algorithm for solving LBP with probability $1-o(1)$ in time $\tilde{O}(n^{\frac{2\omega}{3}+\epsilon}\rho^{-\frac{8\omega}{9\epsilon}-\frac{4}{3}})$ .

We present the second reduction from LDME to LBP. Note that the reduced problem is a kind of binary problem regardless of the alphabet size of the original problem. The detail will be given in Section 4.

Theorem 3.

Assume that there exist $d\geq\Omega(\frac{\log{N}}{\rho^{2}})$ and an algorithm for solving LBP of degree $d$ in time $T(N,\rho)$ w.h.p., where $N$ is the number of vectors in LBP. Then for any target linear function $\chi_{\alpha}:\mathbb{F}_{q}^{n}\to\mathbb{F}_{q}$ $(1\leq|\alpha|\leq k)$ and any correlation $\rho$ , LDME is solved w.h.p. in time $\textup{\text{poly}}(n,\rho^{-1})\cdot d\cdot T\left((qn)^{\frac{k}{2}},\frac{\rho}{2q^{3}}\right)$ .

In our reduction, the size of data is stretched from $n$ to $O(n^{\frac{k}{2}})$ . Thus, the naive quadratic algorithm for LBP does not improve the trivial upper bound on the running time of LDME at all. However, by combining our reductions with the subquadratic algorithm for LBP, we have a non-trivial learnability result which holds for any finite field, and Theorem 1 immediately follows from Theorems 2 and 3, and Fact 1.

In Theorem 1, the condition that $k=O(\log_{q}n)$ essentially comes from the condition that $\rho>n^{-\Theta(1)}$ in Fact 1. Therefore, by adopting another subquadratic algorithm for LBP that works for any $\rho\in(0,1]$ (e.g., [19]), we have a non-trivial learnability result for any $k\leq n$ . Remark that our reduction and such a subquadratic algorithm also give the non-trivial learning algorithm for LDME, in particular, LWE parameterized by $k$ .

2 Preliminaries

We use $\log$ to denote logarithm of the base 2, and $\ln$ to denote natural logarithm. For any integer $n$ , we define a set $[n]:=\{1,2,\ldots,n\}$ . Let $\mathbb{F}_{q}$ be a finite field of order $q=p^{\ell}$ where $p=char(q)$ . We define a trace function $\mathrm{Tr}:\mathbb{F}_{q}\to\mathbb{F}_{p}$ by $\mathrm{Tr}(a):=\sum_{j=0}^{\ell-1}a^{p^{j}}$ . Note that for any $a,b\in\mathbb{F}_{q}$ , $\mathrm{Tr}(a)+\mathrm{Tr}(b)=\mathrm{Tr}(a+b)$ , and $\mathrm{Tr}(\cdot)$ takes on each value in $\mathbb{F}_{p}$ equally often.

For $\alpha\in\mathbb{F}_{q}^{n}$ , we define the weight of $\alpha$ by $|\alpha|=|\{i\in[n]:\alpha_{i}\neq 0\}|$ . For $\alpha\neq 0^{n}$ , we also define its initial $init(\alpha)$ by the first non-zero value of $\alpha$ , that is, $init(\alpha)=v$ iff there exists $i\in[n]$ such that $\alpha_{i}=v$ and $\alpha_{j}=0$ for each $1\leq j<i$ . Note that if $\alpha,\alpha^{\prime}\in\mathbb{F}_{q}^{n}\setminus\{0^{n}\}$ satisfy $\alpha\neq\alpha^{\prime}$ and $init(\alpha)=init(\alpha^{\prime})$ , then there is no $c\in\mathbb{F}_{q}$ such that $\alpha=c\alpha^{\prime}$ (i.e., $\alpha$ and $\alpha^{\prime}$ are linearly independent over $\mathbb{F}_{q}^{n}$ ).

For any $J\subseteq[n]$ , we define a subspace $\mathbb{F}_{q}^{J}\leq\mathbb{F}_{q}^{n}$ by $\mathbb{F}_{q}^{J}=\{x\in\mathbb{F}_{q}^{n}:x_{i}=0\text{ for each }i\in\bar{J}\}$ , where $\bar{J}=[n]\setminus J$ . For any $\alpha\in\mathbb{F}_{q}^{n}$ and $J\subseteq[n]$ , we also define $\alpha^{J}\in\mathbb{F}_{q}^{J}$ by $\alpha^{J}_{i}=\alpha_{i}$ if $i\in J$ .

For a subset $J\subseteq[n]$ , we call a pair $(J,\bar{J})$ a partition of $[n]$ . In addition, if $J$ consists of cyclically consecutive $\lceil n/2\rceil$ coordinates, we say that the partition $(J,\bar{J})$ is consecutive. Obviously an index set $[n]$ has exactly $n$ consecutive partitions. Now we introduce the following useful lemma, which says that any subset in $[n]$ is divided into exactly half by at least one consecutive partition of $[n]$ .

Lemma 1.

For any $\alpha\in\mathbb{F}_{q}^{n}$ with $|\alpha|=k$ , there exist at least one consecutive partition $(J,\overline{J})$ which satisfies that $|\alpha^{J}|=\lceil k/2\rceil$ and $|\alpha^{\bar{J}}|=\lfloor k/2\rfloor$ .

Proof.

See Appendix A.1. ∎

We use the term “a truth table” to denote a table of values of a function over $\mathbb{F}_{q}$ as in the binary case. For any function $f:\mathbb{F}_{q}^{n}\to\mathbb{F}_{q}$ and value $a\in\mathbb{F}_{q}$ , we define a function $af:\mathbb{F}_{q}^{n}\to\mathbb{F}_{q}$ by $af(x)=a\cdot f(x)$ . For a subset $J\subseteq[n]$ , we define a restriction $\tau$ on $J$ as a partial assignment to $J$ , and we use $f|_{\tau}:\mathbb{F}_{q}^{|\bar{J}|}\to\mathbb{F}_{q}$ to denote the restricted function of which variables are partially assigned $\tau$ on $J$ . We use $|\tau|$ to denote the size of a restriction $\tau$ , that is, $|\tau|=|J|$ .

For a finite set $S$ , we write $x\leftarrow_{u}S$ for a random sampling of $x$ according to the uniform distribution over $S$ . In the subsequent discussions, we assume the basic facts about probability theory, especially, pairwise independence and the union bound. We will make extensive use of the following tail bound.

*Fact 2** (Hoeffding inequality [11]).*

For real values $a,b\in\mathbb{R}$ , let $X_{1},\ldots,X_{m}$ be independent and identically distributed random variables with $X_{i}\in[a,b]$ and $\mathrm{E}[X_{i}]=\mu$ for each $i\in[m]$ . Then for any $\epsilon>0$ , the following inequality holds:

[TABLE]

2.1 Fourier Analysis

We introduce some basics of Fourier analysis over finite fields. For further details, see [16, 9]. For each $a\in\mathbb{F}_{q}$ , let $e(a):=e^{\frac{2\pi i}{p}\mathrm{Tr}(a)}\in\mathbb{C}$ . For $a,b\in\mathbb{F}_{q}$ , it is easy to see that $e(a+b)=e(a)e(b)$ and $e(-a)=\overline{e(a)}$ . For any two functions $f,g:\mathbb{F}_{q}^{n}\to\mathbb{C}$ , we define their inner product by $\langle f,g\rangle=\mathrm{E}_{x}[f(x)\overline{g(x)}]$ . Then a family $\{e(\chi_{\alpha})\}_{\alpha\in\mathbb{F}_{q}^{n}}$ of $q^{n}$ functions forms an orthonormal basis, that is, $\langle e(\chi_{\alpha}),e(\chi_{\beta})\rangle=1$ if $\alpha=\beta$ , otherwise, $\langle e(\chi_{\alpha}),e(\chi_{\beta})\rangle=0$ . Therefore, for any function $f:\mathbb{F}_{q}^{n}\to\mathbb{C}$ has a unique Fourier expansion form as $f(x)=\sum_{\alpha}\widehat{f}(\alpha)e(\chi_{\alpha}(x))$ , where $\widehat{f}(\alpha)$ is a Fourier coefficient given by $\widehat{f}(\alpha)=\langle f,e(\chi_{\alpha})\rangle$ .

For a function $f:\mathbb{F}_{q}^{n}\to\mathbb{F}_{q}$ and $\alpha\in\mathbb{F}_{q}$ , we also define its Fourier coefficient on $\alpha$ by $\widehat{f}(\alpha)=\langle e(f),e(\chi_{\alpha})\rangle$ (we use the same notation as the above). Let us remark that, not as complex-valued functions, $f$ does not always have the unique Fourier form, because the value $f(x)\in\mathbb{F}_{q}$ is mapped onto $\mathrm{Tr}(f(x))\in\mathbb{F}_{p}$ in the definition of $e(\cdot)$ , and there exist different functions $f,g:\mathbb{F}_{q}^{n}\to\mathbb{F}_{q}$ which satisfies $\mathrm{Tr}(f)=\mathrm{Tr}(g)$ . Our algorithm will extensively use the above analysis, more specifically, it will map the target function $f$ to $\mathrm{Tr}(f)$ and use the Fourier analysis over $\mathbb{F}_{p}$ . However, in the setting of learning juntas, some relevant coordinates for $f$ may turn to be irrelevant for $\mathrm{Tr}(f)$ . This lack of information will be overcome by considering $p^{\ell-1}$ functions $c_{1}f,\ldots,c_{p^{\ell-1}}f$ simultaneously for distinct elements $c_{1},\ldots,c_{p^{\ell-1}}\in\mathbb{F}_{q}\setminus\{0\}$ , which is indicated by the following simple lemma. Note that, for any $c\in\mathbb{F}_{q}$ , we can easily simulate the example oracle $\mathbb{O}(cf)$ from $\mathbb{O}(f)$ by multiplying each label by the value $c$ .

Lemma 2.

For any function $f:\mathbb{F}_{q}^{n}\to\mathbb{F}_{q}$ , distinct elements $c_{1},\ldots,c_{p^{\ell-1}}\in\mathbb{F}_{q}\setminus\{0\}$ , and relevant coordinate $i\in[n]$ for $f$ , there exists $j\in[p^{\ell-1}]$ such that $i$ is also relevant for $\mathrm{Tr}(c_{j}f):\mathbb{F}_{q}^{n}\to\mathbb{F}_{p}$ .

Proof.

By the definition of relevant coordinates, there exists $x,y\in\mathbb{F}_{q}^{n}$ such that $x$ and $y$ differ only at the coordinate $i$ and $v:=f(x)-f(y)\neq 0$ . Since $c_{1},\ldots,c_{p^{\ell-1}}$ are distinct and nonzero, the $p^{\ell-1}$ values $c_{1}v,\ldots,c_{p^{\ell-1}}v$ are also distinct and nonzero. The trace function $\mathrm{Tr}(\cdot)$ takes each value exactly $p^{\ell-1}$ times and $\mathrm{Tr}(0)=0$ , thus there exists $j\in[p^{\ell-1}]$ satisfying $\mathrm{Tr}(c_{j}v)\neq 0$ , which implies

[TABLE]

Therefore, $i$ is also relevant for the function $\mathrm{Tr}(c_{j}f)$ . ∎

We also introduce the following fact which plays a crucial role in learning juntas.

*Fact 3**.*

If a function $f:\mathbb{F}_{q}^{n}\to\mathbb{F}_{q}$ satisfies that $\widehat{f}(\alpha)\neq 0$ for some $\alpha\in\mathbb{F}_{q}^{n}$ , then all coordinates $i\in[n]$ with $\alpha_{i}\neq 0$ are relevant.

Proof.

By contraposition. If there exists an irrelevant coordinate $i\in[n]$ such that $\alpha_{i}\neq 0$ ,

[TABLE]

where $\alpha^{\prime}=(\alpha_{1},\ldots,\alpha_{i-1},0,\alpha_{i+1},\ldots,\alpha_{n})$ . ∎

2.2 $(a,A)$ -Projection

We define a notion of $(a,A)$ -projection which is a generalization of $A$ -projection in $\mathbb{F}_{2}$ by [8].

Definition 2 ( $(a,A)$ -projection).

For $f:\mathbb{F}_{q}^{n}\to\mathbb{F}_{q}$ , $A\in\mathbb{F}_{q}^{m\times n}$ , and $a\in\mathbb{F}_{q}$ , we define $f^{a}_{A}:\mathbb{F}_{q}^{n}\to\mathbb{C}$ by

[TABLE]

Lemma 3.

For $A\in\mathbb{F}_{q}^{m\times n}$ and $a\in\mathbb{F}_{q}$ ,

[TABLE]

Moreover, if an example and its label are given by $(x,b)=(y-A^{T}z,f(y)-\sum z_{i})$ for $y\leftarrow_{u}\mathbb{F}_{q}^{n}$ and $z\leftarrow_{u}\mathbb{F}_{q}^{m}$ , then for any $x\in\mathbb{F}_{q}^{n}$ , $\mathrm{E}_{b_{x}}[e(ab_{x})]=f^{a}_{A}(x)$ , where $b_{x}$ denotes a random variable according to the distribution of $b$ conditioned on the example $x$ .

Proof.

It is essentially the same as the proof in [8]. For completeness, see Appendix A.2. ∎

2.3 Statistical Distance and Character Distance

For our proofs, we introduce the following two distances about random variables taking values in $\mathbb{F}_{q}$ , which was introduced first in [6].

Definition 3 (statistical/character distance).

For random variables $X,X^{\prime}$ taking values in $\mathbb{F}_{q}$ , we define their statistical distance $SD(X,X^{\prime})$ by

[TABLE]

and we also define their character distance $CD(X,X^{\prime})$ by

[TABLE]

In the case where $q$ is not prime, we adopt a different definition for $e(\cdot)$ from one in the original paper [6]. However, it is easily checked that the following fact holds from exactly the same argument.

*Fact 4** ([6, Claim 33]).*

For any random variables $X,X^{\prime}$ taking values in $\mathbb{F}_{q}$ ,

[TABLE]

In particular, $SD(X,X^{\prime})=0$ if and only if $CD(X,X^{\prime})=0$ .

3 Reduction from Learning Juntas to LDME

In this paper, for simplicity, we assume the following computational model:

•

A learning algorithm can uniformly select an element in $\mathbb{F}_{q}$ with probability 1 in constant steps. In fact, a usual randomized model with binary coins may fail in selecting such random elements with exponentially small probability, but we can deal with this probability as a general error probability (i.e., confidence error). For the same reason, we allow algorithms to flip a biased coin which lands heads up with a rational probability (of the polynomial-time computable denominator).

•

A learning algorithm with an example oracle $\mathbb{O}(f)$ , where $f:\mathbb{F}_{q}^{n}\to\mathbb{F}_{q}$ is $k$ -junta, can simulate an oracle $\mathbb{O}(f|_{\tau})$ w.r.t. any restriction $\tau$ of the size $|\tau|\leq k$ . In fact, this simulation is done by taking several examples until getting an example consistent with $\tau$ . Since the probability that an example consistent with $\tau$ is sampled is at least $q^{-k}$ , the failure probability becomes exponentially small by taking $\textup{\text{poly}}(q^{k})$ examples. We can also deal with this error probability as a general confidence error, and the additional running time is at most $\textup{\text{poly}}(n,q^{k})$ .

3.1 Overview of the Reduction

Our learning algorithm (main1) has two phases, a checking phase (lines 6 and 7) and a detection phase (line 9), and repeats them alternately as the MOS algorithm [15]. The algorithm starts the checking phase with a set $R$ empty. In the following steps, the relevant coordinates found by the algorithm will be put in $R$ . In the checking phase, the algorithm verifies whether $R$ contains all relevant coordinates of the target function $f$ by examining that restricted functions $f|_{\tau}$ are constant for all restrictions $\tau$ on $R$ . If $R$ contains all relevant coordinates, then the algorithm outputs $R$ and halts, otherwise moves on to the detection phase. In the detection phase, the algorithm will find at least one relevant coordinate, add them to $R$ , and will move on to the checking phase. Since the algorithm finds at least one relevant coordinate in each loop, the number of repetitions is at most $k$ .

In the detection phase, we reduce the task of finding relevant coordinates to LDME in the subroutine addRC by $(a,A)$ -projection. In our reduction, the target linear function $\chi_{\alpha}$ satisfies that $\widehat{cf}(a\alpha)\neq 0$ for some $c,a\in\mathbb{F}_{q}\setminus\{0\}$ . Therefore, if the algorithm for LDME finds $\alpha$ (up to constant factor), then the learning algorithm can find at least one relevant coordinate $i$ such that $\alpha_{i}\neq 0$ by Fact 3.

3.2 Algorithms and Analysis

First we introduce two simple subroutines. For the proofs of their correctness (i.e., Lemmas 4 and 5), see Appendix B.

Algorithm 1 checks whether the target function is constant or not by simply examining that the collected examples take the same value. As mentioned in Section 3.1, we will use this subroutine to determine the end of learning in the checking phase.

Algorithm 2 checks whether the given $\alpha\in\mathbb{F}_{q}^{n}$ has nonzero entry at an irrelevant coordinate. Our learning algorithm main1 may find an undesirable candidate $\alpha$ in the detection phase, thus we must check whether the candidate $\alpha$ consists of only a part of relevant coordinates by this subroutine not to add any irrelevant coordinate to the container $R$ for relevant coordinates.

Lemma 4.

For any input $(n,k,\delta,\mathbb{O}(f))$ , const outputs $a\in\mathbb{F}_{q}$ if $f\equiv a$ , otherwise $\bot$ with probability at least $1-\delta$ .

Proof.

See Appendix B.1. ∎

Lemma 5.

For any input $(n,k,\alpha,\delta,\mathbb{O}(f))$ , if $\widehat{f}(\alpha)\neq 0$ , then checkRC outputs true with probability at least $1-\delta$ . Otherwise if $\alpha_{i}\neq 0$ for an irrelevant coordinate $i\in[n]$ , checkRC outputs false with probability at least $1-\delta$ .

In general, there is a case where $\widehat{f}(\alpha)=0$ and all $i\in[n]$ satisfying $\alpha_{i}\neq 0$ are relevant. In the above lemma, we do not say anything about such a case.

Proof.

See Appendix B.2. ∎

Algorithm 3 is a core part of our reduction, which reduces the task of finding candidates for relevant coordinates to LDME, checks whether the candidates are indeed relevant, and returns them to the main algorithm. Let LDME $(n,k,\rho)$ be the learning algorithm for LDME.

We briefly explain how the subroutine addRC works. The details will be addressed in Lemma 6 and Appendix B.3.

If the given set $R$ does not contain all relevant coordinates, then for some restriction $\tau$ on $R$ , the restricted function $f|_{\tau}$ is not constant. By Lemma 2, there exists an element $c_{j}$ such that $\mathrm{Tr}(c_{j}f|_{\tau})$ is also non-constant. This subroutine works for such a restriction $\tau$ and an element $c_{j}$ , and finds new relevant coordinates for the function $\mathrm{Tr}(c_{j}f|_{\tau})$ . In fact, the subroutine tries all (at most $q^{k}$ ) restrictions on $R$ and elements $c_{j}$ (line 4).

Let $n^{\prime}:=n-|R|$ . For the (non-constant) restricted function $c_{j}f|_{\tau}:\mathbb{F}_{q}^{n^{\prime}}\to\mathbb{F}_{q}$ , addRC repeats the following process: (1) selects a matrix $A\in\mathbb{F}_{q}^{(k+1)\times n^{\prime}}$ at random (line 6), (2) selects a value $a\in\mathbb{F}_{q}$ (line 7), and (3) executes LDME with the example oracle simulated as in Lemma 3 w.r.t the selected $A$ and $a$ (line 8).

Let $g=c_{j}f|_{\tau}$ . Since the function $\mathrm{Tr}(g):\mathbb{F}_{q}^{n^{\prime}}\to\mathbb{F}_{p}$ is not constant, it has a non-zero coefficient $\widehat{g}(\alpha)\neq 0$ of $|\alpha|>0$ , which means that $g$ has some correlation with the linear function $\chi_{\alpha}$ . In fact $g$ may have correlation with other linear functions, but the number of such linear functions is small because $g$ is also $k$ -junta. Simply speaking, the role of $A$ is to filter out some of these correlations on simulated examples, and we can show that the non-negligible fraction of $A$ ’s remove all the correlations except for the linear function $\chi_{\alpha}$ (Claim 2 in Appendix B.3). In other words, the simulated examples depend on only $\chi_{\alpha}$ , and it is just an instance of LDME. While, the role of $a$ is to enhance the correlation with the target linear function, and for a good choice of $a$ , the correlation is bounded below by $1/q^{k+1}$ (Claim 4 in Appendix B.3).

If the algorithm LDME finds $c\alpha$ for some constant $c\in\mathbb{F}_{q}\setminus\{0\}$ , by Fact 3 and the fact that $\widehat{g}(\alpha)\neq 0$ , all coordinates taking non-zero values are relevant for $g=c_{j}f|_{\tau}$ . Moreover, they are also relevant for $f|_{\tau}$ because the algorithm selected non-zero $c_{j}$ . Therefore, we can reduce the task of finding relevant coordinates to LDME of the correlation bound $\rho=1/q^{k+1}$ .

In fact, for a bad choices of $A$ and $a$ , the algorithm may find undesirable candidates $\alpha$ . Not to add irrelevant coordinates to $R$ in such a case, addRC executes checkRC for any candidate found by LDME (line 13).

Lemma 6.

If the algorithm LDME solves LDME in time $T(n,k,\rho)$ w.h.p. and $R$ does not contain all relevant coordinates, then the subroutine addRC adds at least one relevant coordinate to $R$ with probability at least $1-\delta$ , and its running time is bounded above by $T(n,k,1/q^{k+1})\cdot\textup{\text{poly}}(n,k,\ln{\delta^{-1}})$ .

Proof.

The outline is shown in the above. For the complete proof, see Appendix B.3. ∎

Algorithm 4 is our learning algorithm. Now we prove its learnability by Lemma 6. Theorem 2 immediately follows from Lemma 7 by substituting some constant for $\delta$ .

Lemma 7.

If the algorithm LDME solves LDME in time $T(n,k,\rho)$ w.h.p., then the algorithm main1 outputs all relevant coordinates for any $k$ -junta function $f:\mathbb{F}_{q}^{n}\to\mathbb{F}_{q}$ with probability at least $1-\delta$ , and its running time is bounded above by $T(n,k,1/q^{k+1})\cdot\textup{\text{poly}}(n,k,\ln{\delta^{-1}})$ .

Proof.

First we show that the algorithm halts at most $k+1$ loops assuming that all subroutines succeed. If $R$ contains all relevant coordinates, then for all restrictions $\tau$ on $R$ , the restricted functions $f|_{\tau}$ must be constant, thus the algorithm halts and outputs $R$ in line 7. On the other hand, if $R$ does not contain some relevant coordinates, addRC adds at least one relevant coordinate to $R$ by Lemma 6. Since $f$ has at most $k$ relevant coordinates, addRC is executed at most $k$ times, and the main loop is repeated at most $k+1$ times.

In fact, the algorithm may fail in executing const and addRC. The number of the executions is at most $(k+1)q^{k}+k$ . Thus if we set their confidence parameter as $\delta/((k+1)q^{k}+k)$ , then by the union bound, the total failure probability is bounded above by $\delta$ . By Lemma 6, the total running time is at most

[TABLE]

∎

4 Reduction from LDME to LBP

First we introduce the following simple lemmas and their corollaries as observations of LDME.

Lemma 8.

Let $X$ be a random variable taking values in $\mathbb{F}_{q}$ . For $0\leq\rho\leq 1$ , if $|\mathrm{E}[e(X)]|\geq\rho$ , then there exists $a\in\mathbb{F}_{q}$ such that $\Pr[X=a]\geq\frac{1}{q}+\frac{\rho}{q^{2}}$ .

Proof.

See Appendix C.1. ∎

Lemma 9.

Let $\alpha,\beta\in\mathbb{F}_{q}^{n}\setminus\{0^{n}\}$ and $X$ be a random variable taking values in $\mathbb{F}_{q}$ . If the distribution of $X$ is determined by only the value of $\chi_{\alpha}(x)$ where $x\leftarrow_{u}\mathbb{F}_{q}^{n}$ , and $\alpha$ and $\beta$ are linearly independent over $\mathbb{F}_{q}^{n}$ , then for all $a\in\mathbb{F}_{q}$ , $\Pr_{x,X}[X-\chi_{\beta}(x)=a]=\frac{1}{q}$ .

Proof.

See Appendix C.2. ∎

As a corollary, we have the following facts about LDME. Let $\alpha,\beta\in\mathbb{F}_{q}^{n}\setminus\{0^{n}\}$ , $\chi_{\alpha}$ be a target linear function, and $f$ be the target (randomized) function, that is, $\mathrm{Cor}(f,\chi_{\alpha})\geq\rho$ . If $\beta=\alpha$ , then by Lemma 8, there exists some value $a\in\mathbb{F}_{q}$ such that $\Pr[f(x)-\chi_{\beta}(x)=a]\geq 1/q+\rho/q^{2}$ . On the other hand, if $\beta$ and $\alpha$ are linearly independent, then by Lemma 9, $\Pr[f(x)-\chi_{\beta}(x)=a]=1/q$ for all $a\in\mathbb{F}_{q}$ . We essentially use the difference in our reduction. Note that we do not say anything about the case where $\beta\neq\alpha$ but they are linearly dependent (i.e., $\beta=c\alpha$ for some $c\in\mathbb{F}_{q}\setminus\{0,1\}$ ).

4.1 Overview of the Reduction

Our learning algorithm is Algorithm 6 (main2) and the main idea is similar to the split-and-list idea in previous work [19, 12]. Let $\alpha\in\mathbb{F}_{q}^{n}$ be the coefficients of a target linear function with $|\alpha|\leq k$ . First we select a consecutive partition that divides the nonzero entries of $\alpha$ into half by brute-force search (line 6), then list the values of linear functions $\chi_{\beta}$ of weight $1\leq|\beta|\leq k/2$ where $\beta$ is contained in either $\beta\in\mathbb{F}_{q}^{J}$ or $\beta\in\mathbb{F}_{q}^{\bar{J}}$ (lines 8:1–4). Not to contain linearly dependent linear functions, we fix an initial value of the coefficient vector for each partition. Since there are at most $(q-1)^{2}$ patterns about the initial values, we can easily guess the pair of initial values consistent with $\alpha^{J}$ and $\alpha^{\bar{J}}$ .

As the above, we stretch a noisy example to $O(n^{\frac{k}{2}})$ entries taking values in $\mathbb{F}_{q}$ . Then, we translate the stretched data into an instance of LBP, that is, a $\{\pm 1\}$ -valued instance. We can observe the following three facts. First, each entry takes values uniformly over $\mathbb{F}_{q}$ . Second, the pair of entries corresponding to $\alpha$ (we may call it a target pair) has some correlation in the sense that they take a certain value $a\in\mathbb{F}_{q}$ with relatively high probability, where we refer to such a value $a$ as a concentrated value. Finally, other pairs are distributed pairwise independently.

Now we translate each entry $a\in\mathbb{F}_{q}$ into $1$ or $-1$ as follows: (1) For the case where $a$ is concentrated, we change the entry to $1$ (line 8:5), (2) for the case where $a$ is not concentrated, we flip a biased coin with the head probability $q/(2(q-1))$ , and if it comes up with head, then we change the entry to $-1$ , otherwise to $1$ (line 8:6). Because each entry is uniformly distributed, the probability that the entry is changed to $-1$ is exactly $\frac{q-1}{q}\cdot\frac{q}{2(q-1)}=\frac{1}{2}$ , that is, uniformly distributed over $\{\pm 1\}$ . Moreover, by pairwise independence, all pairs except for the target pair are also independently distributed. On the other hand, in the target pair, the correlation remains even in resulting binary instance. In other words, the reduced instance is just the one of LBP.

4.2 Algorithms and Analysis

First, we introduce the following simple subroutine Algorithm 5, which checks whether a candidate linear function found in the main routine is indeed a target linear function or not. In fact, it can be also implemented by the standard empirical estimation of the correlation. The merit of our implementation by using the conditions in Lemmas 8 and 9 is simply to avoid calculations of complex numbers.

Lemma 10.

Let $\chi_{\alpha}$ be a target linear function. The subroutine checkCor outputs true if the given $\gamma$ satisfies $|\mathrm{E}[e(f(x)-\chi_{\gamma}(x))]|\geq\rho$ with probability at least $1-\delta$ . On the other hand, if $\gamma$ and $\alpha$ are linearly independent, checkCor outputs false with probability at least $1-\delta$ in time $poly(n,\rho^{-1},\ln{\delta^{-1}})$ .

Proof.

The lemma follows from Lemmas 8 and 9 and the standard probabilistic argument. For the complete proof, see Appendix C.3. ∎

Algorithm 6 is our main reduction from LDME to LBP. Let LBP $(S,\rho)$ be a subroutine for solving LBP (of the degree $d$ ) with high probability. W.l.o.g., we can assume the failure probability is at most $1/4$ by constant number of repetitions.

The proof of Lemma 11 is informally given as mentioned in Section 4.1, and we give the complete proof in Appendix C.4. Theorem 3 immediately follows from Lemma 11 by substituting some constant for $\delta$ .

Lemma 11.

Assume that the subroutine LBP solves LBP for some $d\geq\Omega(\frac{\log{N}}{\rho^{2}})$ in time $T(N,\rho)$ w.h.p., where $N$ is the number of the vectors. Then the algorithm main2 $(n,k,\rho,\delta)$ solves LDME for any target linear function $\chi_{\alpha}$ $(1\leq|\alpha|\leq k)$ in time $\textup{\text{poly}}(n,\rho^{-1},\ln{\delta^{-1}})\cdot d\cdot T((qn)^{\frac{k}{2}},\frac{\rho}{2q^{3}})$ with probability at least $1-\delta$ .

5 Discussions and Future Directions

We introduced the reduction from learning juntas over any finite fields to LBP, and gave the first non-trivial learning algorithm for such a class. Our results also enhance the motivation of designing an efficient algorithm for LBP, because it automatically improves the upper bound for learning $k$ -juntas for not only the binary domain but also any finite field.

However, by our reduction, if we could construct a linear-time algorithm for LBP, the upper bound will be improved to at best $O(n^{\frac{k}{2}})$ . Therefore, unlike in the binary case, it is open whether there exists a scenario that the polynomial factor can be improved to less than $k/2$ . Remember that we first reduced the learning juntas problem to LDME which was the extension of the challenging learning problem, LWE. For further improvement, such a hard problem should be avoided.

In addition, our reduction makes extensive use of the properties of finite fields. Thus, it is also open whether we can design a non-trivial learning algorithm that works for any finite alphabet, in particular, $q=6$ .

Appendix A Proofs of Lemmas in Section 2

A.1 Proof of Lemma 1

For convenience, we say $i\in[n]$ is supportive if $\alpha_{i}\neq 0$ . For $i\in[n]$ , let $J_{i}\subset[n]$ be a subset which consists of cyclically consecutive $\lceil n/2\rceil$ coordinates from $i$ , and $m_{i}$ be the number of supportive coordinates contained in $J_{i}$ . For $J_{1}$ , the remaining $\lfloor n/2\rfloor$ coordinates contain $k-m_{1}$ supportive coordinates, thus $k-m_{1}\leq m_{\lceil n/2\rceil+1}\leq k-m_{1}+1$ (because $J_{\lceil n/2\rceil+1}$ also contains the first coordinate in the case where $n$ is odd). If $m_{1}=\lceil k/2\rceil$ , then $(J_{1},\bar{J_{1}})$ is a desired partition. So we assume that $m_{1}\neq\lceil k/2\rceil$ . If $m_{1}\leq\lceil k/2\rceil-1$ , we have $m_{\lceil n/2\rceil+1}\geq k-m_{1}\geq\lfloor k/2\rfloor+1\geq\lceil k/2\rceil$ . Otherwise if $m_{1}\geq\lceil k/2\rceil+1$ , we have $m_{\lceil n/2\rceil+1}\leq k-m_{1}+1\leq\lfloor k/2\rfloor$ . Since the difference between $m_{i}$ and $m_{i+1}$ must be [math] or $\pm 1$ , there exist at least one coordinate $i$ satisfying $m_{i}=\lceil k/2\rceil$ in any cases.

A.2 Proof of Lemma 3

Let $g:\mathbb{F}_{q}^{n}\to\mathbb{C}$ be the right-hand side of (1). It is enough to show that for any $\alpha\in\mathbb{F}_{q}^{n}$ ,

[TABLE]

From the definition of $\widehat{g}(\alpha)$ , it follows that

[TABLE]

For the second part, notice that for any $x\in\mathbb{F}_{q}^{n}$ and $z\in\mathbb{F}_{q}^{m}$ , exactly one element $y_{z}\in\mathbb{F}_{q}^{n}$ satisfying $y_{z}-A^{T}z=x$ is determined. Therefore,

[TABLE]

Appendix B Proofs of Lemmas in Section 3

B.1 Proof of Lemma 4

If $f$ is constant, then the algorithm obviously outputs the value with probability 1. If $f$ is not constant, then there are two entries which have different values in the truth table of $f$ . The probability that each value appears is at least $q^{-k}$ because the value of the truth table is affected by only at most $k$ coordinates. If $m$ examples contain these values as their labels, then the algorithm will output $\bot$ . The probability that each value does not appear in $m$ labels is bounded above by $(1-q^{-k})^{m}\leq e^{-m/q^{k}}\leq\frac{\delta}{2}$ . By the union bound, the failure probability is at most $\delta$ .

B.2 Proof of Lemma 5

First, we consider the case where $\widehat{f}(\alpha)\neq 0$ . Assume that $\Pr[\mathrm{Tr}(f(x)-\chi_{\alpha}(x))=a]<\frac{1}{p}+\frac{1}{q^{k}}$ for all $a\in\mathbb{F}_{p}$ . Since $\alpha$ does not have nonzero value at irrelevant coordinates by Fact 3, the value $f-\chi_{\alpha}$ is determined by at most $k$ coordinates of $x$ , and $\Pr[\mathrm{Tr}(f(x)-\chi_{\alpha}(x))=a]\leq\frac{1}{p}$ for all $a\in\mathbb{F}_{p}$ . This implies $\Pr[\mathrm{Tr}(f(x)-\chi_{\alpha}(x))=a]=\frac{1}{p}$ for all $a\in\mathbb{F}_{p}$ and $\widehat{f}(\alpha)=0$ , which is contradiction. Thus, there exists $a^{\prime}\in\mathbb{F}_{p}$ such that $\Pr[\mathrm{Tr}(f(x)-\chi_{\alpha}(x))=a^{\prime}]\geq\frac{1}{p}+\frac{1}{q^{k}}$ . By the Hoeffding inequality, the probability that the condition in line 6 does not hold w.r.t. $a^{\prime}$ is bounded above by $e^{-\frac{m}{2q^{2k}}}\leq\frac{\delta}{p}<\delta$ .

On the other hand, if there exists $i\in[n]$ such that $i$ is irrelevant and $\alpha_{i}\neq 0$ , then for any $a_{q}\in\mathbb{F}_{q}$ ,

[TABLE]

where $\alpha^{\prime}_{i}=0$ and $\alpha^{\prime}_{j}=\alpha_{j}$ for $j\neq i$ . For any $a_{p}\in\mathbb{F}_{p}$ , this implies

[TABLE]

By the Hoeffding inequality, the probability that the condition in line 6 holds is bounded above by $e^{-\frac{m}{2q^{2k}}}\leq\frac{\delta}{p}$ . Therefore, by the union bound, the probability that the condition holds for some $a_{p}\in\mathbb{F}_{p}$ (i.e., the failure probability) is at most $\delta$ .

B.3 Proof of Lemma 6

In this section, we show the correctness of the subroutine addRC. First, we introduce the following simple fact. The reader may skip the proof of the Claim 1 because it is quite basic and not essential.

*Claim 1**.*

For any vectors $\alpha,\beta\in\mathbb{F}_{q}^{n}\setminus\{0^{n}\}$ , the following holds:

(i) If $\beta\neq c\alpha$ for any $c\in\mathbb{F}_{q}$ (i.e., $\alpha$ and $\beta$ are linearly independent), then for any $a,b\in\mathbb{F}_{q}$ ,

[TABLE]

(ii) If $\beta=c\alpha$ ( $c\neq 0$ ), then for any $a,b\in\mathbb{F}_{q}$ ,

[TABLE]

In other words, if $\alpha,\beta(\neq 0^{n})$ satisfies the condition (i), then $\chi_{\alpha}(x)$ and $\chi_{\beta}(x)$ are uniformly and pairwise independently distributed w.r.t. the uniform selection of $x\in\mathbb{F}_{q}^{n}$ .

Proof.

(i) If $\beta\neq c\alpha$ for any $c\in\mathbb{F}_{q}$ , there are two coordinates $i,j\in[n]$ satisfying $\beta_{i}=c\alpha_{i}$ , $\beta_{j}=c^{\prime}\alpha_{j}$ , $c\neq c^{\prime}$ , and $\alpha_{i},\alpha_{j}\neq 0$ . First we select values in $\mathbb{F}_{q}^{[n]\setminus\{i,j\}}$ , and for any choice, the remaining condition takes the following form: for some $v_{1},v_{2}\in\mathbb{F}_{q}$ ,

[TABLE]

Since $\alpha_{i}c^{\prime}\alpha_{j}-\alpha_{j}c\alpha_{i}=\alpha_{i}\alpha_{j}(c^{\prime}-c)\neq 0$ , the above equations have a unique solution w.r.t. $(x_{i},x_{j})$ . The probability that they take the values of the unique solution is exactly $q^{-2}$ .

(ii) If $\beta=c\alpha$ ( $c\neq 0$ ), the condition takes the following form:

[TABLE]

Obviously, the probability is $q^{-1}$ if $a=c^{-1}b$ , otherwise, the probability is [math]. ∎

Next, we show that for small subspace $\mathbb{F}_{q}^{D}$ , only one vector $\alpha\in\mathbb{F}_{q}^{D}$ satisfies $A\alpha=1^{m}$ with non-negligible probability w.r.t. the uniform selection of $A$ .

*Claim 2**.*

For any subset $D\subseteq[n]$ $(|D|\leq k)$ , $\alpha\in\mathbb{F}_{q}^{D}\setminus\{0^{n}\}$ , and $m\geq k$ ,

[TABLE]

Especially, if the parameter $m$ is selected as $m=k+1$ , then

[TABLE]

Proof.

The second part immediately follows from the first one, thus we give only a proof of the first part. It is sufficient to show that

[TABLE]

Since $\alpha\neq 0^{n}$ , $\Pr_{x\sim\mathbb{F}_{q}^{n}}[x^{T}\alpha=1]=q^{-1}$ holds, thus we have $\Pr_{A}[A\alpha=1^{m}]=q^{-m}$ . By Claim 1, for any $\beta\neq\alpha$ , we have

[TABLE]

Therefore,

[TABLE]

and

[TABLE]

Since $|D|\leq k$ , the number of vectors $\beta\in\mathbb{F}_{q}^{D}$ is at most $q^{k}$ . Hence, by the union bound,

[TABLE]

which is equivalent to the second part of the inequality (2). ∎

Let $f:\mathbb{F}_{q}^{n}\to\mathbb{F}_{q}$ be $k$ -junta and $D\subseteq[n]$ be the set of relevant coordinates of $f$ . In the following claims, we assume that there exists $\alpha\in\mathbb{F}_{q}^{D}\setminus\{0^{n}\}$ satisfying $\widehat{f}(\alpha)\neq 0$ and the event in Claim 2 occurs for $D$ , $\alpha$ , and $m=k+1$ . By the definition of $(A,a)$ -projection, the projected function satisfies $f^{a}_{A}\equiv\widehat{af}(\alpha)e(a\chi_{\alpha})$ for any $a\in\mathbb{F}_{q}\setminus\{0\}$ , because $af$ has the same domain $D$ . In addition, we assume that the example $(x,b)$ is simulated as follows: for $y\leftarrow_{u}\mathbb{F}_{q}^{n}\text{ and }z\leftarrow_{u}\mathbb{F}_{q}^{k+1}$ ,

[TABLE]

*Claim 3**.*

Let $\alpha\in\mathbb{F}_{q}^{n}$ . If the $(a,A)$ -projected function satisfies $f^{a}_{A}\equiv\widehat{af}(a\alpha)e(a\chi_{\alpha})$ for all $a\in\mathbb{F}_{q}\setminus\{0\}$ , and the example $(x,b)$ is simulated as the above, then the conditional distribution of $b_{x}$ is determined by only the value of $\chi_{\alpha}(x)$ , that is, for $x,x^{\prime}\in\mathbb{F}_{q}^{n}$ , if $\chi_{\alpha}(x)=\chi_{\alpha}(x^{\prime})$ , then $SD(b_{x},b_{x^{\prime}})=0$ .

Proof.

By Lemma 3 and the assumption, $\mathrm{E}[e(ab_{x})]=f^{a}_{A}(x)=\widehat{af}(a\alpha)e(a\chi_{\alpha}(x))$ for any $a\in\mathbb{F}_{q}\setminus\{0\}$ . By Fact 4,

[TABLE]

∎

In the algorithm addRC, an example of LDWE is simulated as $(x,a\cdot b)$ for some $a\in\mathbb{F}_{q}\setminus\{0\}$ . Obviously if the distribution of $b_{x}$ is determined, then the distribution of $a\cdot b_{x}$ is also determined. In addition, it is also obvious that the value of $\chi_{\alpha}(x)=a^{-1}\chi_{a\alpha}(x)$ is determined by the value of $\chi_{a\alpha}(x)$ . Therefore, the above claim implies that the simulated oracle in the algorithm addRC returns indeed an instance of LDME for the target linear function $\chi_{a\alpha}(x)$ . Finally, we show that the simulated instance has a large correlation with the linear function $\chi_{a\alpha}$ if the algorithm addRC chooses a “good” $a\in\mathbb{F}_{q}\setminus\{0\}$ .

*Claim 4**.*

We assume the same notations and conditions as in Claim 3. In addition, if the $k$ -junta function $f$ satisfies $\widehat{f}(\alpha)\neq 0$ and the parameter $m$ is selected by $m=k+1$ , (i.e., $A\in\mathbb{F}_{q}^{(k+1)\times n}$ ), then

[TABLE]

Proof.

For any $a\in\mathbb{F}_{q}\setminus\{0\}$ ,

[TABLE]

Thus, it is enough to show that $\max_{a\in\mathbb{F}_{q}\setminus\{0\}}|\widehat{af}(a\alpha)|\geq 1/q^{k+1}$ . Let $U^{(1)}_{q},\ldots,U^{(n)}_{q}$ and $U^{\prime}_{q}$ be independently and uniformly distributed random variables over $\mathbb{F}_{q}$ , and let $U^{n}_{q}=(U^{(1)}_{q},\ldots,U^{(n)}_{q})$ .

[TABLE]

By the assumption, $\mathrm{E}[e(f(U^{n}_{q})-\chi_{\alpha}(U^{n}_{q}))]=\widehat{f}(\alpha)\neq 0$ . Since $\mathrm{E}[e(U^{\prime}_{q})]=0$ , they must not be statistically identical, that is, $SD(f(U^{n}_{q})-\chi_{\alpha}(U^{n}_{q}),U^{\prime}_{q})\neq 0$ . In addition, by Fact 3, $f(x)-\chi_{\alpha}(x)$ is $k$ -junta. Therefore, by the definition of statistical distance, $2\cdot SD(f(U^{n}_{q})-\chi_{\alpha}(U^{n}_{q}),U^{\prime}_{q})\geq 1/q^{k}$ . Now we have

[TABLE]

∎

Now we give the proof of Lemma 6.

Proof (Lemma 6).

First, for simplicity, let us assume that execution of checkRC succeeds with probability 1. If $f$ is not constant and some relevant coordinates are not contained in $R$ , then there exists a restriction $\tau$ on $R$ such that the restricted function $f|_{\tau}$ is not constant. In this case, by Lemma 2, there exist $j\in[p^{\ell-1}]$ and $\alpha\in\mathbb{F}_{q}^{|\bar{R}|}$ such that $|\alpha|\geq 1$ and $\widehat{c_{j}f|_{\tau}}(\alpha)\neq 0$ .

For convenience, we regard $f$ as the restricted function as $f:=c_{j}f|_{\tau}$ . For the set $D$ of relevant coordinates of $f$ , $|D|\leq k$ . By Claim 2 and the argument following Claim 2, for all $a^{\prime}\in\mathbb{F}_{q}\setminus\{0\}$ , $f^{a^{\prime}}_{A}\equiv\widehat{a^{\prime}f}(a^{\prime}\alpha)e(a^{\prime}\chi_{\alpha})$ with probability at least $1/q^{k+2}$ w.r.t. the uniform selection of $A$ . Since addRC tries to select $A$ more than $q^{k+2}\ln{4/\delta}$ times, at least one of selected $A$ ’s satisfies this condition with probability at least $1-\delta/4$ . Thus in the following argument, we assume that the algorithm addRC succeeds in selecting such an $A$ .

If the algorithm addRC succeeds in selecting the above matrix $A$ , then by Claims 3 and 4, there exists $a\in\mathbb{F}_{q}\setminus\{0\}$ such that the simulated noisy example in line 10 corresponds to the example from LDME of the correlation $\rho\geq 1/q^{k+1}$ . By the assumption, the repetition of LDME recovers $\alpha$ up to constant factor (i.e., finds $a^{\prime}\alpha$ for some $a^{\prime}\in\mathbb{F}_{q}$ ) with probability at least $1-\delta/4$ . If LDME is solved successfully, then at least one relevant coordinate is added to $R$ in line 14.

If the algorithm addRC fails in selecting $A$ and $a$ , the subroutine LDME may return some undesirable candidate. In this case, the subroutine checkRC returns false in line 13, and irrelevant coordinates are not added to $R$ . Therefore, by the union bound, the failure probability is at most $\delta/4+\delta/4=\delta/2$ under the condition that checkRC succeeds with probability 1.

In fact, our algorithm checkRC may fail. Since the number of executions of checkRC is at most $Mq^{k+3}$ , by the union bound, the probability that some executions of checkRC fail is at most $\delta/2$ . Thus, the total failure probability is at most $\delta$ . The total running time is bounded above by

[TABLE]

∎

Appendix C Proofs of Lemmas in Section 4

C.1 Proof of Lemma 8

For simplicity, let $p_{a}:=\Pr[X=a]$ for $a\in\mathbb{F}_{q}$ . First we show that

[TABLE]

By contraposition, we assume that $|p_{a}-\frac{1}{q}|<\frac{\rho}{q}$ for any $a\in\mathbb{F}_{q}$ . Then,

[TABLE]

where the second equality follows from the fact that $\sum_{a\in\mathbb{F}_{q}}{e(a)}=0$ .

Now we have that $|p_{a}-\frac{1}{q}|\geq\frac{\rho}{q}$ for some $a\in\mathbb{F}_{q}$ . If $p_{a}-\frac{1}{q}\geq\frac{\rho}{q}$ , then $p_{a}\geq\frac{1}{q}+\frac{\rho}{q}\geq\frac{1}{q}+\frac{\rho}{q^{2}}$ . Therefore, the remaining case is that $p_{a}\leq\frac{1}{q}-\frac{\rho}{q}$ . In this case,

[TABLE]

Thus, there exists $b\in\mathbb{F}_{q}$ such that $p_{b}\geq\frac{1}{q}+\frac{\rho}{q(q-1)}\geq\frac{1}{q}+\frac{\rho}{q^{2}}.$

C.2 Proof of Lemma 9

The lemma immediate follows from Claim 1 as follows:

[TABLE]

C.3 Proof of Lemma 10

If $\mathrm{E}[e(f(x)-\chi_{\gamma}(x))]\geq\rho$ , then by Lemma 8, there exists $a\in\mathbb{F}_{q}$ such that $\Pr[f(x)-\chi_{\gamma}(x)=a]\geq 1/q+\rho/q^{2}$ . Since checkCor tries all $a\in\mathbb{F}_{q}$ , by Hoeffding inequality, the condition in line 6 is not satisfied with probability at most

[TABLE]

On the other hand, if $\chi_{\alpha}$ is a target linear function, and $\gamma$ and $\alpha$ are linearly independent, then by Lemma 9, $\Pr[f(x)-\chi_{\gamma}(x)=a]=1/q$ for each $a\in\mathbb{F}_{q}$ . By Hoeffding inequality and the union bound, the error probability that the condition in line 6 is satisfied is at most

[TABLE]

C.4 Proof of Lemma 11

In this section, we show the correctness of the algorithm main2. We use $\alpha$ to denote the coefficients of the target linear function, that is, the distribution of the target randomized function $f(x)$ is determined only by $\chi_{\alpha}(x)$ for each $x\in\mathbb{F}_{q}^{n}$ . We assume that a partition $(J,\bar{J})$ is consecutive and divides a nonzero part of $\alpha$ into half as in Lemma 1.

We begin with the analysis of non-target pairs for each row in the reduced instance.

*Claim 5**.*

If a partition $(J,\bar{J})$ and linearly independent vectors $\beta,\beta^{\prime}\in\mathbb{F}_{q}^{J}\cup\mathbb{F}_{q}^{\bar{J}}$ satisfy that $\alpha^{J}\neq 0^{n}$ , $\alpha^{\bar{J}}\neq 0^{n}$ , and for any $a\in\mathbb{F}_{q}\setminus\{1\},\alpha^{J}\neq a\beta,\alpha^{J}\neq a\beta^{\prime},\alpha^{\bar{J}}\neq a\beta,\alpha^{\bar{J}}\neq a\beta^{\prime}\text{ and }\beta+\beta^{\prime}\neq\alpha$ , then $\chi_{\beta}$ and $\chi_{\beta^{\prime}}$ are uniformly and pairwise independently distributed under any condition about $\chi_{\alpha}$ , i.e., for any $v_{1},v_{2},v_{3}\in\mathbb{F}_{q}$ ,

[TABLE]

The proof of Claim 5 is not so essential, thus the reader may skip over it.

Proof.

Since $\alpha\neq 0^{n}$ , it is enough to show that, for any $v_{1},v_{2},v_{3}\in\mathbb{F}_{q}$ ,

[TABLE]

W.l.o.g., we can assume that $\beta\in\mathbb{F}_{q}^{J}$ and $\beta\neq\alpha^{J}$ (in this case, either $\beta^{\prime}=\alpha^{J}$ or $\beta^{\prime}=\alpha^{\bar{J}}$ may hold). First consider the case where $\beta^{\prime}\in\mathbb{F}_{q}^{\bar{J}}$ . We select three coordinates $(i_{1},i_{2},i_{3})$ as follows: by linearly independence of $\beta$ and $\alpha^{J}$ , we can select $(i_{1},i_{2})$ such that $(\alpha_{i_{1}},\alpha_{i_{2}})$ and $(\beta_{i_{1}},\beta_{i_{2}})$ are also linearly independent. Then, we select $i_{3}\in\bar{J}$ to satisfy that $\beta^{\prime}_{i_{3}}\neq 0$ . Now we have the three vectors $\{(\alpha_{i_{1}},\alpha_{i_{2}},\alpha_{i_{3}}),(\beta_{i_{1}},\beta_{i_{2}},0),(0,0,\beta^{\prime}_{i_{3}})\}$ . It is not so difficult to see that they are linearly independent.

Otherwise if $\beta^{\prime}\in\mathbb{F}_{q}^{J}$ , we select $i_{3}$ satisfying $\alpha_{i_{3}}\neq 0$ , and we can select $(i_{1},i_{2})$ such that $(\beta_{i_{1}},\beta_{i_{2}})$ and $(\beta^{\prime}_{i_{1}},\beta^{\prime}_{i_{2}})$ are also linearly independent. Then we have three vectors $\{(\alpha_{i_{1}},\alpha_{i_{2}},\alpha_{i_{3}}),$ $(\beta_{i_{1}},\beta_{i_{2}},0),(\beta^{\prime}_{i_{1}},\beta^{\prime}_{i_{2}},0)\}$ which are also linearly independent.

In any case, for any assignment to $[n]\setminus\{i_{1},i_{2},i_{3}\}$ , the solution of the remaining linear system in $x_{i_{1}},x_{i_{2}},x_{i_{3}}$ is uniquely determined, and the claim holds as in the proof of Claim 1. ∎

In the reduction, we assume that the initial values $s_{1}$ and $s_{2}$ are consistent with $\alpha$ , that is, $s_{1}=init(\alpha^{J})$ and $s_{2}=init(\alpha^{\bar{J}})$ . Any pair of indices $(\beta,\beta^{\prime})$ except for $(\alpha^{J},\alpha^{\bar{J}})$ satisfies the conditions in Claim 5, because they are non-zero and their initial values are fixed. In addition, the value of $f(x)$ depends on only the value of $\chi_{\alpha}$ . Therefore, by Claim 5, the pair of entries indexed by $(\beta,\beta^{\prime})$ are also uniformly and independently distributed.

For an element $a\in\mathbb{F}_{q}$ and a random variable $X$ taking values in $\mathbb{F}_{q}$ , we use ${X}_{bin}^{a}$ to denote a $\{\pm 1\}$ -valued random variable given by operation in line 8 of main2, i.e.,

(1)

if $X$ takes $a$ , set as ${X}_{bin}^{a}=1$ , 2. (2)

otherwise, flip a biased coin with the head probability $p_{h}=q/(2(q-1))$ , and if it comes up with head (resp. tail), set as ${X}_{bin}^{a}=-1$ , (resp. ${X}_{bin}^{a}=1$ ).

For any $a\in\mathbb{F}_{q}$ , if $X$ is uniformly distributed over $\mathbb{F}_{q}$ , then $\Pr[{X}_{bin}^{a}=1]=\frac{q-1}{q}\cdot\frac{q}{2(q-1)}=\frac{1}{2}$ . Moreover, it is easy to see that if $X,Y$ are uniformly and pairwise independently distributed, then ${X}_{bin}^{a}$ and ${Y}_{bin}^{a}$ are also uniformly and pairwise independently distributed over $\{\pm 1\}$ . Therefore, any pair of entries indexed by $(\beta,\beta^{\prime})\neq(\alpha^{J},\alpha^{\bar{J}})$ is selected uniformly and independently.

Now we move on to the analysis of the target pair, that is, the pair of entries corresponding to $(\alpha^{J},\alpha^{\bar{J}})$ .

*Claim 6**.*

Let $(J,\bar{J})$ be any partition of $[n]$ . If a randomized function $f:\mathbb{F}_{q}^{n}\to\mathbb{F}_{q}$ has a correlation with $\chi_{\alpha}$ as $\mathrm{Cor}(f,\chi_{\alpha})\geq\rho$ , then there exist $a_{1},a_{2}\in\mathbb{F}_{q}$ such that

[TABLE]

Proof.

By Lemma 8, $\mathrm{Cor}(f,\chi_{\alpha})\geq\rho$ implies that there exists $a_{1}\in\mathbb{F}_{q}$ such that

[TABLE]

Therefore,

[TABLE]

∎

Then we estimate the correlation between the target pair in the reduced instance.

*Claim 7**.*

Let $a\in\mathbb{F}_{q}$ and $\mu\in[0,1]$ . If random variables $X$ and $Y$ in $\mathbb{F}_{q}$ satisfies

[TABLE]

then,

[TABLE]

where $p_{h}=\frac{q}{2(q-1)}$ as in the definition of ${X}_{bin}^{a}$ .

Proof.

Let $p_{1},p_{2},p_{3},p_{4}$ denote probabilities as

[TABLE]

Then, it follows that $p_{1}+p_{2}+p_{3}+p_{4}=1$ , $p_{1}\geq\frac{1}{q^{2}}+\mu$ , and

[TABLE]

Therefore, the probability is bounded below by

[TABLE]

∎

For our settings, take $X=f(x)-\chi_{\alpha^{J}}(x)-a_{1}$ , $Y=\chi_{\alpha^{\bar{J}}}(x)$ , and $\mu=\rho/q^{3}$ . Then we have

[TABLE]

and

[TABLE]

Therefore, if we take sufficiently many samples, then the target pair has a correlation at least $\frac{\rho}{2q^{3}}$ w.h.p. Now we give the proof of Lemma 11.

Proof (Lemma 11).

As in the proof of Lemma 6, we assume that all executions of checkCor will succeed. Under the condition, even if an incorrect candidate is found in brute-force search in $(J,\bar{J}),a_{1},a_{2},s_{1}$ , and $s_{2}$ , the algorithm main2 does not output such an incorrect answer by Lemma 10. In fact, it is easily checked that the number of executions of checkCor in lines 4 and 17 is at most $n(q-1)$ and $nq^{2}(q-1)^{2}\cdot M$ , respectively. Therefore, by the union bound, the probability that at least one execution fails is bounded above by

[TABLE]

Let $\alpha\in\mathbb{F}_{q}^{n}$ be the coefficients of the target linear function and $f$ be the target randomized function corrupted with noise. If $|\alpha|=1$ , then by our assumption on checkCor, the target linear function must be found in line 4. Therefore, we assume that $2\leq|\alpha|\leq k$ . In this case, we show that the reduced binary instance is the one of LBP with the correlation $\frac{\rho}{2q^{3}}$ w.h.p. We assume that, as mentioned in the definition of the algorithm main2, all columns are labeled by vectors in $\mathbb{F}_{q}^{n}$ . In addition, assume that the algorithm main2 succeeds in selecting $(J,\bar{J}),a_{1},a_{2},s_{1}$ , and $s_{2}$ satisfying that

•

$1\leq|\alpha^{J}|\leq\lceil k/2\rceil$ and $1\leq|\alpha^{\bar{J}}|\leq\lfloor k/2\rfloor$ (by Lemma 1, such a consecutive partition must exist)

•

$\Pr_{x,f}[f(x)-\chi_{\alpha^{J}}-a_{1}=\chi_{\alpha^{\bar{J}}}=a_{2}]\geq 1/q+\rho/q^{3}$ (by Claim 6, such values of $a_{1},a_{2}$ must exist)

•

$init(\alpha^{J})=s_{1}$ and $init(\alpha^{\bar{J}})=s_{2}$

Then, the reduced instance must contain the pair of columns indexed by $(\alpha^{J},\alpha^{\bar{J}})$ , we call it the target pair. For any pair of columns except for the target pair, as mentioned in the observation following Claim 5, the pair in the reduced instance is uniformly and independently distributed over $\{\pm 1\}^{d}$ . On the other hand, for each row of the target pair, their product is also $\{\pm 1\}$ -valued and the expectation is at least $\rho/q^{3}$ by Claim 7. If we select the sample size $d$ to be more than $\frac{8q^{6}}{\rho^{2}}\ln 4$ , then by Hoeffding inequality, the probability that their inner product does not exceed $\rho/2q^{3}$ is bounded above by

[TABLE]

In other words, with probability at least $3/4$ , the algorithm reduces LDME to LBP of the correlation $\rho/2q^{3}$ . W.l.o.g., we can assume that the failure probability of LBP is at most $1/4$ , (otherwise, it is achieved by constant number of repetitions). Thus, for each trial in lines 8 and 16, the probability that LBP does not find the target pair is at most $1/2$ . Therefore, by repeating these trials at least $\log{2/\delta}$ times, the failure probability decreases to $\delta/2$ . Even if we consider the possibility that checkCor may fail, the total failure probability is bounded above by $\delta/2+\delta/2=\delta$ . The total running time is bounded above by

[TABLE]

∎

Bibliography21

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] J. Alman. An Illuminating Algorithm for the Light Bulb Problem. In 2nd Symposium on Simplicity in Algorithms (SOSA 2019) , volume 69 of OAS Ics , pages 2:1–2:11, 2018.
2[2] V. Arvind, J. Köbler, and W. Lindner. Parameterized learnability of juntas. Theoretical Computer Science , 410(47):4928–4936, 2009.
3[3] A. Blum. Relevant Examples and Relevant Features: Thoughts from Computational Learning Theory. In AAAI-94 Fall Symposium on Relevance , pages 14–18, 1994.
4[4] A. Blum. Learning a Function of r 𝑟 r Relevant Variables. In Bernhard Schölkopf and Manfred K Warmuth, editors, Learning Theory and Kernel Machines , pages 731–733, Berlin, Heidelberg, 2003. Springer Berlin Heidelberg.
5[5] A. Blum and P. Langley. Selection of relevant features and examples in machine learning. Artificial Intelligence , 97(1):245 – 271, 1997. Relevance.
6[6] A. Bogdanov and E. Viola. Pseudorandom bits for polynomials. SIAM J. Comput. , 39(6):2464–2486, 2010.
7[7] Rod G Downey and Michael R Fellows. Fixed-Parameter Tractability and Completeness I: Basic Results. SIAM J. Comput. , 24(4):873–921, 1995.
8[8] V. Feldman, P. Gopalan, S. Khot, and A. K. Ponnuswami. New results for learning noisy parities and halfspaces. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06) , pages 563–574, 2006.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

A Faster Algorithm Enumerating Relevant Features

1 Introduction

1.1 Background and Motivation

1.2 Our Contributions

Definition 1**.**

Theorem 1** (main).**

Theorem 2**.**

Fact 1* ([12, Corollary 2.2]).*

Theorem 3**.**

2 Preliminaries

Lemma 1**.**

Proof.

Fact 2* (Hoeffding inequality [11]).*

2.1 Fourier Analysis

Lemma 2**.**

Proof.

Fact 3*.*

Proof.

2.2 (a,A)(a,A)(a,A)-Projection

Definition 2** ((a,A)(a,A)(a,A)-projection).**

Lemma 3**.**

Proof.

2.3 Statistical Distance and Character Distance

Definition 3** (statistical/character distance).**

Fact 4* ([6, Claim 33]).*

3 Reduction from Learning Juntas to LDME

3.1 Overview of the Reduction

3.2 Algorithms and Analysis

Lemma 4**.**

Proof.

Lemma 5**.**

Proof.

Lemma 6**.**

Proof.

Lemma 7**.**

Proof.

4 Reduction from LDME to LBP

Lemma 8**.**

Proof.

Lemma 9**.**

Proof.

4.1 Overview of the Reduction

4.2 Algorithms and Analysis

Lemma 10**.**

Proof.

Lemma 11**.**

5 Discussions and Future Directions

Appendix A Proofs of Lemmas in Section 2

A.1 Proof of Lemma 1

A.2 Proof of Lemma 3

Appendix B Proofs of Lemmas in Section 3

B.1 Proof of Lemma 4

B.2 Proof of Lemma 5

B.3 Proof of Lemma 6

Claim 1*.*

Proof.

Claim 2*.*

Proof.

Claim 3*.*

Proof.

Claim 4*.*

Proof.

Proof (Lemma 6).

Appendix C Proofs of Lemmas in Section 4

C.1 Proof of Lemma 8

C.2 Proof of Lemma 9

C.3 Proof of Lemma 10

C.4 Proof of Lemma 11

Claim 5*.*

Proof.

Claim 6*.*

Proof.

Claim 7*.*

Definition 1.

Theorem 1 (main).

Theorem 2.

*Fact 1** ([12, Corollary 2.2]).*

Theorem 3.

Lemma 1.

*Fact 2** (Hoeffding inequality [11]).*

Lemma 2.

*Fact 3**.*

2.2 $(a,A)$ -Projection

Definition 2 ( $(a,A)$ -projection).

Lemma 3.

Definition 3 (statistical/character distance).

*Fact 4** ([6, Claim 33]).*

Lemma 4.

Lemma 5.

Lemma 6.

Lemma 7.

Lemma 8.

Lemma 9.

Lemma 10.

Lemma 11.

*Claim 1**.*

*Claim 2**.*

*Claim 3**.*

*Claim 4**.*

*Claim 5**.*

*Claim 6**.*

*Claim 7**.*