A Faster Algorithm Enumerating Relevant Features over Finite Fields
Mikito Nanashima

TL;DR
This paper introduces a novel, efficient algorithm for learning k-juntas over finite fields, expanding Fourier detection techniques beyond binary cases and connecting to problems like LDME and the light bulb problem.
Contribution
It extends Fourier detection methods to finite fields and provides the first non-trivial algorithm for k-juntas over such fields, answering an open question.
Findings
Achieves an $O(n^{0.8k})$-time learning algorithm for k-juntas over finite fields.
First non-trivial algorithm for multi-labeled data in this context.
Reduces the problem to well-studied problems like LDME and LBP, enabling the use of existing techniques.
Abstract
We consider the problem of enumerating relevant features hidden in other irrelevant information for multi-labeled data, which is formalized as learning juntas. A -junta function is a function which depends on only coordinates of the input. For relatively small w.r.t. the input size , learning -junta functions is one of fundamental problems both theoretically and practically in machine learning. For the last two decades, much effort has been made to design efficient learning algorithms for Boolean junta functions, and some novel techniques have been developed. However, in real world, multi-labeled data seem to be obtained in much more often than binary-labeled one. Thus, it is a natural question whether these techniques can be applied to more general cases about the alphabet size. In this paper, we expand the Fourier detection techniques for the binary alphabet toâŠ
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Algorithms and Data Compression · Complexity and Algorithms in Graphs
A Faster Algorithm Enumerating Relevant Features
over Finite Fields
Mikito Nanashima
Tokyo Institute of Technology
We consider the problem of enumerating relevant features hidden in other irrelevant information for multi-labeled data, which is formalized as learning juntas.
A -junta function is a function which depends on only coordinates of the input. For relatively small w.r.t. the input size , learning -junta functions is one of fundamental problems both theoretically and practically in machine learning. For the last two decades, much effort has been made to design efficient learning algorithms for Boolean junta functions, and some novel techniques have been developed. However, in real world, multi-labeled data seem to be obtained in much more often than binary-labeled one. Thus, it is a natural question whether these techniques can be applied to more general cases about the alphabet size.
In this paper, we expand the Fourier detection techniques for the binary alphabet to any finite field , and give, roughly speaking, an -time learning algorithm for -juntas over . Note that our algorithm is the first non-trivial (i.e., non-brute force) algorithm for such a class even in the case where and we give an affirmative answer to the question posed by Mossel et al. [15].
Our algorithm consists of two reductions: (1) from learning juntas to the learning with discrete memoryless errors (LDME) problem which is the extension of the learning with errors (LWE) problems introduced by Regev [17], and (2) from LDME to the light bulb problem (LBP) introduced by L.Valiant [21]. Since the reduced problem (i.e., LBP) is a kind of binary problem regardless of the alphabet size of the original problem (i.e., learning juntas), we can directly apply the techniques for the binary problem in the previous work.
1 Introduction
1.1 Background and Motivation
In both practical and theoretical senses, it is a fundamental challenge to separate relevant information from irrelevant information in data analysis. In many machine learning settings, collected data may contain many irrelevant features together with relevant features (e.g., DNA sequences and big data), and the efficient techniques for selecting relevant features are widely required. This problem is captured by learning juntas, which is one of the most challenging and important issues in computational learning theory. Informally, we say an -input function is -junta () iff depends on only at most coordinates of the input. Our task is to find the relevant coordinates (i.e., features) of a -junta function , called a target function, from passively collected examples of the form .
In the special case where the domain of a target function is binary, that is, , the learning junta problem has theoretically important meanings. For , learning -junta functions is a special case of learning polynomial-size DNF (disjunctive normal form) formulas and log-depth decision trees, which are also known as notorious open problems in computational learning theory, even in the uniform-distribution model (i.e., examples are distributed uniformly over ). Therefore, for an affirmative answer to such problems, finding an efficient learning algorithm for log-juntas is inevitable. Despite much effort by many researchers, efficient (i.e., polynomial-time) learning algorithms for log-juntas have not been found. From the other point of view (i.e., parameterized complexity introduced by [7]), learning juntas problem can be regarded as a parametrized learning problem for general Boolean functions, and in fact, fixed parameter intractable results have been found in (proper) learning juntas under arbitrary example distribution in [2]. However, in the uniform-distribution model, any convincing argument on intractability has not been found until now. For further details about learning juntas, see the survey by Blum [4].
On the positive side, some elegant techniques for learning Boolean juntas have been developed in the uniform-distribution model since the problem was posed in [3, 5]. Obviously, any -junta function can be learned in time with high probability by brute-force search for all patterns about relevant coordinates. The first polynomial factor improvement was found by Mossel et al. [15], and the running time was reduced to , where denotes the exponential factor of the running time of fast matrix multiplication with best known bound of in [14]. Further improvement has been made by G.Valiant [19], and the faster learning algorithm in time has been developed, which is the best learning algorithm at present. Their contributions are mainly to give a subquadratic algorithm for the light bulb problem which was posed in [21] and a reduction from learning Boolean juntas to the light bulb problem.
In real world, multi-labeled data such as questionnaires or DNA sequences (i.e., (A,T,G,C)) seem to be obtained in much more often than binary-labeled one. Then, it is a natural question whether the techniques for learning Boolean juntas can be modified to more general domains. Although the learning problem for -juntas over the finite alphabet size was mentioned as a direction for future work in [15], there are much less learnability results in the general case than in the binary case. Obviously, it can be solved in time as in the case . The subsequent work [9] implicitly gave the non-trivial -time algorithm in the case where for some , by reducing the learning problem to learning problems for junta functions of the range . However, to the best of our knowledge, any non-trivial learning algorithm for juntas over more general domains has not been known, even in the case where . In this paper, we investigate the learnability of juntas over arbitrary finite fields, and explicitly give the first non-trivial learning algorithm for such classes.
1.2 Our Contributions
Let be arbitrary finite field of order where . In this paper, we focus on -junta functions over as target functions. Formally, -junta functions are defined as follows.
Definition 1**.**
For a function , we say that a coordinate is relevant if for some points of such that and differ only at the coordinate . For , we say that a function is -junta if has at most relevant coordinates.
We state the learning junta problem more formally. The learning setting mainly follows the framework of PAC (Probably Approximately Correct) learning which was first introduced by L.Valiant [20]. The number of relevant coordinates is given in advance by some fixed function , and a learning algorithm knows the function . The learning algorithm is given an example oracle as the only access to the target function . For each access to , it returns an example , where is selected uniformly at random over .
The learning junta problem is formally stated as follows. In this paper, we will use the term âwith high probability (w.h.p. for short)â to imply with some constant probability.
Learning -juntas (over finite field)
Input: , and an example oracle where is -junta
Goal: Find all (at most ) relevant coordinates w.h.p.
As described in [10], the failure probability can be reduced to any given by independent repetitions. The reader may think the above formulation differs from the usual PAC learning model in the sense that the learning algorithm will not output a hypothesis function. However, as described in [4, 15], the difficulty of learning juntas comes from the task of finding not what the function is but where the relevant coordinates are. In fact, the above formulation is equivalent to the usual PAC learnability under uniform distribution in learning juntas (within the multiplicative factor of ).
In this paper, we will prove the following main result.
Theorem 1** (main).**
For any and , -juntas over any finite field is learnable in time .
Our learning algorithm mainly follows the line of work by [8, 19] and consists of two reductions that generalize their reductions for the binary domain to any finite field .
In the first step, we reduce the learning juntas problem to another learning problem, learning with discrete memoryless errors (LDME). Simply speaking, the task of LDME is to learn a linear function with under the condition that the label may be corrupted with random noise, where with arithmetic in . For simplicity, we regard a randomized function as a target function to capture the noise.
Learning with Discrete Memoryless Errors: LDME
Input: , , and an example oracle ,
where is randomized. The distribution of the value is determined by only a value of (not itself), where . The target function is close to in the sense of correlation as follows:
[TABLE]
where the mapping is defined by for and .
Goal: Find the coefficients for some w.h.p.
We call the above function as a target linear function. The reason why we allow the algorithm to output instead of is that the linear function may also have large correlation, that is, .
Let us briefly overview the background of the above problem. LDME, introduced first by [9], is the extension of the well-known learning with errors problem (LWE) which has been known as one of the most challenging problems in learning theory and even used as a hardness assumption in cryptography (see [17, 18]). The difference between them is the noise setting. In LWE, the (unknown) distribution of noise is fixed in advance, while in LDME, the distribution is determined for each value of the target linear function, in other words, there exist totally unknown distributions of the noise. Note that, in addition, we adopt slightly different condition about the closeness between and compared to [9]. In the previous formulation, the given was the lower bound for the agreement probability that . However, in our formulation by correlation, the agreement probability is not always large. For example, even in the case where the subtraction is close to some constant, our condition about the closeness may hold.
We first present the reduction from the learning juntas problem to LDME, which is a generalization of the binary case in [8]. The detail will be given in Section 3.
Theorem 2**.**
If there exists a learning algorithm for solving LDME in time , then there exists a learning algorithm for -juntas over in time .
In the second step, we reduce LDME to the light bulb problem (LBP), which is first introduced by [21] and also a fundamental problem in machine learning and data analysis. Roughly speaking, the task of LBP is to find a correlated pair from the other uncorrelated pairs. The formal definition is as follows:
Light Bulb Problem: LBP
Input: a set of vectors, and ,
where for each . The instance contains a single correlated pair satisfying , and the other pairs of vectors are selected independently and uniformly at random.
Goal: Find indices of the correlated pair w.h.p.
It is obvious that LBP is solved in time by calculating inner products of all pairs. As a breakthrough result, the first subquadratic algorithm for LBP has been found by [19]. Moreover, in the case where , a faster algorithm was presented by [12]. Other subquadratic algorithms also have been proposed in [13, 1].
Fact 1* ([12, Corollary 2.2]).*
For any and , if , then there is a randomized algorithm for solving LBP with probability in time .
We present the second reduction from LDME to LBP. Note that the reduced problem is a kind of binary problem regardless of the alphabet size of the original problem. The detail will be given in Section 4.
Theorem 3**.**
Assume that there exist and an algorithm for solving LBP of degree in time w.h.p., where is the number of vectors in LBP. Then for any target linear function and any correlation , LDME is solved w.h.p. in time .
In our reduction, the size of data is stretched from to . Thus, the naive quadratic algorithm for LBP does not improve the trivial upper bound on the running time of LDME at all. However, by combining our reductions with the subquadratic algorithm for LBP, we have a non-trivial learnability result which holds for any finite field, and Theorem 1 immediately follows from Theorems 2 and 3, and Fact 1.
In Theorem 1, the condition that essentially comes from the condition that in Fact 1. Therefore, by adopting another subquadratic algorithm for LBP that works for any (e.g., [19]), we have a non-trivial learnability result for any . Remark that our reduction and such a subquadratic algorithm also give the non-trivial learning algorithm for LDME, in particular, LWE parameterized by .
2 Preliminaries
We use to denote logarithm of the base 2, and to denote natural logarithm. For any integer , we define a set . Let be a finite field of order where . We define a trace function by . Note that for any , , and takes on each value in equally often.
For , we define the weight of by . For , we also define its initial by the first non-zero value of , that is, iff there exists such that and for each . Note that if satisfy and , then there is no such that (i.e., and are linearly independent over ).
For any , we define a subspace by , where . For any and , we also define by if .
For a subset , we call a pair a partition of . In addition, if consists of cyclically consecutive coordinates, we say that the partition is consecutive. Obviously an index set has exactly consecutive partitions. Now we introduce the following useful lemma, which says that any subset in is divided into exactly half by at least one consecutive partition of .
Lemma 1**.**
For any with , there exist at least one consecutive partition which satisfies that and .
Proof.
See Appendix A.1. â
We use the term âa truth tableâ to denote a table of values of a function over as in the binary case. For any function and value , we define a function by . For a subset , we define a restriction on as a partial assignment to , and we use to denote the restricted function of which variables are partially assigned on . We use to denote the size of a restriction , that is, .
For a finite set , we write for a random sampling of according to the uniform distribution over . In the subsequent discussions, we assume the basic facts about probability theory, especially, pairwise independence and the union bound. We will make extensive use of the following tail bound.
Fact 2* (Hoeffding inequality [11]).*
For real values , let be independent and identically distributed random variables with and for each . Then for any , the following inequality holds:
[TABLE]
2.1 Fourier Analysis
We introduce some basics of Fourier analysis over finite fields. For further details, see [16, 9]. For each , let . For , it is easy to see that and . For any two functions , we define their inner product by . Then a family of functions forms an orthonormal basis, that is, if , otherwise, . Therefore, for any function has a unique Fourier expansion form as , where is a Fourier coefficient given by .
For a function and , we also define its Fourier coefficient on by (we use the same notation as the above). Let us remark that, not as complex-valued functions, does not always have the unique Fourier form, because the value is mapped onto in the definition of , and there exist different functions which satisfies . Our algorithm will extensively use the above analysis, more specifically, it will map the target function to and use the Fourier analysis over . However, in the setting of learning juntas, some relevant coordinates for may turn to be irrelevant for . This lack of information will be overcome by considering functions simultaneously for distinct elements , which is indicated by the following simple lemma. Note that, for any , we can easily simulate the example oracle from by multiplying each label by the value .
Lemma 2**.**
For any function , distinct elements , and relevant coordinate for , there exists such that is also relevant for .
Proof.
By the definition of relevant coordinates, there exists such that and differ only at the coordinate and . Since are distinct and nonzero, the values are also distinct and nonzero. The trace function takes each value exactly times and , thus there exists satisfying , which implies
[TABLE]
Therefore, is also relevant for the function . â
We also introduce the following fact which plays a crucial role in learning juntas.
Fact 3*.*
If a function satisfies that for some , then all coordinates with are relevant.
Proof.
By contraposition. If there exists an irrelevant coordinate such that ,
[TABLE]
where . â
2.2 -Projection
We define a notion of -projection which is a generalization of -projection in by [8].
Definition 2** (-projection).**
For , , and , we define by
[TABLE]
Lemma 3**.**
For and ,
[TABLE]
Moreover, if an example and its label are given by for and , then for any , , where denotes a random variable according to the distribution of conditioned on the example .
Proof.
It is essentially the same as the proof in [8]. For completeness, see Appendix A.2. â
2.3 Statistical Distance and Character Distance
For our proofs, we introduce the following two distances about random variables taking values in , which was introduced first in [6].
Definition 3** (statistical/character distance).**
For random variables taking values in , we define their statistical distance by
[TABLE]
and we also define their character distance by
[TABLE]
In the case where is not prime, we adopt a different definition for from one in the original paper [6]. However, it is easily checked that the following fact holds from exactly the same argument.
Fact 4* ([6, Claim 33]).*
For any random variables taking values in ,
[TABLE]
In particular, if and only if .
3 Reduction from Learning Juntas to LDME
In this paper, for simplicity, we assume the following computational model:
- âą
A learning algorithm can uniformly select an element in with probability 1 in constant steps. In fact, a usual randomized model with binary coins may fail in selecting such random elements with exponentially small probability, but we can deal with this probability as a general error probability (i.e., confidence error). For the same reason, we allow algorithms to flip a biased coin which lands heads up with a rational probability (of the polynomial-time computable denominator).
- âą
A learning algorithm with an example oracle , where is -junta, can simulate an oracle w.r.t. any restriction of the size . In fact, this simulation is done by taking several examples until getting an example consistent with . Since the probability that an example consistent with is sampled is at least , the failure probability becomes exponentially small by taking examples. We can also deal with this error probability as a general confidence error, and the additional running time is at most .
3.1 Overview of the Reduction
Our learning algorithm (main1) has two phases, a checking phase (lines 6 and 7) and a detection phase (line 9), and repeats them alternately as the MOS algorithm [15]. The algorithm starts the checking phase with a set empty. In the following steps, the relevant coordinates found by the algorithm will be put in . In the checking phase, the algorithm verifies whether contains all relevant coordinates of the target function by examining that restricted functions are constant for all restrictions on . If contains all relevant coordinates, then the algorithm outputs and halts, otherwise moves on to the detection phase. In the detection phase, the algorithm will find at least one relevant coordinate, add them to , and will move on to the checking phase. Since the algorithm finds at least one relevant coordinate in each loop, the number of repetitions is at most .
In the detection phase, we reduce the task of finding relevant coordinates to LDME in the subroutine addRC by -projection. In our reduction, the target linear function satisfies that for some . Therefore, if the algorithm for LDME finds (up to constant factor), then the learning algorithm can find at least one relevant coordinate such that by Fact 3.
3.2 Algorithms and Analysis
First we introduce two simple subroutines. For the proofs of their correctness (i.e., Lemmas 4 and 5), see Appendix B.
Algorithm 1 checks whether the target function is constant or not by simply examining that the collected examples take the same value. As mentioned in Section 3.1, we will use this subroutine to determine the end of learning in the checking phase.
Algorithm 2 checks whether the given has nonzero entry at an irrelevant coordinate. Our learning algorithm main1 may find an undesirable candidate in the detection phase, thus we must check whether the candidate consists of only a part of relevant coordinates by this subroutine not to add any irrelevant coordinate to the container for relevant coordinates.
Lemma 4**.**
For any input , const outputs if , otherwise with probability at least .
Proof.
See Appendix B.1. â
Lemma 5**.**
For any input , if , then checkRC outputs true with probability at least . Otherwise if for an irrelevant coordinate , checkRC outputs false with probability at least .
In general, there is a case where and all satisfying are relevant. In the above lemma, we do not say anything about such a case.
Proof.
See Appendix B.2. â
Algorithm 3 is a core part of our reduction, which reduces the task of finding candidates for relevant coordinates to LDME, checks whether the candidates are indeed relevant, and returns them to the main algorithm. Let LDME be the learning algorithm for LDME.
We briefly explain how the subroutine addRC works. The details will be addressed in Lemma 6 and Appendix B.3.
If the given set does not contain all relevant coordinates, then for some restriction on , the restricted function is not constant. By Lemma 2, there exists an element such that is also non-constant. This subroutine works for such a restriction and an element , and finds new relevant coordinates for the function . In fact, the subroutine tries all (at most ) restrictions on and elements (line 4).
Let . For the (non-constant) restricted function , addRC repeats the following process: (1) selects a matrix at random (line 6), (2) selects a value (line 7), and (3) executes LDME with the example oracle simulated as in Lemma 3 w.r.t the selected and (line 8).
Let . Since the function is not constant, it has a non-zero coefficient of , which means that has some correlation with the linear function . In fact may have correlation with other linear functions, but the number of such linear functions is small because is also -junta. Simply speaking, the role of is to filter out some of these correlations on simulated examples, and we can show that the non-negligible fraction of âs remove all the correlations except for the linear function (Claim 2 in Appendix B.3). In other words, the simulated examples depend on only , and it is just an instance of LDME. While, the role of is to enhance the correlation with the target linear function, and for a good choice of , the correlation is bounded below by (Claim 4 in Appendix B.3).
If the algorithm LDME finds for some constant , by Fact 3 and the fact that , all coordinates taking non-zero values are relevant for . Moreover, they are also relevant for because the algorithm selected non-zero . Therefore, we can reduce the task of finding relevant coordinates to LDME of the correlation bound .
In fact, for a bad choices of and , the algorithm may find undesirable candidates . Not to add irrelevant coordinates to in such a case, addRC executes checkRC for any candidate found by LDME (line 13).
Lemma 6**.**
If the algorithm LDME solves LDME in time w.h.p. and does not contain all relevant coordinates, then the subroutine addRC adds at least one relevant coordinate to with probability at least , and its running time is bounded above by .
Proof.
The outline is shown in the above. For the complete proof, see Appendix B.3. â
Algorithm 4 is our learning algorithm. Now we prove its learnability by Lemma 6. Theorem 2 immediately follows from Lemma 7 by substituting some constant for .
Lemma 7**.**
If the algorithm LDME solves LDME in time w.h.p., then the algorithm main1 outputs all relevant coordinates for any -junta function with probability at least , and its running time is bounded above by .
Proof.
First we show that the algorithm halts at most loops assuming that all subroutines succeed. If contains all relevant coordinates, then for all restrictions on , the restricted functions must be constant, thus the algorithm halts and outputs in line 7. On the other hand, if does not contain some relevant coordinates, addRC adds at least one relevant coordinate to by Lemma 6. Since has at most relevant coordinates, addRC is executed at most times, and the main loop is repeated at most times.
In fact, the algorithm may fail in executing const and addRC. The number of the executions is at most . Thus if we set their confidence parameter as , then by the union bound, the total failure probability is bounded above by . By Lemma 6, the total running time is at most
[TABLE]
â
4 Reduction from LDME to LBP
First we introduce the following simple lemmas and their corollaries as observations of LDME.
Lemma 8**.**
Let be a random variable taking values in . For , if , then there exists such that .
Proof.
See Appendix C.1. â
Lemma 9**.**
Let and be a random variable taking values in . If the distribution of is determined by only the value of where , and and are linearly independent over , then for all , .
Proof.
See Appendix C.2. â
As a corollary, we have the following facts about LDME. Let , be a target linear function, and be the target (randomized) function, that is, . If , then by Lemma 8, there exists some value such that . On the other hand, if and are linearly independent, then by Lemma 9, for all . We essentially use the difference in our reduction. Note that we do not say anything about the case where but they are linearly dependent (i.e., for some ).
4.1 Overview of the Reduction
Our learning algorithm is Algorithm 6 (main2) and the main idea is similar to the split-and-list idea in previous work [19, 12]. Let be the coefficients of a target linear function with . First we select a consecutive partition that divides the nonzero entries of into half by brute-force search (line 6), then list the values of linear functions of weight where is contained in either or (lines 8:1â4). Not to contain linearly dependent linear functions, we fix an initial value of the coefficient vector for each partition. Since there are at most patterns about the initial values, we can easily guess the pair of initial values consistent with and .
As the above, we stretch a noisy example to entries taking values in . Then, we translate the stretched data into an instance of LBP, that is, a -valued instance. We can observe the following three facts. First, each entry takes values uniformly over . Second, the pair of entries corresponding to (we may call it a target pair) has some correlation in the sense that they take a certain value with relatively high probability, where we refer to such a value as a concentrated value. Finally, other pairs are distributed pairwise independently.
Now we translate each entry into or as follows: (1) For the case where is concentrated, we change the entry to (line 8:5), (2) for the case where is not concentrated, we flip a biased coin with the head probability , and if it comes up with head, then we change the entry to , otherwise to (line 8:6). Because each entry is uniformly distributed, the probability that the entry is changed to is exactly , that is, uniformly distributed over . Moreover, by pairwise independence, all pairs except for the target pair are also independently distributed. On the other hand, in the target pair, the correlation remains even in resulting binary instance. In other words, the reduced instance is just the one of LBP.
4.2 Algorithms and Analysis
First, we introduce the following simple subroutine Algorithm 5, which checks whether a candidate linear function found in the main routine is indeed a target linear function or not. In fact, it can be also implemented by the standard empirical estimation of the correlation. The merit of our implementation by using the conditions in Lemmas 8 and 9 is simply to avoid calculations of complex numbers.
Lemma 10**.**
Let be a target linear function. The subroutine checkCor outputs true if the given satisfies with probability at least . On the other hand, if and are linearly independent, checkCor outputs false with probability at least in time .
Proof.
The lemma follows from Lemmas 8 and 9 and the standard probabilistic argument. For the complete proof, see Appendix C.3. â
Algorithm 6 is our main reduction from LDME to LBP. Let LBP be a subroutine for solving LBP (of the degree ) with high probability. W.l.o.g., we can assume the failure probability is at most by constant number of repetitions.
The proof of Lemma 11 is informally given as mentioned in Section 4.1, and we give the complete proof in Appendix C.4. Theorem 3 immediately follows from Lemma 11 by substituting some constant for .
Lemma 11**.**
Assume that the subroutine LBP solves LBP for some in time w.h.p., where is the number of the vectors. Then the algorithm main2 solves LDME for any target linear function in time with probability at least .
5 Discussions and Future Directions
We introduced the reduction from learning juntas over any finite fields to LBP, and gave the first non-trivial learning algorithm for such a class. Our results also enhance the motivation of designing an efficient algorithm for LBP, because it automatically improves the upper bound for learning -juntas for not only the binary domain but also any finite field.
However, by our reduction, if we could construct a linear-time algorithm for LBP, the upper bound will be improved to at best . Therefore, unlike in the binary case, it is open whether there exists a scenario that the polynomial factor can be improved to less than . Remember that we first reduced the learning juntas problem to LDME which was the extension of the challenging learning problem, LWE. For further improvement, such a hard problem should be avoided.
In addition, our reduction makes extensive use of the properties of finite fields. Thus, it is also open whether we can design a non-trivial learning algorithm that works for any finite alphabet, in particular, .
Appendix A Proofs of Lemmas in Section 2
A.1 Proof of Lemma 1
For convenience, we say is supportive if . For , let be a subset which consists of cyclically consecutive coordinates from , and be the number of supportive coordinates contained in . For , the remaining coordinates contain supportive coordinates, thus (because also contains the first coordinate in the case where is odd). If , then is a desired partition. So we assume that . If , we have . Otherwise if , we have . Since the difference between and must be [math] or , there exist at least one coordinate satisfying in any cases.
A.2 Proof of Lemma 3
Let be the right-hand side of (1). It is enough to show that for any ,
[TABLE]
From the definition of , it follows that
[TABLE]
For the second part, notice that for any and , exactly one element satisfying is determined. Therefore,
[TABLE]
Appendix B Proofs of Lemmas in Section 3
B.1 Proof of Lemma 4
If is constant, then the algorithm obviously outputs the value with probability 1. If is not constant, then there are two entries which have different values in the truth table of . The probability that each value appears is at least because the value of the truth table is affected by only at most coordinates. If examples contain these values as their labels, then the algorithm will output . The probability that each value does not appear in labels is bounded above by . By the union bound, the failure probability is at most .
B.2 Proof of Lemma 5
First, we consider the case where . Assume that for all . Since does not have nonzero value at irrelevant coordinates by Fact 3, the value is determined by at most coordinates of , and for all . This implies for all and , which is contradiction. Thus, there exists such that . By the Hoeffding inequality, the probability that the condition in line 6 does not hold w.r.t. is bounded above by .
On the other hand, if there exists such that is irrelevant and , then for any ,
[TABLE]
where and for . For any , this implies
[TABLE]
By the Hoeffding inequality, the probability that the condition in line 6 holds is bounded above by . Therefore, by the union bound, the probability that the condition holds for some (i.e., the failure probability) is at most .
B.3 Proof of Lemma 6
In this section, we show the correctness of the subroutine addRC. First, we introduce the following simple fact. The reader may skip the proof of the Claim 1 because it is quite basic and not essential.
Claim 1*.*
For any vectors , the following holds:
(i) If for any (i.e., and are linearly independent), then for any ,
[TABLE]
(ii) If (), then for any ,
[TABLE]
In other words, if satisfies the condition (i), then and are uniformly and pairwise independently distributed w.r.t. the uniform selection of .
Proof.
(i) If for any , there are two coordinates satisfying , , , and . First we select values in , and for any choice, the remaining condition takes the following form: for some ,
[TABLE]
Since , the above equations have a unique solution w.r.t. . The probability that they take the values of the unique solution is exactly .
(ii) If (), the condition takes the following form:
[TABLE]
Obviously, the probability is if , otherwise, the probability is [math]. â
Next, we show that for small subspace , only one vector satisfies with non-negligible probability w.r.t. the uniform selection of .
Claim 2*.*
For any subset , , and ,
[TABLE]
Especially, if the parameter is selected as , then
[TABLE]
Proof.
The second part immediately follows from the first one, thus we give only a proof of the first part. It is sufficient to show that
[TABLE]
Since , holds, thus we have . By Claim 1, for any , we have
[TABLE]
Therefore,
[TABLE]
and
[TABLE]
Since , the number of vectors is at most . Hence, by the union bound,
[TABLE]
which is equivalent to the second part of the inequality (2). â
Let be -junta and be the set of relevant coordinates of . In the following claims, we assume that there exists satisfying and the event in Claim 2 occurs for , , and . By the definition of -projection, the projected function satisfies for any , because has the same domain . In addition, we assume that the example is simulated as follows: for ,
[TABLE]
Claim 3*.*
Let . If the -projected function satisfies for all , and the example is simulated as the above, then the conditional distribution of is determined by only the value of , that is, for , if , then .
Proof.
By Lemma 3 and the assumption, for any . By Fact 4,
[TABLE]
â
In the algorithm addRC, an example of LDWE is simulated as for some . Obviously if the distribution of is determined, then the distribution of is also determined. In addition, it is also obvious that the value of is determined by the value of . Therefore, the above claim implies that the simulated oracle in the algorithm addRC returns indeed an instance of LDME for the target linear function . Finally, we show that the simulated instance has a large correlation with the linear function if the algorithm addRC chooses a âgoodâ .
Claim 4*.*
We assume the same notations and conditions as in Claim 3. In addition, if the -junta function satisfies and the parameter is selected by , (i.e., ), then
[TABLE]
Proof.
For any ,
[TABLE]
Thus, it is enough to show that . Let and be independently and uniformly distributed random variables over , and let .
[TABLE]
By the assumption, . Since , they must not be statistically identical, that is, . In addition, by Fact 3, is -junta. Therefore, by the definition of statistical distance, . Now we have
[TABLE]
â
Now we give the proof of Lemma 6.
Proof (Lemma 6).
First, for simplicity, let us assume that execution of checkRC succeeds with probability 1. If is not constant and some relevant coordinates are not contained in , then there exists a restriction on such that the restricted function is not constant. In this case, by Lemma 2, there exist and such that and .
For convenience, we regard as the restricted function as . For the set of relevant coordinates of , . By Claim 2 and the argument following Claim 2, for all , with probability at least w.r.t. the uniform selection of . Since addRC tries to select more than times, at least one of selected âs satisfies this condition with probability at least . Thus in the following argument, we assume that the algorithm addRC succeeds in selecting such an .
If the algorithm addRC succeeds in selecting the above matrix , then by Claims 3 and 4, there exists such that the simulated noisy example in line 10 corresponds to the example from LDME of the correlation . By the assumption, the repetition of LDME recovers up to constant factor (i.e., finds for some ) with probability at least . If LDME is solved successfully, then at least one relevant coordinate is added to in line 14.
If the algorithm addRC fails in selecting and , the subroutine LDME may return some undesirable candidate. In this case, the subroutine checkRC returns false in line 13, and irrelevant coordinates are not added to . Therefore, by the union bound, the failure probability is at most under the condition that checkRC succeeds with probability 1.
In fact, our algorithm checkRC may fail. Since the number of executions of checkRC is at most , by the union bound, the probability that some executions of checkRC fail is at most . Thus, the total failure probability is at most . The total running time is bounded above by
[TABLE]
â
Appendix C Proofs of Lemmas in Section 4
C.1 Proof of Lemma 8
For simplicity, let for . First we show that
[TABLE]
By contraposition, we assume that for any . Then,
[TABLE]
where the second equality follows from the fact that .
Now we have that for some . If , then . Therefore, the remaining case is that . In this case,
[TABLE]
Thus, there exists such that
C.2 Proof of Lemma 9
The lemma immediate follows from Claim 1 as follows:
[TABLE]
C.3 Proof of Lemma 10
If , then by Lemma 8, there exists such that . Since checkCor tries all , by Hoeffding inequality, the condition in line 6 is not satisfied with probability at most
[TABLE]
On the other hand, if is a target linear function, and and are linearly independent, then by Lemma 9, for each . By Hoeffding inequality and the union bound, the error probability that the condition in line 6 is satisfied is at most
[TABLE]
C.4 Proof of Lemma 11
In this section, we show the correctness of the algorithm main2. We use to denote the coefficients of the target linear function, that is, the distribution of the target randomized function is determined only by for each . We assume that a partition is consecutive and divides a nonzero part of into half as in Lemma 1.
We begin with the analysis of non-target pairs for each row in the reduced instance.
Claim 5*.*
If a partition and linearly independent vectors satisfy that , , and for any , then and are uniformly and pairwise independently distributed under any condition about , i.e., for any ,
[TABLE]
The proof of Claim 5 is not so essential, thus the reader may skip over it.
Proof.
Since , it is enough to show that, for any ,
[TABLE]
W.l.o.g., we can assume that and (in this case, either or may hold). First consider the case where . We select three coordinates as follows: by linearly independence of and , we can select such that and are also linearly independent. Then, we select to satisfy that . Now we have the three vectors . It is not so difficult to see that they are linearly independent.
Otherwise if , we select satisfying , and we can select such that and are also linearly independent. Then we have three vectors which are also linearly independent.
In any case, for any assignment to , the solution of the remaining linear system in is uniquely determined, and the claim holds as in the proof of Claim 1. â
In the reduction, we assume that the initial values and are consistent with , that is, and . Any pair of indices except for satisfies the conditions in Claim 5, because they are non-zero and their initial values are fixed. In addition, the value of depends on only the value of . Therefore, by Claim 5, the pair of entries indexed by are also uniformly and independently distributed.
For an element and a random variable taking values in , we use to denote a -valued random variable given by operation in line 8 of main2, i.e.,
- (1)
if takes , set as , 2. (2)
otherwise, flip a biased coin with the head probability , and if it comes up with head (resp. tail), set as , (resp. ).
For any , if is uniformly distributed over , then . Moreover, it is easy to see that if are uniformly and pairwise independently distributed, then and are also uniformly and pairwise independently distributed over . Therefore, any pair of entries indexed by is selected uniformly and independently.
Now we move on to the analysis of the target pair, that is, the pair of entries corresponding to .
Claim 6*.*
Let be any partition of . If a randomized function has a correlation with as , then there exist such that
[TABLE]
Proof.
By Lemma 8, implies that there exists such that
[TABLE]
Therefore,
[TABLE]
â
Then we estimate the correlation between the target pair in the reduced instance.
Claim 7*.*
Let and . If random variables and in satisfies
[TABLE]
then,
[TABLE]
where as in the definition of .
Proof.
Let denote probabilities as
[TABLE]
[TABLE]
Then, it follows that , , and
[TABLE]
Therefore, the probability is bounded below by
[TABLE]
â
For our settings, take , , and . Then we have
[TABLE]
and
[TABLE]
Therefore, if we take sufficiently many samples, then the target pair has a correlation at least w.h.p. Now we give the proof of Lemma 11.
Proof (Lemma 11).
As in the proof of Lemma 6, we assume that all executions of checkCor will succeed. Under the condition, even if an incorrect candidate is found in brute-force search in , and , the algorithm main2 does not output such an incorrect answer by Lemma 10. In fact, it is easily checked that the number of executions of checkCor in lines 4 and 17 is at most and , respectively. Therefore, by the union bound, the probability that at least one execution fails is bounded above by
[TABLE]
Let be the coefficients of the target linear function and be the target randomized function corrupted with noise. If , then by our assumption on checkCor, the target linear function must be found in line 4. Therefore, we assume that . In this case, we show that the reduced binary instance is the one of LBP with the correlation w.h.p. We assume that, as mentioned in the definition of the algorithm main2, all columns are labeled by vectors in . In addition, assume that the algorithm main2 succeeds in selecting , and satisfying that
- âą
and (by Lemma 1, such a consecutive partition must exist)
- âą
(by Claim 6, such values of must exist)
- âą
and
Then, the reduced instance must contain the pair of columns indexed by , we call it the target pair. For any pair of columns except for the target pair, as mentioned in the observation following Claim 5, the pair in the reduced instance is uniformly and independently distributed over . On the other hand, for each row of the target pair, their product is also -valued and the expectation is at least by Claim 7. If we select the sample size to be more than , then by Hoeffding inequality, the probability that their inner product does not exceed is bounded above by
[TABLE]
In other words, with probability at least , the algorithm reduces LDME to LBP of the correlation . W.l.o.g., we can assume that the failure probability of LBP is at most , (otherwise, it is achieved by constant number of repetitions). Thus, for each trial in lines 8 and 16, the probability that LBP does not find the target pair is at most . Therefore, by repeating these trials at least times, the failure probability decreases to . Even if we consider the possibility that checkCor may fail, the total failure probability is bounded above by . The total running time is bounded above by
[TABLE]
â
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] J. Alman. An Illuminating Algorithm for the Light Bulb Problem. In 2nd Symposium on Simplicity in Algorithms (SOSA 2019) , volume 69 of OAS Ics , pages 2:1â2:11, 2018.
- 2[2] V. Arvind, J. Köbler, and W. Lindner. Parameterized learnability of juntas. Theoretical Computer Science , 410(47):4928â4936, 2009.
- 3[3] A. Blum. Relevant Examples and Relevant Features: Thoughts from Computational Learning Theory. In AAAI-94 Fall Symposium on Relevance , pages 14â18, 1994.
- 4[4] A. Blum. Learning a Function of r đ r Relevant Variables. In Bernhard Schölkopf and Manfred K Warmuth, editors, Learning Theory and Kernel Machines , pages 731â733, Berlin, Heidelberg, 2003. Springer Berlin Heidelberg.
- 5[5] A. Blum and P. Langley. Selection of relevant features and examples in machine learning. Artificial Intelligence , 97(1):245 â 271, 1997. Relevance.
- 6[6] A. Bogdanov and E. Viola. Pseudorandom bits for polynomials. SIAM J. Comput. , 39(6):2464â2486, 2010.
- 7[7] Rod G Downey and Michael R Fellows. Fixed-Parameter Tractability and Completeness I: Basic Results. SIAM J. Comput. , 24(4):873â921, 1995.
- 8[8] V. Feldman, P. Gopalan, S. Khot, and A. K. Ponnuswami. New results for learning noisy parities and halfspaces. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCSâ06) , pages 563â574, 2006.
