Learning first-order definable concepts over structures of small degree
Martin Grohe, Martin Ritzert

TL;DR
This paper introduces a logical framework for machine learning where concepts are defined by first-order formulas over structures with small degree, demonstrating efficient learnability in polylogarithmic time.
Contribution
It shows that first-order definable concepts over structures of small degree can be learned efficiently in the PAC setting, combining logic and complexity theory.
Findings
Concepts definable by first-order formulas are learnable in polylogarithmic time.
The framework applies to structures with polylogarithmic degree.
Efficient learning is achieved within the PAC model.
Abstract
We consider a declarative framework for machine learning where concepts and hypotheses are defined by formulas of a logic over some background structure. We show that within this framework, concepts defined by first-order formulas over a background structure of at most polylogarithmic degree can be learned in polylogarithmic time in the "probably approximately correct" learning sense.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputability, Logic, AI Algorithms · Logic, Reasoning, and Knowledge · Advanced Algebra and Logic
Learning first-order definable concepts over structures of small degree
Martin Grohe
RWTH Aachen University
Martin Ritzert
RWTH Aachen University
Abstract
We consider a declarative framework for machine learning where concepts and hypotheses are defined by formulas of a logic over some “background structure”. We show that within this framework, concepts defined by first-order formulas over a background structure of at most polylogarithmic degree can be learned in polylogarithmic time in the “probably approximately correct” learning sense.
1 Introduction
This paper studies, from a theoretical perspective, a role that logic might play as the foundation of a more declarative approach to machine learning. Machine learning algorithms produce a hypothesis about some unknown target function defined on an instance space . In a supervised learning setting, the input of a learning algorithm (the “data”) consists of a sequence of labelled examples, that is, instances labelled by the value . The quality of the hypothesis is measured in terms of how well it generalises, that is, predicts correct values of the target function on new data items. In this paper, we focus on Boolean classification problems, where the target function has the range . In this case, we usually speak of a target concept. We also consider a setting where the target concept is not deterministic, but a random variable.
The type of hypothesis we get is determined by the learning algorithm we use. For example, if we use support vector machines, the hypothesis is a linear halfspace of the instance space111We assume the instance space is for some and the hypothesis is a halfspace determined by a hyperplane., if we use decision tree learning, then the hypothesis is a decision tree, and if we use deep learning the hypothesis is specified by the weights and structure of a neural network. The natural workflow would be to first decide on a model of how the target concept might look, or rather, what kind of hypothesis might be appropriate. Then the learning algorithm solves an optimisation problem by choosing the parameters of the model in such a way that they fit the data. For example, if the instance space is and we choose a linear model, the parameters of the model consist of a vector and a number , specifying the hyperplane . Then we may choose an algorithm such as support vector machine222Arguably, we could also call the support vector machine the “model” and the solver for the quadratic optimisation system behind it the “algorithm”. or the perceptron algorithm for computing the parameters.
From a declarative viewpoint, it seems desirable to separate the choice of the model from the choice of the algorithm. Then as logicians, we will ask which language we best use to describe the model. A natural and very flexible framework to do this is the following. We first choose a background structure . For example, if we have numerical data, may be the the field of reals, possibly expanded additional functions like the sigmoid function . If we have graph data, our background structure may be a finite labelled graph. Given the background structure, we can specify a parametric model by a formula of some logic L, for example first-order logic (FO). This formula has two types of free variables, the instance variables and the parameter variables . The instance space of our model is , where denotes the universe of our background structure . For each choice of parameters, the formula defines a function by
[TABLE]
which we regard as a concept or hypothesis over our instance space. Here means that satisfies if the variables are interpreted by the values and the variables by the values . Depending on the context, we often call an L-definable model or hypothesis.
Example 1.1**.**
Let be an -structure, where is a binary and a unary relation symbol. may be viewed as a directed graph in which some vertices are coloured red. Consider, for example, the graph shown in Figure 1.
As input for a learning algorithm, we receive a training sequence consisting of some vertices labelled [math] or . In our example, this may be the sequence \big{(}(a,0),(b,1),(g,0),(k,1)\big{)}. From these training examples, we are supposed to figure out a global labelling function.
Consider the first order formula
[TABLE]
If we take as parameters , then the hypothesis is consistent with the training examples.
Working in such a rich declarative framework, however, it is easy to get carried away by the expressiveness it gives. It is important, therefore, to make sure that the models can still be learned by efficient algorithms. There are two basic algorithmic problems, which we may call parameter learning (or parameter estimation) and model learning (or model estimation). For both, assume that we have a background structure and a logic L. In the parameter learning problem, we assume a fixed L-formula , and we want to find parameters that fit the data. In the model learning problem, we are only given the data, and we want to find an L-formula as well as parameters fitting the data. To avoid overfitting, we want to choose the formula to be as simple as possible, according to some metric (for example, length or quantifier-rank). At first sight, it seems that the parameter learning problem is the simpler one, but this is not necessarily the case (as we shall discuss in Section 3).
These considerations suggest the following research program: Identify logics suitable for expressing relevant models for machine learning and study their algorithmic learnability, that is, efficient learning algorithms and also complexity theoretic or information theoretic lower bounds. Ideally, the logics would be expressive enough to define all feasible models and at the same time only permit the definition of feasible models. Note the similarity of these desiderata with those for database query languages (e.g. [6]). The theoretical side of this program may be viewed as a “descriptive complexity theory of machine learning”, and this is where our technical contributions are.
Before we describe our results, let us discuss one more technical issue. The input of a learning algorithm consists of the training examples, but in our framework of learning definable concepts the algorithm also needs access to the background structure . One possible scenario is that is a fixed infinite structure, for example the field of real numbers, and we consider an abstract computation model where the algorithm can store an element of the structure in a single memory cell and has access to the operations of the structure. A second is that is a finite structure, for example a graph describing the world wide web or a social network. In this case, we can simply regard as part of the input, but we may think of as still being too large to fit into main memory and only give our algorithms limited access to , such as local access that only allows the algorithm to retrieve the neighbours of a vertex that we already know (we can think of this as being able to follow links). This is the scenario we consider here.
1.1 Our results
We give a learning algorithm for the model learning problem for first-order logic. The twist of our result is that if the degree of the background structure is at most polylogarithmic, then the algorithm works in sublinear, in fact, polylogarithmic time, in the size of the background structure. It came as a surprise to us that this is possible at all. In analysing the algorithm, we take a data-complexity point of view [36], that is, we measure the running time in terms of the size of the structure and hide the dependence on the (presumably small) formula in the constants.
We only consider relational structures in this paper. The maximum degree of a structure is the maximum degree of its Gaifman graph, in which two vertices are adjacent if they appear together in some tuple of some relation of (see Section 2.1 for details). The -ary learning problem over a background structure has instance space . The goal is to learn an unknown target concept .
A learning algorithm for the -ary learning problem over some background structure receives as input a finite sequence of training examples. In addition, we grant our learning algorithms local access to the background structure (in the sense described above, see Section 2.1 for details). We usually let be the length of the sequence . The training examples are pairs , where . We say that is consistent with if for all we have . Given , the learning algorithm is supposed to compute a hypothesis , which in our setting is always of the form for some first-order formula and parameter tuple . Of course the algorithm is not supposed to return the whole set , but just the formula and the parameter tuple . However, we also allow our learning algorithms to reject an input, to account for the situation that after seeing the training examples the algorithm realises that its assumption about the model was wrong, that is, there simply is no hypothesis of the form consistent with the training examples.
If a learning algorithm returns a hypothesis specified by a formula and the parameter tuple , then this hypothesis is useless if we cannot efficiently determine the value for a given . We say that the hypotheses returned by can be evaluated in time if there is an algorithm that, given a pair , returned by and a tuple , computes in time .
The goal is to produce a hypothesis that generalises well, that is, approximates the target concept closely. To capture theoretically what it means for a hypothesis to generalise well, we will use the framework of probably approximately correct learning. However, let us first state our main algorithmic result.
Theorem 1.1**.**
Let . Then there is a and a learning algorithm for the -ary learning problem over some finite background structure with the following properties.
If the algorithm returns a hypothesis , then is of the form for some first-order formula of quantifier rank at most and , and is consistent with the input sequence of training examples. 2. 2.
If there is a first-order formula of quantifier rank and some tuple of parameters such that is consistent with the input sequence , then always returns a hypothesis and never rejects. 3. 3.
The algorithm runs in time with only local access to , where and and . 4. 4.
The hypotheses returned by can be evaluated in time with only local access to .
Note that if the maximum degree and the length of the training sequence are polylogarithmic in , then the overall running time of the algorithm is polylogarithmic in .
The proof of Theorem 1.1 critically relies on the locality of first-order logic.
We also prove a generalisation of Theorem 1.1 (Theorem 4.6), where instead of insisting on a consistent hypothesis, we compute, within the same polylogarithmic time bound, a hypothesis that minimises the training error. The training error of a concept or hypothesis is the fraction of training examples on which it is wrong. The algorithm we obtain returns a hypothesis for a formula of quantifier rank bounded in terms of that matches the training error of the best for formulas of quantifier rank .
A variant of Theorem 1.1 (Theorem 4.5) for bounded degree background structures even applies to infinite structures. For this to work, we need another level of abstraction in our computation model: we allow it to store an element of the structure in a single memory cell and to access it in a single computation step. We call this the uniform cost measure. This is in line with the standard uniform-cost RAM model that is also underlying the analysis of algorithmic meta theorems for bounded degree graphs [32, 16, 24, 10, 33]. Under the uniform cost measure, we even obtain a learning algorithm (in fact the same algorithm as in Theorem 1.1) running in time and producing hypotheses that can be evaluated in time .
Let us briefly discuss the implications of our results in Valiant’s [35] framework of probably approximately correct (PAC) learning. A detailed technical discussion and a precise statement of our results follows in Section 5. The basic assumption of PAC-learning is that there is an unknown probability distribution on the instance space and that instances are drawn independently from this distribution; the training instances as well as the new instances that we want to classify with our hypothesis. The generalisation error of a hypothesis is then defined as the probability that the hypothesis is wrong on an instance drawn randomly from the distribution. A PAC-learning algorithm is supposed to generate, with high confidence , a hypothesis with a small generalisation error . The confidence is the probability taken over the randomly chosen training examples that the algorithm succeeds. The number of training examples the algorithm has access to is bounded in terms of the error parameter and the confidence parameter , and the running time is supposed to be polynomial in the size of its input, which in our setting is , or just under the uniform cost measure. To obtain a PAC-learning algorithm, we usually need to make assumptions about the target concept. We say that a class of concepts is PAC-learnable if there is a learning algorithm that meets the PAC-criterion whenever the target concept is from , regardless of the probability distribution.
Corollary 1.2**.**
Let . For a background structure of maximum degree at most , let be the class of all concepts , where is a first-order formula of quantifier rank at most with , and .
Then is PAC-learnable by an algorithm that only has local access to and runs in time polynomial in and .
Theorem 4.5 is a more detailed statement of this result. In addition to Theorem 1.1 (or Theorem 4.5 for the uniform cost measure), the corollary also relies on a result from [17] stating that first-order definable set families on graphs of bounded degree, such as the family in the corollary, have bounded VC-dimension, which by a general theorem due to Blumer, Ehrenfeucht, Haussler and Warmuth [4] implies that a number of training examples only depending on and is sufficient.
We also prove a PAC-learning result for background structures with polylogarithmic degree (Theorem 5.2) and a generalisation in the so-called “agnostic” PAC-learning framework which deals with target concepts that are random variables (Theorem 5.9).
1.2 Related Work
Closest to our framework is that of inductive logic programming (ILP) (see, for example, [7, 26, 29, 30, 31]). However, there are important differences. First of all, our framework is by no means restricted to first-order logic, and in future work we intend to look at other languages which may be more suitable for expressing concepts relevant in the machine learning context. However, the present paper is only concerned with first-order logic. A second difference that is more significant for this paper is that we represent the background knowledge in a background structure, whereas in ILP it is represented by a background theory. This leads to quite different intuitions. Whereas the scarce positive PAC-learnability results in the ILP framework are mostly obtained by syntactically restricting the formulas defining models (see, however, [20, 19]), our results exploit structural restrictions—small degree—of the background structure.
To the best of our knowledge, the idea of a learning algorithm having only local access to the background structure is new here, and this is precisely what enables the polylogarithmic running time of our algorithms. We are not aware of results from the ILP context that lead to algorithms which are sublinear in the background knowledge. The local access approach seems related to ideas used in property testing on bounded degree graphs (see, for example, [14, 15, 9]). We leave it for future work to explore possible connections.
Our framework of learning definable concepts over a background structure has been considered before by Grohe and Turán [17]. However, the results from [17] are not algorithmic. They only bound the VC-dimension of definable concept classes over certain classes of structures. As mentioned above, we use one of the results of [17] in the proof of Corollary 1.2.
An alternative logical learning framework, also with a strong foundation in descriptive complexity theory, has recently been proposed by Crouch, Immerman, and Moss [8] and Jordan and Kaiser [22] (also see [23, 21]). Here the goal is to learn a logical reduction between structures; instances are pairs of structures. There is is other interesting recent work on learning in a logic setting in database theory (for example, [1, 5]) and verification (for example, [27, 13]). While there is no direct technical connection between our work and these research directions, they all seem to be similar in spirit. Exploring the exact technical relations remains future work.
2 Background from Logic
2.1 Structures
In this paper, we only consider relational structures. A relational vocabulary is a finite set of relation symbols, each with a prescribed arity. A -structure consists of a set , the universe of , and for each -ary a -ary relation . A structure is finite if its universe is finite, and the order of is . (For infinite , we let .)
The union of two -structures is the -structure with universe and relations for all . The intersection is defined similarly. A substructure of a -structure is a -structure with and for all . For a subset , the substructure induced by on is the the structure with universe and for each -ary .
The Gaifman graph of a -structure is the graph with vertex set and an edge for all such that and there is a -ary relation symbol and a tuple with . The Gaifman graph allows us to transfer graph theoretic notions from graphs to arbitrary relational structures. In particular, the degree of an element is the number of neighbours of in , and the maximum degree is if this maximum exists and if it does not. The distance between two elements in is the length of the shortest path from to in , and if there is no path from to . If then we say that is a neighbour of .
The -neighbourhood of in is the set of all vertices of distance at most from . For a tuple , we let . To avoid cluttering the notation even more, we also use to denote the induced substructure .
In all these notations we omit the superscript A if the structure is clear from the context.
Let us briefly review the computation model described in the introduction. We say that an algorithm has local access to a -structure it can query an oracle in the following two ways.
Relation queries:
Is ?
Neighbourhood queries:
Return a list of all neighbours for a given .
A relation query requires constant time. A neighbourhood query requires time proportional to the size of the (representation of the) answer, which is the degree of times the space required to store a single element of . With the uniform cost measure, storing an element of requires constant space, and with the logarithmic cost measure it requires space . Unless explicitly stated otherwise, we assume the logarithmic cost measure.
2.2 First-Order Logic
Let us briefly review the definition of first-order logic FO. First-order formulas of vocabulary are formed from atomic formulas and , where is a -ary relation symbol and are variables by the Boolean connectives (negation), (conjunction), (disjunction), (implication) and existential and universal quantification , respectively, all with the usual semantics. The set of all first-order formulas of vocabulary is denoted by . The free variables of a formula are those not in the scope of a quantifier, and we write to indicate that the free variables of the formula are among . A sentence is a formula without free variables. We write to denote that satisfies if is interpreted by . As explained in the introduction, when describing (machine learning) models, we partition the free variables of a formula into instance variables and parameter variables, and we use a semicolon to separate the two parts, as in or . We use the notation introduced in (1.1) for the instantiation of the model with parameters in a background structure .
The quantifier rank of a first-order formula is the nesting depth of quantifiers in .
2.3 Locality
Let us fix a vocabulary . An -formula is -local if for all -structures and all tuples of elements,
[TABLE]
A formula is local if it is -local for some . For all there is an -formula of quantifier rank stating that the distance between and is at most . We write instead of . A basic local sentence of radius is a first-order sentence of the form
[TABLE]
where is -local.
Theorem 2.1** (Gaifman’s Locality Theorem [12]).**
Every first-order formula is equivalent to a Boolean combination of basic local sentences and local formulas.
This notion of locality defined above is semantical, but we also need a syntactical version. The radius- relativisation of an -formula is the formula obtained from by replacing each subformula by the formula and every subformula by Note that for every the radius- relativisation is -local. Moreover, if is -local then and are equivalent. Note that the transition from to increases the quantifier rank by a factor of .
An -formula is syntactically -local if it is the radius- relativisation of some -formula . Every syntactically -local formula is -local, and conversely, every -local formula of quantifier rank is equivalent to a syntactically -local formula of quantifier rank (its own radius- relativisation).
We say that a basic local sentence of the form (2.1) is syntactically basic local if the -local formula is syntactically -local. A formula is in Gaifman normal form if it is a Boolean combination of syntactically basic local sentences and syntactically local formulas. The locality radius of a formula in Gaifman normal form is the least such that all basic local sentences in have radius at most and all local formulas are syntactically -local form some .
2.4 Types
Let be a -structure and . For every , the (first-order) -type of in is the set of all of quantifier rank at most such that . Types are infinite sets of formulas, but we can syntactically normalise formulas in such a way that there are only finitely many normalised formulas of fixed quantifier rank and with a fixed set of free variables, and that every formula can effectively be transformed into an equivalent normalised formula of the same quantifier rank. We represent a type by the set of normalised formulas it contains.
We need the following Feferman-Vaught style composition lemma [11] (also see [28]). For tuples and , we let .
Lemma 2.2** (Composition Lemma [11]).**
Let be -structures such that and . Let , , , and such that and . Then
[TABLE]
We also need a “local” version of types. Let be a -structure and , and let . The local -type of in is the set of all syntactically -local of quantifier rank at most such that , or equivalently, .
As a corollary to Lemma 2.2, we obtain the following composition lemma for local types.
Corollary 2.3** (Local Composition Lemma).**
Let be -structures, , , , and such that and and and . Then
[TABLE]
3 Parameter Learning
Recall the two different modes of learning we described in the introduction: parameter learning and model learning. In our context, for the parameter learning problem we assume that we have a fixed -formula . The input to our learning algorithm is a sequence of training examples over some background structure to which we have local access. Our goal is to find a tuple such that is consistent with (or at least approximately consistent).
The following simple example shows that parameter learning requires reading the whole background structure and thus is not possible in sublinear time, whereas our model learning algorithms only need polylogarithmic time.
Example 3.1**.**
Let be a unary relation symbol, and let . Then for every -structure and every , the function is constant if and constant [math] otherwise.
Now suppose that the background structure is such that for some , and the unknown target concept is , that is, constant . Then our learning algorithm receives only positive training examples, and it needs to find the parameter . However, unless happens to be one of the training examples, this requires reading the whole structure in the worst case. If the algorithm only has local access to , it is actually impossible to find , because the graph is the trivial graph with vertex set and empty edge set.
Thus parameter learning is not possible in our setting with only local access to the background structure. However, there is an intermediate mode of learning between parameter and model learning, where we assume that we know the formula defining the target concept, but are still allowed to modify it when formulating our hypothesis. This is potentially much easier than constructing the formula from scratch, as we are required to do in the model learning setting.
For instance, in Example 3.1, looking at we immediately know that the target concept is constant, and just from one labelled example we know if it is [math] or . Then we can either return the universally true formula or the universally false formula as our hypothesis (we do not even need a parameter here).
4 Model Learning
In this section, we look at the model learning problem for first-order logic over structures of small degree. We prove Theorem 1.1 and several variants and generalisations of it.
4.1 Consistent Hypotheses
We start by proving Theorem 1.1.
Throughout this section, we fix and a vocabulary . Let such that every -formula with at most free variables and quantifier rank at most is equivalent to a formula in Gaifman normal form of locality radius at most such that the quantifier rank of every syntactically local formula in the Boolean combination has quantifier rank at most . Such exist by Gaifman’s theorem and because up to logical equivalence there only exist finitely many -formulas of quantifier rank at most with at most free variables.
Lemma 4.1**.**
Let be a -structure, and let be an -formula of quantifier rank at most and with . Then for all
[TABLE]
Proof.
It follows from Gaifman’s Locality Theorem and the choice of that \big{(}B\models\psi(\bar{w})\iff B\models\psi(\bar{w}^{\prime})\big{)} if \big{(}B\models\chi(\bar{w})\iff B\models\chi(\bar{w}^{\prime})\big{)} for all syntactically -local formulas of quantifier rank at most . If then the latter equivalence holds. ∎
To simplify the presentation, we fix a background structure . We also fix the length of our training sequences. Of course our algorithm will neither depend on the specific structure nor on .
We fix a tuple of instance variables and a tuple of parameter variables. We let be the set of all -formulas of quantifier rank at most and
[TABLE]
Similarly, we let be the set of all syntactically -local -formulas of quantifier rank at most and
[TABLE]
Moreover, we let {\mathcal{T}}:=\big{(}U(B)^{k}\times\{0,1\}\big{)}^{t} be the set of all training sequences of length for the -ary learning problem over . For every T=\big{(}(\bar{u}_{1},c_{1}),\ldots,(\bar{u}_{t},c_{t})\big{)}\in{\mathcal{T}} and we let
[TABLE]
Recall that is consistent with if for all we have .
What we need to prove is that for all , if there is a that is consistent with then we can find a consistent with within the time bounds specified in Theorem 1.1.
The following lemma is the crucial step in our proof.
Lemma 4.2**.**
Let be consistent with some . Then there is a formula and a tuple such that is consistent with .
Proof.
Let T=\big{(}(\bar{u}_{1},c_{1}),\ldots,(\bar{u}_{t},c_{t})\big{)}. Let and such that is consistent with .
For some , we define and as follows: we let . Now suppose that is already defined. If there is a such that , then we pick such a (arbitrarily if there are more than one) and let and . If there is no such , we let and stop the construction.
We let . To simplify the notation, we further assume (without loss of generality) that for all . We let and . Possibly, or is the empty tuple. Observe that and
[TABLE]
and
[TABLE]
Furthermore,
[TABLE]
Claim 1*.*
Let such that
[TABLE]
Then .
Now let be the set of indices of the positive examples, that is, . For every , let , where , be the conjunction of all normalised formulas in the type . Then for all we have
[TABLE]
Now we let . Then it follows from Claim 1 and (4.3) that is consistent with . Furthermore, all the and hence are syntactically -local of quantifier rank at most .
It remains to transform into a formula with the right number of parameter variables. We simply do this by adding redundant variables, but we have to be careful that the resulting formula is still syntactically -local. Since is syntactically -local, all its quantifiers are relativised to the -neighbourhood of the free variables, that is, of the form
[TABLE]
or
[TABLE]
To obtain from , we replace (4.4) by
[TABLE]
and similarly for the universal quantifier in (4.5). Then is syntactically -local and has the same quantifier rank as . Hence . Moreover, for all and all we have
[TABLE]
We choose an arbitrary and let . Then for all we have
[TABLE]
Hence is consistent with .
Proof 4.4* (Proof of Theorem 1.1).*
The pseudocode for our learning algorithm is shown in Figure 2. The algorithm proceeds by brute-force: it goes through all formulas and all tuples and checks, in lines 4–7, if is consistent with . If it is, the algorithm returns , otherwise it proceeds to the next . If it does not find any consistent , it rejects. To see that the consistency test in lines 4–7 is correct, note that
[TABLE]
because is -local.
Hence the algorithm is correct, that is, satisfies conditions (1) and (2) of Theorem 1.1: it obviously satisfies (1), and it follows from Lemma 4.2 that it satisfies (2).
Note that the set (in line 1) can be computed from with only local access to .
To analyse the running time of , let and and . Note that for all we have
[TABLE]
Thus the representation size of the substructure of is , which is if we treat as constants. It requires time polynomial in the size of to test if the structure satisfies . Thus the overall running time of lines 4–7 is . We have
[TABLE]
and . Hence the two outer loops add a factor of , and the overall runtime is
[TABLE]
This proves Theorem 1.1(3).
Finally, any hypothesis returned by can be evaluated in time with only local access to , because the formula is -local, and thus to compute we only need to look at the substructure of . This proves Theorem 1.1(4).
Let us now analyse the algorithm under the uniform cost measure. Everything remains unchanged, except that the log-factors in the running time disappear. One advantage of the uniform cost model is that we can even apply it to infinite background structures . We obtain the following theorem.
Theorem 4.5*.*
Let . Then there is a and a learning algorithm for the -ary learning problem over some (possibly infinite) background structure with properties 1) and 2) of Theorem 1.1 and the following two properties.
- 3u)
The algorithm runs in time under the uniform cost measure with only local access to , where and . 2. 4u)
The hypotheses returned by can be evaluated in time under the uniform cost measure with only local access to .
4.2 Minimising the Training Error
We continue to work in the same framework as before, that is, we consider the -ary learning problem over a background structure . Let T=\big{(}(\bar{u}_{1},c_{1}),\ldots,(\bar{u}_{t},c_{t})\big{)}\in\big{(}U(B)^{k}\times\{0,1\}\big{)}^{t} be a training sequence. The training error of a hypothesis on is the fraction of examples on which is wrong, that is,
[TABLE]
The next theorem is a generalisation of Theorem 1.1 where, instead of insisting on a consistent hypothesis, we try to find a hypothesis with minimal training error.
Theorem 4.6*.*
Let . Then there is a and a learning algorithm for the -ary learning problem over some finite background structure with the following properties.
always returns a hypothesis for some -local first-order formula of quantifier rank and . 2. 2.
If there is a first-order formula of quantifier rank and some tuple of parameters such that , where is the input sequence , then for the hypothesis returned by on input . 3. 3.
The algorithm runs in time with only local access to , where and and . 4. 4.
The hypotheses returned by can be evaluated in time with only local access to .
To prove the theorem, we use the same notation as in the previous section: we fix and define as before. We let be a -structure. We define and as before. We let and {\mathcal{T}}:=\big{(}U(B)^{k}\times\{0,1\}\big{)}^{t}.
The proof of Theorem relies on the following generalisation of Lemma 4.2.
Lemma 4.7*.*
Let such that for some . Then there is a formula and a tuple such that
[TABLE]
Proof 4.8*.*
Let T=\big{(}(\bar{u}_{1},c_{1}),\ldots,(\bar{u}_{t},c_{t})\big{)}. Let and such that \operatorname{err}_{T}\big{(}\llbracket\varphi(\bar{x}\mathbin{;}\bar{v})\rrbracket^{B}\big{)}\leq\epsilon. Then there exists a subsequence of such that and is consistent with .
By Lemma 4.2, there is a formula and a tuple such that is consistent with . Then \operatorname{err}_{T}\big{(}\llbracket\varphi^{*}(\bar{x}\mathbin{;}\bar{v}^{*})\rrbracket^{B}\big{)}\leq\epsilon.
Proof 4.9* (Proof of Theorem 4.6).*
The pseudocode for our learning algorithm is shown in Figure 2. The algorithm is very similar to the algorithm of Theorem 1.1, except that we do not check for consistency, but count the errors of all hypotheses and return the one with minimum error. The runtime of is essentially the same as that of .
There is an analogous generalisation of Theorem 4.5.
Theorem 4.10*.*
Let . Then there is a and a learning algorithm for the -ary learning problem over some (possibly infinite) background structure with properties 1) and 2) of Theorem 4.6 and the following two properties.
- 3u)
The algorithm runs in time under the uniform cost measure with only local access to , where and . 2. 4u)
The hypotheses returned by can be evaluated in time under the uniform cost measure with only local access to .
Proof 4.11*.*
The proof is the same as that of Theorem 4.6, except that in the analysis of the algorithm the log-factor disappears because of the uniform cost measure.
5 PAC Learning
In this section, we sketch some of the basic principles of algorithmic learning theory and show how they apply in our context. For more background, we refer the reader to [3, 25, 34].
So far, we have focussed on the training error of our learning algorithms. But of course that is not the error we are mainly interested in; our goal is to generate hypotheses that with a low generalisation error. Probably approximately correct (PAC) learning gives us a framework for analysing the generalisation error theoretically.
Consider a learning problem with instance space where we want to learn an unknown target concept . As before, we are mainly interested in the case that for some background structure and that for some first-order formula and parameter tuple . The basic assumption of PAC learning is is that there is an (unknown) probability distribution on the instance space and that instances are drawn independently from this distribution; the training instances as well as the new instances that we want to classify with our hypothesis. We define the generalisation error of a hypothesis to be the probability that is wrong on a random instance, that is,
[TABLE]
We allow our algorithm to make a small generalisation error controlled by the error parameter . As the hypothesis depends on the randomly chosen training examples, we must allow for an error caused by unusually bad examples as well. We usually quantify this second type of error by the confidence parameter . Our goal is to generate a hypothesis , which of course depends on the training sequence and may also depend on and , such that
[TABLE]
Here indicates that the training examples are drawn independently from . Intuitively, a hypothesis satisfying (5.2) is probably (referring to the high confidence of at least ) approximately (referring to the low error of at most ) correct.
Ideally, we would like that for all target concepts , all probability distributions on , and all generates hypotheses that are probably approximately correct. This is something we usually cannot achieve. Thus we make assumptions about the target concept, which we formalise by considering target concepts from a concept class . We also specify the hypothesis class and a function that determines the number of training examples required by the algorithm. We say that a learning algorithm is a -PAC-learning algorithm if for all probability distributions on , all target concepts , and all , given a sequence of training examples and , the algorithm generates a hypothesis that satisfies (5.2).
5.1 Sample Size Bound
It is a basic insight from computational learning theory that if the hypothesis class is finite we need roughly training examples to achieve probable approximate correctness. The following well-known lemma makes this precise. For a proof, see for example [34].
Lemma 5.1* (Sample Size Bound).*
Suppose that the hypothesis class is finite and that the length of the training sequence satisfies
[TABLE]
Then for all probability distributions on and all target functions ,
[TABLE]
This means that if is finite and the training sequence is long enough (as specified in (5.3)) then with high confidence, every consistent hypothesis will have low generalisation error.
If we combine this lemma with Theorem 1.1, we obtain the following result.
Theorem 5.2*.*
Let . Then there are and a learning algorithm for the -ary learning problem over some finite background -structure with the following properties.
Let be tuples of length , respectively. Let be the class of all for a -formula of quantifier rank and , and let be the class of all for a syntactically -local -formula of quantifier rank and . Let
[TABLE]
where .
Then is a -PAC-learning algorithm. 2. 2.
The algorithm runs in time with only local access to , where .
Proof 5.3*.*
Let be the algorithm of Theorem 1.1. Given a training sequence of length consistent with a concept from , it generates a hypothesis consistent with . It follows from the Lemma 5.1 that is probably approximately correct.
5.2 VC Dimension
We cannot apply the sample size bound in a situation where the background structure is infinite. But there is an improved sample size bound that even holds for (some) infinite hypothesis classes. For this bound, we replace the factor in the sample size bound by the VC-dimension of . The VC-dimension is a combinatorial measure for the complexity of a set system.
Let be a set and . A set is shattered by if for every there is an such that is the restriction of to . The VC-dimension of , denoted by , is the maximum size of a set shattered by , or if this maximum does not exist. has finite VC-dimension if . Observe that for finite we have .
The following lemma due to Blumer, Ehrenfeucht, Haussler and Warmuth [4] relates VC-dimension to PAC-learning.
Lemma 5.4* (VC-Dimension Sample Size Bound [4]).*
There is a constant such that the following holds. Suppose that the hypothesis class has finite VC-dimension and that the length of the training sequence satisfies
[TABLE]
Then for all probability distributions on and all target functions ,
[TABLE]
We can apply this improved sample size bound in our setting for bounded degree structures, because the VC-dimension of first-order definable models is bounded on bounded degree structures.
Lemma 5.5* ([17]).*
Let . Then there is an such that the following holds. Let be a structure of maximum degree , let be the class of all for a -formula of quantifier rank with and . Then .
Grohe and Turán [17] bounded the VC-dimension of first-order definable concept classes on a wide range of further classes structures, among them planar graphs and graphs of bounded tree width. Adler and Adler [2] extended this to all nowhere dense graph classes.
Using the VC-Dimension Sample Size Bound and the previous lemma, we obtain the following theorem.
Theorem 5.6*.*
Let . Then there are and a learning algorithm for the -ary learning problem over some (possibly infinite) background -structure of maximum degree at most with the following properties.
Let be tuples of length , respectively. Let be the class of all for a -formula of quantifier rank and , and let be the class of all for a syntactically -local -formula of quantifier rank and . Let
[TABLE]
Then is a -PAC-learning algorithm. 2. 2.
The algorithm runs in time under the uniform cost measure and with only local access to .
Proof 5.7*.*
This follows by combining Theorem 4.5 with Lemmas 5.4 and 5.5.
Corollary 1.2 follows from this theorem. (In fact, the theorem should be viewed as a precise version of Corollary 1.2.)
5.3 Agnostic PAC-Learning
Agnostic learning is a generalisation of our setting where there is no deterministic target function (or concept), but only a probabilistic one. In practice, this may occur in a situation where the instances in our abstract instance space do not fully capture the relevant properties of the real-world objects they describe. This can easily happen, because typically the instances are tuples only describing certain features of the objects.
We continue to consider Boolean classification problems on an instance space . Instead of a probability distribution on and a target concept , we now assume that we have a probability distribution on . We define the generalisation error of a hypothesis to be
[TABLE]
To quantify the quality of a learning algorithm, we compare the quality of the hypothesis of our algorithm with the best possible hypothesis coming from a certain class . For classes and a function , we say that a learning algorithm is an agnostic -PAC-learning algorithm if for all probability distributions on and all , given a sequence of training examples and , the algorithm generates a hypothesis that satisfies
[TABLE]
The agnostic PAC learning framework has been introduced by Haussler [18].
To obtain agnostic PAC-learning algorithms, we can use the following Uniform Convergence Lemma instead of the Sample Size Bound of Lemma 5.1. Recall that denotes the training error of a hypothesis on a training sequence .
Lemma 5.8* (Uniform Convergence).*
Suppose that the hypothesis class is finite and that the length of the training sequence satisfies
[TABLE]
Then for all probability distributions on ,
[TABLE]
A proof can be found in [34].
Now from Theorem 4.6 we get an agnostic PAC-learning algorithm for first-order definable concept classes on structures of small degree.
Theorem 5.9*.*
Let . Then there are and a learning algorithm for the -ary learning problem over some finite background -structure with the following properties.
Let be tuples of length , respectively. Let be the class of all for a -formula of quantifier rank and , and let be the class of all for a syntactically -local -formula of quantifier rank and . Let
[TABLE]
where .
Then is an agnostic -PAC-learning algorithm. 2. 2.
The algorithm runs in time with only local access to , where .
Proof 5.10*.*
Let be the algorithm of Theorem 4.6. Let be a probability distribution on . Let such that
[TABLE]
As is finite, this minimum exists. By the Uniform Convergence Lemma (Lemma 5.8) applied to and we have
[TABLE]
(assuming that the constant is sufficiently large).
Given a training sequence , the algorithm generates a hypothesis with at most the training error of on . Applying the Uniform Convergence Lemma again, this time to and , we get
[TABLE]
This implies that
[TABLE]
Hence is an agnostic -PAC-learning algorithm.
6 Conclusions
We prove that first-order definable models are learnable in polylogarithmic time on finite structures of polylogarithmic degree. In view of the simple example showing that sublinear parameter learning is impossible, which for a long time made us (or rather, the first author) believe that sublinear learning is impossible in general in our framework, this result came as a surprise to us.
It is less surprising that the proof relies on the locality of first-order logic. In fact, the proof is not very difficult, but it has to be set up in the right way. In particular, the use of syntactic locality and the notion of local types, which to the best of our knowledge is new, is essential. Let us remark that we cannot use Hanf’s Locality Theorem, which is usually easier to handle than Gaifman’s, because in structures of (poly)logarithmic degree the number of isomorphism types of local neighbourhoods grows too fast.
Algorithmically, our paper is not very sophisticated: our algorithms are simple brute-force algorithms that are practically useless due to enormous hidden constants. It is a very interesting question whether there are also practical algorithms for learning FO-definable models (which may not be more efficient than ours in the worst case, but nevertheless work better). One approach would be to map the data (consisting of tuples of elements and the background structure) to high dimensional feature vectors, maybe even obtain a kernel, and then apply conventional machine learning algorithms.
As we have outlined in the introduction, our results may be viewed as a contribution to a descriptive complexity theory of machine learning. Within such a theory, many questions, both technical and conceptual, remain open. For example, are there sublinear learning algorithms for first-order logic on other classes of structures such as words, trees, planar graphs? At a more fundamental level, what is a good computation model (replacing our “local access” model) on background structures that are still sparse, but have large maximum degree. In fact, sublinear algorithms seem unlikely for most classes of high maximum degree. But what about fixed-parameter tractable learning algorithms. As soon as we allow formulas with arbitrarily many instance and parameter variables (that is, allow unbounded and ), fixed-parameter tractability becomes nontrivial on any of the classes suggested above. And what about other logics, for example monadic second-order logic or modal and temporal logics. Finally, one should generalise the framework from Boolean classification problems to other types of learning problems.
If our version of a declarative approach to machine learning is supposed to have any impact in practice, maybe the most important question is: what are suitable logics and background structures for expressing relevant and feasible machine learning models?
Acknowledgements
We would like to thank Kristian Kersting and Daniel Neider for very helpful comments on an earlier version of this paper.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] A. Abouzied, D. Angluin, C. Papadimitriou, J. Hellerstein, and A. Silberschatz. Learning and verifying quantified boolean queries by example. In R. Hull and W. Fan, editors, Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems , pages 49–60, 2013.
- 2[2] H. Adler and I. Adler. Interpreting nowhere dense graph classes as a classical notion of model theory. European Journal of Combinatorics , 36:322–330, 2014.
- 3[3] A. Blum, J. Hopcroft, and R. Kannan. Foundations of data science. Unpublished manuscript available at https://www.cs.cornell.edu/jeh/book 2016 June 9.pdf , 2016.
- 4[4] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. J. ACM , 36:929–965, 1989.
- 5[5] A. Bonifati, R. Ciucanu, and S. Staworko. Learning join queries from user examples. ACM Trans. Database Syst. , 40(4):24:1–24:38, 2016.
- 6[6] A. Chandra and D. Harel. Structure and complexity of relational queries. Journal of Computer and System Sciences , 25:99–128, 1982.
- 7[7] W. Cohen and C. Page. Polynomial learnability and inductive logic programming: Methods and results. New generation Computing , 13:369–404, 1995.
- 8[8] M. Crouch, N. Immerman, and J. Moss. Finding reductions automatically. In A. Blass, N. Dershowitz, and W. Reisig, editors, Fields of Logic and Computation: Essays Dedicated to Yuri Gurevich on the Occasion of His 70th Birthday , volume 6300 of Lecture Notes in Computer Science , pages 181–200. Springer Verlag, 2010.
