Learning first-order definable concepts over structures of small degree

Martin Grohe; Martin Ritzert

arXiv:1701.05487·cs.LG·January 20, 2017

Learning first-order definable concepts over structures of small degree

Martin Grohe, Martin Ritzert

PDF

Open Access

TL;DR

This paper introduces a logical framework for machine learning where concepts are defined by first-order formulas over structures with small degree, demonstrating efficient learnability in polylogarithmic time.

Contribution

It shows that first-order definable concepts over structures of small degree can be learned efficiently in the PAC setting, combining logic and complexity theory.

Findings

01

Concepts definable by first-order formulas are learnable in polylogarithmic time.

02

The framework applies to structures with polylogarithmic degree.

03

Efficient learning is achieved within the PAC model.

Abstract

We consider a declarative framework for machine learning where concepts and hypotheses are defined by formulas of a logic over some background structure. We show that within this framework, concepts defined by first-order formulas over a background structure of at most polylogarithmic degree can be learned in polylogarithmic time in the "probably approximately correct" learning sense.

Equations91

[[φ (\overset{x}{ˉ}; \overset{v}{ˉ})]]^{B} (\overset{u}{ˉ}) := {10 if B ⊨ φ (\overset{u}{ˉ}; \overset{v}{ˉ}), otherwise,

[[φ (\overset{x}{ˉ}; \overset{v}{ˉ})]]^{B} (\overset{u}{ˉ}) := {10 if B ⊨ φ (\overset{u}{ˉ}; \overset{v}{ˉ}), otherwise,

φ (x; y_{1}, y_{2}) :=

φ (x; y_{1}, y_{2}) :=

\land \neg\exists z (E (y_{2}, z) \land E (z, x)) .

A ⊨ ψ (\overset{u}{ˉ}) ⟺ N_{r} (\overset{u}{ˉ}) ⊨ ψ (\overset{u}{ˉ}) .

A ⊨ ψ (\overset{u}{ˉ}) ⟺ N_{r} (\overset{u}{ˉ}) ⊨ ψ (\overset{u}{ˉ}) .

\exists x_{1}\ldots\exists x_{k}\big{(}\bigwedge_{1\leq i<j\leq k}\delta_{>2r}(x_{i},x_{j})\wedge\bigwedge_{i=1}^{k}\psi(x_{i})\big{)},

\exists x_{1}\ldots\exists x_{k}\big{(}\bigwedge_{1\leq i<j\leq k}\delta_{>2r}(x_{i},x_{j})\wedge\bigwedge_{i=1}^{k}\psi(x_{i})\big{)},

tp_{q} (A \cup B, \overset{u}{ˉ} \overset{v}{ˉ}) = tp_{q} (A^{'} \cup B^{'}, \overset{u}{ˉ}^{'} \overset{v}{ˉ}^{'}) .

tp_{q} (A \cup B, \overset{u}{ˉ} \overset{v}{ˉ}) = tp_{q} (A^{'} \cup B^{'}, \overset{u}{ˉ}^{'} \overset{v}{ˉ}^{'}) .

ltp_{q, r} (A, \overset{u}{ˉ} \overset{v}{ˉ}) = ltp_{q, r} (A^{'}, \overset{u}{ˉ}^{'} \overset{v}{ˉ}^{'}) .

ltp_{q, r} (A, \overset{u}{ˉ} \overset{v}{ˉ}) = ltp_{q, r} (A^{'}, \overset{u}{ˉ}^{'} \overset{v}{ˉ}^{'}) .

ltp_{q^{*}, r^{*}} (B, \overset{w}{ˉ}) = ltp_{q^{*}, r^{*}} (B, \overset{w}{ˉ}^{'}) ⟹

ltp_{q^{*}, r^{*}} (B, \overset{w}{ˉ}) = ltp_{q^{*}, r^{*}} (B, \overset{w}{ˉ}^{'}) ⟹

\displaystyle\hskip 51.21504pt\big{(}B\models\psi(\bar{w})\iff B\models\psi(\bar{w}^{\prime})\big{)}

{\mathcal{C}}:=\big{\{}\llbracket\varphi(\bar{x}\mathbin{;}\bar{v})\rrbracket^{B}\mathbin{\big{|}}\varphi(\bar{x}\mathbin{;}\bar{y})\in\Phi,\bar{v}\in U(B)^{\ell}\big{\}},

{\mathcal{C}}:=\big{\{}\llbracket\varphi(\bar{x}\mathbin{;}\bar{v})\rrbracket^{B}\mathbin{\big{|}}\varphi(\bar{x}\mathbin{;}\bar{y})\in\Phi,\bar{v}\in U(B)^{\ell}\big{\}},

{\mathcal{C}}^{*}:=\big{\{}\llbracket\varphi^{*}(\bar{x}\mathbin{;}\bar{v})\rrbracket^{B}\mathbin{\big{|}}\varphi(\bar{x}\mathbin{;}\bar{y})\in\Phi^{*},\bar{v}\in U(B)^{\ell}\big{\}}.

{\mathcal{C}}^{*}:=\big{\{}\llbracket\varphi^{*}(\bar{x}\mathbin{;}\bar{v})\rrbracket^{B}\mathbin{\big{|}}\varphi(\bar{x}\mathbin{;}\bar{y})\in\Phi^{*},\bar{v}\in U(B)^{\ell}\big{\}}.

N_{r} (T) := i = 1 ⋃ t N_{r} (\overset{u}{ˉ}_{i}) .

N_{r} (T) := i = 1 ⋃ t N_{r} (\overset{u}{ˉ}_{i}) .

N^{\circ} \subseteq N_{2 ℓ r^{*} + r^{*}} (T)

N^{\circ} \subseteq N_{2 ℓ r^{*} + r^{*}} (T)

N^{\circ} = i = 1 ⋃ t N_{r^{*}} (\overset{u}{ˉ}_{i}) \cup i = 1 ⋃ m N_{r^{*}} (v_{i}) .

N^{\circ} = i = 1 ⋃ t N_{r^{*}} (\overset{u}{ˉ}_{i}) \cup i = 1 ⋃ m N_{r^{*}} (v_{i}) .

N_{r^{*}} (\overset{v}{ˉ}^{∙}) \cap N^{\circ} = \emptyset.

N_{r^{*}} (\overset{v}{ˉ}^{∙}) \cap N^{\circ} = \emptyset.

ltp_{q^{*}, r^{*}} (B, \overset{u}{ˉ}_{i} \overset{v}{ˉ}^{\circ}) = ltp_{q^{*}, r^{*}} (B, \overset{u}{ˉ}_{j} \overset{v}{ˉ}^{\circ}) .

ltp_{q^{*}, r^{*}} (B, \overset{u}{ˉ}_{i} \overset{v}{ˉ}^{\circ}) = ltp_{q^{*}, r^{*}} (B, \overset{u}{ˉ}_{j} \overset{v}{ˉ}^{\circ}) .

B ⊨ ϑ_{p} (\overset{u}{ˉ}^{'}, \overset{v}{ˉ}^{'}) ⟺ ltp_{q^{*}, r^{*}} (B, \overset{u}{ˉ}^{'} \overset{v}{ˉ}^{'}) = ltp_{q^{*}, r^{*}} (B, \overset{u}{ˉ}_{p} \overset{v}{ˉ}^{\circ}) .

B ⊨ ϑ_{p} (\overset{u}{ˉ}^{'}, \overset{v}{ˉ}^{'}) ⟺ ltp_{q^{*}, r^{*}} (B, \overset{u}{ˉ}^{'} \overset{v}{ˉ}^{'}) = ltp_{q^{*}, r^{*}} (B, \overset{u}{ˉ}_{p} \overset{v}{ˉ}^{\circ}) .

\exists z ((i = 1 ⋁ k δ_{\leq r^{*}} (x_{i}, z) \lor j = 1 ⋁ m δ_{\leq r^{*}} (y_{j}, z)) \land \dots)

\exists z ((i = 1 ⋁ k δ_{\leq r^{*}} (x_{i}, z) \lor j = 1 ⋁ m δ_{\leq r^{*}} (y_{j}, z)) \land \dots)

\forall z ((i = 1 ⋁ k δ_{\leq r^{*}} (x_{i}, z) \lor j = 1 ⋁ m δ_{\leq r^{*}} (y_{j}, z)) \to \dots) .

\forall z ((i = 1 ⋁ k δ_{\leq r^{*}} (x_{i}, z) \lor j = 1 ⋁ m δ_{\leq r^{*}} (y_{j}, z)) \to \dots) .

\exists z ((i = 1 ⋁ k δ_{\leq r^{*}} (x_{i}, z) \lor j = 1 ⋁ ℓ δ_{\leq r^{*}} (y_{j}, z))

\exists z ((i = 1 ⋁ k δ_{\leq r^{*}} (x_{i}, z) \lor j = 1 ⋁ ℓ δ_{\leq r^{*}} (y_{j}, z))

\land (i = 1 ⋁ k δ_{\leq r^{*}} (x_{i}, z) \lor j = 1 ⋁ m δ_{\leq r^{*}} (y_{j}, z)) \land \dots),

B ⊨ φ^{*} (\overset{u}{ˉ}^{'}; \overset{v}{ˉ}^{'}) ⟺ B ⊨ φ^{\circ} (\overset{u}{ˉ}; (v_{1}^{'}, \dots, v_{m}^{'})) .

B ⊨ φ^{*} (\overset{u}{ˉ}^{'}; \overset{v}{ˉ}^{'}) ⟺ B ⊨ φ^{\circ} (\overset{u}{ˉ}; (v_{1}^{'}, \dots, v_{m}^{'})) .

B ⊨ φ^{*} (\overset{u}{ˉ}_{i}; \overset{v}{ˉ}^{*}) ⟺ B ⊨ φ^{\circ} (\overset{u}{ˉ}_{i}; \overset{v}{ˉ}^{\circ}) ⟺ c_{i} = 1.

B ⊨ φ^{*} (\overset{u}{ˉ}_{i}; \overset{v}{ˉ}^{*}) ⟺ B ⊨ φ^{\circ} (\overset{u}{ˉ}_{i}; \overset{v}{ˉ}^{\circ}) ⟺ c_{i} = 1.

N_{r^{*}} (\overset{u}{ˉ} \overset{v}{ˉ}^{*}) ⊨ φ^{*} (\overset{u}{ˉ}, \overset{v}{ˉ}^{*}) ⟺ B ⊨ φ^{*} (\overset{u}{ˉ}, \overset{v}{ˉ}^{*}),

N_{r^{*}} (\overset{u}{ˉ} \overset{v}{ˉ}^{*}) ⊨ φ^{*} (\overset{u}{ˉ}, \overset{v}{ˉ}^{*}) ⟺ B ⊨ φ^{*} (\overset{u}{ˉ}, \overset{v}{ˉ}^{*}),

∣ N_{r^{*}} (\overset{u}{ˉ} \overset{v}{ˉ}^{*}) ∣ \leq (k + ℓ) \cdot 2 d^{r^{*}} .

∣ N_{r^{*}} (\overset{u}{ˉ} \overset{v}{ˉ}^{*}) ∣ \leq (k + ℓ) \cdot 2 d^{r^{*}} .

∣ N ∣ \leq 2 t k d^{2 ℓ r^{*}} = (t + d)^{O (1)}

∣ N ∣ \leq 2 t k d^{2 ℓ r^{*}} = (t + d)^{O (1)}

(t + d)^{O (1)} \cdot (lo g n + d)^{O (1)} \cdot t = (lo g n + d + t)^{O (1)} .

(t + d)^{O (1)} \cdot (lo g n + d)^{O (1)} \cdot t = (lo g n + d + t)^{O (1)} .

\operatorname{err}_{T}(H):=\frac{1}{t}\big{|}\big{\{}i\in[t]\mathbin{\big{|}}H(\bar{u}_{i})\neq c_{i}\big{\}}\big{|}.

\operatorname{err}_{T}(H):=\frac{1}{t}\big{|}\big{\{}i\in[t]\mathbin{\big{|}}H(\bar{u}_{i})\neq c_{i}\big{\}}\big{|}.

\operatorname{err}_{T}\big{(}\llbracket\varphi^{*}(\bar{x}\mathbin{;}\bar{v}^{*})\rrbracket^{B}\big{)}\leq\epsilon.

\operatorname{err}_{T}\big{(}\llbracket\varphi^{*}(\bar{x}\mathbin{;}\bar{v}^{*})\rrbracket^{B}\big{)}\leq\epsilon.

err_{D, C^{*}} (H) := x \sim D Pr (H (x) \neq = C^{*} (x)) .

err_{D, C^{*}} (H) := x \sim D Pr (H (x) \neq = C^{*} (x)) .

T \sim D Pr (err_{D, C^{*}} (H (ϵ, δ, T)) \leq ϵ) \geq 1 - δ .

T \sim D Pr (err_{D, C^{*}} (H (ϵ, δ, T)) \leq ϵ) \geq 1 - δ .

t \geq \frac{ln ( ∣ H ∣/ δ )}{ϵ} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputability, Logic, AI Algorithms · Logic, Reasoning, and Knowledge · Advanced Algebra and Logic

Full text

Learning first-order definable concepts over structures of small degree

Martin Grohe

RWTH Aachen University

[email protected]

Martin Ritzert

RWTH Aachen University

[email protected]

Abstract

We consider a declarative framework for machine learning where concepts and hypotheses are defined by formulas of a logic over some “background structure”. We show that within this framework, concepts defined by first-order formulas over a background structure of at most polylogarithmic degree can be learned in polylogarithmic time in the “probably approximately correct” learning sense.

1 Introduction

This paper studies, from a theoretical perspective, a role that logic might play as the foundation of a more declarative approach to machine learning. Machine learning algorithms produce a hypothesis $H$ about some unknown target function $C^{*}$ defined on an instance space ${{\mathbb{U}}}$ . In a supervised learning setting, the input of a learning algorithm (the “data”) consists of a sequence of labelled examples, that is, instances $u\in{{\mathbb{U}}}$ labelled by the value $C^{*}(u)$ . The quality of the hypothesis $H$ is measured in terms of how well it generalises, that is, predicts correct values of the target function on new data items. In this paper, we focus on Boolean classification problems, where the target function has the range $\{0,1\}$ . In this case, we usually speak of a target concept. We also consider a setting where the target concept is not deterministic, but a random variable.

The type of hypothesis we get is determined by the learning algorithm we use. For example, if we use support vector machines, the hypothesis is a linear halfspace of the instance space111We assume the instance space is $\mathbb{R}^{\ell}$ for some $\ell$ and the hypothesis is a halfspace determined by a hyperplane., if we use decision tree learning, then the hypothesis is a decision tree, and if we use deep learning the hypothesis is specified by the weights and structure of a neural network. The natural workflow would be to first decide on a model of how the target concept might look, or rather, what kind of hypothesis might be appropriate. Then the learning algorithm solves an optimisation problem by choosing the parameters of the model in such a way that they fit the data. For example, if the instance space is $\mathbb{R}^{\ell}$ and we choose a linear model, the parameters of the model consist of a vector $\boldsymbol{a}\in\mathbb{R}^{\ell}$ and a number $b\in\mathbb{R}$ , specifying the hyperplane $\{\boldsymbol{u}\in\mathbb{R}^{\ell}\mid\boldsymbol{a}\cdot\boldsymbol{u}-b\geq 0\}$ . Then we may choose an algorithm such as support vector machine222Arguably, we could also call the support vector machine the “model” and the solver for the quadratic optimisation system behind it the “algorithm”. or the perceptron algorithm for computing the parameters.

From a declarative viewpoint, it seems desirable to separate the choice of the model from the choice of the algorithm. Then as logicians, we will ask which language we best use to describe the model. A natural and very flexible framework to do this is the following. We first choose a background structure $B$ . For example, if we have numerical data, $B$ may be the the field of reals, possibly expanded additional functions like the sigmoid function $x\mapsto\frac{1}{1+e^{-x}}$ . If we have graph data, our background structure may be a finite labelled graph. Given the background structure, we can specify a parametric model by a formula $\varphi(\bar{x}\mathbin{;}\bar{y})$ of some logic L, for example first-order logic (FO). This formula has two types of free variables, the instance variables $\bar{x}=(x_{1},\ldots,x_{k})$ and the parameter variables $\bar{y}=(y_{1},\ldots,y_{\ell})$ . The instance space ${{\mathbb{U}}}$ of our model is $U(B)^{k}$ , where $U(B)$ denotes the universe of our background structure $B$ . For each choice $\bar{v}\in U(B)^{\ell}$ of parameters, the formula defines a function $\llbracket\varphi(\bar{x}\mathbin{;}\bar{v})\rrbracket^{B}:U(B)^{k}\to\{0,1\}$ by

[TABLE]

which we regard as a concept or hypothesis over our instance space. Here $B\models\varphi(\bar{u}\mathbin{;}\bar{v})$ means that $B$ satisfies $\varphi$ if the variables $\bar{x}$ are interpreted by the values $\bar{u}$ and the variables $\bar{y}$ by the values $\bar{v}$ . Depending on the context, we often call $\llbracket\varphi(\bar{x}\mathbin{;}\bar{v})\rrbracket^{B}$ an L-definable model or hypothesis.

Example 1.1.

Let $B$ be an $\{E,R\}$ -structure, where $E$ is a binary and $R$ a unary relation symbol. $B$ may be viewed as a directed graph in which some vertices are coloured red. Consider, for example, the graph shown in Figure 1.

As input for a learning algorithm, we receive a training sequence consisting of some vertices labelled [math] or $1$ . In our example, this may be the sequence $\big{(}(a,0),(b,1),(g,0),(k,1)\big{)}$ . From these training examples, we are supposed to figure out a global labelling function.

Consider the first order formula

[TABLE]

If we take as parameters $v_{1}:=j,v_{2}:=e$ , then the hypothesis $\llbracket\varphi(x\mathbin{;}v_{1},v_{2})\rrbracket^{B}$ is consistent with the training examples.

Working in such a rich declarative framework, however, it is easy to get carried away by the expressiveness it gives. It is important, therefore, to make sure that the models can still be learned by efficient algorithms. There are two basic algorithmic problems, which we may call parameter learning (or parameter estimation) and model learning (or model estimation). For both, assume that we have a background structure $B$ and a logic L. In the parameter learning problem, we assume a fixed L-formula $\varphi(\bar{x}\mathbin{;}\bar{y})$ , and we want to find parameters that fit the data. In the model learning problem, we are only given the data, and we want to find an L-formula $\varphi(\bar{x}\mathbin{;}\bar{y})$ as well as parameters fitting the data. To avoid overfitting, we want to choose the formula $\varphi(\bar{x}\mathbin{;}\bar{y})$ to be as simple as possible, according to some metric (for example, length or quantifier-rank). At first sight, it seems that the parameter learning problem is the simpler one, but this is not necessarily the case (as we shall discuss in Section 3).

These considerations suggest the following research program: Identify logics suitable for expressing relevant models for machine learning and study their algorithmic learnability, that is, efficient learning algorithms and also complexity theoretic or information theoretic lower bounds. Ideally, the logics would be expressive enough to define all feasible models and at the same time only permit the definition of feasible models. Note the similarity of these desiderata with those for database query languages (e.g. [6]). The theoretical side of this program may be viewed as a “descriptive complexity theory of machine learning”, and this is where our technical contributions are.

Before we describe our results, let us discuss one more technical issue. The input of a learning algorithm consists of the training examples, but in our framework of learning definable concepts the algorithm also needs access to the background structure $B$ . One possible scenario is that $B$ is a fixed infinite structure, for example the field of real numbers, and we consider an abstract computation model where the algorithm can store an element of the structure in a single memory cell and has access to the operations of the structure. A second is that $B$ is a finite structure, for example a graph describing the world wide web or a social network. In this case, we can simply regard $B$ as part of the input, but we may think of $B$ as still being too large to fit into main memory and only give our algorithms limited access to $B$ , such as local access that only allows the algorithm to retrieve the neighbours of a vertex that we already know (we can think of this as being able to follow links). This is the scenario we consider here.

1.1 Our results

We give a learning algorithm for the model learning problem for first-order logic. The twist of our result is that if the degree of the background structure is at most polylogarithmic, then the algorithm works in sublinear, in fact, polylogarithmic time, in the size of the background structure. It came as a surprise to us that this is possible at all. In analysing the algorithm, we take a data-complexity point of view [36], that is, we measure the running time in terms of the size of the structure and hide the dependence on the (presumably small) formula in the constants.

We only consider relational structures in this paper. The maximum degree $\Delta(B)$ of a structure $B$ is the maximum degree of its Gaifman graph, in which two vertices are adjacent if they appear together in some tuple of some relation of $B$ (see Section 2.1 for details). The $k$ -ary learning problem over a background structure $B$ has instance space $U(B)^{k}$ . The goal is to learn an unknown target concept $C^{*}:U(B)^{k}\to\{0,1\}$ .

A learning algorithm for the $k$ -ary learning problem over some background structure $B$ receives as input a finite sequence $T$ of training examples. In addition, we grant our learning algorithms local access to the background structure (in the sense described above, see Section 2.1 for details). We usually let $t:=|T|$ be the length of the sequence $T$ . The training examples are pairs $(\bar{u},C^{*}(\bar{u}))$ , where $\bar{u}\in U(B)^{k}$ . We say that $H:U(B)^{k}\to\{0,1\}$ is consistent with $T$ if for all $(\bar{u},c)\in T$ we have $H(\bar{u})=c$ . Given $T$ , the learning algorithm is supposed to compute a hypothesis $H:U(B)^{k}\to\{0,1\}$ , which in our setting is always of the form $\llbracket\varphi(\bar{x}\mathbin{;}\bar{v})\rrbracket^{B}$ for some first-order formula $\varphi(\bar{x}\mathbin{;}\bar{y})$ and parameter tuple $\bar{v}\in U(B)^{\ell}$ . Of course the algorithm is not supposed to return the whole set $\llbracket\varphi(\bar{x}\mathbin{;}\bar{v})\rrbracket^{B}$ , but just the formula $\varphi(\bar{x}\mathbin{;}\bar{y})$ and the parameter tuple $\bar{v}$ . However, we also allow our learning algorithms to reject an input, to account for the situation that after seeing the training examples the algorithm realises that its assumption about the model was wrong, that is, there simply is no hypothesis of the form $\llbracket\varphi(\bar{x}\mathbin{;}\bar{v})\rrbracket^{B}$ consistent with the training examples.

If a learning algorithm $\mathfrak{L}$ returns a hypothesis $\llbracket\varphi(\bar{x}\mathbin{;}\bar{v})\rrbracket^{B}$ specified by a formula $\varphi(\bar{x}\mathbin{;}\bar{y})$ and the parameter tuple $\bar{v}$ , then this hypothesis is useless if we cannot efficiently determine the value $\llbracket\varphi(\bar{x}\mathbin{;}\bar{v})\rrbracket^{B}(\bar{u})$ for a given $\bar{u}\in U(B)^{k}$ . We say that the hypotheses returned by $\mathfrak{L}$ can be evaluated in time $\mathfrak{t}$ if there is an algorithm that, given a pair $\varphi(\bar{x}\mathbin{;}\bar{y})$ , $\bar{v}$ returned by $\mathfrak{L}$ and a tuple $\bar{u}\in U(B)^{k}$ , computes $\llbracket\varphi(\bar{x}\mathbin{;}\bar{v})\rrbracket^{B}(\bar{u})$ in time $\mathfrak{t}$ .

The goal is to produce a hypothesis $H$ that generalises well, that is, approximates the target concept $C^{*}$ closely. To capture theoretically what it means for a hypothesis to generalise well, we will use the framework of probably approximately correct learning. However, let us first state our main algorithmic result.

Theorem 1.1.

Let $k,\ell,q\in\mathbb{N}$ . Then there is a $q^{*}\in\mathbb{N}$ and a learning algorithm $\mathfrak{L}$ for the $k$ -ary learning problem over some finite background structure $B$ with the following properties.

If the algorithm returns a hypothesis $H$ , then $H$ is of the form $\llbracket\varphi^{*}(\bar{x}\mathbin{;}\bar{v}^{*})\rrbracket^{B}$ for some first-order formula $\varphi^{*}(\bar{x}\mathbin{;}\bar{y})$ of quantifier rank at most $q^{*}$ and $\bar{v}^{*}\in U(B)^{\ell}$ , and $H$ is consistent with the input sequence $T$ of training examples. 2. 2.

If there is a first-order formula $\varphi(\bar{x}\mathbin{;}\bar{y})$ of quantifier rank $q$ and some tuple $\bar{v}\in U(B)^{\ell}$ of parameters such that $\llbracket\varphi(\bar{x}\mathbin{;}\bar{v})\rrbracket^{B}$ is consistent with the input sequence $T$ , then $\mathfrak{L}$ always returns a hypothesis and never rejects. 3. 3.

The algorithm runs in time $(\log n+d+t)^{O(1)}$ with only local access to $B$ , where $n:=|U(B)|$ and $d:=|\Delta(B)|$ and $t:=|T|$ . 4. 4.

The hypotheses returned by $\mathfrak{L}$ can be evaluated in time $(\log n+d)^{O(1)}$ with only local access to $B$ .

Note that if the maximum degree $d$ and the length $t$ of the training sequence are polylogarithmic in $n$ , then the overall running time of the algorithm is polylogarithmic in $n$ .

The proof of Theorem 1.1 critically relies on the locality of first-order logic.

We also prove a generalisation of Theorem 1.1 (Theorem 4.6), where instead of insisting on a consistent hypothesis, we compute, within the same polylogarithmic time bound, a hypothesis that minimises the training error. The training error of a concept or hypothesis is the fraction of training examples on which it is wrong. The algorithm we obtain returns a hypothesis $\llbracket\varphi^{*}(\bar{x}\mathbin{;}\bar{v}^{*})\rrbracket^{B}$ for a formula $\varphi^{*}(\bar{x}\mathbin{;}\bar{y})$ of quantifier rank $q^{*}$ bounded in terms of $q,k,\ell$ that matches the training error of the best $\llbracket\varphi(\bar{x}\mathbin{;}\bar{v})\rrbracket^{B}$ for formulas $\varphi$ of quantifier rank $q$ .

A variant of Theorem 1.1 (Theorem 4.5) for bounded degree background structures even applies to infinite structures. For this to work, we need another level of abstraction in our computation model: we allow it to store an element of the structure in a single memory cell and to access it in a single computation step. We call this the uniform cost measure. This is in line with the standard uniform-cost RAM model that is also underlying the analysis of algorithmic meta theorems for bounded degree graphs [32, 16, 24, 10, 33]. Under the uniform cost measure, we even obtain a learning algorithm (in fact the same algorithm $\mathfrak{L}$ as in Theorem 1.1) running in time $(d+t)^{O(1)}$ and producing hypotheses that can be evaluated in time $d^{O(1)}$ .

Let us briefly discuss the implications of our results in Valiant’s [35] framework of probably approximately correct (PAC) learning. A detailed technical discussion and a precise statement of our results follows in Section 5. The basic assumption of PAC-learning is that there is an unknown probability distribution on the instance space and that instances are drawn independently from this distribution; the training instances as well as the new instances that we want to classify with our hypothesis. The generalisation error of a hypothesis is then defined as the probability that the hypothesis is wrong on an instance drawn randomly from the distribution. A PAC-learning algorithm is supposed to generate, with high confidence $1-\delta$ , a hypothesis with a small generalisation error $\epsilon$ . The confidence is the probability taken over the randomly chosen training examples that the algorithm succeeds. The number $t$ of training examples the algorithm has access to is bounded in terms of the error parameter $\epsilon$ and the confidence parameter $\delta$ , and the running time is supposed to be polynomial in the size of its input, which in our setting is $t\log n$ , or just $t$ under the uniform cost measure. To obtain a PAC-learning algorithm, we usually need to make assumptions about the target concept. We say that a class ${\mathcal{C}}$ of concepts is PAC-learnable if there is a learning algorithm that meets the PAC-criterion whenever the target concept is from ${\mathcal{C}}$ , regardless of the probability distribution.

Corollary 1.2.

Let $d,k,\ell,q\in\mathbb{N}$ . For a background structure $B$ of maximum degree at most $d$ , let ${\mathcal{C}}$ be the class of all concepts $\llbracket\varphi(\bar{x}\mathbin{;}\bar{v})\rrbracket^{B}$ , where $\varphi(\bar{x}\mathbin{;}\bar{y})$ is a first-order formula of quantifier rank at most $q$ with $|\bar{x}|=k$ , $|\bar{y}|=\ell$ and $\bar{v}\in U(B)^{\ell}$ .

Then ${\mathcal{C}}$ is PAC-learnable by an algorithm that only has local access to $B$ and runs in time polynomial in $1/\epsilon$ and $\log 1/\delta$ .

Theorem 4.5 is a more detailed statement of this result. In addition to Theorem 1.1 (or Theorem 4.5 for the uniform cost measure), the corollary also relies on a result from [17] stating that first-order definable set families on graphs of bounded degree, such as the family ${\mathcal{C}}$ in the corollary, have bounded VC-dimension, which by a general theorem due to Blumer, Ehrenfeucht, Haussler and Warmuth [4] implies that a number of training examples only depending on $\epsilon$ and $\delta$ is sufficient.

We also prove a PAC-learning result for background structures with polylogarithmic degree (Theorem 5.2) and a generalisation in the so-called “agnostic” PAC-learning framework which deals with target concepts that are random variables (Theorem 5.9).

1.2 Related Work

Closest to our framework is that of inductive logic programming (ILP) (see, for example, [7, 26, 29, 30, 31]). However, there are important differences. First of all, our framework is by no means restricted to first-order logic, and in future work we intend to look at other languages which may be more suitable for expressing concepts relevant in the machine learning context. However, the present paper is only concerned with first-order logic. A second difference that is more significant for this paper is that we represent the background knowledge in a background structure, whereas in ILP it is represented by a background theory. This leads to quite different intuitions. Whereas the scarce positive PAC-learnability results in the ILP framework are mostly obtained by syntactically restricting the formulas defining models (see, however, [20, 19]), our results exploit structural restrictions—small degree—of the background structure.

To the best of our knowledge, the idea of a learning algorithm having only local access to the background structure is new here, and this is precisely what enables the polylogarithmic running time of our algorithms. We are not aware of results from the ILP context that lead to algorithms which are sublinear in the background knowledge. The local access approach seems related to ideas used in property testing on bounded degree graphs (see, for example, [14, 15, 9]). We leave it for future work to explore possible connections.

Our framework of learning definable concepts over a background structure has been considered before by Grohe and Turán [17]. However, the results from [17] are not algorithmic. They only bound the VC-dimension of definable concept classes over certain classes of structures. As mentioned above, we use one of the results of [17] in the proof of Corollary 1.2.

An alternative logical learning framework, also with a strong foundation in descriptive complexity theory, has recently been proposed by Crouch, Immerman, and Moss [8] and Jordan and Kaiser [22] (also see [23, 21]). Here the goal is to learn a logical reduction between structures; instances are pairs of structures. There is is other interesting recent work on learning in a logic setting in database theory (for example, [1, 5]) and verification (for example, [27, 13]). While there is no direct technical connection between our work and these research directions, they all seem to be similar in spirit. Exploring the exact technical relations remains future work.

2 Background from Logic

2.1 Structures

In this paper, we only consider relational structures. A relational vocabulary is a finite set $\rho$ of relation symbols, each with a prescribed arity. A $\rho$ -structure $A$ consists of a set $U(A)$ , the universe of $A$ , and for each $k$ -ary $R\in\rho$ a $k$ -ary relation $R(A)\subseteq U(A)^{k}$ . A structure $A$ is finite if its universe $U(A)$ is finite, and the order of $A$ is $|A|:=|U(A)|$ . (For infinite $A$ , we let $|A|:=\infty$ .)

The union of two $\rho$ -structures $A,B$ is the $\rho$ -structure $A\cup B$ with universe $U(A\cup B):=U(A)\cup U(B)$ and relations $R(A\cup B):=R(A)\cup R(B)$ for all $R\in\rho$ . The intersection $A\cap B$ is defined similarly. A substructure of a $\rho$ -structure $A$ is a $\rho$ -structure $B$ with $U(B)\subseteq U(A)$ and $R(B)\subseteq R(A)$ for all $R\in\rho$ . For a subset $V\subseteq U(A)$ , the substructure induced by $A$ on $V$ is the the structure $A[V]$ with universe $U(A[V])=V$ and $R(A[V]):=R(A)\cap V^{k}$ for each $k$ -ary $R\in\rho$ .

The Gaifman graph of a $\rho$ -structure $A$ is the graph $G_{A}$ with vertex set $V(G_{A}):=U(A)$ and an edge $uv$ for all $u,v\in U(A)$ such that $u\neq v$ and there is a $k$ -ary relation symbol $R\in\rho$ and a tuple $(v_{1},\ldots,v_{k})\in R(A)$ with $u,v\in\{v_{1},\ldots,v_{k}\}$ . The Gaifman graph allows us to transfer graph theoretic notions from graphs to arbitrary relational structures. In particular, the degree $\deg^{A}(u)$ of an element $u\in U(A)$ is the number of neighbours of $u$ in $G_{A}$ , and the maximum degree $\Delta(A)$ is $\max\{\deg^{A}(u)\mid u\in U(A)\}$ if this maximum exists and $\infty$ if it does not. The distance $\operatorname{dist}^{A}(u,v)$ between two elements $u,v\in U(A)$ in $A$ is the length of the shortest path from $u$ to $v$ in $G_{A}$ , and $\infty$ if there is no path from $u$ to $v$ . If $\operatorname{dist}^{A}(u,v)=1$ then we say that $u$ is a neighbour of $v$ .

The $r$ -neighbourhood $N_{r}^{A}(u)$ of $u$ in $A$ is the set of all vertices of distance at most $r$ from $u$ . For a tuple $\bar{u}=(u_{1},\ldots,u_{k})$ , we let $N_{r}^{A}(\bar{u}):=\bigcup_{i=1}^{k}N^{A}_{r}(u_{i})$ . To avoid cluttering the notation even more, we also use $N_{r}^{A}(\bar{u})$ to denote the induced substructure $A[N_{r}^{A}(\bar{u})]$ .

In all these notations we omit the superscript A if the structure $A$ is clear from the context.

Let us briefly review the computation model described in the introduction. We say that an algorithm $\mathfrak{A}$ has local access to a $\rho$ -structure $A$ it can query an oracle in the following two ways.

Relation queries:

Is $(u_{1},\ldots,u_{k})\in R$ ?

Neighbourhood queries:

Return a list of all neighbours for a given $u\in U(A)$ .

A relation query requires constant time. A neighbourhood query requires time proportional to the size of the (representation of the) answer, which is the degree of $u$ times the space required to store a single element of $A$ . With the uniform cost measure, storing an element of $A$ requires constant space, and with the logarithmic cost measure it requires space $O(\log|A|)$ . Unless explicitly stated otherwise, we assume the logarithmic cost measure.

2.2 First-Order Logic

Let us briefly review the definition of first-order logic FO. First-order formulas of vocabulary $\rho$ are formed from atomic formulas $x=y$ and $R(x_{1},\ldots,x_{k})$ , where $R\in\rho$ is a $k$ -ary relation symbol and $x,y,x_{1},\ldots,x_{k}$ are variables by the Boolean connectives $\neg$ (negation), $\wedge$ (conjunction), $\vee$ (disjunction), $\to$ (implication) and existential and universal quantification $\exists x,\forall x$ , respectively, all with the usual semantics. The set of all first-order formulas of vocabulary $\rho$ is denoted by $\text{FO}[\rho]$ . The free variables of a formula are those not in the scope of a quantifier, and we write $\varphi(x_{1},\ldots,x_{k})$ to indicate that the free variables of the formula $\varphi$ are among $x_{1},\ldots,x_{k}$ . A sentence is a formula without free variables. We write $A\models\varphi(u_{1},\ldots,u_{k})$ to denote that $A$ satisfies $\varphi$ if $x_{i}$ is interpreted by $u_{i}$ . As explained in the introduction, when describing (machine learning) models, we partition the free variables of a formula into instance variables and parameter variables, and we use a semicolon to separate the two parts, as in $\varphi(x_{1},\ldots,x_{k}\mathbin{;}y_{1},\ldots,y_{k})$ or $\varphi(\bar{x}\mathbin{;}\bar{y})$ . We use the notation $\llbracket\varphi(\bar{x}\mathbin{;}\bar{v})\rrbracket^{B}$ introduced in (1.1) for the instantiation of the model with parameters $\bar{v}$ in a background structure $B$ .

The quantifier rank of a first-order formula $\varphi$ is the nesting depth of quantifiers in $\varphi$ .

2.3 Locality

Let us fix a vocabulary $\rho$ . An $\text{FO}[\rho]$ -formula $\psi(\bar{x})$ is $r$ -local if for all $\rho$ -structures $A$ and all tuples $\bar{u}$ of elements,

[TABLE]

A formula is local if it is $r$ -local for some $r$ . For all $r\geq 0$ there is an $\text{FO}[\rho]$ -formula $\delta_{\leq r}(x,y)$ of quantifier rank $O(\log r)$ stating that the distance between $x$ and $y$ is at most $r$ . We write $\delta_{>r}(x,y)$ instead of $\neg\delta_{\leq r}(x,y)$ . A basic local sentence of radius $r$ is a first-order sentence of the form

[TABLE]

where $\psi$ is $r$ -local.

Theorem 2.1 (Gaifman’s Locality Theorem [12]).

Every first-order formula is equivalent to a Boolean combination of basic local sentences and local formulas.

This notion of locality defined above is semantical, but we also need a syntactical version. The radius- $r$ relativisation of an $\text{FO}[\rho]$ -formula $\varphi(x_{1},\ldots,x_{k})$ is the formula $\varphi_{[\leq r]}(x_{1},\ldots,x_{k})$ obtained from $\varphi$ by replacing each subformula $\exists y\psi$ by the formula $\exists y\left(\bigvee_{i=1}^{k}\delta_{\leq r}(x_{i},y)\wedge\psi\right)$ and every subformula $\forall y\psi$ by $\forall y\left(\bigvee_{i=1}^{k}\delta_{\leq r}(x_{i},y)\to\psi\right).$ Note that for every $\varphi(\bar{x})$ the radius- $r$ relativisation $\varphi_{[\leq r]}(\bar{x})$ is $r$ -local. Moreover, if $\varphi(\bar{x})$ is $r$ -local then $\varphi(\bar{x})$ and $\varphi_{[\leq r]}(\bar{x})$ are equivalent. Note that the transition from $\varphi(\bar{x})$ to $\varphi_{[\leq r]}(\bar{x})$ increases the quantifier rank by a factor of $O(\log r)$ .

An $\text{FO}[\rho]$ -formula $\varphi(\bar{x})$ is syntactically $r$ -local if it is the radius- $r$ relativisation of some $\text{FO}[\rho]$ -formula $\varphi^{\prime}(\bar{x})$ . Every syntactically $r$ -local formula is $r$ -local, and conversely, every $r$ -local formula of quantifier rank $q$ is equivalent to a syntactically $r$ -local formula of quantifier rank $O(q\cdot\log r)$ (its own radius- $r$ relativisation).

We say that a basic local sentence of the form (2.1) is syntactically basic local if the $r$ -local formula $\psi(x)$ is syntactically $r$ -local. A formula is in Gaifman normal form if it is a Boolean combination of syntactically basic local sentences and syntactically local formulas. The locality radius of a formula $\varphi$ in Gaifman normal form is the least $r$ such that all basic local sentences in $\varphi$ have radius at most $r$ and all local formulas are syntactically $r^{\prime}$ -local form some $r^{\prime}\leq r$ .

2.4 Types

Let $A$ be a $\rho$ -structure and $\bar{u}=(u_{1},\ldots,u_{k})\in U(A)^{k}$ . For every $q\geq 0$ , the (first-order) $q$ -type of $\bar{u}$ in $A$ is the set $\operatorname{tp}_{q}(A,\bar{u})$ of all $\varphi(x_{1},\ldots,x_{k})\in\text{FO}[\rho]$ of quantifier rank at most $q$ such that $A\models\varphi(u_{1},\ldots,u_{k})$ . Types are infinite sets of formulas, but we can syntactically normalise formulas in such a way that there are only finitely many normalised formulas of fixed quantifier rank and with a fixed set of free variables, and that every formula can effectively be transformed into an equivalent normalised formula of the same quantifier rank. We represent a type by the set of normalised formulas it contains.

We need the following Feferman-Vaught style composition lemma [11] (also see [28]). For tuples $\bar{u}=(u_{1},\ldots,u_{k})$ and $\bar{v}=(v_{1},\ldots,v_{\ell})$ , we let $\bar{u}\bar{v}:=(u_{1},\ldots,u_{k},v_{1},\ldots,v_{\ell})$ .

Lemma 2.2 (Composition Lemma [11]).

Let $A,A^{\prime},B,B^{\prime}$ be $\rho$ -structures such that $A\cap B=\emptyset$ and $A^{\prime}\cap B^{\prime}=\emptyset$ . Let $\bar{u}\in U(A)^{k}$ , $\bar{v}\in U(B)^{\ell}$ , $\bar{u}^{\prime}\in U(A^{\prime})^{k}$ , and $\bar{v}^{\prime}\in U(B^{\prime})^{\ell}$ such that $\operatorname{tp}_{q}(A,\bar{u})=\operatorname{tp}_{q}(A^{\prime},\bar{u}^{\prime})$ and $\operatorname{tp}_{q}(B,\bar{v})=\operatorname{tp}_{q}(B^{\prime},\bar{v}^{\prime})$ . Then

[TABLE]

We also need a “local” version of types. Let $A$ be a $\rho$ -structure and $\bar{u}\in U(A)^{k}$ , and let $q,r\geq 0$ . The local $(q,r)$ -type of $\bar{u}$ in $A$ is the set $\operatorname{ltp}_{q,r}(G,\bar{u})$ of all syntactically $r$ -local $\varphi(\bar{x})\in\text{FO}[\rho]$ of quantifier rank at most $q$ such that $A\models\varphi(\bar{u})$ , or equivalently, $N_{r}^{A}(\bar{u})\models\varphi(\bar{u})$ .

As a corollary to Lemma 2.2, we obtain the following composition lemma for local types.

Corollary 2.3 (Local Composition Lemma).

Let $A,A^{\prime}$ be $\rho$ -structures, $\bar{u}\in U(A)^{k}$ , $\bar{v}\in U(A)^{\ell}$ , $\bar{u}^{\prime}\in U(A^{\prime})^{k}$ , and $\bar{v}^{\prime}\in U(A^{\prime})^{\ell}$ such that $N_{r}(\bar{u})\cap N_{r}(\bar{v})=\emptyset$ and $N_{r}(\bar{u}^{\prime})\cap N_{r}(\bar{v}^{\prime})=\emptyset$ and $\operatorname{ltp}_{q,r}(A,\bar{u})=\operatorname{ltp}_{q,r}(A^{\prime},\bar{u}^{\prime})$ and $\operatorname{ltp}_{q,r}(A,\bar{v})=\operatorname{ltp}_{q,r}(A^{\prime},\bar{v}^{\prime})$ . Then

[TABLE]

3 Parameter Learning

Recall the two different modes of learning we described in the introduction: parameter learning and model learning. In our context, for the parameter learning problem we assume that we have a fixed $\text{FO}[\rho]$ -formula $\varphi(\bar{x}\mathbin{;}\bar{y})$ . The input to our learning algorithm is a sequence $T$ of training examples over some background structure $B$ to which we have local access. Our goal is to find a tuple $\bar{v}$ such that $\llbracket\varphi(\bar{x}\mathbin{;}\bar{v})\rrbracket^{B}$ is consistent with $T$ (or at least approximately consistent).

The following simple example shows that parameter learning requires reading the whole background structure and thus is not possible in sublinear time, whereas our model learning algorithms only need polylogarithmic time.

Example 3.1.

Let $P$ be a unary relation symbol, and let $\varphi(x\mathbin{;}y):=P(y)$ . Then for every $\{P\}$ -structure $B$ and every $v\in U(B)$ , the function $\llbracket\varphi(x\mathbin{;}v)\rrbracket^{B}$ is constant $1$ if $v\in P(B)$ and constant [math] otherwise.

Now suppose that the background structure $B$ is such that $P(B)=\{v^{*}\}$ for some $v^{*}\in U(B)$ , and the unknown target concept is $\llbracket\varphi(x\mathbin{;}v^{*})\rrbracket^{B}$ , that is, constant $1$ . Then our learning algorithm receives only positive training examples, and it needs to find the parameter $v^{*}$ . However, unless $v^{*}$ happens to be one of the training examples, this requires reading the whole structure $B$ in the worst case. If the algorithm only has local access to $B$ , it is actually impossible to find $v^{*}$ , because the graph $G_{B}$ is the trivial graph with vertex set $U(B)$ and empty edge set.

Thus parameter learning is not possible in our setting with only local access to the background structure. However, there is an intermediate mode of learning between parameter and model learning, where we assume that we know the formula $\varphi(\bar{x}\mathbin{;}\bar{y})$ defining the target concept, but are still allowed to modify it when formulating our hypothesis. This is potentially much easier than constructing the formula from scratch, as we are required to do in the model learning setting.

For instance, in Example 3.1, looking at $\varphi$ we immediately know that the target concept is constant, and just from one labelled example we know if it is [math] or $1$ . Then we can either return the universally true formula $\varphi^{\prime}(x\mathbin{;}):=(x=x)$ or the universally false formula $\varphi^{\prime\prime}(x\mathbin{;}):=(\neg x=x)$ as our hypothesis (we do not even need a parameter here).

4 Model Learning

In this section, we look at the model learning problem for first-order logic over structures of small degree. We prove Theorem 1.1 and several variants and generalisations of it.

4.1 Consistent Hypotheses

We start by proving Theorem 1.1.

Throughout this section, we fix $k,\ell,q\in\mathbb{N}$ and a vocabulary $\rho$ . Let $q^{*},r^{*}\in\mathbb{N}$ such that every $\text{FO}[\rho]$ -formula $\varphi$ with at most $k+\ell$ free variables and quantifier rank at most $q$ is equivalent to a formula $\varphi^{*}$ in Gaifman normal form of locality radius at most $r^{*}$ such that the quantifier rank of every syntactically local formula in the Boolean combination $\varphi^{*}$ has quantifier rank at most $q^{*}$ . Such $q^{*},r^{*}$ exist by Gaifman’s theorem and because up to logical equivalence there only exist finitely many $\text{FO}[\rho]$ -formulas of quantifier rank at most $q$ with at most $k+\ell$ free variables.

Lemma 4.1.

Let $B$ be a $\rho$ -structure, and let $\psi(\bar{z})$ be an $\text{FO}[\rho]$ -formula of quantifier rank at most $q$ and with $m:=|\bar{z}|\leq k+\ell$ . Then for all $\bar{w},\bar{w}^{\prime}\in U(A)^{m}$

[TABLE]

Proof.

It follows from Gaifman’s Locality Theorem and the choice of $q^{*},r^{*}$ that $\big{(}B\models\psi(\bar{w})\iff B\models\psi(\bar{w}^{\prime})\big{)}$ if $\big{(}B\models\chi(\bar{w})\iff B\models\chi(\bar{w}^{\prime})\big{)}$ for all syntactically $r^{*}$ -local formulas $\chi(\bar{z})$ of quantifier rank at most $q^{*}$ . If $\operatorname{ltp}_{q^{*},r^{*}}(B,\bar{w})=\operatorname{ltp}_{q^{*},r^{*}}(B,\bar{w}^{\prime})$ then the latter equivalence holds. ∎

To simplify the presentation, we fix a background structure $B$ . We also fix the length $t\geq 1$ of our training sequences. Of course our algorithm will neither depend on the specific structure $B$ nor on $t$ .

We fix a tuple $\bar{x}=(x_{1},\ldots,x_{k})$ of instance variables and a tuple $\bar{y}=(y_{1},\ldots,y_{\ell})$ of parameter variables. We let $\Phi$ be the set of all $\text{FO}[\rho]$ -formulas $\varphi(\bar{x}\mathbin{;}\bar{y})$ of quantifier rank at most $q$ and

[TABLE]

Similarly, we let $\Phi^{*}$ be the set of all syntactically $r$ -local $\text{FO}[\rho]$ -formulas $\varphi^{*}(\bar{x}\mathbin{;}\bar{y})$ of quantifier rank at most $q^{*}$ and

[TABLE]

Moreover, we let ${\mathcal{T}}:=\big{(}U(B)^{k}\times\{0,1\}\big{)}^{t}$ be the set of all training sequences of length $t$ for the $k$ -ary learning problem over $B$ . For every $T=\big{(}(\bar{u}_{1},c_{1}),\ldots,(\bar{u}_{t},c_{t})\big{)}\in{\mathcal{T}}$ and $r\in\mathbb{N}$ we let

[TABLE]

Recall that $T\in{\mathcal{T}}$ is consistent with $C\subseteq U(B)^{k}$ if for all $(\bar{u},c)\in T$ we have $C(\bar{u})=c$ .

What we need to prove is that for all $T\in{\mathcal{T}}$ , if there is a $C\in{\mathcal{C}}$ that is consistent with $T$ then we can find a $C^{*}\in{\mathcal{C}}^{*}$ consistent with $T$ within the time bounds specified in Theorem 1.1.

The following lemma is the crucial step in our proof.

Lemma 4.2.

Let $T\in{\mathcal{T}}$ be consistent with some $C\in{\mathcal{C}}$ . Then there is a formula $\varphi^{*}(\bar{x}\mathbin{;}\bar{y})\in\Phi^{*}$ and a tuple $\bar{v}^{*}\in N_{2\ell r^{*}}(T)^{\ell}$ such that $\llbracket\varphi^{*}(\bar{x}\mathbin{;}\bar{v}^{*})\rrbracket^{B}$ is consistent with $T$ .

Proof.

Let $T=\big{(}(\bar{u}_{1},c_{1}),\ldots,(\bar{u}_{t},c_{t})\big{)}$ . Let $\varphi(\bar{x}\mathbin{;}\bar{y})\in\Phi$ and $\bar{v}=(v_{1},\ldots,v_{\ell})\in U(B)^{\ell}$ such that $C=\llbracket\varphi(\bar{x}\mathbin{;}\bar{v})\rrbracket^{B}$ is consistent with $T$ .

For some $m\leq\ell$ , we define $v^{(1)},\ldots,v^{(m)}\in\{v_{1},\ldots,v_{\ell}\}$ and $N^{(0)},N^{(1)},\ldots,N^{(m)}\subseteq U(B)$ as follows: we let $N^{(0)}:=N_{r^{*}}(T)$ . Now suppose that $N^{(i)}$ is already defined. If there is a $v\in\{v_{1},\ldots,v_{\ell}\}\setminus\{v^{(1)},\ldots,v^{(i)}\}$ such that $\operatorname{dist}^{B}(v,N^{(i)})\leq r^{*}$ , then we pick such a $v$ (arbitrarily if there are more than one) and let $v^{(i+1)}:=v$ and $N^{(i+1)}:=N^{(i)}\cup N_{r^{*}}(v^{(i+1)})$ . If there is no such $v$ , we let $m:=i$ and stop the construction.

We let $N^{\circ}:=N^{(m)}$ . To simplify the notation, we further assume (without loss of generality) that $v^{(i)}=v_{i}$ for all $i\in[m]$ . We let $\bar{v}^{\circ}:=(v_{1},\ldots,v_{m})$ and $\bar{v}^{\bullet}:=(v_{m+1},\ldots,v_{\ell})$ . Possibly, $\bar{v}^{\circ}$ or $\bar{v}^{\bullet}$ is the empty tuple. Observe that $\bar{v}^{\circ}\in N_{2\ell r^{*}}(T)^{m}$ and

[TABLE]

and

[TABLE]

Furthermore,

[TABLE]

*Claim 1**.*

Let $i,j\in[t]$ such that

[TABLE]

Then $c_{i}=c_{j}$ .

Now let $P\subseteq[t]$ be the set of indices of the positive examples, that is, $P:=\{p\in[t]\mid c_{p}=1\}$ . For every $p\in P$ , let $\vartheta_{p}(\bar{x},\bar{y}^{\circ})$ , where $\bar{y}^{\circ}:=(y_{1},\ldots,y_{m})$ , be the conjunction of all normalised formulas in the type $\operatorname{ltp}_{q^{*},r^{*}}(B,\bar{u}_{i}\bar{v}^{\circ})$ . Then for all $\bar{u}^{\prime}\in U(B)^{k},\bar{v}^{\prime}\in U(B)^{m}$ we have

[TABLE]

Now we let $\varphi^{\circ}(\bar{x}\mathbin{;}\bar{y}^{\circ}):=\bigvee_{p\in P}\vartheta_{p}(\bar{x},\bar{y}^{\circ})$ . Then it follows from Claim 1 and (4.3) that $\llbracket\varphi^{\circ}(\bar{x}\mathbin{;}\bar{v}^{\circ})\rrbracket^{B}$ is consistent with $T$ . Furthermore, all the $\vartheta_{p}$ and hence $\varphi^{\circ}$ are syntactically $r^{*}$ -local of quantifier rank at most $q^{*}$ .

It remains to transform $\varphi^{\circ}(\bar{x}\mathbin{;}\bar{y}^{\circ})$ into a formula $\varphi^{*}(\bar{x}\mathbin{;}\bar{y})$ with the right number of parameter variables. We simply do this by adding redundant variables, but we have to be careful that the resulting formula is still syntactically $r^{*}$ -local. Since $\varphi^{\circ}(\bar{x}\mathbin{;}\bar{y}^{\circ})$ is syntactically $r^{*}$ -local, all its quantifiers are relativised to the $r^{*}$ -neighbourhood of the free variables, that is, of the form

[TABLE]

or

[TABLE]

To obtain $\varphi^{*}(\bar{x}\mathbin{;}\bar{y})$ from $\varphi^{\circ}(\bar{x}\mathbin{;}\bar{y}^{\circ})$ , we replace (4.4) by

[TABLE]

and similarly for the universal quantifier in (4.5). Then $\varphi^{*}(\bar{x}\mathbin{;}\bar{y})$ is syntactically $r^{*}$ -local and has the same quantifier rank as $\varphi^{\circ}(\bar{x}\mathbin{;}\bar{y}^{\circ})$ . Hence $\varphi^{*}(\bar{x}\mathbin{;}\bar{y})\in\Phi^{*}$ . Moreover, for all $\bar{u}^{\prime}\in U(B)^{k}$ and all $\bar{v}^{\prime}=(v^{\prime}_{1},\ldots,v^{\prime}_{\ell})\in U(B)^{\ell}$ we have

[TABLE]

We choose an arbitrary $v\in N_{2\ell r^{*}}(T)$ and let $\bar{v}^{*}=(v_{1},\ldots,v_{m},\overbrace{v,\ldots,v}^{(\ell-m)\text{ times}})$ . Then for all $i\in[t]$ we have

[TABLE]

Hence $\llbracket\varphi^{*}(\bar{x}\mathbin{;}\bar{v}^{*})\rrbracket^{B}$ is consistent with $T$ .

*Proof 4.4** (Proof of Theorem 1.1).*

The pseudocode for our learning algorithm $\mathfrak{L}$ is shown in Figure 2. The algorithm proceeds by brute-force: it goes through all formulas $\varphi^{*}\in\Phi^{*}$ and all tuples $\bar{v}^{*}\in N_{2\ell r^{*}}(T)$ and checks, in lines 4–7, if $\llbracket\varphi^{*}(\bar{x}\mathbin{;}\bar{v}^{*})\rrbracket^{B}$ is consistent with $T$ . If it is, the algorithm returns $\varphi^{*},\bar{v}^{*}$ , otherwise it proceeds to the next $\varphi^{*},\bar{v}^{*}$ . If it does not find any consistent $\varphi^{*},\bar{v}^{*}$ , it rejects. To see that the consistency test in lines 4–7 is correct, note that

[TABLE]

because $\varphi^{*}$ is $r^{*}$ -local.

Hence the algorithm is correct, that is, satisfies conditions (1) and (2) of Theorem 1.1: it obviously satisfies (1), and it follows from Lemma 4.2 that it satisfies (2).

Note that the set $N$ (in line 1) can be computed from $T$ with only local access to $B$ .

To analyse the running time of $\mathfrak{L}$ , let $n:=|B|$ and $d:=|\Delta(B)|$ and $t:=|T|$ . Note that for all $\bar{u}\in U(B)^{k},\bar{v}^{*}\in U(B)^{\ell}$ we have

[TABLE]

Thus the representation size of the substructure $N_{r^{*}}(\bar{u}\bar{v}^{*})$ of $B$ is $O((k+\ell)\cdot d^{r^{*}}\cdot\log n)$ , which is $(\log n+d)^{O(1)}$ if we treat $k,\ell,r^{*}$ as constants. It requires time polynomial in the size of $N_{r^{*}}(\bar{u}\bar{v}^{*})$ to test if the structure satisfies $\varphi^{*}$ . Thus the overall running time of lines 4–7 is $(\log n+d)^{O(1)}\cdot t$ . We have

[TABLE]

and $|\Phi^{*}|=O(1)$ . Hence the two outer loops add a factor of $(t+d)^{O(1)}$ , and the overall runtime is

[TABLE]

This proves Theorem 1.1(3).

Finally, any hypothesis $\llbracket\varphi^{*}(\bar{x}\mathbin{;}\bar{v}^{*})\rrbracket^{B}$ returned by $\mathfrak{L}$ can be evaluated in time $(\log n+d)^{O(1)}$ with only local access to $B$ , because the formula $\varphi^{*}$ is $r^{*}$ -local, and thus to compute $\llbracket\varphi^{*}(\bar{x}\mathbin{;}\bar{v}^{*})\rrbracket^{B}(\bar{u})$ we only need to look at the substructure $N_{r^{*}}(\bar{u}\bar{v}^{*})$ of $B$ . This proves Theorem 1.1(4).

Let us now analyse the algorithm under the uniform cost measure. Everything remains unchanged, except that the log-factors in the running time disappear. One advantage of the uniform cost model is that we can even apply it to infinite background structures $B$ . We obtain the following theorem.

*Theorem 4.5**.*

Let $k,\ell,q\in\mathbb{N}$ . Then there is a $q^{*}\in\mathbb{N}$ and a learning algorithm $\mathfrak{L}$ for the $k$ -ary learning problem over some (possibly infinite) background structure $B$ with properties 1) and 2) of Theorem 1.1 and the following two properties.

3u)

The algorithm runs in time $(d+t)^{O(1)}$ under the uniform cost measure with only local access to $B$ , where $d:=|\Delta(B)|$ and $t:=|T|$ . 2. 4u)

The hypotheses returned by $\mathfrak{L}$ can be evaluated in time $d^{O(1)}$ under the uniform cost measure with only local access to $B$ .

4.2 Minimising the Training Error

We continue to work in the same framework as before, that is, we consider the $k$ -ary learning problem over a background structure $B$ . Let $T=\big{(}(\bar{u}_{1},c_{1}),\ldots,(\bar{u}_{t},c_{t})\big{)}\in\big{(}U(B)^{k}\times\{0,1\}\big{)}^{t}$ be a training sequence. The training error of a hypothesis $H:U(B)^{k}\to\{0,1\}$ on $T$ is the fraction of examples on which $H$ is wrong, that is,

[TABLE]

The next theorem is a generalisation of Theorem 1.1 where, instead of insisting on a consistent hypothesis, we try to find a hypothesis with minimal training error.

*Theorem 4.6**.*

Let $k,\ell,q\in\mathbb{N}$ . Then there is a $q^{*}\in\mathbb{N}$ and a learning algorithm $\mathfrak{M}$ for the $k$ -ary learning problem over some finite background structure $B$ with the following properties.

$\mathfrak{M}$ always returns a hypothesis $H=\llbracket\varphi^{*}(\bar{x}\mathbin{;}\bar{v}^{*})\rrbracket^{B}$ for some $r^{*}$ -local first-order formula $\varphi^{*}(\bar{x}\mathbin{;}\bar{y})$ of quantifier rank $q^{*}$ and $\bar{v}^{*}\in U(B)^{\ell}$ . 2. 2.

If there is a first-order formula $\varphi(\bar{x}\mathbin{;}\bar{y})$ of quantifier rank $q$ and some tuple $\bar{v}\in U(B)^{\ell}$ of parameters such that $\operatorname{err}_{T}(\llbracket\varphi(\bar{x}\mathbin{;}\bar{v})\rrbracket^{B})\leq\epsilon$ , where $T$ is the input sequence $T$ , then $\operatorname{err}_{T}(H)\leq\epsilon$ for the hypothesis $H$ returned by $\mathfrak{L}$ on input $T$ . 3. 3.

The algorithm runs in time $(\log n+d+t)^{O(1)}$ with only local access to $B$ , where $n:=|U(B)|$ and $d:=|\Delta(B)|$ and $t:=|T|$ . 4. 4.

The hypotheses returned by $\mathfrak{M}$ can be evaluated in time $(\log n+d)^{O(1)}$ with only local access to $B$ .

To prove the theorem, we use the same notation as in the previous section: we fix $\rho,k,\ell,q$ and define $q^{*},r^{*}$ as before. We let $B$ be a $\rho$ -structure. We define $\Phi,\Phi^{*}$ and ${\mathcal{C}},{\mathcal{C}}^{*}$ as before. We let $t\geq 1$ and ${\mathcal{T}}:=\big{(}U(B)^{k}\times\{0,1\}\big{)}^{t}$ .

The proof of Theorem relies on the following generalisation of Lemma 4.2.

*Lemma 4.7**.*

Let $T\in{\mathcal{T}}$ such that $\operatorname{err}_{T}(C)\leq\epsilon$ for some $C\in{\mathcal{C}}$ . Then there is a formula $\varphi^{*}(\bar{x}\mathbin{;}\bar{y})\in\Phi^{*}$ and a tuple $\bar{v}^{*}\in N_{2\ell r^{*}}(T)^{\ell}$ such that

[TABLE]

*Proof 4.8**.*

Let $T=\big{(}(\bar{u}_{1},c_{1}),\ldots,(\bar{u}_{t},c_{t})\big{)}$ . Let $\varphi(\bar{x}\mathbin{;}\bar{y})\in\Phi$ and $\bar{v}=(v_{1},\ldots,v_{\ell})\in U(B)^{\ell}$ such that $\operatorname{err}_{T}\big{(}\llbracket\varphi(\bar{x}\mathbin{;}\bar{v})\rrbracket^{B}\big{)}\leq\epsilon$ . Then there exists a subsequence $S$ of $T$ such that $|S|\geq(1-\epsilon)\cdot t$ and $\llbracket\varphi(\bar{x}\mathbin{;}\bar{v})\rrbracket^{B}$ is consistent with $S$ .

By Lemma 4.2, there is a formula $\varphi^{*}(\bar{x}\mathbin{;}\bar{y})\in\Phi^{*}$ and a tuple $\bar{v}^{*}\in N_{2\ell r^{*}}(S)^{\ell}$ such that $\llbracket\varphi^{*}(\bar{x}\mathbin{;}\bar{v}^{*})\rrbracket^{B}$ is consistent with $S$ . Then $\operatorname{err}_{T}\big{(}\llbracket\varphi^{*}(\bar{x}\mathbin{;}\bar{v}^{*})\rrbracket^{B}\big{)}\leq\epsilon$ .

*Proof 4.9** (Proof of Theorem 4.6).*

The pseudocode for our learning algorithm $\mathfrak{M}$ is shown in Figure 2. The algorithm is very similar to the algorithm $\mathfrak{L}$ of Theorem 1.1, except that we do not check for consistency, but count the errors of all hypotheses and return the one with minimum error. The runtime of $\mathfrak{M}$ is essentially the same as that of $\mathfrak{L}$ .

There is an analogous generalisation of Theorem 4.5.

*Theorem 4.10**.*

Let $k,\ell,q\in\mathbb{N}$ . Then there is a $q^{*}\in\mathbb{N}$ and a learning algorithm $\mathfrak{M}$ for the $k$ -ary learning problem over some (possibly infinite) background structure $B$ with properties 1) and 2) of Theorem 4.6 and the following two properties.

3u)

The algorithm runs in time $(d+t)^{O(1)}$ under the uniform cost measure with only local access to $B$ , where $d:=|\Delta(B)|$ and $t:=|T|$ . 2. 4u)

The hypotheses returned by $\mathfrak{M}$ can be evaluated in time $d^{O(1)}$ under the uniform cost measure with only local access to $B$ .

*Proof 4.11**.*

The proof is the same as that of Theorem 4.6, except that in the analysis of the algorithm the log-factor disappears because of the uniform cost measure.

5 PAC Learning

In this section, we sketch some of the basic principles of algorithmic learning theory and show how they apply in our context. For more background, we refer the reader to [3, 25, 34].

So far, we have focussed on the training error of our learning algorithms. But of course that is not the error we are mainly interested in; our goal is to generate hypotheses that with a low generalisation error. Probably approximately correct (PAC) learning gives us a framework for analysing the generalisation error theoretically.

Consider a learning problem with instance space ${\mathbb{U}}$ where we want to learn an unknown target concept $C^{*}:{\mathbb{U}}\to\{0,1\}$ . As before, we are mainly interested in the case that ${\mathbb{U}}=U(B)^{k}$ for some background structure $B$ and that $C^{*}=\llbracket\varphi(\bar{x}\mathbin{;}\bar{v})\rrbracket^{B}$ for some first-order formula $\varphi(\bar{x}\mathbin{;}\bar{y})$ and parameter tuple $\bar{v}\in U(B)^{\ell}$ . The basic assumption of PAC learning is is that there is an (unknown) probability distribution ${\mathcal{D}}$ on the instance space ${{\mathbb{U}}}$ and that instances are drawn independently from this distribution; the training instances as well as the new instances that we want to classify with our hypothesis. We define the generalisation error of a hypothesis $H$ to be the probability that $H$ is wrong on a random instance, that is,

[TABLE]

We allow our algorithm to make a small generalisation error controlled by the error parameter $\epsilon$ . As the hypothesis $H$ depends on the randomly chosen training examples, we must allow for an error caused by unusually bad examples as well. We usually quantify this second type of error by the confidence parameter $\delta$ . Our goal is to generate a hypothesis $H=H(T,\epsilon,\delta)$ , which of course depends on the training sequence $T\subseteq({\mathbb{U}}\times\{0,1\})^{t}$ and may also depend on $\epsilon$ and $\delta$ , such that

[TABLE]

Here $T\sim{\mathcal{D}}$ indicates that the training examples are drawn independently from ${\mathcal{D}}$ . Intuitively, a hypothesis satisfying (5.2) is probably (referring to the high confidence of at least $1-\delta$ ) approximately (referring to the low error of at most $\epsilon$ ) correct.

Ideally, we would like $\mathfrak{L}$ that for all target concepts $C^{*}\subseteq{\mathbb{U}}$ , all probability distributions ${\mathcal{D}}$ on ${\mathbb{U}}$ , and all $\epsilon,\delta>0$ generates hypotheses that are probably approximately correct. This is something we usually cannot achieve. Thus we make assumptions about the target concept, which we formalise by considering target concepts from a concept class ${\mathcal{C}}$ . We also specify the hypothesis class ${\mathcal{H}}$ and a function $t$ that determines the number of training examples required by the algorithm. We say that a learning algorithm $\mathfrak{L}$ is a $({\mathbb{U}},{\mathcal{C}},{\mathcal{H}},t)$ -PAC-learning algorithm if for all probability distributions ${\mathcal{D}}$ on ${\mathbb{U}}$ , all target concepts $C^{*}\in{\mathcal{C}}$ , and all $\epsilon,\delta>0$ , given a sequence $T$ of $t(\epsilon,\delta)$ training examples and $\epsilon,\delta$ , the algorithm generates a hypothesis $H(T,\epsilon,\delta)\in{\mathcal{H}}$ that satisfies (5.2).

5.1 Sample Size Bound

It is a basic insight from computational learning theory that if the hypothesis class ${\mathcal{H}}$ is finite we need roughly $\log|{\mathcal{H}}|$ training examples to achieve probable approximate correctness. The following well-known lemma makes this precise. For a proof, see for example [34].

*Lemma 5.1** (Sample Size Bound).*

Suppose that the hypothesis class ${\mathcal{H}}$ is finite and that the length $t$ of the training sequence satisfies

[TABLE]

Then for all probability distributions ${\mathcal{D}}$ on ${\mathbb{U}}$ and all target functions $C^{*}$ ,

[TABLE]

This means that if ${\mathcal{H}}$ is finite and the training sequence is long enough (as specified in (5.3)) then with high confidence, every consistent hypothesis will have low generalisation error.

If we combine this lemma with Theorem 1.1, we obtain the following result.

*Theorem 5.2**.*

Let $k,\ell,q\in\mathbb{N}$ . Then there are $q^{*},r^{*},s^{*}\in\mathbb{N}$ and a learning algorithm $\mathfrak{L}$ for the $k$ -ary learning problem over some finite background $\rho$ -structure $B$ with the following properties.

Let $\bar{x},\bar{y}$ be tuples of length $k,\ell$ , respectively. Let ${\mathcal{C}}$ be the class of all $\llbracket\varphi(\bar{x}\mathbin{;}\bar{v})\rrbracket^{B}$ for a $\text{FO}[\rho]$ -formula $\varphi(\bar{x}\mathbin{;}\bar{y})$ of quantifier rank $q$ and $\bar{v}\in U(B)^{\ell}$ , and let ${\mathcal{H}}$ be the class of all $\llbracket\varphi^{*}(\bar{x}\mathbin{;}\bar{v}^{*})\rrbracket^{B}$ for a syntactically $r^{*}$ -local $\text{FO}[\rho]$ -formula $\varphi^{*}(\bar{x}\mathbin{;}\bar{y})$ of quantifier rank $q^{*}$ and $\bar{v}^{*}\in U(B)^{\ell}$ . Let

[TABLE]

where $n:=|B|$ .

Then $\mathfrak{L}$ is a $(U(B)^{k},{\mathcal{C}},{\mathcal{H}},t)$ -PAC-learning algorithm. 2. 2.

The algorithm runs in time $(\log n+d+1/\epsilon+\log 1/\delta)^{O(1)}$ with only local access to $B$ , where $d:=|\Delta(B)|$ .

*Proof 5.3**.*

Let $\mathfrak{L}$ be the algorithm of Theorem 1.1. Given a training sequence $T$ of length $t$ consistent with a concept from ${\mathcal{C}}$ , it generates a hypothesis $H\in{\mathcal{H}}$ consistent with $T$ . It follows from the Lemma 5.1 that $H$ is probably approximately correct.

5.2 VC Dimension

We cannot apply the sample size bound in a situation where the background structure is infinite. But there is an improved sample size bound that even holds for (some) infinite hypothesis classes. For this bound, we replace the factor $\ln|{\mathcal{H}}|$ in the sample size bound by the VC-dimension of ${\mathcal{H}}$ . The VC-dimension is a combinatorial measure for the complexity of a set system.

Let ${\mathbb{U}}$ be a set and ${\mathcal{H}}\subseteq 2^{{\mathbb{U}}}$ . A set $V\subseteq{\mathbb{U}}$ is shattered by ${\mathcal{H}}$ if for every $I:V\to\{0,1\}$ there is an $H\in{\mathcal{H}}$ such that $I$ is the restriction of $H$ to $V$ . The VC-dimension of ${\mathcal{H}}$ , denoted by $\operatorname{VC}({\mathcal{H}})$ , is the maximum size of a set shattered by ${\mathcal{H}}$ , or $\infty$ if this maximum does not exist. ${\mathcal{H}}$ has finite VC-dimension if $\operatorname{VC}({\mathcal{H}})<\infty$ . Observe that for finite ${\mathcal{H}}$ we have $\operatorname{VC}({\mathcal{H}})\leq\log|{\mathcal{H}}|$ .

The following lemma due to Blumer, Ehrenfeucht, Haussler and Warmuth [4] relates VC-dimension to PAC-learning.

*Lemma 5.4** (VC-Dimension Sample Size Bound [4]).*

There is a constant $c$ such that the following holds. Suppose that the hypothesis class ${\mathcal{H}}$ has finite VC-dimension and that the length $t$ of the training sequence satisfies

[TABLE]

Then for all probability distributions ${\mathcal{D}}$ on ${\mathbb{U}}$ and all target functions $C^{*}$ ,

[TABLE]

We can apply this improved sample size bound in our setting for bounded degree structures, because the VC-dimension of first-order definable models is bounded on bounded degree structures.

*Lemma 5.5** ([17]).*

Let $d,k,\ell,q\in\mathbb{N}$ . Then there is an $m\in\mathbb{N}$ such that the following holds. Let $B$ be a structure of maximum degree $\Delta(B)\leq d$ , let ${\mathcal{C}}$ be the class of all $\llbracket\varphi(\bar{x}\mathbin{;}\bar{v})\rrbracket^{B}$ for a $\text{FO}[\rho]$ -formula $\varphi(\bar{x}\mathbin{;}\bar{y})$ of quantifier rank $q$ with $|\bar{x}|=k,|\bar{y}|=\ell$ and $\bar{v}\in U(B)^{\ell}$ . Then $\operatorname{VC}({\mathcal{C}})\leq m$ .

Grohe and Turán [17] bounded the VC-dimension of first-order definable concept classes on a wide range of further classes structures, among them planar graphs and graphs of bounded tree width. Adler and Adler [2] extended this to all nowhere dense graph classes.

Using the VC-Dimension Sample Size Bound and the previous lemma, we obtain the following theorem.

*Theorem 5.6**.*

Let $d,k,\ell,q\in\mathbb{N}$ . Then there are $q^{*},r^{*},s^{*}\in\mathbb{N}$ and a learning algorithm $\mathfrak{L}$ for the $k$ -ary learning problem over some (possibly infinite) background $\rho$ -structure $B$ of maximum degree at most $d$ with the following properties.

Let $\bar{x},\bar{y}$ be tuples of length $k,\ell$ , respectively. Let ${\mathcal{C}}$ be the class of all $\llbracket\varphi(\bar{x}\mathbin{;}\bar{v})\rrbracket^{B}$ for a $\text{FO}[\rho]$ -formula $\varphi(\bar{x}\mathbin{;}\bar{y})$ of quantifier rank $q$ and $\bar{v}\in U(B)^{\ell}$ , and let ${\mathcal{H}}$ be the class of all $\llbracket\varphi^{*}(\bar{x}\mathbin{;}\bar{v}^{*})\rrbracket^{B}$ for a syntactically $r^{*}$ -local $\text{FO}[\rho]$ -formula $\varphi^{*}(\bar{x}\mathbin{;}\bar{y})$ of quantifier rank $q^{*}$ and $\bar{v}^{*}\in U(B)^{\ell}$ . Let

[TABLE]

Then $\mathfrak{L}$ is a $(U(B)^{k},{\mathcal{C}},{\mathcal{H}},t)$ -PAC-learning algorithm. 2. 2.

The algorithm runs in time $(1/\epsilon+\log 1/\delta)^{O(1)}$ under the uniform cost measure and with only local access to $B$ .

*Proof 5.7**.*

This follows by combining Theorem 4.5 with Lemmas 5.4 and 5.5.

Corollary 1.2 follows from this theorem. (In fact, the theorem should be viewed as a precise version of Corollary 1.2.)

5.3 Agnostic PAC-Learning

Agnostic learning is a generalisation of our setting where there is no deterministic target function (or concept), but only a probabilistic one. In practice, this may occur in a situation where the instances in our abstract instance space do not fully capture the relevant properties of the real-world objects they describe. This can easily happen, because typically the instances are tuples only describing certain features of the objects.

We continue to consider Boolean classification problems on an instance space ${\mathbb{U}}$ . Instead of a probability distribution on ${\mathbb{U}}$ and a target concept $C^{*}\subseteq{\mathbb{U}}$ , we now assume that we have a probability distribution ${\mathcal{D}}$ on ${\mathbb{U}}\times\{0,1\}$ . We define the generalisation error of a hypothesis $H$ to be

[TABLE]

To quantify the quality of a learning algorithm, we compare the quality of the hypothesis of our algorithm with the best possible hypothesis coming from a certain class ${\mathcal{C}}$ . For classes ${\mathcal{C}},{\mathcal{H}}\subseteq 2^{{\mathbb{U}}}$ and a function $t$ , we say that a learning algorithm $\mathfrak{L}$ is an agnostic $({\mathbb{U}},{\mathcal{C}},{\mathcal{H}},t)$ -PAC-learning algorithm if for all probability distributions ${\mathcal{D}}$ on ${\mathbb{U}}\times\{0,1\}$ and all $\epsilon,\delta>0$ , given a sequence $T$ of $t(\epsilon,\delta)$ training examples and $\epsilon,\delta$ , the algorithm generates a hypothesis $H(T,\epsilon,\delta)\in{\mathcal{H}}$ that satisfies

[TABLE]

The agnostic PAC learning framework has been introduced by Haussler [18].

To obtain agnostic PAC-learning algorithms, we can use the following Uniform Convergence Lemma instead of the Sample Size Bound of Lemma 5.1. Recall that $\operatorname{err}_{T}(H)$ denotes the training error of a hypothesis $H$ on a training sequence $T$ .

*Lemma 5.8** (Uniform Convergence).*

Suppose that the hypothesis class ${\mathcal{H}}$ is finite and that the length $t$ of the training sequence satisfies

[TABLE]

Then for all probability distributions ${\mathcal{D}}$ on ${\mathbb{U}}\times\{0,1\}$ ,

[TABLE]

A proof can be found in [34].

Now from Theorem 4.6 we get an agnostic PAC-learning algorithm for first-order definable concept classes on structures of small degree.

*Theorem 5.9**.*

Let $k,\ell,q\in\mathbb{N}$ . Then there are $q^{*},r^{*},s^{*}\in\mathbb{N}$ and a learning algorithm $\mathfrak{M}$ for the $k$ -ary learning problem over some finite background $\rho$ -structure $B$ with the following properties.

Let $\bar{x},\bar{y}$ be tuples of length $k,\ell$ , respectively. Let ${\mathcal{C}}$ be the class of all $\llbracket\varphi(\bar{x}\mathbin{;}\bar{v})\rrbracket^{B}$ for a $\text{FO}[\rho]$ -formula $\varphi(\bar{x}\mathbin{;}\bar{y})$ of quantifier rank $q$ and $\bar{v}\in U(B)^{\ell}$ , and let ${\mathcal{H}}$ be the class of all $\llbracket\varphi^{*}(\bar{x}\mathbin{;}\bar{v}^{*})\rrbracket^{B}$ for a syntactically $r^{*}$ -local $\text{FO}[\rho]$ -formula $\varphi^{*}(\bar{x}\mathbin{;}\bar{y})$ of quantifier rank $q^{*}$ and $\bar{v}^{*}\in U(B)^{\ell}$ . Let

[TABLE]

where $n:=|B|$ .

Then $\mathfrak{M}$ is an agnostic $(U(B)^{k},{\mathcal{C}},{\mathcal{H}},t)$ -PAC-learning algorithm. 2. 2.

The algorithm runs in time $(\log n+d+1/\epsilon+\log 1/\delta)^{O(1)}$ with only local access to $B$ , where $d:=|\Delta(B)|$ .

*Proof 5.10**.*

Let $\mathfrak{M}$ be the algorithm of Theorem 4.6. Let ${\mathcal{D}}$ be a probability distribution on $U(B)^{k}\times\{0,1\}$ . Let $C^{*}\in{\mathcal{C}}$ such that

[TABLE]

As ${\mathcal{C}}$ is finite, this minimum exists. By the Uniform Convergence Lemma (Lemma 5.8) applied to ${\mathcal{C}}$ and $\epsilon/2,\delta/2$ we have

[TABLE]

(assuming that the constant $s^{*}$ is sufficiently large).

Given a training sequence $T$ , the algorithm $\mathfrak{M}$ generates a hypothesis $H$ with at most the training error of $C^{*}$ on $T$ . Applying the Uniform Convergence Lemma again, this time to ${\mathcal{H}}$ and $\epsilon/2,\delta/2$ , we get

[TABLE]

This implies that

[TABLE]

Hence $\mathfrak{M}$ is an agnostic $(U(B)^{k},{\mathcal{C}},{\mathcal{H}},t)$ -PAC-learning algorithm.

6 Conclusions

We prove that first-order definable models are learnable in polylogarithmic time on finite structures of polylogarithmic degree. In view of the simple example showing that sublinear parameter learning is impossible, which for a long time made us (or rather, the first author) believe that sublinear learning is impossible in general in our framework, this result came as a surprise to us.

It is less surprising that the proof relies on the locality of first-order logic. In fact, the proof is not very difficult, but it has to be set up in the right way. In particular, the use of syntactic locality and the notion of local types, which to the best of our knowledge is new, is essential. Let us remark that we cannot use Hanf’s Locality Theorem, which is usually easier to handle than Gaifman’s, because in structures of (poly)logarithmic degree the number of isomorphism types of local neighbourhoods grows too fast.

Algorithmically, our paper is not very sophisticated: our algorithms are simple brute-force algorithms that are practically useless due to enormous hidden constants. It is a very interesting question whether there are also practical algorithms for learning FO-definable models (which may not be more efficient than ours in the worst case, but nevertheless work better). One approach would be to map the data (consisting of tuples of elements and the background structure) to high dimensional feature vectors, maybe even obtain a kernel, and then apply conventional machine learning algorithms.

As we have outlined in the introduction, our results may be viewed as a contribution to a descriptive complexity theory of machine learning. Within such a theory, many questions, both technical and conceptual, remain open. For example, are there sublinear learning algorithms for first-order logic on other classes of structures such as words, trees, planar graphs? At a more fundamental level, what is a good computation model (replacing our “local access” model) on background structures that are still sparse, but have large maximum degree. In fact, sublinear algorithms seem unlikely for most classes of high maximum degree. But what about fixed-parameter tractable learning algorithms. As soon as we allow formulas with arbitrarily many instance and parameter variables (that is, allow unbounded $k$ and $\ell$ ), fixed-parameter tractability becomes nontrivial on any of the classes suggested above. And what about other logics, for example monadic second-order logic or modal and temporal logics. Finally, one should generalise the framework from Boolean classification problems to other types of learning problems.

If our version of a declarative approach to machine learning is supposed to have any impact in practice, maybe the most important question is: what are suitable logics and background structures for expressing relevant and feasible machine learning models?

Acknowledgements

We would like to thank Kristian Kersting and Daniel Neider for very helpful comments on an earlier version of this paper.

Bibliography36

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. Abouzied, D. Angluin, C. Papadimitriou, J. Hellerstein, and A. Silberschatz. Learning and verifying quantified boolean queries by example. In R. Hull and W. Fan, editors, Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems , pages 49–60, 2013.
2[2] H. Adler and I. Adler. Interpreting nowhere dense graph classes as a classical notion of model theory. European Journal of Combinatorics , 36:322–330, 2014.
3[3] A. Blum, J. Hopcroft, and R. Kannan. Foundations of data science. Unpublished manuscript available at https://www.cs.cornell.edu/jeh/book 2016 June 9.pdf , 2016.
4[4] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. J. ACM , 36:929–965, 1989.
5[5] A. Bonifati, R. Ciucanu, and S. Staworko. Learning join queries from user examples. ACM Trans. Database Syst. , 40(4):24:1–24:38, 2016.
6[6] A. Chandra and D. Harel. Structure and complexity of relational queries. Journal of Computer and System Sciences , 25:99–128, 1982.
7[7] W. Cohen and C. Page. Polynomial learnability and inductive logic programming: Methods and results. New generation Computing , 13:369–404, 1995.
8[8] M. Crouch, N. Immerman, and J. Moss. Finding reductions automatically. In A. Blass, N. Dershowitz, and W. Reisig, editors, Fields of Logic and Computation: Essays Dedicated to Yuri Gurevich on the Occasion of His 70th Birthday , volume 6300 of Lecture Notes in Computer Science , pages 181–200. Springer Verlag, 2010.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Learning first-order definable concepts over structures of small degree

Abstract

1 Introduction

Example 1.1**.**

1.1 Our results

Theorem 1.1**.**

Corollary 1.2**.**

1.2 Related Work

2 Background from Logic

2.1 Structures

2.2 First-Order Logic

2.3 Locality

Theorem 2.1** (Gaifman’s Locality Theorem [12]).**

2.4 Types

Lemma 2.2** (Composition Lemma [11]).**

Corollary 2.3** (Local Composition Lemma).**

3 Parameter Learning

Example 3.1**.**

4 Model Learning

4.1 Consistent Hypotheses

Lemma 4.1**.**

Proof.

Lemma 4.2**.**

Proof.

Claim 1*.*

Proof 4.4* (Proof of Theorem 1.1).*

Theorem 4.5*.*

4.2 Minimising the Training Error

Theorem 4.6*.*

Lemma 4.7*.*

Proof 4.8*.*

Proof 4.9* (Proof of Theorem 4.6).*

Theorem 4.10*.*

Proof 4.11*.*

5 PAC Learning

5.1 Sample Size Bound

Lemma 5.1* (Sample Size Bound).*

Theorem 5.2*.*

Proof 5.3*.*

5.2 VC Dimension

Lemma 5.4* (VC-Dimension Sample Size Bound [4]).*

Lemma 5.5* ([17]).*

Theorem 5.6*.*

Proof 5.7*.*

5.3 Agnostic PAC-Learning

Lemma 5.8* (Uniform Convergence).*

Theorem 5.9*.*

Proof 5.10*.*

6 Conclusions

Acknowledgements

Example 1.1.

Theorem 1.1.

Corollary 1.2.

Theorem 2.1 (Gaifman’s Locality Theorem [12]).

Lemma 2.2 (Composition Lemma [11]).

Corollary 2.3 (Local Composition Lemma).

Example 3.1.

Lemma 4.1.

Lemma 4.2.

*Claim 1**.*

*Proof 4.4** (Proof of Theorem 1.1).*

*Theorem 4.5**.*

*Theorem 4.6**.*

*Lemma 4.7**.*

*Proof 4.8**.*

*Proof 4.9** (Proof of Theorem 4.6).*

*Theorem 4.10**.*

*Proof 4.11**.*

*Lemma 5.1** (Sample Size Bound).*

*Theorem 5.2**.*

*Proof 5.3**.*

*Lemma 5.4** (VC-Dimension Sample Size Bound [4]).*

*Lemma 5.5** ([17]).*

*Theorem 5.6**.*

*Proof 5.7**.*

*Lemma 5.8** (Uniform Convergence).*

*Theorem 5.9**.*

*Proof 5.10**.*