Counting the learnable functions of structured data

Pietro Rotondo; Marco Cosentino Lagomarsino; Marco Gherardi

arXiv:1903.12021·cond-mat.dis-nn·May 20, 2020

Counting the learnable functions of structured data

Pietro Rotondo, Marco Cosentino Lagomarsino, Marco Gherardi

PDF

Open Access

TL;DR

This paper extends Cover's function counting theorem to structured data, providing analytical tools to understand the capacity of neural networks in recognizing invariant, non-pointlike patterns such as objects with transformations.

Contribution

It develops a new function counting theory for structured data, deriving formulas for the number of dichotomies and classifier capacity for complex, correlated patterns.

Findings

01

Derived analytical expressions for dichotomies of structured data.

02

Obtained a closed-form formula for classifier capacity on polytopes.

03

Enhanced theoretical understanding of neural network generalization and invariant recognition.

Abstract

Cover's function counting theorem is a milestone in the theory of artificial neural networks. It provides an answer to the fundamental question of determining how many binary assignments (dichotomies) of $p$ points in $n$ dimensions can be linearly realized. Regrettably, it has proved hard to extend the same approach to more advanced problems than the classification of points. In particular, an emerging necessity is to find methods to deal with structured data, and specifically with non-pointlike patterns. A prominent case is that of invariant recognition, whereby identification of a stimulus is insensitive to irrelevant transformations on the inputs (such as rotations or changes in perspective in an image). An object is therefore represented by an extended perceptual manifold, consisting of inputs that are classified similarly. Here, we develop a function counting theory for structured…

Equations82

ϕ (ξ_{i}) = θ (ξ_{i} \cdot w),

ϕ (ξ_{i}) = θ (ξ_{i} \cdot w),

C_{n, p + 1} = C_{n, p} + C_{n - 1, p},

C_{n, p + 1} = C_{n, p} + C_{n - 1, p},

C_{n, p} = 2 i = 0 \sum n - 1 (i p - 1),

C_{n, p} = 2 i = 0 \sum n - 1 (i p - 1),

(- 1, 1) ∋ ρ = \frac{1}{n} ξ_{i} \cdot \overset{ˉ}{ξ}_{i}

(- 1, 1) ∋ ρ = \frac{1}{n} ξ_{i} \cdot \overset{ˉ}{ξ}_{i}

Ψ_{2} (ρ) = \frac{2}{π} arctan \frac{1 + ρ}{1 - ρ} .

Ψ_{2} (ρ) = \frac{2}{π} arctan \frac{1 + ρ}{1 - ρ} .

C_{n, p + 1} = Ψ_{2} (ρ) (C_{n, p} + C_{n - 1, p}) + [1 - Ψ_{2} (ρ)] R_{n, p} .

C_{n, p + 1} = Ψ_{2} (ρ) (C_{n, p} + C_{n - 1, p}) + [1 - Ψ_{2} (ρ)] R_{n, p} .

R_{n, p} = C_{n - 1, p} + C_{n - 2, p} .

R_{n, p} = C_{n - 1, p} + C_{n - 2, p} .

C_{n, p + 1} = Ψ_{2} (ρ) C_{n, p} + C_{n - 1, p} + [1 - Ψ_{2} (ρ)] C_{n - 2, p} .

C_{n, p + 1} = Ψ_{2} (ρ) C_{n, p} + C_{n - 1, p} + [1 - Ψ_{2} (ρ)] C_{n - 2, p} .

C_{0, p} C_{n > 0, 1} = 0, = 2 {1 - [1 - Ψ_{2} (ρ)] δ_{n, 1}} .

C_{0, p} C_{n > 0, 1} = 0, = 2 {1 - [1 - Ψ_{2} (ρ)] δ_{n, 1}} .

K_{i, p} = m = 0 \sum p - 1 (m , i - 2 m p - 1) Ψ_{2} (ρ)^{p - 1 - i + m} [1 - Ψ_{2} (ρ)]^{m},

K_{i, p} = m = 0 \sum p - 1 (m , i - 2 m p - 1) Ψ_{2} (ρ)^{p - 1 - i + m} [1 - Ψ_{2} (ρ)]^{m},

(m _{1} , m _{2} n) = \frac{n !}{m _{1} ! m _{2} ! ( n - m _{1} - m _{2} ) !}

(m _{1} , m _{2} n) = \frac{n !}{m _{1} ! m _{2} ! ( n - m _{1} - m _{2} ) !}

C_{n, p} = 2 i = 0 \sum n - 2 K_{i, p} + 2 Ψ_{2} (ρ) K_{n - 1, p} .

C_{n, p} = 2 i = 0 \sum n - 2 K_{i, p} + 2 Ψ_{2} (ρ) K_{n - 1, p} .

C_{n, p} \approx 2 i = 0 \sum n - 1 K_{i, p} .

C_{n, p} \approx 2 i = 0 \sum n - 1 K_{i, p} .

C_{p / α_{c}, p} \approx 2^{p - 1},

C_{p / α_{c}, p} \approx 2^{p - 1},

\overset{}{ˉ} = (p - 1) l = 0 \sum 2 l P (γ_{j} \to γ_{j} + l),

\overset{}{ˉ} = (p - 1) l = 0 \sum 2 l P (γ_{j} \to γ_{j} + l),

α_{c} \approx \frac{p - 1}{ ˉ} = \frac{2}{3 - 2 Ψ _{2} ( ρ )} .

α_{c} \approx \frac{p - 1}{ ˉ} = \frac{2}{3 - 2 Ψ _{2} ( ρ )} .

\overset{ˉ}{ξ}_{p + 1} = {ξ_{p + 1}^{2}, \dots, ξ_{p + 1}^{k}} \subset ξ_{p + 1} .

\overset{ˉ}{ξ}_{p + 1} = {ξ_{p + 1}^{2}, \dots, ξ_{p + 1}^{k}} \subset ξ_{p + 1} .

Q^{k - 1} (C_{n, p}^{(k)}, C_{n - 1, p}^{(k)}, \dots, C_{n - k + 1, p}^{(k)}) .

Q^{k - 1} (C_{n, p}^{(k)}, C_{n - 1, p}^{(k)}, \dots, C_{n - k + 1, p}^{(k)}) .

C_{n, p + 1}^{(k)} = \tilde{Ψ}_{k} [Q^{k - 1} (\dots) - R_{n, p}^{k - 1}] + R_{n, p}^{k - 1} .

C_{n, p + 1}^{(k)} = \tilde{Ψ}_{k} [Q^{k - 1} (\dots) - R_{n, p}^{k - 1}] + R_{n, p}^{k - 1} .

\tilde{Ψ}_{k} = \frac{\int _{S^{n - 1}} d w \prod _{μ, ν = 1}^{k} θ ( w \cdot ξ _{p + 1}^{μ} w \cdot ξ _{p + 1}^{ν} )}{\int _{S^{n - 1}} d w \prod _{μ, ν = 2}^{k} θ ( w \cdot ξ _{p + 1}^{μ} w \cdot ξ _{p + 1}^{ν} )} .

\tilde{Ψ}_{k} = \frac{\int _{S^{n - 1}} d w \prod _{μ, ν = 1}^{k} θ ( w \cdot ξ _{p + 1}^{μ} w \cdot ξ _{p + 1}^{ν} )}{\int _{S^{n - 1}} d w \prod _{μ, ν = 2}^{k} θ ( w \cdot ξ _{p + 1}^{μ} w \cdot ξ _{p + 1}^{ν} )} .

ρ_{μν} = \frac{1}{n} ξ_{i}^{μ} \cdot ξ_{i}^{ν}, i = 1, \dots, p; μ, ν = 1, \dots, k .

ρ_{μν} = \frac{1}{n} ξ_{i}^{μ} \cdot ξ_{i}^{ν}, i = 1, \dots, p; μ, ν = 1, \dots, k .

\tilde{Ψ}_{k} ({ρ_{μν}}_{μ, ν = 1, \dots, k}) = \frac{Ψ _{k} ( { ρ _{μν} } _{μ, ν = 1, \dots, k} )}{Ψ _{k - 1} ( { ρ _{μν} } _{μ, ν = 2, \dots, k} )},

\tilde{Ψ}_{k} ({ρ_{μν}}_{μ, ν = 1, \dots, k}) = \frac{Ψ _{k} ( { ρ _{μν} } _{μ, ν = 1, \dots, k} )}{Ψ _{k - 1} ( { ρ _{μν} } _{μ, ν = 2, \dots, k} )},

R_{n, p}^{k - 1} = Q^{k - 1} (C_{n - 1, p}^{(k)}, C_{n - 2, p}^{(k)}, \dots, C_{n - k, p}^{(k)}) .

R_{n, p}^{k - 1} = Q^{k - 1} (C_{n - 1, p}^{(k)}, C_{n - 2, p}^{(k)}, \dots, C_{n - k, p}^{(k)}) .

C_{n, p + 1}^{(k)} = Q^{k} (C_{n, p}^{(k)}, C_{n - 1, p}^{(k)}, \dots, C_{n - k, p}^{(k)}),

C_{n, p + 1}^{(k)} = Q^{k} (C_{n, p}^{(k)}, C_{n - 1, p}^{(k)}, \dots, C_{n - k, p}^{(k)}),

Q^{k} (x_{n}, \dots, x_{n - k}) = \tilde{Ψ}_{k} + (1 - \tilde{Ψ}_{k}) Q^{k - 1} (x_{n}, \dots, x_{n - k + 1}) Q^{k - 1} (x_{n - 1}, \dots, x_{n - k}),

Q^{k} (x_{n}, \dots, x_{n - k}) = \tilde{Ψ}_{k} + (1 - \tilde{Ψ}_{k}) Q^{k - 1} (x_{n}, \dots, x_{n - k + 1}) Q^{k - 1} (x_{n - 1}, \dots, x_{n - k}),

C_{n, p + 1}^{(k)} = l = 0 \sum k θ_{k} (l) C_{n - l, p}^{(k)} .

C_{n, p + 1}^{(k)} = l = 0 \sum k θ_{k} (l) C_{n - l, p}^{(k)} .

θ_{k} (l) = \tilde{Ψ}_{k} θ_{k - 1} (l) + (1 - \tilde{Ψ}_{k}) θ_{k - 1} (l - 1),

θ_{k} (l) = \tilde{Ψ}_{k} θ_{k - 1} (l) + (1 - \tilde{Ψ}_{k}) θ_{k - 1} (l - 1),

C_{n, p + 1}^{(3)} = \tilde{Ψ}_{3} Ψ_{2} C_{n, p}^{(3)} + [\tilde{Ψ}_{3} + Ψ_{2} (1 - \tilde{Ψ}_{3})] C_{n - 1, p}^{(3)} + [\tilde{Ψ}_{3} (1 - Ψ_{2}) + (1 - \tilde{Ψ}_{3})] C_{n - 2, p}^{(3)} + (1 - \tilde{Ψ}_{3}) (1 - Ψ_{2}) C_{n - 3, p}^{(3)} .

C_{n, p + 1}^{(3)} = \tilde{Ψ}_{3} Ψ_{2} C_{n, p}^{(3)} + [\tilde{Ψ}_{3} + Ψ_{2} (1 - \tilde{Ψ}_{3})] C_{n - 1, p}^{(3)} + [\tilde{Ψ}_{3} (1 - Ψ_{2}) + (1 - \tilde{Ψ}_{3})] C_{n - 2, p}^{(3)} + (1 - \tilde{Ψ}_{3}) (1 - Ψ_{2}) C_{n - 3, p}^{(3)} .

α_{c} = \frac{\sum _{l = 0}^{k} θ _{k} ( l )}{\sum _{l = 0}^{k} l θ _{k} ( l )} = \frac{λ _{0} ( k )}{λ _{1} ( k )},

α_{c} = \frac{\sum _{l = 0}^{k} θ _{k} ( l )}{\sum _{l = 0}^{k} l θ _{k} ( l )} = \frac{λ _{0} ( k )}{λ _{1} ( k )},

λ_{m} (k) = l = 0 \sum k l^{m} θ_{k} (l) .

λ_{m} (k) = l = 0 \sum k l^{m} θ_{k} (l) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Topological and Geometric Data Analysis · Statistical Methods and Inference

Full text

Counting the learnable functions of structured data

Pietro Rotondo

School of Physics and Astronomy, University of Nottingham, Nottingham, NG7 2RD, UK

Centre for the Mathematics and Theoretical Physics of Quantum Non-equilibrium Systems, University of Nottingham, Nottingham NG7 2RD, UK

Marco Cosentino Lagomarsino

Università degli Studi di Milano, Via Celoria 16, 20133 Milano, Italy

I.N.F.N. Sezione di Milano

Marco Gherardi

[email protected]

Università degli Studi di Milano, Via Celoria 16, 20133 Milano, Italy

I.N.F.N. Sezione di Milano

Abstract

Cover’s function counting theorem is a milestone in the theory of artificial neural networks. It provides an answer to the fundamental question of determining how many binary assignments (dichotomies) of $p$ points in $n$ dimensions can be linearly realized. Regrettably, it has proved hard to extend the same approach to more advanced problems than the classification of points. In particular, an emerging necessity is to find methods to deal with structured data, and specifically with non-pointlike patterns. A prominent case is that of invariant recognition, whereby identification of a stimulus is insensitive to irrelevant transformations on the inputs (such as rotations or changes in perspective in an image). An object is therefore represented by an extended perceptual manifold, consisting of inputs that are classified similarly. Here, we develop a function counting theory for structured data of this kind, by extending Cover’s combinatorial technique, and we derive analytical expressions for the average number of dichotomies of generically correlated sets of patterns. As an application, we obtain a closed formula for the capacity of a binary classifier trained to distinguish general polytopes of any dimension. These results may help extend our theoretical understanding of generalization, feature extraction, and invariant object recognition by neural networks.

I Introduction

Machine learning and deep learning demonstrate astonishing results in applications Krizhevsky et al. (2012); Goodfellow et al. (2014, 2016), sometimes beyond our theoretical reach. This provides a formidable challenge for theorists who wish to develop a framework for their understanding Baldassi et al. (2016); Baity-Jesi et al. (2018). A landmark achievement in learning theory is Cover’s function counting theorem, which counts the number of binary classification functions, or “dichotomies”, that can be realized by given architectures Cover (1965). This foundational result allowed to quantify the complexity of a learning model and the advantage gained in using non-linear kernels, provided a benchmark for the performance of both artificial and natural neural networks, and is a handy tool for several applications Brunel et al. (2004); Engel and Broeck (2001); Hertz et al. (1991); Opper and Kinzel (1996); Ganguli and Sompolinsky (2010); Chung et al. (2018a).

Other commonly used methods in this area come from statistical physics (pioneered by E. Gardner Gardner (1987); Gardner and Derrida (1988)). With respect to these, Cover’s method has the advantage of offering a simple geometric insight and of being valid at finite number of dimensions, while statistical physics methods typically apply in the “thermodynamic limit” of infinite dimensions. Yet, despite its benefits and relative simplicity, Cover’s analytical technique has so far eluded efforts to extend it Engel and Broeck (2001).

Uncorrelated random patterns are commonly taken as a simplifying assumption for the theoretical investigation of artificial neural networks. Yet, it is becoming apparent that providing a theoretical framework that includes structure in the input data is essential. This need is emerging in different contexts: (a) The invariant representation of perceptual stimuli by brains (e.g., the coherent perception of differently rotated and rescaled objects in vision, or the recognition of the same sound in different acoustic environments in audition) prompted the formalization of perceptual manifolds as extended patterns Tenenbaum et al. (2000); Seung and Lee (2000); Roweis and Saul (2000); Ranzato et al. (2007); Goodfellow et al. (2009); Anselmi et al. (2016); Chung et al. (2016, 2018a, 2018b). Perceptual manifolds are the regions in input space corresponding to all variations of a stimulus that do not modify the object’s identification. (b) The discovery of spatial maps in rodent brains O’Keefe and Dostrovsky (1971) motivated extensions of associative memory models to attractors that are not point-like but occupy a region in configuration space Cocco et al. (2018). (c) The problem of local generalization and robustness to noise, a main theme of machine learning, can be cast as a problem of non-pointlike patterns Szegedy et al. (2013); Novak et al. (2018); Borra et al. (2019). (d) The description of the input patterns as modular combinations of elementary features (a well studied aspect of empirical datasets Pang and Maslov (2013); Mazzolini et al. (2018)), was shown to induce a multi-layer structure in certain network architectures Mézard (2017).

Here, we develop a theory that extends Cover’s approach to non point-like patterns, by counting only those dichotomies that assign the same label to different variants of the same input. Our theory (i) enables the exact computation of the (average) number of dichotomies of structured data, (ii) gives direct access to quantities at finite size, and (iii) naturally disentangles combinatorial and geometric aspects, thus lending itself to further generalizations.

II Number of admissible dichotomies

The central quantity obtained by Cover’s function counting method is the number $C_{n,p}$ of linearly-realizable dichotomies of $p$ points $\xi_{1},\ldots,\xi_{p}$ in $n$ dimensions. A dichotomy of this set is a function $\phi$ mapping each point $\xi_{i}$ to its $\left\{0,1\right\}$ binary label (see Fig. 1). A linearly-realizable dichotomy is identified by a vector $w\in\mathbb{R}^{n}$ :

[TABLE]

where $\theta$ is the Heaviside theta function. The hyperplane perpendicular to the vector $w$ separates the space into two half-spaces, where the points mapped to [math] and $1$ lie respectively. There are $2^{p}$ dichotomies, but only $C_{n,p}$ of them are linearly realizable. We focus on linearly realizable dichotomies, and will therefore omit this specification when it is clear from the context.

It turns out that $C_{n,p}$ does not depend on the $\xi_{i}$ ’s, as long as they are in general position (meaning that no subset of $n$ points is linearly dependent) Cover (1965). Structure in the data may thus appear not to affect $C_{n,p}$ at all. However, in general we do not wish to admit all possible dichotomies. For instance, among the hand-written digits in MNIST we could choose to admit dichotomies separating “1” and “I”, but not two similar-looking “0”s. Our definition of structure is based on such a restriction: a data set is qualified as structured whenever only a subset of all possible dichotomies is considered admissible. $C_{n,p}$ will then be the number of admissible dichotomies that can be realized linearly.

Here we focus on a rather general definition of admissibility, inspired by the literature cited above. We consider datasets of $kp$ points, structured as $p$ multiplets of $k$ points each. A dichotomy $\phi$ is admissible if different points $\xi$ in the same multiplet are classified coherently, i.e., if $\phi(\xi)$ is constant on each multiplet. We will restrict the points $\xi$ to lie on the unit sphere $S^{n-1}$ , meaning that $\xi^{2}=1$ , but this technical requirement can be easily relaxed. (A useful consequence of this is that setting the overlap between two points determines their distance.) The ensemble we consider fixes all the overlaps between the points in a multiplet, equally for all multiplets, but the relative positions and orientations of the multiplets are unspecified. The quantities we will compute are averages over all possible positions and orientations of the multiplets.

Because of the convexity of linear separability, separating the multiplets is equivalent to separating the polytopes whose vertices are the points in the multiplets. (These polytopes play the role of the perceptual manifolds of Ref. Chung et al. (2018a).) For instance, $k=2$ corresponds to segments, $k=3$ to triangles, $k=4$ to tetrahedra.

III Single points ( $k=1$ )

Let us first outline Cover’s original computation. Imagine starting with $p$ points and adding the $(p+1)$ th point $\xi_{p+1}$ to $\left\{\xi_{1},\ldots,\xi_{p}\right\}$ . For each dichotomy $\phi$ of the $p$ points $\xi_{1},\ldots\xi_{p}$ one of two possibilities is satisfied: either (i) $\phi$ can be realized by a hyperplane passing through $\xi_{p+1}$ (equivalently, $\phi$ can be realized by a vector $w$ such that $\xi_{p+1}\cdot w=0$ ), or (ii) it can not. If (i) is true, then $w$ can be rotated infinitesimally to yield both $\xi_{p+1}\cdot w\gtrless 0$ ; otherwise, the half-space where $\xi_{p+1}$ lies is fixed. Therefore, for each dichotomy $\phi$ of $\left\{\xi_{1},\ldots,\xi_{p}\right\}$ satisfying (i) there are 2 different dichotomies $\phi_{1}$ and $\phi_{2}$ of $\left\{\xi_{1},\ldots,\xi_{p},\xi_{p+1}\right\}$ agreeing with $\phi$ on the common points [i.e., such that $\phi_{1,2}(\xi_{i})=\phi(\xi_{i})$ for $i=1,\ldots,p$ ]. If the number of dichotomies satisfying (i) is $M$ , then the number of those satisfying (ii) is $C_{n,p}-M$ , and one can write $C_{n,p+1}=2M+C_{n,p}-M$ . The condition (i) is in the form of a single linear constraint, therefore $M$ is the number of dichotomies of $p$ points in $n-1$ dimensions, $M=C_{n-1,p}$ . Thus $C_{n,p}$ satisfies the recursion

[TABLE]

with boundary conditions $C_{n>0,1}=2$ (a single point can be classified either way) and $C_{0,p}=0$ .

The solution to Eq. (2) can be obtained by observing that the contribution of the boundary value $C_{n-i,1}$ to $C_{n,p}$ is given by the number of directed paths $\{\gamma_{j}\}_{j=1,\ldots,p}$ , with $\gamma_{j}\in\mathbb{N}$ , that start from $\gamma_{1}=n-i$ and end in $\gamma_{p}=n$ , where at each step $\gamma_{j+1}$ can be either $\gamma_{j}$ or $\gamma_{j}+1$ . The number of such paths is simply the binomial coefficient ${{p-1}\choose i}$ . Summing over the boundary gives

[TABLE]

where it is assumed that ${{p-1}\choose i}=0$ whenever $i>p-1$ .

Let us consider the fraction $c_{n,p}$ of linearly realizable dichotomies $c_{n,p}=C_{n,p}/2^{p}$ . For finite $n$ and $p$ , the capacity $\alpha_{\mathrm{c}}$ can be defined as the ratio $p/n$ at which half of all dichotomies can be realized: $c_{n,n\alpha_{\mathrm{c}}}=1/2$ . From the explicit expression (3) one sees that $c_{n,p}=1$ if $p\leq n$ , $c_{n,p}\to 0$ for $p\to\infty$ , and $c_{n,2n}=1/2$ , which pinpoints the well-known capacity $\alpha_{\mathrm{c}}=2$ .

IV Segments (doublets, $k=2$ )

The first step towards the general problem is the case where data are structured as pairs of points. Alongside the set of points $\xi=\left\{\xi_{1},\ldots,\xi_{p}\right\}$ , let us consider another set $\bar{\xi}=\left\{\bar{\xi}_{1},\ldots,\bar{\xi}_{p}\right\}$ . The multiplets discussed above are the doublets $\{\xi_{i},\bar{\xi}_{i}\}$ . Each doublet is such that the overlap between the two partners is fixed:

[TABLE]

for all $i$ . The admissible dichotomies $\phi$ are those for which $\phi(\xi_{i})=\phi(\bar{\xi}_{i})$ for all $i$ ; their total number is $2^{p}$ .

The recursion step now corresponds to the addition of the $(p+1)$ th doublet $\{\xi_{p+1}$ , $\bar{\xi}_{p+1}\}$ . Repeating Cover’s reasoning for the point $\bar{\xi}_{p+1}$ alone gives a number of dichotomies equal to $Q_{n,p}=C_{n,p}+C_{n-1,p}$ . This is the number of dichotomies of the set $\{\xi_{1},\bar{\xi}_{1},\xi_{2},\bar{\xi}_{2},\ldots,\xi_{p},\bar{\xi}_{p},\bar{\xi}_{p+1}\}$ that are admissible on the first $p$ doublets [meaning that $\phi(\xi_{i})=\phi(\bar{\xi}_{i})$ for all $i=1,\ldots,p$ ]. A number $R_{n,p}$ of such dichotomies are realizable by a hyperplane passing through the point $\xi_{p+1}$ . These are all admissible, thanks to the freedom in the choice of $\phi(\xi_{p+1})$ by an infinitesimal adjustment of the hyperplane. Among the other $Q_{n,p}-R_{n,p}$ dichotomies, on average, a fraction $\Psi_{2}$ will happen to assign the same label to $\xi_{p+1}$ and $\bar{\xi}_{p+1}$ . $\Psi_{2}$ can be computed as the fraction of hyperplanes keeping $\xi_{p+1}$ and $\bar{\xi}_{p+1}$ in the same half-space; the calculation is carried out in the Appendix. Importantly, $\Psi_{2}$ is a function of the overlap $\rho$ alone:

[TABLE]

Note that $\Psi_{2}(\rho)=1-\Psi_{2}(-\rho)$ as expected from its definition. The foregoing argument brings to estimate the total number of admissible dichotomies as

[TABLE]

In order to compute $R_{n,p}$ it suffices to repeat Cover’s reasoning with respect to the point $\bar{\xi}_{p+1}$ , this time in $n-1$ dimensions because of the constraint imposed by the hyperplane passing through $\xi_{p+1}$ , thereby obtaining

[TABLE]

Finally the recursion for $C_{n,p}$ reads

[TABLE]

The boundary conditions are now slightly different than those for the case $k=1$ in Eq. (2). In fact, in $n=1$ dimension the number of admissible dichotomies of a single pair of points ( $p=1$ ) is $2$ only when both points lie on the same half-line, otherwise it is [math]; on average, it is $2\Psi_{2}(\rho)$ . The boundary conditions are then

[TABLE]

To find the solution of the recursion (8), similarly to the single point case, consider all the directed paths $\{\gamma_{j}\}_{j=1,\ldots,p}$ propagating from the boundary to $C_{n,p}$ , where $\gamma_{j+1}$ at each step can be $\gamma_{j}$ , $\gamma_{j}+1$ , or $\gamma_{j}+2$ . Contrary to the one point case, different paths with the same endpoints can now give different contributions to $C_{n,p}$ , since the three types of steps correspond to three different factors ( $\Psi_{2}$ , $1$ , and $1-\Psi_{2}$ respectively). The contribution $K_{i,p}$ of a path from $\gamma_{1}=n-i$ to $\gamma_{p}=n$ is

[TABLE]

where the multinomial coefficient is defined as

[TABLE]

(with the obvious analytical extension for negative factorials). Summation over the non-zero boundary $i=0,\ldots,n-1$ yields the number of admissible dichotomies

[TABLE]

It is easy to see (by the multinomial theorem) that $C_{n,p}=2^{p}$ if $p\leq n/2$ ; this locates the usual Vapnik-Chervonenkis dimension Vapnik and Chervonenkis (1971), $d_{\mathrm{VC}}=n$ , as the total number of points is $2p$ .

An estimate for the capacity, valid for large $n$ , can be obtained by approximating Eq. (12) as

[TABLE]

The capacity $\alpha_{\mathrm{c}}$ is such that

[TABLE]

i.e., it corresponds to the value of $n$ for which the sum of $K_{i,p}$ takes half its maximum value. The quantity $K_{i,p}$ can be interpreted as the partition function of an ensemble of directed random walks $\{\gamma_{j}\}_{j=1,\ldots,p}$ of $p-1$ steps, with the same boundary conditions as for $k=1$ , and the following transition probabilities: $P\left(\gamma_{j}\to\gamma_{j}\right)=\Psi_{2}/2$ , $P\left(\gamma_{j}\to\gamma_{j}+1\right)=1/2$ , $P\left(\gamma_{j}\to\gamma_{j}+2\right)=(1-\Psi_{2})/2$ . The normalization factor $2$ at the denominator is the sum of the weights $\Psi_{2}$ , $1$ , and $1-\Psi_{2}$ . The capacity therefore corresponds to the median of the distribution function of the walk’s endpoint $i$ . We approximate the median with the mean

[TABLE]

which evaluates to $\bar{\imath}=(3/2-\Psi_{2})(p-1)$ , and finally we obtain

[TABLE]

This result, with $\Psi_{2}$ given by Eq. (5), was found in Lopez et al. (1995) by means of replica calculations, and appeared more recently in other contexts in Borra et al. (2019); Chung et al. (2016). Our derivation is somewhat more elementary, and naturally highlights the role of the geometric quantity $\Psi_{2}(\rho)$ .

Figure 2 compares the analytical formulas (12) and (16) with numerical results obtained by training a linear classifier with random doublets at varying dimension $n$ , number of points $p$ , and overlap $\rho$ . Equation (12) matches perfectly as expected. Equation (16) is surprisingly precise even at very small sizes; deviations are less than $1\%$ already for $n=5$ .

V Polytopes (multiplets, generic $k$ )

Let us now move to the general case where the data are structured in multiplets of $k$ points. We consider dichotomies of $k$ sets of points $\xi^{\mu}=\{\xi^{\mu}_{1},\ldots,\xi^{\mu}_{p}\}$ , with $\mu=1,\ldots,k$ . The $i$ th multiplet is the set $\xi_{i}=\{\xi_{i}^{1},\ldots,\xi_{i}^{k}\}$ . A dichotomy $\phi$ is admissible if the images of all $k$ partner points in each multiplet are equal: $\phi\left(\xi^{\mu}_{i}\right)=\phi\left(\xi^{\nu}_{i}\right)$ for all $\mu,\nu=1,\ldots,k$ , separately for all $i=1,\ldots,p$ . For clarity, we denote the number of admissible dichotomies by $C_{n,p}^{(k)}$ , as shown in Fig. 1.

A recursion relation for $C_{n,p}^{(k)}$ can be obtained by carefully extending the method used for the doublet case. At the $(p+1)$ th step, we consider the multiplet $\xi_{p+1}$ , composed of the $k$ points $\xi^{1}_{p+1},\ldots,\xi^{k}_{p+1}$ . Let us exclude momentarily the point $\xi_{p+1}^{1}$ , and suppose we know how to apply Cover’s method to the set of $k-1$ points

[TABLE]

This would give an expression, let us call it

[TABLE]

The fact that $Q^{k-1}$ is a function of $C_{n-l,p}^{(k)}$ with $l=0,\ldots,k-1$ will be clear in the following. Intuitively, the case $k=1$ involves only $l=0$ and $l=1$ , the case $k=2$ adds $l=2$ because it uses the expression for $k=1$ in $n-1$ dimensions, and the same pattern repeats inductively up to $k-1$ points.

The quantity $Q^{k-1}$ represents the number of dichotomies of the set $\xi_{1}\cup\xi_{2}\cup\cdots\cup\xi_{p}\cup\bar{\xi}_{p+1}$ that are admissible on the first $p$ multiplets [meaning that $\phi(\xi_{i}^{\mu})=\phi(\xi_{i}^{\nu})$ for all $\mu,\nu=1,\ldots,k$ and all $i=1,\ldots,p$ ] and admissible on the $k-1$ points in $\bar{\xi}_{p+1}$ [meaning that $\phi(\xi_{p+1}^{\mu})=\phi(\xi_{p+1}^{\nu})$ for all $\mu,\nu=2,\ldots,k$ ]. A number $R_{n,p}^{k-1}$ of these dichotomies are realizable by a hyperplane passing through the excluded point $\xi_{p+1}^{1}$ , and are therefore all admissible. Of the remaining $Q^{k-1}(\ldots)-R_{n,p}^{k-1}$ ones, a fraction $\tilde{\Psi}_{k}$ assign the same value to $\xi_{p+1}^{1}$ and to the points in $\bar{\xi}_{p+1}$ , and are therefore admissible on the whole multiplet $\xi_{p+1}$ . Therefore,

[TABLE]

While $\Psi_{2}$ was a probability (over all possible hyperplanes), $\tilde{\Psi}_{k}$ is a conditional probability, namely the probability that a uniform vector $w$ on the sphere $S^{n-1}$ does not separate the multiplet $\xi_{p+1}$ , conditioned on the event that $w$ does not separate the set $\bar{\xi}_{p+1}$ :

[TABLE]

The dependence of $\tilde{\Psi}_{k}$ on the relative positions of the points is discussed in the Appendix, where it is shown that (i) the calculation of $\tilde{\Psi}_{k}$ can be reduced from $n$ -dimensional to $k$ -dimensional integrals, and (ii) $\tilde{\Psi}_{k}$ depends on $n$ only through the $k(k-1)/2$ overlaps $\rho_{\mu\nu}$ between the points in a multiplet, which we fix for all multiplets:

[TABLE]

This property allows us to treat $\tilde{\Psi}_{k}$ as a constant in the recursions, thus simplifying the computations. Note that, since it is a conditional probability, $\tilde{\Psi}$ can be written as a ratio of probabilities:

[TABLE]

where $\Psi_{k}$ depends on $k(k-1)/2$ overlaps between $k$ points, and denotes the fraction of hyperplanes not separating the $k$ points. This definition, together with the identity $\Psi_{1}=1$ , implies that the geometric quantity computed above for $k=2$ is $\Psi_{2}(\rho)=\tilde{\Psi}_{2}(\rho)$ .

The number $R_{n,p}^{k-1}$ can be obtained by applying again Cover’s method with respect to the set $\bar{\xi}_{p+1}$ this time in $n-1$ dimensions because the hyperplane is constrained to pass through $\xi_{p+1}^{1}$ . Hence

[TABLE]

Finally, from Eqs. (19) and (23), the recursion for $C_{n,p}^{(k)}$ is

[TABLE]

where the functions $Q^{k}$ (having $k+1$ arguments) satisfy the recursive functional relation

[TABLE]

with the boundary $Q^{1}\left(x_{n},x_{n-1}\right)=x_{n}+x_{n-1}$ given by the form of Eq. (2) for a single point.

The recursion in $k$ can be solved, thus yielding again a recursion for $C_{n,p+1}^{(k)}$ in $n$ and $p$ only. Let us call $\theta_{k}(l)$ the coefficients in the solved recursion:

[TABLE]

Equation (25) then becomes

[TABLE]

with boundaries $\theta_{1}(0)=\theta_{1}(1)=1$ and $\theta_{k}(-1)=\theta_{k}(k+1)=0$ . For instance, setting $k=2$ in Eqs. (26) and (27) recovers the recursion for doublets, Eq. (8), as expected. For $k=3$ one obtains

[TABLE]

In the process of deriving the foregoing recursion relations we considered the points $\xi_{p+1}^{\mu}$ in a particular order, therefore explicitly breaking invariance under permutations within the multiplets. We restore the invariance a posteriori, by prescribing that all $\tilde{\Psi}_{l}$ (with $l\leq k$ ) be symmetrized with respect to all $k(k-1)/2$ overlaps. For instance, when $k=3$ , the $\Psi_{2}=\tilde{\Psi}_{2}$ appearing in Eq. (28) is to be intended as $[\Psi_{2}(\rho_{12})+\Psi_{2}(\rho_{13})+\Psi_{2}(\rho_{23})]/3$ . The goodness of this prescription is substantiated by the numerical results shown in Fig. 3; see also the limit case (ii) in the Discussion below.

The solution for $C_{n,p}$ (with the appropriate boundary conditions) can be obtained, for instance via generating functions, but we do not give it here. Instead, we focus on the capacity, which can be computed by the same approximate method used for $k=2$ [Eqs. (15) and (16)]:

[TABLE]

where we have defined the moments

[TABLE]

Summing Eq. (27) over $l$ shows that $\lambda_{0}(k)=\lambda_{0}(k-1)$ and therefore $\lambda_{0}(k)=\lambda_{0}(1)=2$ . By multiplying Eq. (27) by $l$ and summing over $l$ , one obtains $\lambda_{1}(k)=\lambda_{1}(k-1)+(1-\tilde{\Psi}_{k})\lambda_{0}(k-1)$ . The boundary condition $\lambda_{1}(1)=1$ then fixes the solution

[TABLE]

Finally, substituting $\lambda_{0}(k)$ and $\lambda_{1}(k)$ into Eq. (29) yields a remarkably simple formula for the capacity:

[TABLE]

Figure 3 compares our theory with numerical computations in the case of triplets ( $k=3$ ), for triangles with three, two, and no sides of the same length. The agreement is excellent. The function $\tilde{\Psi}_{3}$ is a double integral (given in the Appendix), which we evaluate numerically.

VI Discussion

Our extension of Cover’s combinatorial technique to structured data allows to obtain closed expressions of $C_{n,p}^{(k)}$ at finite $n$ and $p$ , for any $k$ [we have written explicitly the result for $k=2$ in Eq. (12)]. Beside this, our main result is Eq. (32), which expresses the capacity as a simple function of the quantities $\tilde{\Psi}_{l}$ . Regarding these quantities, the merit of our method is twofold: first, the $\tilde{\Psi}_{l}$ ’s are revealed to be the only relevant parameters characterizing the linear separability of the multiplets; second, they have a very simple geometric interpretation in terms of probabilities.

We mention three simple limit cases of Eq. (32). (i) If all the points in each multiplet coincide, then $\tilde{\Psi}_{l}=1$ for all $l=1,\ldots,k$ and we recover the single-point classic result $\alpha_{\mathrm{c}}=2$ . (ii) When $k=3$ and two points of a triplet coincide the overlaps are $\{\rho,\rho,1\}$ . Symmetrizing $\tilde{\Psi}_{3}(\rho,\rho,1)$ gives $\Psi_{3}(\rho,\rho,1)\left[2/\Psi_{2}(\rho)+1/\Psi_{2}(1)\right]/3$ where $\Psi_{3}(\rho,\rho,1)$ is the fraction of hyperplanes not separating the three points. Clearly $\Psi_{3}(\rho,\rho,1)=\Psi_{2}(\rho)$ , and one recovers Eq. (16) for $k=2$ as expected. (iii) If $\Psi_{2}=0$ and $\tilde{\Psi}_{l}=0$ for all $l=3,\ldots,k$ Eq. (32) gives $\alpha_{\mathrm{c}}=2/(2k-1)$ . This prediction matches that obtained in Chung et al. (2018a) for $(k-1)$ -dimensional linear manifolds. However, this turns out to be an unphysical limit in our framework, since $\tilde{\Psi}_{l}$ cannot be all vanishing. For instance, for $k=3$ , equilateral triplets with overlaps $\{\rho,\rho,\rho\}$ lie on a linear manifold passing through the origin when $\rho$ takes its minimum value $\rho_{\triangle}=-1/2$ . The same happens for isosceles triplets $\{\rho,\rho/2,\rho/2\}$ at $\rho_{\vartriangle}=1-\sqrt{3}.$ Interestingly, the capacity evaluated at the respective minimum $\rho$ is $\alpha_{\mathrm{c}}\approx 0.46154$ for both geometries, to be compared to the value $\alpha_{\mathrm{c}}=2/5$ found for two-dimensional linear manifolds.

Another interesting, albeit less elementary, limit case would be $k\to\infty$ , taken in such a way that the points generate a sphere of radius $\kappa$ ; then Eq. (32) should reproduce the well-known capacity with margin $\kappa$ Gardner (1987), which has never been obtained by combinatorial methods Chung et al. (2018a); Engel and Broeck (2001).

Other applications and extensions of the theory appear possible. First, the capacity is written in Eq. (29) as a combination of the zeroth and first moments, but higher-order moments can be computed similarly and give access to other useful quantities. For instance, the second moment is related to the width of the crossover region separating the regimes where $c_{n,p}\approx 1,0$ respectively. Second, it would be interesting to express our results for general (non-linear) separating surfaces, in the same spirit of Cover’s original work, and in view of useful applications.

Acknowledgements.

We would like to dedicate this work to the memory of Bruno Bassetti. P.R. acknowledges funding by the European Union through the H2020 - MCIF Grant No. 766442. *

Appendix A Computation of $\Psi_{k}$

Computation of $\Psi_{2}(\rho)$ .

The fraction of hyperplanes assigning the same value to two points $\xi$ and $\bar{\xi}$ is given by:

[TABLE]

The normalization factor is

[TABLE]

where $\Omega_{n}$ is the solid angle in $n$ dimensions. Gram-Schmidt (GS) orthonormalization of $\xi$ and $\bar{\xi}$ yields

[TABLE]

where $\rho=\xi\cdot\bar{\xi}/n$ is the overlap between the two points. Inverting Eq. (35) gives

[TABLE]

Having orthonormalized the points allows to safely exploit the $(n-2)$ -dimensional spherical symmetry of the integral in the space orthogonal to $\xi_{1}$ and $\xi_{2}$ , and to reduce it to an integral over the two-dimensional solid angle:

[TABLE]

which evaluates to the result in Eq. (5), and shows that $\Psi_{2}=\Psi_{2}(\rho)$ .

Computation of $\Psi_{3}(\rho_{12},\rho_{13},\rho_{23})$ .

Eq. (22) expresses the conditional probability $\tilde{\Psi}_{k}$ in terms of the probabilities $\Psi_{k}$ . $\Psi_{k}$ is defined as the fraction of hyperplanes assigning the same value to the $k$ points $\xi_{1},\xi_{2},\dots,\xi_{k}$ :

[TABLE]

with $\mathcal{N}$ given by Eq. (34). For $k=3$ , the Gram-Schmidt procedure gives:

[TABLE]

where $\rho_{\mu\nu}=\xi^{\mu}\cdot\xi^{\nu}/n$ are the overlaps, and $g=(\rho_{23}-\rho_{12}\rho_{13})/\sqrt{1-\rho_{12}^{2}}$ . Again, thanks to the spherical symmetry in the space orthogonal to the $\xi^{\mu}$ ’s the result is an integral over the $3$ -dimensional solid angle:

[TABLE]

where the measure $\mathrm{d}\Omega_{3}$ can be expressed via the angles $\phi_{1}$ and $\phi_{2}$ , and $x_{1}=\sin\phi_{1}\cos\phi_{2}$ , $x_{2}=\sin\phi_{1}\sin\phi_{2}$ and $x_{3}=\cos\phi_{1}$ . As above, this computation shows that $\Psi_{3}=\Psi_{3}(\rho_{12},\rho_{13},\rho_{23})$ . The results presented in Fig. 3 have been obtained by integrating numerically Eq. (39).

The procedure for $k=2,3$ can be extended to $k>3$ . The final result has the following structure:

[TABLE]

where the functions $v_{\alpha}$ appearing in the $\theta$ ’s can be systematically derived in a similar way from the GS procedure. This shows that $\tilde{\Psi}_{k}$ , related to $\Psi_{k}$ by Eq. (22), depends in general on the $\xi^{\mu}$ ’s only through the overlaps $\rho_{\mu\nu}$ , and it can be written in terms of $k$ -dimensional integrals.

Bibliography32

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Krizhevsky et al. (2012) A. Krizhevsky, I. Sutskever, and G. E. Hinton, in Advances in Neural Information Processing Systems 25 , edited by F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Curran Associates, Inc., 2012) pp. 1097–1105.
2Goodfellow et al. (2014) I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, in Advances in Neural Information Processing Systems 27 , edited by Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Curran Associates, Inc., 2014) pp. 2672–2680.
3Goodfellow et al. (2016) I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning (The MIT Press, 2016).
4Baldassi et al. (2016) C. Baldassi, C. Borgs, J. T. Chayes, A. Ingrosso, C. Lucibello, L. Saglietti, and R. Zecchina, Proceedings of the National Academy of Sciences 113 , E 7655 (2016) , https://www.pnas.org/content/113/48/E 7655.full.pdf . · doi ↗
5Baity-Jesi et al. (2018) M. Baity-Jesi, L. Sagun, M. Geiger, S. Spigler, G. B. Arous, C. Cammarota, Y. Le Cun, M. Wyart, and G. Biroli, in Proceedings of the 35th International Conference on Machine Learning , Proceedings of Machine Learning Research, Vol. 80, edited by J. Dy and A. Krause (PMLR, Stockholmsmässan, Stockholm Sweden, 2018) pp. 314–323.
6Cover (1965) T. M. Cover, IEEE Transactions on Electronic Computers EC-14 , 326 (1965) . · doi ↗
7Brunel et al. (2004) N. Brunel, V. Hakim, P. Isope, J.-P. Nadal, and B. Barbour, Neuron 43 , 745 (2004) . · doi ↗
8Engel and Broeck (2001) A. Engel and C. P. L. V. d. Broeck, Statistical Mechanics of Learning (Cambridge University Press, New York, NY, USA, 2001).

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Counting the learnable functions of structured data

Abstract

I Introduction

II Number of admissible dichotomies

III Single points (k=1k=1k=1)

IV Segments (doublets, k=2k=2k=2)

V Polytopes (multiplets, generic kkk)

VI Discussion

Acknowledgements.

Appendix A Computation of Ψk\Psi_{k}Ψk​

Computation of Ψ2(ρ)\Psi_{2}(\rho)Ψ2​(ρ).

Computation of Ψ3(ρ12,ρ13,ρ23)\Psi_{3}(\rho_{12},\rho_{13},\rho_{23})Ψ3​(ρ12​,ρ13​,ρ23​).

III Single points ( $k=1$ )

IV Segments (doublets, $k=2$ )

V Polytopes (multiplets, generic $k$ )

Appendix A Computation of $\Psi_{k}$

Computation of $\Psi_{2}(\rho)$ .

Computation of $\Psi_{3}(\rho_{12},\rho_{13},\rho_{23})$ .