Counting the learnable functions of structured data
Pietro Rotondo, Marco Cosentino Lagomarsino, Marco Gherardi

TL;DR
This paper extends Cover's function counting theorem to structured data, providing analytical tools to understand the capacity of neural networks in recognizing invariant, non-pointlike patterns such as objects with transformations.
Contribution
It develops a new function counting theory for structured data, deriving formulas for the number of dichotomies and classifier capacity for complex, correlated patterns.
Findings
Derived analytical expressions for dichotomies of structured data.
Obtained a closed-form formula for classifier capacity on polytopes.
Enhanced theoretical understanding of neural network generalization and invariant recognition.
Abstract
Cover's function counting theorem is a milestone in the theory of artificial neural networks. It provides an answer to the fundamental question of determining how many binary assignments (dichotomies) of points in dimensions can be linearly realized. Regrettably, it has proved hard to extend the same approach to more advanced problems than the classification of points. In particular, an emerging necessity is to find methods to deal with structured data, and specifically with non-pointlike patterns. A prominent case is that of invariant recognition, whereby identification of a stimulus is insensitive to irrelevant transformations on the inputs (such as rotations or changes in perspective in an image). An object is therefore represented by an extended perceptual manifold, consisting of inputs that are classified similarly. Here, we develop a function counting theory for structured…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Topological and Geometric Data Analysis · Statistical Methods and Inference
Counting the learnable functions of structured data
Pietro Rotondo
School of Physics and Astronomy, University of Nottingham, Nottingham, NG7 2RD, UK
Centre for the Mathematics and Theoretical Physics of Quantum Non-equilibrium Systems, University of Nottingham, Nottingham NG7 2RD, UK
Marco Cosentino Lagomarsino
Università degli Studi di Milano, Via Celoria 16, 20133 Milano, Italy
I.N.F.N. Sezione di Milano
Marco Gherardi
Università degli Studi di Milano, Via Celoria 16, 20133 Milano, Italy
I.N.F.N. Sezione di Milano
Abstract
Cover’s function counting theorem is a milestone in the theory of artificial neural networks. It provides an answer to the fundamental question of determining how many binary assignments (dichotomies) of points in dimensions can be linearly realized. Regrettably, it has proved hard to extend the same approach to more advanced problems than the classification of points. In particular, an emerging necessity is to find methods to deal with structured data, and specifically with non-pointlike patterns. A prominent case is that of invariant recognition, whereby identification of a stimulus is insensitive to irrelevant transformations on the inputs (such as rotations or changes in perspective in an image). An object is therefore represented by an extended perceptual manifold, consisting of inputs that are classified similarly. Here, we develop a function counting theory for structured data of this kind, by extending Cover’s combinatorial technique, and we derive analytical expressions for the average number of dichotomies of generically correlated sets of patterns. As an application, we obtain a closed formula for the capacity of a binary classifier trained to distinguish general polytopes of any dimension. These results may help extend our theoretical understanding of generalization, feature extraction, and invariant object recognition by neural networks.
I Introduction
Machine learning and deep learning demonstrate astonishing results in applications Krizhevsky et al. (2012); Goodfellow et al. (2014, 2016), sometimes beyond our theoretical reach. This provides a formidable challenge for theorists who wish to develop a framework for their understanding Baldassi et al. (2016); Baity-Jesi et al. (2018). A landmark achievement in learning theory is Cover’s function counting theorem, which counts the number of binary classification functions, or “dichotomies”, that can be realized by given architectures Cover (1965). This foundational result allowed to quantify the complexity of a learning model and the advantage gained in using non-linear kernels, provided a benchmark for the performance of both artificial and natural neural networks, and is a handy tool for several applications Brunel et al. (2004); Engel and Broeck (2001); Hertz et al. (1991); Opper and Kinzel (1996); Ganguli and Sompolinsky (2010); Chung et al. (2018a).
Other commonly used methods in this area come from statistical physics (pioneered by E. Gardner Gardner (1987); Gardner and Derrida (1988)). With respect to these, Cover’s method has the advantage of offering a simple geometric insight and of being valid at finite number of dimensions, while statistical physics methods typically apply in the “thermodynamic limit” of infinite dimensions. Yet, despite its benefits and relative simplicity, Cover’s analytical technique has so far eluded efforts to extend it Engel and Broeck (2001).
Uncorrelated random patterns are commonly taken as a simplifying assumption for the theoretical investigation of artificial neural networks. Yet, it is becoming apparent that providing a theoretical framework that includes structure in the input data is essential. This need is emerging in different contexts: (a) The invariant representation of perceptual stimuli by brains (e.g., the coherent perception of differently rotated and rescaled objects in vision, or the recognition of the same sound in different acoustic environments in audition) prompted the formalization of perceptual manifolds as extended patterns Tenenbaum et al. (2000); Seung and Lee (2000); Roweis and Saul (2000); Ranzato et al. (2007); Goodfellow et al. (2009); Anselmi et al. (2016); Chung et al. (2016, 2018a, 2018b). Perceptual manifolds are the regions in input space corresponding to all variations of a stimulus that do not modify the object’s identification. (b) The discovery of spatial maps in rodent brains O’Keefe and Dostrovsky (1971) motivated extensions of associative memory models to attractors that are not point-like but occupy a region in configuration space Cocco et al. (2018). (c) The problem of local generalization and robustness to noise, a main theme of machine learning, can be cast as a problem of non-pointlike patterns Szegedy et al. (2013); Novak et al. (2018); Borra et al. (2019). (d) The description of the input patterns as modular combinations of elementary features (a well studied aspect of empirical datasets Pang and Maslov (2013); Mazzolini et al. (2018)), was shown to induce a multi-layer structure in certain network architectures Mézard (2017).
Here, we develop a theory that extends Cover’s approach to non point-like patterns, by counting only those dichotomies that assign the same label to different variants of the same input. Our theory (i) enables the exact computation of the (average) number of dichotomies of structured data, (ii) gives direct access to quantities at finite size, and (iii) naturally disentangles combinatorial and geometric aspects, thus lending itself to further generalizations.
II Number of admissible dichotomies
The central quantity obtained by Cover’s function counting method is the number of linearly-realizable dichotomies of points in dimensions. A dichotomy of this set is a function mapping each point to its binary label (see Fig. 1). A linearly-realizable dichotomy is identified by a vector :
[TABLE]
where is the Heaviside theta function. The hyperplane perpendicular to the vector separates the space into two half-spaces, where the points mapped to [math] and lie respectively. There are dichotomies, but only of them are linearly realizable. We focus on linearly realizable dichotomies, and will therefore omit this specification when it is clear from the context.
It turns out that does not depend on the ’s, as long as they are in general position (meaning that no subset of points is linearly dependent) Cover (1965). Structure in the data may thus appear not to affect at all. However, in general we do not wish to admit all possible dichotomies. For instance, among the hand-written digits in MNIST we could choose to admit dichotomies separating “1” and “I”, but not two similar-looking “0”s. Our definition of structure is based on such a restriction: a data set is qualified as structured whenever only a subset of all possible dichotomies is considered admissible. will then be the number of admissible dichotomies that can be realized linearly.
Here we focus on a rather general definition of admissibility, inspired by the literature cited above. We consider datasets of points, structured as multiplets of points each. A dichotomy is admissible if different points in the same multiplet are classified coherently, i.e., if is constant on each multiplet. We will restrict the points to lie on the unit sphere , meaning that , but this technical requirement can be easily relaxed. (A useful consequence of this is that setting the overlap between two points determines their distance.) The ensemble we consider fixes all the overlaps between the points in a multiplet, equally for all multiplets, but the relative positions and orientations of the multiplets are unspecified. The quantities we will compute are averages over all possible positions and orientations of the multiplets.
Because of the convexity of linear separability, separating the multiplets is equivalent to separating the polytopes whose vertices are the points in the multiplets. (These polytopes play the role of the perceptual manifolds of Ref. Chung et al. (2018a).) For instance, corresponds to segments, to triangles, to tetrahedra.
III Single points ()
Let us first outline Cover’s original computation. Imagine starting with points and adding the th point to . For each dichotomy of the points one of two possibilities is satisfied: either (i) can be realized by a hyperplane passing through (equivalently, can be realized by a vector such that ), or (ii) it can not. If (i) is true, then can be rotated infinitesimally to yield both ; otherwise, the half-space where lies is fixed. Therefore, for each dichotomy of satisfying (i) there are 2 different dichotomies and of agreeing with on the common points [i.e., such that for ]. If the number of dichotomies satisfying (i) is , then the number of those satisfying (ii) is , and one can write . The condition (i) is in the form of a single linear constraint, therefore is the number of dichotomies of points in dimensions, . Thus satisfies the recursion
[TABLE]
with boundary conditions (a single point can be classified either way) and .
The solution to Eq. (2) can be obtained by observing that the contribution of the boundary value to is given by the number of directed paths , with , that start from and end in , where at each step can be either or . The number of such paths is simply the binomial coefficient . Summing over the boundary gives
[TABLE]
where it is assumed that whenever .
Let us consider the fraction of linearly realizable dichotomies . For finite and , the capacity can be defined as the ratio at which half of all dichotomies can be realized: . From the explicit expression (3) one sees that if , for , and , which pinpoints the well-known capacity .
IV Segments (doublets, )
The first step towards the general problem is the case where data are structured as pairs of points. Alongside the set of points , let us consider another set . The multiplets discussed above are the doublets . Each doublet is such that the overlap between the two partners is fixed:
[TABLE]
for all . The admissible dichotomies are those for which for all ; their total number is .
The recursion step now corresponds to the addition of the th doublet , . Repeating Cover’s reasoning for the point alone gives a number of dichotomies equal to . This is the number of dichotomies of the set that are admissible on the first doublets [meaning that for all ]. A number of such dichotomies are realizable by a hyperplane passing through the point . These are all admissible, thanks to the freedom in the choice of by an infinitesimal adjustment of the hyperplane. Among the other dichotomies, on average, a fraction will happen to assign the same label to and . can be computed as the fraction of hyperplanes keeping and in the same half-space; the calculation is carried out in the Appendix. Importantly, is a function of the overlap alone:
[TABLE]
Note that as expected from its definition. The foregoing argument brings to estimate the total number of admissible dichotomies as
[TABLE]
In order to compute it suffices to repeat Cover’s reasoning with respect to the point , this time in dimensions because of the constraint imposed by the hyperplane passing through , thereby obtaining
[TABLE]
Finally the recursion for reads
[TABLE]
The boundary conditions are now slightly different than those for the case in Eq. (2). In fact, in dimension the number of admissible dichotomies of a single pair of points () is only when both points lie on the same half-line, otherwise it is [math]; on average, it is . The boundary conditions are then
[TABLE]
To find the solution of the recursion (8), similarly to the single point case, consider all the directed paths propagating from the boundary to , where at each step can be , , or . Contrary to the one point case, different paths with the same endpoints can now give different contributions to , since the three types of steps correspond to three different factors (, , and respectively). The contribution of a path from to is
[TABLE]
where the multinomial coefficient is defined as
[TABLE]
(with the obvious analytical extension for negative factorials). Summation over the non-zero boundary yields the number of admissible dichotomies
[TABLE]
It is easy to see (by the multinomial theorem) that if ; this locates the usual Vapnik-Chervonenkis dimension Vapnik and Chervonenkis (1971), , as the total number of points is .
An estimate for the capacity, valid for large , can be obtained by approximating Eq. (12) as
[TABLE]
The capacity is such that
[TABLE]
i.e., it corresponds to the value of for which the sum of takes half its maximum value. The quantity can be interpreted as the partition function of an ensemble of directed random walks of steps, with the same boundary conditions as for , and the following transition probabilities: , , . The normalization factor at the denominator is the sum of the weights , , and . The capacity therefore corresponds to the median of the distribution function of the walk’s endpoint . We approximate the median with the mean
[TABLE]
which evaluates to , and finally we obtain
[TABLE]
This result, with given by Eq. (5), was found in Lopez et al. (1995) by means of replica calculations, and appeared more recently in other contexts in Borra et al. (2019); Chung et al. (2016). Our derivation is somewhat more elementary, and naturally highlights the role of the geometric quantity .
Figure 2 compares the analytical formulas (12) and (16) with numerical results obtained by training a linear classifier with random doublets at varying dimension , number of points , and overlap . Equation (12) matches perfectly as expected. Equation (16) is surprisingly precise even at very small sizes; deviations are less than already for .
V Polytopes (multiplets, generic )
Let us now move to the general case where the data are structured in multiplets of points. We consider dichotomies of sets of points , with . The th multiplet is the set . A dichotomy is admissible if the images of all partner points in each multiplet are equal: for all , separately for all . For clarity, we denote the number of admissible dichotomies by , as shown in Fig. 1.
A recursion relation for can be obtained by carefully extending the method used for the doublet case. At the th step, we consider the multiplet , composed of the points . Let us exclude momentarily the point , and suppose we know how to apply Cover’s method to the set of points
[TABLE]
This would give an expression, let us call it
[TABLE]
The fact that is a function of with will be clear in the following. Intuitively, the case involves only and , the case adds because it uses the expression for in dimensions, and the same pattern repeats inductively up to points.
The quantity represents the number of dichotomies of the set that are admissible on the first multiplets [meaning that for all and all ] and admissible on the points in [meaning that for all ]. A number of these dichotomies are realizable by a hyperplane passing through the excluded point , and are therefore all admissible. Of the remaining ones, a fraction assign the same value to and to the points in , and are therefore admissible on the whole multiplet . Therefore,
[TABLE]
While was a probability (over all possible hyperplanes), is a conditional probability, namely the probability that a uniform vector on the sphere does not separate the multiplet , conditioned on the event that does not separate the set :
[TABLE]
The dependence of on the relative positions of the points is discussed in the Appendix, where it is shown that (i) the calculation of can be reduced from -dimensional to -dimensional integrals, and (ii) depends on only through the overlaps between the points in a multiplet, which we fix for all multiplets:
[TABLE]
This property allows us to treat as a constant in the recursions, thus simplifying the computations. Note that, since it is a conditional probability, can be written as a ratio of probabilities:
[TABLE]
where depends on overlaps between points, and denotes the fraction of hyperplanes not separating the points. This definition, together with the identity , implies that the geometric quantity computed above for is .
The number can be obtained by applying again Cover’s method with respect to the set this time in dimensions because the hyperplane is constrained to pass through . Hence
[TABLE]
Finally, from Eqs. (19) and (23), the recursion for is
[TABLE]
where the functions (having arguments) satisfy the recursive functional relation
[TABLE]
with the boundary given by the form of Eq. (2) for a single point.
The recursion in can be solved, thus yielding again a recursion for in and only. Let us call the coefficients in the solved recursion:
[TABLE]
Equation (25) then becomes
[TABLE]
with boundaries and . For instance, setting in Eqs. (26) and (27) recovers the recursion for doublets, Eq. (8), as expected. For one obtains
[TABLE]
In the process of deriving the foregoing recursion relations we considered the points in a particular order, therefore explicitly breaking invariance under permutations within the multiplets. We restore the invariance a posteriori, by prescribing that all (with ) be symmetrized with respect to all overlaps. For instance, when , the appearing in Eq. (28) is to be intended as . The goodness of this prescription is substantiated by the numerical results shown in Fig. 3; see also the limit case (ii) in the Discussion below.
The solution for (with the appropriate boundary conditions) can be obtained, for instance via generating functions, but we do not give it here. Instead, we focus on the capacity, which can be computed by the same approximate method used for [Eqs. (15) and (16)]:
[TABLE]
where we have defined the moments
[TABLE]
Summing Eq. (27) over shows that and therefore . By multiplying Eq. (27) by and summing over , one obtains . The boundary condition then fixes the solution
[TABLE]
Finally, substituting and into Eq. (29) yields a remarkably simple formula for the capacity:
[TABLE]
Figure 3 compares our theory with numerical computations in the case of triplets (), for triangles with three, two, and no sides of the same length. The agreement is excellent. The function is a double integral (given in the Appendix), which we evaluate numerically.
VI Discussion
Our extension of Cover’s combinatorial technique to structured data allows to obtain closed expressions of at finite and , for any [we have written explicitly the result for in Eq. (12)]. Beside this, our main result is Eq. (32), which expresses the capacity as a simple function of the quantities . Regarding these quantities, the merit of our method is twofold: first, the ’s are revealed to be the only relevant parameters characterizing the linear separability of the multiplets; second, they have a very simple geometric interpretation in terms of probabilities.
We mention three simple limit cases of Eq. (32). (i) If all the points in each multiplet coincide, then for all and we recover the single-point classic result . (ii) When and two points of a triplet coincide the overlaps are . Symmetrizing gives where is the fraction of hyperplanes not separating the three points. Clearly , and one recovers Eq. (16) for as expected. (iii) If and for all Eq. (32) gives . This prediction matches that obtained in Chung et al. (2018a) for -dimensional linear manifolds. However, this turns out to be an unphysical limit in our framework, since cannot be all vanishing. For instance, for , equilateral triplets with overlaps lie on a linear manifold passing through the origin when takes its minimum value . The same happens for isosceles triplets at Interestingly, the capacity evaluated at the respective minimum is for both geometries, to be compared to the value found for two-dimensional linear manifolds.
Another interesting, albeit less elementary, limit case would be , taken in such a way that the points generate a sphere of radius ; then Eq. (32) should reproduce the well-known capacity with margin Gardner (1987), which has never been obtained by combinatorial methods Chung et al. (2018a); Engel and Broeck (2001).
Other applications and extensions of the theory appear possible. First, the capacity is written in Eq. (29) as a combination of the zeroth and first moments, but higher-order moments can be computed similarly and give access to other useful quantities. For instance, the second moment is related to the width of the crossover region separating the regimes where respectively. Second, it would be interesting to express our results for general (non-linear) separating surfaces, in the same spirit of Cover’s original work, and in view of useful applications.
Acknowledgements.
We would like to dedicate this work to the memory of Bruno Bassetti. P.R. acknowledges funding by the European Union through the H2020 - MCIF Grant No. 766442. *
Appendix A Computation of
Computation of .
The fraction of hyperplanes assigning the same value to two points and is given by:
[TABLE]
The normalization factor is
[TABLE]
where is the solid angle in dimensions. Gram-Schmidt (GS) orthonormalization of and yields
[TABLE]
where is the overlap between the two points. Inverting Eq. (35) gives
[TABLE]
Having orthonormalized the points allows to safely exploit the -dimensional spherical symmetry of the integral in the space orthogonal to and , and to reduce it to an integral over the two-dimensional solid angle:
[TABLE]
which evaluates to the result in Eq. (5), and shows that .
Computation of .
Eq. (22) expresses the conditional probability in terms of the probabilities . is defined as the fraction of hyperplanes assigning the same value to the points :
[TABLE]
with given by Eq. (34). For , the Gram-Schmidt procedure gives:
[TABLE]
where are the overlaps, and . Again, thanks to the spherical symmetry in the space orthogonal to the ’s the result is an integral over the -dimensional solid angle:
[TABLE]
where the measure can be expressed via the angles and , and , and . As above, this computation shows that . The results presented in Fig. 3 have been obtained by integrating numerically Eq. (39).
The procedure for can be extended to . The final result has the following structure:
[TABLE]
where the functions appearing in the ’s can be systematically derived in a similar way from the GS procedure. This shows that , related to by Eq. (22), depends in general on the ’s only through the overlaps , and it can be written in terms of -dimensional integrals.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Krizhevsky et al. (2012) A. Krizhevsky, I. Sutskever, and G. E. Hinton, in Advances in Neural Information Processing Systems 25 , edited by F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Curran Associates, Inc., 2012) pp. 1097–1105.
- 2Goodfellow et al. (2014) I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, in Advances in Neural Information Processing Systems 27 , edited by Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Curran Associates, Inc., 2014) pp. 2672–2680.
- 3Goodfellow et al. (2016) I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning (The MIT Press, 2016).
- 4Baldassi et al. (2016) C. Baldassi, C. Borgs, J. T. Chayes, A. Ingrosso, C. Lucibello, L. Saglietti, and R. Zecchina, Proceedings of the National Academy of Sciences 113 , E 7655 (2016) , https://www.pnas.org/content/113/48/E 7655.full.pdf . · doi ↗
- 5Baity-Jesi et al. (2018) M. Baity-Jesi, L. Sagun, M. Geiger, S. Spigler, G. B. Arous, C. Cammarota, Y. Le Cun, M. Wyart, and G. Biroli, in Proceedings of the 35th International Conference on Machine Learning , Proceedings of Machine Learning Research, Vol. 80, edited by J. Dy and A. Krause (PMLR, Stockholmsmässan, Stockholm Sweden, 2018) pp. 314–323.
- 6Cover (1965) T. M. Cover, IEEE Transactions on Electronic Computers EC-14 , 326 (1965) . · doi ↗
- 7Brunel et al. (2004) N. Brunel, V. Hakim, P. Isope, J.-P. Nadal, and B. Barbour, Neuron 43 , 745 (2004) . · doi ↗
- 8Engel and Broeck (2001) A. Engel and C. P. L. V. d. Broeck, Statistical Mechanics of Learning (Cambridge University Press, New York, NY, USA, 2001).
