TL;DR
This paper introduces neural persistence, a novel topological complexity measure for neural networks that captures structural properties and can guide training efficiency improvements.
Contribution
We propose neural persistence as a new complexity measure based on algebraic topology, enabling structural analysis and training optimization of neural networks.
Findings
Neural persistence reflects effects of dropout and batch normalization.
It can be used as a stopping criterion to reduce training time.
It achieves comparable accuracy to traditional early stopping methods.
Abstract
While many approaches to make neural networks more fathomable have been proposed, they are restricted to interrogating the network with input data. Measures for characterizing and monitoring structural properties, however, have not been developed. In this work, we propose neural persistence, a complexity measure for neural network architectures based on topological data analysis on weighted stratified graphs. To demonstrate the usefulness of our approach, we show that neural persistence reflects best practices developed in the deep learning community such as dropout and batch normalization. Moreover, we derive a neural persistence-based stopping criterion that shortens the training process while achieving comparable accuracies as early stopping based on validation loss.
| Data set | # Runs | # Epochs | Architecture | Optimizer | Batch Size | Hyperparameters |
|---|---|---|---|---|---|---|
| MNIST | 50 | 40 | Adam | 32 | , , | |
| , , , Batch Normalization | ||||||
| , , , Dropout 50% |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsEarly Stopping · Dropout
Neural Persistence: A Complexity Measure for Deep Neural Networks Using Algebraic Topology
Bastian Rieck1,2,†, Matteo Togninalli1,2,†, Christian Bock1,2,†,
**Michael Moor1,2, Max Horn1,2, Thomas Gumbsch1,2, Karsten Borgwardt1,2
1**Department of Biosystems Science and Engineering, ETH Zurich, Switzerland
2SIB Swiss Institute of Bioinformatics, Switzerland
*†*These authors contributed equally
Abstract
While many approaches to make neural networks more fathomable have been proposed, they are restricted to interrogating the network with input data. Measures for characterizing and monitoring structural properties, however, have not been developed. In this work, we propose neural persistence, a complexity measure for neural network architectures based on topological data analysis on weighted stratified graphs. To demonstrate the usefulness of our approach, we show that neural persistence reflects best practices developed in the deep learning community such as dropout and batch normalization. Moreover, we derive a neural persistence-based stopping criterion that shortens the training process while achieving comparable accuracies as early stopping based on validation loss.
1 Introduction
The practical successes of deep learning in various fields such as image processing (Simonyan & Zisserman, 2015; He et al., 2016; Hu et al., 2018), biomedicine (Ching et al., 2018; Rajpurkar et al., 2017; Rajkomar et al., 2018), and language translation (Bahdanau et al., 2015; Sutskever et al., 2014; Wu et al., 2016) still outpace our theoretical understanding. While hyperparameter adjustment strategies exist (Bengio, 2012), formal measures for assessing the generalization capabilities of deep neural networks have yet to be identified (Zhang et al., 2017). Previous approaches for improving theoretical and practical comprehension focus on interrogating networks with input data. These methods include
i) feature visualization of deep convolutional neural networks (Zeiler & Fergus, 2014; Springenberg et al., 2015),
ii) sensitivity and relevance analysis of features (Montavon et al., 2017),
iii) a descriptive analysis of the training process based on information theory (Tishby & Zaslavsky, 2015; Shwartz-Ziv & Tishby, 2017; Saxe et al., 2018; Achille & Soatto, 2018), and
iv) a statistical analysis of interactions of the learned weights (Tsang et al., 2018).
Additionally, Raghu et al. (2017) develop a measure of expressivity of a neural network and use it to explore the empirical success of batch normalization, as well as for the definition of a new regularization method. They note that one key challenge remains, namely to provide meaningful insights while maintaining theoretical generality. This paper presents a method for elucidating neural networks in light of both aspects.
We develop neural persistence, a novel measure for characterizing neural network structural complexity. In doing so, we adopt a new perspective that integrates both network weights and connectivity while not relying on interrogating networks through input data. Neural persistence builds on computational techniques from algebraic topology, specifically topological data analysis (TDA), which was already shown to be beneficial for feature extraction in deep learning (Hofer et al., 2017) and describing the complexity of GAN sample spaces (Khrulkov & Oseledets, 2018). More precisely, we rephrase deep networks with fully-connected layers into the language of algebraic topology and develop a measure for assessing the structural complexity of
i) individual layers, and
ii) the entire network.
In this work, we present the following contributions:
We introduce neural persistence, a novel measure for characterizing the structural complexity of neural networks that can be efficiently computed.
- -
We prove its theoretical properties, such as upper and lower bounds, thereby arriving at a normalization for comparing neural networks of varying sizes.
- -
We demonstrate the practical utility of neural persistence in two scenarios:
i) it correctly captures the benefits of dropout and batch normalization during the training process, and
ii) it can be easily used as a competitive early stopping criterion that does not require validation data.
2 Background: Topological data analysis
Topological data analysis (TDA) recently emerged as a field that provides computational tools for analysing complex data within a rigorous mathematical framework that is based on algebraic topology. This paper uses persistent homology, a theory that was developed to understand high-dimensional manifolds (Edelsbrunner et al., 2002; Edelsbrunner & Harer, 2010), and has since been successfully employed in characterizing graphs (Sizemore et al., 2017; Rieck et al., 2018), finding relevant features in unstructured data (Lum et al., 2013), and analysing image manifolds (Carlsson et al., 2008). This section gives a brief summary of the key concepts; please refer to Edelsbrunner & Harer (2010) for an extensive introduction.
Simplicial homology
The central object in algebraic topology is a simplicial complex , i.e. a high-dimensional generalization of a graph, which is typically used to describe complex objects such as manifolds. Various notions to describe the connectivity of exist, one of them being simplicial homology. Briefly put, simplicial homology uses matrix reduction algorithms (Munkres, 1996) to derive a set of groups, the homology groups, for a given simplicial complex . Homology groups describe topological features—colloquially also referred to as holes—of a certain dimension , such as connected components (), tunnels (), and voids (). The information from the th homology group is summarized in a simple complexity measure, the th Betti number , which merely counts the number of -dimensional features: a circle, for example, has Betti numbers , i.e. one connected component and one tunnel, while a filled circle has Betti numbers , i.e. one connected component but no tunnel. In the context of analysing simple feedforward neural networks for two classes, Bianchini & Scarselli (2014) calculated bounds of Betti numbers of the decision region belonging to the positive class, and were thus able to show the implications of different activation functions. These ideas were extended by Guss & Salakhutdinov (2018) to obtain a measure of the topological complexity of decision boundaries.
Persistent homology
For the analysis of real-world data sets, however, Betti numbers turn out to be of limited use because their representation is too coarse and unstable. This prompted the development of persistent homology. Given a simplicial complex with an additional set of weights , which are commonly thought to represent the idea of a scale, it is possible to put in a filtration, i.e. a nested sequence of simplicial complexes . This filtration is thought to represent the ‘growth’ of as the scale is being changed. During this growth process, topological features can be created (new vertices may be added, for example, which creates a new connected component) or destroyed (two connected components may merge into one). Persistent homology tracks these changes and represents the creation and destruction of a feature as a point for indices with respect to the filtration. The collection of all points corresponding to -dimensional topological features is called the th persistence diagram . It can be seen as a collection of Betti numbers at multiple scales. Given a point , the quantity is referred to as its persistence. Typically, high persistence is considered to correspond to features, while low persistence is considered to indicate noise (Edelsbrunner et al., 2002).
3 A novel measure for neural network complexity
This section details neural persistence, our novel measure for assessing the structural complexity of neural networks. By exploiting both network structure and weight information through persistent homology, our measure captures network expressiveness and goes beyond mere connectivity properties. Subsequently, we describe its calculation, provide theorems for theoretical and empirical bounds, and show the existence of neural networks complexity regimes. To summarize this section, Figure 1 illustrates how our method treats a neural network.
3.1 Neural persistence
Given a feedforward neural network with an arrangement of neurons and their connections , let refer to the set of weights. Since is typically changing during training, we require a function that maps a specific edge to a weight. Fixing an activation function, the connections form a stratified graph.
Definition 1** (Stratified graph and layers).**
A stratified graph is a multipartite graph satisfying , such that if , , and , we have . Hence, edges are only permitted between adjacent vertex sets. Given , the th layer of a stratified graph is the unique subgraph .
This enables calculating the persistent homology of and each , using the filtration induced by sorting all weights, which is common practice in topology-based network analysis (Carstens & Horadam, 2013; Horak et al., 2009) where weights often represent closeness or node similarity. However, our context requires a novel filtration because the weights arise from an incremental fitting procedure, namely the training, which could theoretically lead to unbounded values. When analysing geometrical data with persistent homology, one typically selects a filtration based on the (Euclidean) distance between data points (Bubenik, 2015). The filtration then connects points that are increasingly distant from each other, starting from points that are direct neighbours. Our network filtration aims to mimic this behaviour in the context of fully-connected neural networks. Our framework does not explicitly take activation functions into account; however, activation functions influence the evolution of weights during training.
Filtration
Given the set of weights for one training step, let . Furthermore, let be the set of transformed weights, indexed in non-ascending order, such that . This permits us to define a filtration for the th layer as , where and denotes the transformed weight of an edge. We tailored this filtration towards the analysis of neural networks, for which large (absolute) weights indicate that certain neurons exert a larger influence over the final activation of a layer. The strength of a connection is thus preserved by the filtration, and weaker weights with remain close to [math]. Moreover, since holds for the transformed weights, this filtration makes the network invariant to scaling, which simplifies the comparison of different networks.
Persistence diagrams
Having set up the filtration, we can calculate persistent homology for every layer . As the filtration contains at most -simplices (edges), we capture zero-dimensional topological information, i.e. how connected components are created and merged during the filtration. These information are structurally equivalent to calculating a maximum spanning tree using the weights, or performing hierarchical clustering with a specific setup (Carlsson & Mémoli, 2010). While it would theoretically be possible to include higher-dimensional information about each layer , for example in the form of cliques (Rieck et al., 2018), we focus on zero-dimensional information in this paper, because of the following advantages:
i) the resulting values are easily interpretable as they essentially describe the clustering of the network at multiple weight thresholds,
ii) previous research (Rieck & Leitte, 2016; Hofer et al., 2017) indicates that zero-dimensional topological information is already capturing a large amount of information, and
iii) persistent homology calculations are highly efficient in this regime (see below).
We thus calculate zero-dimensional persistent homology with this filtration. The resulting persistence diagrams have a special structure: since our filtration solely sorts edges, all vertices are present at the beginning of the filtration, i.e. they are already part of for each . As a consequence, they are assigned a weight of , resulting in connected components. Hence, entries in the corresponding persistence diagram are of the form , with , and will be situated below the diagonal, similar to superlevel set filtrations (Bubenik, 2015; Cohen-Steiner et al., 2009). Using the -norm of a persistence diagram, as introduced by Cohen-Steiner et al. (2010), we obtain the following definition for neural persistence.
Definition 2** (Neural persistence).**
The neural persistence of the th layer , denoted by , is the -norm of the persistence diagram resulting from our previously-introduced filtration, i.e.
[TABLE]
which (for ) captures the Euclidean distance of points in to the diagonal.
The -norm is known to be a stable summary (Cohen-Steiner et al., 2010) of topological features in a persistence diagram. For neural persistence to be a meaningful measure of structural complexity, it should increase as a neural network is learning. We evaluate this and other properties in Section 4.
Algorithm 1 provides pseudocode for the calculation process. It is highly efficient: the filtration (line 4) amounts to sorting all weights of a network, which has a computational complexity of . Calculating persistent homology of this filtration (line 5) can be realized using an algorithm based on union–find data structures Edelsbrunner et al. (2002). This has a computational complexity of , where refers to the extremely slow-growing inverse of the Ackermann function (Cormen et al., 2009, Chapter 22). We make our implementation and experiments available under https://github.com/BorgwardtLab/Neural-Persistence.
3.2 Properties of neural persistence
We elucidate properties about neural persistence to permit the comparison of networks with different architectures. As a first step, we derive bounds for the neural persistence of a single layer .
Theorem 1**.**
Let be a layer of a neural network according to Definition 1. Furthermore, let denote the function that assigns each edge of a transformed weight. Using the filtration from Section 3.1 to calculate persistent homology, the neural persistence of the th layer satisfies
[TABLE]
where denotes the cardinality of the vertex set, i.e. the number of neurons in the layer.
Proof.
We prove this constructively and show that the bounds can be realized. For the lower bound, let be a fully-connected layer with vertices and, given , let for every edge . Since a vertex is created before its incident edges, the filtration degenerates to a lexicographical ordering of vertices and edges, and all points in will be of the form . Thus, . For the upper bound, let again be a fully-connected layer with vertices and let with . Select one edge at random and define a weight function as and otherwise. In the filtration, the addition of the first edge will create a pair of the form , while all other pairs will be of the form . Consequently, we have
[TABLE]
so our upper bound can be realized. To show that this term cannot be exceeded by for any , suppose we perturb the weight function . This cannot increase , however, because each difference in Equation 3 is maximized by . ∎
We can use the upper bound of Theorem 1 to normalize the neural persistence of a layer, making it possible to compare layers (and neural networks) that feature different architectures, i.e. a different number of neurons.
Definition 3** (Normalized neural persistence).**
For a layer following Definition 1, using the upper bound of Theorem 1, the normalized neural persistence is defined as the neural persistence of divided by its upper bound, i.e. .
The normalized neural persistence of a layer permits us to extend the definition to an entire network. While this is more complex than using a single filtration for a neural network, this permits us to side-step the problem of different layers having different scales.
Definition 4** (Mean normalized neural persistence).**
Considering a network as a stratified graph according to Definition 1, we sum the neural persistence values per layer to obtain the mean normalized neural persistence, i.e. .
While Theorem 1 gives a lower and upper bound in a general setting, it is possible to obtain empirical bounds when we consider the tuples that result from the computation of a persistence diagram. Recall that our filtration ensures that the persistence diagram of a layer contains tuples of the form , with being a transformed weight. Exploiting this structure permits us to obtain bounds that could be used prior to calculating the actual neural persistence value in order to make the implementation more efficient.
Theorem 2**.**
Let be a layer of a neural network as in Theorem 1 with vertices and edges whose edge weights are sorted in non-descending order, i.e. . Then can be empirically bounded by
[TABLE]
where and are the vectors containing the largest and smallest weights, respectively.
Proof.
See Section A.2 in the appendix. ∎
Complexity regimes in neural persistence
As an application of the two theorems, we briefly take a look at how neural persistence changes for different classes of simple neural networks. To this end, we train a perceptron on the ‘MNIST’ data set. Since our measure uses the weight matrix of a perceptron, we can compare its neural persistence with the neural persistence of random weight matrices, drawn from different distributions. Moreover, we can compare trained networks with respect to their initial parameters. Figure 2 depicts the neural persistence values as well as the lower bounds according to Theorem 2 for different settings. We can see that a network in which the optimizer diverges (due to improperly selected parameters) is similar to a random Gaussian matrix. Trained networks, on the other hand, are clearly distinguished from all other networks. Uniform matrices have a significantly lower neural persistence than Gaussian ones. This is in line with the intuition that the latter type of networks induces functional sparsity because few neurons have large absolute weights. For clarity, we refrain from showing the empirical upper bounds because most weight distributions are highly right-tailed; the bound will not be as tight as the lower bound. These results are in line with a previous analysis (Sizemore et al., 2017) of small weighted networks, in which persistent homology is seen to outperform traditional graph-theoretical complexity measures such as the clustering coefficient (see also Section A.1 in the appendix). For deeper networks, additional experiments discuss the relation between validation accuracy and neural persistence (Section A.5), the impact of different data distributions, as well as the variability of neural persistence for architectures of varying depth (Section A.6).
4 Experiments
This section demonstrates the utility and relevance of neural persistence for fully connected deep neural networks. We examine how commonly used regularization techniques (batch normalization and dropout) affect neural persistence of trained networks. Furthermore, we develop an early stopping criterion based on neural persistence and we compare it to the traditional criterion based on validation loss. We used different architectures with ReLU activation functions across experiments. The brackets denote the number of units per hidden layer. In addition, the Adam optimizer with hyperparameters tuned via cross-validation was used unless noted otherwise. Please refer to Table A.2 in the appendix for further details about the experiments.
4.1 Deep learning best practices in light of neural persistence
We compare the mean normalized neural persistence (see Definition 4) of a two-layer (with an architecture of ) neural network to two models where batch normalization (Ioffe & Szegedy, 2015) or dropout (Srivastava et al., 2014) are applied. Figure 3 shows that the networks designed according to best practices yield higher normalized neural persistence values on the ‘MNIST’ data set in comparison to an unmodified network. The effect of dropout on the mean normalized neural persistence is more pronounced and this trend is directly analogous to the observed accuracy on the test set. These results are consistent with expectations if we consider dropout to be similar to ensemble learning (Hara et al., 2016). As individual parts of the network are trained independently, a higher degree of per-layer redundancy is expected, resulting in a different structural complexity. Overall, these results indicate that for a fixed architecture approaches targeted at increasing the neural persistence during the training process may be of particular interest.
4.2 Early stopping based on neural persistence
Neural persistence can be used as an early stopping criterion that does not require a validation data set to prevent overfitting: if the mean normalized neural persistence does not increase by more than during a certain number of epochs , the training process is stopped. This procedure is called ‘patience’ and Algorithm 2 describes it in detail. A similar variant of this algorithm, using validation loss instead of persistence, is the state-of-the-art for early stopping in training (Bengio, 2012; Chollet et al., 2015). To evaluate the efficacy of our measure, we compare it against validation loss in an extensive set of scenarios. More precisely, for a training process with at most epochs, we define a parameter grid consisting of the ‘patience’ parameter and a burn-in rate (both measured in epochs). defines the number of epochs after which an early stopping criterion starts monitoring, thereby preventing underfitting. Subsequently, we set for all measures to remain comparable and scale-invariant, as non-zero values could implicitly favour one of them due to scaling. For each data set, we perform 100 training runs of the same architecture, monitoring validation loss and mean normalized neural persistence every quarter epoch. The early stopping behaviour of both measures is simulated for each combination of and and their performance over all runs is summarized in terms of median test accuracy and median stopping epoch; if a criterion is not triggered for one run, we report the test accuracy at the end of the training and the number of training epochs. This results in a scatterplot, where each point (corresponding to a single parameter combination) shows the difference in epochs and the absolute difference in test accuracy (measured in percent). The quadrants permit an intuitive explanation: , for example, contains all configurations for which our measure stops earlier, while achieving a higher accuracy. Since and are typically chosen to be small in an early stopping scenario, we use grey points to indicate uncommon configurations for which or is larger than half of the total number of epochs. Furthermore, to summarize the performance of our measure, we calculate the barycentre of all configurations (green square).
Figure 4(a) depicts the comparison with validation loss for the ‘Fashion-MNIST’ (Xiao et al., 2017) data set; please refer to Section A.3 in the appendix for more data sets. Here, we observe that most common configurations are in or in , i.e our criterion stops earlier. The barycentre is at , showing that out of 625 configurations, on average we stop half an epoch earlier than validation loss, while losing virtually no accuracy (). Figure 4(c) depicts detailed differences in accuracy and epoch for our measure when compared to validation loss; each cell in a heatmap corresponds to a single parameter configuration of and . In the heatmap of accuracy differences, blue, white, and red represent parameter combinations for which we obtain higher, equal, or lower accuracy, respectively, than with validation loss for the same parameters. Similarly, in the heatmap of epoch differences, green represents parameter combinations for which we stop earlier than validation loss. For , we stop earlier ( epochs on average), while losing only accuracy. Finally, Figure 4(d) shows how often each measure is triggered. Ideally, each measure should consist of a dark green triangle, as this would indicate that each configuration stops all the time. For this data set, we observe that our method stops for more parameter combinations than validation loss, but not as frequently for all of them. To ensure comparability across scenarios, we did not use the validation data as additional training data when stopping with neural persistence; we refer to Section A.7 for additional experiments in data scarcity scenarios. We observe that our method stops earlier when overfitting can occur, and it stops later when longer training is beneficial.
5 Discussion
In this work, we presented neural persistence, a novel topological measure of the structural complexity of deep neural networks. We showed that this measure captures topological information that pertains to deep learning performance. Being rooted in a rich body of research, our measure is theoretically well-defined and, in contrast to previous work, generally applicable as well as computationally efficient. We showed that our measure correctly identifies networks that employ best practices such as dropout and batch normalization. Moreover, we developed an early stopping criterion that exhibits competitive performance while not relying on a separate validation data set. Thus, by saving valuable data for training, we managed to boost accuracy, which can be crucial for enabling deep learning in regimes of smaller sample sizes. Following Theorem 2, we also experimented with using the -norm of all weights of the neural network as a proxy for neural persistence. However, this did not yield an early stopping measure because it was never triggered, thereby suggesting that neural persistence captures salient information that would otherwise be hidden among all the weights of a network. We extended our framework to convolutional neural networks (see Section A.4) by deriving a closed-form approximation, and observed that an early stopping criterion based on neural persistence for convolutional layers will require additional work. Furthermore, we conjecture that assessing dissimilarities of networks by means of persistence diagrams (making use of higher-dimensional topological features), for example, will lead to further insights regarding their generalization and learning abilities. Another interesting avenue for future research would concern the analysis of the ‘function space’ learned by a neural network. On a more general level, neural persistence demonstrates the great potential of topological data analysis in machine learning.
Appendix A Appendix
A.1 Comparison with graph-theoretical measures
Traditional complexity/structural measures from graph theory, such as the clustering coefficient, the average shortest path length, and global/local efficiency are already known to be insufficiently accurate to characterize different models of complex random networks Sizemore et al. (2017). Our experiments indicate that this holds true for (deep) neural networks, too. As a brief example, we trained a perceptron on the MNIST data set with batch stochastic gradient descent (), achieving a test accuracy of . Moreover, we intentionally ‘sabotaged’ the training by setting 1\text{\times}{10}^{-5}$$ such that SGD is unable to converge properly. This leads to networks with accuracies ranging from –. A complexity measure should be capable of distinguishing both classes of networks. However, as Figure A.1 (top) shows, this is not the case for the clustering coefficient. Neural persistence (bottom), on the other hand, results in two regimes that can clearly be distinguished, with the trained networks having a significantly smaller variance.
A.2 Proof of Theorem 2
Proof.
We may consider the filtration from Section 3.1 to be a subset selection problem with constraints, where we select out of weights. The neural persistence of a layer thus only depends on the selected weights that appear as tuples of the form in . Letting denote the vector of selected weights arising from the persistence diagram calculation, we can rewrite neural persistence as . Furthermore, satisfies . Since all transformed weights are non-negative in our filtration, it follows that (note the reversal of the two terms)
[TABLE]
and the claim follows. ∎
A.3 Additional visualizations and analyses for early stopping
Due to space constraints and the large number of configurations that we investigated for our early stopping experiments, this section contains additional plots that follow the same schematic: the top row shows the differences in accuracy and epoch for our measure when compared to the commonly-used validation loss. Each cell in the heatmap corresponds to a single configuration of and . In the heatmap of accuracy differences, blue represents parameter combinations for which we obtain a higher accuracy than validation loss for the same parameters; white indicates combinations for which we obtain the same accuracy, while red highlights combinations in which our accuracy decreases. Similarly, in the heatmap of epoch differences, green represents parameter combinations for which we stop earlier than validation loss for the same parameter. The scatterplots in Section 4.2 show an ‘unrolled’ version of this heat map, making it possible to count how many parameter combinations result in early stops while also increasing accuracy, for example. The heatmaps, by contrast, make it possible to compare the behaviour of the two measures with respect to each parameter combination. Finally, the bottom row of every plot shows how many times each measure was triggered for every parameter combination. We consider a measure to be triggered if its stopping condition is satisfied prior to the last training epoch. Due to the way the parameter grid is set up, no configuration above the diagonal can stop, because would be larger than the total number of training epochs. This permits us to compare the ‘slopes’ of cells for each measure. Ideally, each measure should consist of a dark green triangle, as this would indicate that parameter configuration stops all the time.
MNIST
Please refer to Figures A.2 and A.3. The colours in the difference matrix of the top row are slightly skewed because in a certain configuration, our measure loses of accuracy when stopping. However, there are many other configurations in which virtually no accuracy is lost and in which we are able to stop more than four epochs earlier. The heatmaps in the bottom row again indicate that neural persistence is capable of stopping for more parameter combinations in general. We do not trigger as often for some of them, though.
CIFAR-10
Please refer to Figure A.4. In general, we observe that this data set is more sensitive with respect to the parameters for early stopping. While there are several configurations in which neural persistence stops with an increase of almost in accuracy, there are also scenarios in which we cannot stop training earlier, or have to train longer (up to epochs out of epochs in total). The second row of plots shows our measure triggers reliably for more configurations than validation loss. Overall, the scatterplot of all scenarios (Figure A.5) shows that most practical configurations are again located in and . While we may thus find certain configurations in which we reliably outperform validation loss as an early stopping criterion, we also want to point out that our measures behaves correctly for many practical configurations. Points in , where we train longer and achieve a higher accuracy, are characterized by a high patience of approximately epochs and a low burn-in rate , or vice versa. This is caused by the training for CIFAR-10, which does not reliably converge for FCNs. Figure A.6 demonstrates this by showing loss curves and the mean normalized neural persistence curves of five runs over training (loss curves have been averaged over all runs; standard deviations are shown in grey; we show the first half of the training to highlight the behaviour for practical early stopping conditions). For ‘Fashion-MNIST’, we observe that exhibits clear change points during the training process, which can be exploited for early stopping. For ‘CIFAR-10’, we observe a rather incremental growth for some runs (with no clearly-defined maximum), making it harder to derive a generic early stopping criterion that does not depend on fine-tuned parameters. Hence, we hypothesize that neural persistence cannot be used reliably in scenarios where the architecture is incapable of learning the data set. In the future, we plan to experiment with deliberately selected ‘bad’ and ‘good’ architectures in order to evaluate to what extent our topological measure is capable of assessing their suitability for training, but this is beyond the scope of this paper.
IMDB
Please refer to Figure A.7. For this data set, we observe that most parameter configurations result in earlier stopping (up to two epochs earlier than validation loss), with accuracy increases of up to . This is also shown in the scatterplot A.8. Only a single configuration, viz. and , results in a severe loss of accuracy; we removed it from the scatterplot for reasons of clarity, as its accuracy difference of would skew the display of the remaining configurations too much (this is also why the legends do not include this outlier).
A.4 Neural Persistence for Convolutional Layers
In principle, the proposed filtration process could be applied to any bipartite graph. Hence, we can directly apply our framework to convolutional layers, provided we represent them properly. Specifically, for layer we represent the convolution of its th input feature map with the th filter as one bipartite graph parametrized by a sparse weight matrix , which in each row contains the unrolled values of on the diagonal, with zeros padded in between after each values of . This way, the flattened pre-activation can be described as .
Since flattening does not change the topology of our bipartite graph, we compute the normalized neural persistence on this sparse weight matrix as the unrolled analogue of the fully-connected network’s weight matrix. Averaging over all filters then gives a per-layer measure, similar to the way we derived mean normalized neural persistence in the main paper.
When studying the unrolled adjacency matrix , it becomes clear that the edge filtration process can be approximated in a closed form. Specifically, for and input and output neurons we initialize connected components. When using zero padding, the additional dummy input neurons have to included in . For all tuples in the persistence diagram the creation event . Notably, each output neuron shares the same set of edge weights.
Due to this, the destruction events—except for a few special cases—simplify to a list of length containing the largest filter values (each value is contained times) in descending order until the list is filled. This simplification of neural persistence of a convolution with one filter is shown as a closed expression in Equations 7–11, and our implementation is sketched in Algorithm 3. We thus obtain
[TABLE]
where . Following this notation, Equation 7 expresses neural persistence of the bipartite graph , with denoting the vector of selected weights (i.e. the destruction events) when calculating the persistence diagram. We use to denote the flattened and sorted weight values (in descending order) of the convolutional filter , while represents the vector of all weights that are located in a corner of , whereas is the vector of all weights which do not originate from the corner of the filter while still belonging to the first (and thus largest) weights in , which we denote by .
For the subsequent experiments (see below), we use a simple CNN that employs filters. Hence, by using the shortcut described above, we do not have to unroll 2080 weight matrices explicitly, thereby gaining both in memory efficiency and run time, as compared to the naive approach: on average, a naive exact computation based on unrolling required per convolutional filter and evaluation step, whereas the approximation only took about while showing very similar behaviour up to a constant offset.
For our experiments, we used an off-the-shelf ‘LeNet-like’ CNN model architecture (two convolutional layers each with max pooling and ReLU, 1 fully-connected and softmax) as described in Abadi et al. (2015). We trained the model on ‘Fashion-MNIST’ and included this setup in the early stopping experiments (100 runs of 20 epochs). In Figure A.9, we observe that stopping based on the neural persistence of a convolutional layer typically only incurs a considerable loss of accuracy: given a final test accuracy of , stopping with this naive extension of our measure reduces accuracy by up to 4%. Furthermore, in contrast to early stopping on a fully-connected architecture, we do not observe any parameter combinations that stop early and increase accuracy. In fact, there is no configuration that results in an increased accuracy. This empirically confirms our theoretical scepticism towards naively applying our edge-focused filtration scheme to CNNs.
A.5 Relationship between neural persistence and validation accuracy
Motivated by Figure 2, which shows the different ‘regimes’ of neural persistence for a perceptron network, we investigate a possible correlation of (high) neural persistence with (high) predictive accuracy. For deeper networks, we find that neural persistence measures structural properties that arise from different parameters (such as training procedures or initializations), and no correlation can be observed.
For our experiments, we constructed neural networks with a high neural persistence prior to training. More precisely, following the theorems in this paper, we initialized most weights of each layer with very low values and reserved high values for very few weights. This was achieved by sampling the weights from a beta distribution with and . Using this procedure, we are able to initialize [20,20,20] networks with compared to the same networks that have when initialized by Xavier initialization. The mean validation accuracy of these untrained networks on the ‘Fashion-MNIST’ data set is and , respectively.
Figure A.10 depicts how both types of networks converge to similar regimes of validation accuracy, while the mean normalized neural persistence achieved at the end of the training varies. For networks initialized with high (Figure A.10, left) the validation accuracy of networks with final ranges from (not shown) to . For Xavier initialization (Figure A.10, right), the lack of correlation can also be observed. Furthermore, comparing the two plots, there are no clear advantages in initializing networks with high . This observation further motivates the proposed early stopping criterion, which checks for changes in the value, and considers stagnating values to be indicative of a trained network.
A.6 Neural persistence for different data distributions and deeper FCN architectures
Neural persistence captures information about different data distributions during training. The weights tuned via backpropagation are directly influenced by the input data (as well as their labels) and neural persistence tracks those changes. To demonstrate this, we trained the same architecture , i.e. , on two data sets with the same dimensions but different properties: MNIST and ‘Fashion-MNIST’. Each data set has the same image size ( pixels, one channel) but lay on different manifolds. Figure A.11 (left) shows a histogram of the mean normalized neural persistence () after epochs of training over different runs. The distributions have a similar shape but are shifted, indicating that the two datasets lead the network to different topological regimes.
We also investigated the effect of depth on neural persistence. We selected a fixed layer size (20 hidden units) and increased the number of hidden layers. Figure A.11 (right) depicts the boxplots of mean for multiple architectures after 15 epochs of training on MNIST. Adding layers initially increases the variability of by enabling the network to converge to different regimes (essentially, there are many more valid configurations in which a trained neural network might end up in). However, this effect is reduced after a certain depth: networks with deeper architectures exhibit less variability in .
A.7 Early stopping in data scarcity scenarios
Labelled data is expensive in most domains of interest, which results in small data sets or low quality of the labels. We investigate the following experimental set-ups:
(1) Reducing the training data set size and
(2) Permuting a fraction of the training labels.
We train a fully connected network ( architecture) on ‘MNIST’ and ‘Fashion-MNIST’. In the experiments, we compare the following measures for stopping the training:
i) Stopping at the optimal test accuracy.
ii) Fixed stopping after the burn in period.
iii) Neural persistence patience criterion.
iv) Training loss patience criterion.
v) Validation loss patience criterion.
For a description of the patience criterion, see Algorithm 2. All measures, except validation loss, include the validation datasets () in the training process to simulate a larger data set when no cross-validation is required. We report the accuracy on the non-reduced, non-permuted test sets. The batch size is training instances. The stopping measures are evaluated every quarter epoch.
Figure A.12 shows the results averaged over runs (the error is the standard deviation). The difference between the top and the bottom panel is the data set and the patience parameters. The -axis depicts the fraction of the data set, which is warped for better accessibility. In each panel, the left-hand side subplots depict the results of the reduced data set experiment where the right-hand side subplots depict the result of the permutation experiments. The -axis of the top subplot shows the accuracy on the non-reduced, non-permuted test set. The -axis of the bottom subplot shows when the stopping criterion was triggered.
We note the following observations, which hold for both panels: More, non-permuted data yields higher test accuracy. Also, as expected, the optimal stopping gives the highest test accuracy. The fixed early stopping results in inferior test accuracy when only a fraction of the data is available. The neural persistence based stopping is triggered late when only a fraction of the data is available which results in a slightly better test accuracy compared to training and validation loss. The training loss stopping achieves similar test accuracies compared to the persistence based stopping (for all regimes except the very small data set) with shorter training, on average. We note that, it is generally not advisable to use training loss as a measure for stopping because the stability of this criterion also depends on the batch size. When only a fraction of the data is available, the validation loss based stopping stops on average after the same number of training epochs as the training loss, which results in inferior test accuracy because the network has seen in total fewer training samples. Most strikingly, validation loss based stopping is is triggered later (sometimes never) when most training and validation labels are randomly permuted which results in overfitting and poor test accuracy.
To conclude, the neural persistence based stopping achieves good performance without being affected by the batch size and noisy labels. The authors also note that the result is consistent for multiple architectures and most patience parameters.
[FIGURE:]
A.8 Testing accuracy of differently regularized models
We showed in the main text that neural persistence is capable of distinguishing between networks trained with/without batch normalization and/or dropout. Figure A.13 additionally shows test set accuracies.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abadi et al. (2015) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhou
- 2Achille & Soatto (2018) Alessandro Achille and Stefano Soatto. Emergence of invariance and disentanglement in deep representations. Journal of Machine Learning Research , 18:1–34, 2018.
- 3Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations (ICLR) , 2015.
- 4Bengio (2012) Yoshua Bengio. Practical recommendations for gradient-based training of deep architectures. In Grégoire Montavon, Geneviève B. Orr, and Klaus-Robert Müller (eds.), Neural Networks: Tricks of the Trade , volume 7700 of Lecture Notes in Computer Science , pp. 437–478. Springer, Heidelberg, Germany, 2012.
- 5Bianchini & Scarselli (2014) Monica Bianchini and Franco Scarselli. On the complexity of neural network classifiers: A comparison between shallow and deep architectures. IEEE Transactions on Neural Networks and Learning Systems , 25(8):1553–1565, 2014.
- 6Bubenik (2015) Peter Bubenik. Statistical topological data analysis using persistence landscapes. Journal of Machine Learning Research , 16:77–102, 2015.
- 7Carlsson & Mémoli (2010) Gunnar Carlsson and Facundo Mémoli. Characterization, stability and convergence of hierarchical clustering methods. Journal of Machine Learning Research , 11:1425–1470, 2010.
- 8Carlsson et al. (2008) Gunnar Carlsson, Tigran Ishkhanov, Vin de Silva, and Afra Zomorodian. On the local behavior of spaces of natural images. International Journal of Computer Vision , 76(1):1–12, 2008.
