The Probabilistic Fault Tolerance of Neural Networks in the Continuous   Limit

El-Mahdi El-Mhamdi; Rachid Guerraoui; Andrei Kucharavy; Sergei Volodin

arXiv:1902.01686·stat.ML·September 26, 2019

The Probabilistic Fault Tolerance of Neural Networks in the Continuous Limit

El-Mahdi El-Mhamdi, Rachid Guerraoui, Andrei Kucharavy, Sergei Volodin

PDF

Open Access 3 Repos

TL;DR

This paper investigates the probabilistic fault tolerance of neural networks in the continuous limit, providing theoretical bounds on output errors due to random neuron failures, with implications for hardware design and robustness.

Contribution

It introduces a novel probabilistic analysis of neural network robustness to neuron crashes using Taylor expansion in the continuous limit, and offers methods to improve fault tolerance.

Findings

01

Theoretical bounds on output error under neuron crashes.

02

Comparison of fault tolerance across different architectures.

03

Algorithm for enhancing fault tolerance with minimal neurons.

Abstract

The loss of a few neurons in a brain rarely results in any visible loss of function. However, the insight into what "few" means in this context is unclear. How many random neuron failures will it take to lead to a visible loss of function? In this paper, we address the fundamental question of the impact of the crash of a random subset of neurons on the overall computation of a neural network and the error in the output it produces. We study fault tolerance of neural networks subject to small random neuron/weight crash failures in a probabilistic setting. We give provable guarantees on the robustness of the network to these crashes. Our main contribution is a bound on the error in the output of a network under small random Bernoulli crashes proved by using a Taylor expansion in the continuous limit, where close-by neurons at a layer are similar. The failure mode we adopt in our model is…

Tables4

Table 1. Table 1: Correspondence between discrete and continuous quantities. When an regular (discrete) NN is a function mapping vectors to vectors, a continuous NN is an operator mapping functions to functions

Quantity	Discrete		Continuous
Input	$x : [n_{0}] \to ℝ$	$⟼$	$x : [0, 1] \to ℝ$
Weights	$W_{l} : [n_{l}] \times [n_{l - 1}] \to ℝ$		$W_{l} : {[0, 1]}^{2} \to ℝ$
Pre-activations	$z_{l}^{i} = \sum_{j} W_{l}^{i j} y_{l - 1}^{i} + b_{l}^{i}$		$z_{l} (t) = \int_{0}^{1} W_{l} (t, t^{'}) y_{l - 1} (t^{'}) 𝑑 t^{'} + b_{l} (t)$

Table 2. (a) Experimental metrics

Quantity	Train rank loss	Test rank loss
Crashing, MAE	5.6%	5.6%
Crashing, Accuracy	19.8%	17.7%
Correct, MAE	23.3%	22.0%
Correct, Accuracy	31.7%	39.9%

Table 3. (a) Experimental metrics

Quantity	Train rank loss	Test rank loss
Crashing, MAE	5.6%	5.6%
Crashing, Accuracy	19.8%	17.7%
Correct, MAE	23.3%	22.0%
Correct, Accuracy	31.7%	39.9%

Table 4. (b) Theoretical bounds

Quantity	Rank Loss
T1 $Var Δ$	3.6%
P2 $𝔼 Δ$	24.8%
P2 $Var Δ$	31.7%
T1 $𝔼 Δ$	40.8%

Equations6

\frac{\partial ^{k} y _{L}}{\partial y _{l}^{i^{1}} ... \partial y _{l}^{i^{k}}} = \frac{1}{n _{l}^{k}} \frac{δ ^{k} y _{L}}{δ y _{l} ( i ^{1} ) ... δ y _{l} ( i ^{k} )} + o (1), n_{l} \to \infty

\frac{\partial ^{k} y _{L}}{\partial y _{l}^{i^{1}} ... \partial y _{l}^{i^{k}}} = \frac{1}{n _{l}^{k}} \frac{δ ^{k} y _{L}}{δ y _{l} ( i ^{1} ) ... δ y _{l} ( i ^{k} )} + o (1), n_{l} \to \infty

\mathbb{E}\Delta_{L}=p_{l}\sum\limits_{i=1}^{n_{l}}\frac{\partial y_{L}}{\partial\xi^{i}}\Bigr{|}_{\xi=0}+\Theta_{\pm}(1)D_{2}r^{2},\,\mathrm{Var}\Delta_{L}=p_{l}\sum\limits_{i=1}^{n_{l}}\left(\frac{\partial y_{L}}{\partial\xi^{i}}\Bigr{|}_{\xi=0}\right)^{2}+\Theta_{\pm}(1)D_{12}r^{3}\\

\mathbb{E}\Delta_{L}=p_{l}\sum\limits_{i=1}^{n_{l}}\frac{\partial y_{L}}{\partial\xi^{i}}\Bigr{|}_{\xi=0}+\Theta_{\pm}(1)D_{2}r^{2},\,\mathrm{Var}\Delta_{L}=p_{l}\sum\limits_{i=1}^{n_{l}}\left(\frac{\partial y_{L}}{\partial\xi^{i}}\Bigr{|}_{\xi=0}\right)^{2}+\Theta_{\pm}(1)D_{12}r^{3}\\

\hat{L} (W) = L (W) + λ i = 1 \sum n_{l} (\frac{\partial L}{\partial y ^{i}} \cdot y^{i})^{2} + μ (\frac{max _{i} W _{l}^{i}}{min _{i} W _{l}^{i}})^{2} + ψ \cdot \mbox s m oo t h (W_{l}) + ν ∥ W ∥_{\infty}

\hat{L} (W) = L (W) + λ i = 1 \sum n_{l} (\frac{\partial L}{\partial y ^{i}} \cdot y^{i})^{2} + μ (\frac{max _{i} W _{l}^{i}}{min _{i} W _{l}^{i}})^{2} + ψ \cdot \mbox s m oo t h (W_{l}) + ν ∥ W ∥_{\infty}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Memory and Neural Computing · Ferroelectric and Negative Capacitance Devices · Stochastic Gradient Optimization Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings

Full text

The Probabilistic Fault Tolerance of Neural Networks in The Continuous Limit

El-Mahdi El-Mhamdi, Rachid Guerraoui, Andrei Kucharavy, Sergei Volodin

Distributed Computing Laboratory

Swiss Federal Institute of Technology in Lausanne (EPFL)

1015-Lausanne, Switzerland

{elmahdi.elmhamdi, rachid.guerraoui, andrei.kucharavy, sergei.volodin}@epfl.ch

Abstract

The loss of a few neurons in a brain rarely results in any visible loss of function. However, the insight into what “few” means in this context is unclear. How many random neuron failures will it take to lead to a visible loss of function? In this paper, we address the fundamental question of the impact of the crash of a random subset of neurons on the overall computation of a neural network and the error in the output it produces. We study fault tolerance of neural networks subject to small random neuron/weight crash failures in a probabilistic setting. We give provable guarantees on the robustness of the network to these crashes. Our main contribution is a bound on the error in the output of a network under small random Bernoulli crashes proved by using a Taylor expansion in the continuous limit, where close-by neurons at a layer are similar. The failure mode we adopt in our model is characteristic of neuromorphic hardware, a promising technology to speed up artificial neural networks, as well as of biological networks. We show that our theoretical bounds can be used to compare the fault tolerance of different architectures and to design a regularizer improving the fault tolerance of a given architecture. We design an algorithm achieving fault tolerance using a reasonable number of neurons. In addition to the theoretical proof, we also provide experimental validation of our results and suggest a connection to the generalization capacity problem.

1 Introduction

Understanding the inner working of artificial neural networks (NNs) is currently one of the most pressing questions [20] in learning theory. As of now, neural networks are the backbone of the most successful machine learning solutions [37, 18]. They are deployed in safety-critical tasks in which there is little room for mistakes [10, 40]. Nevertheless, such issues are regularly reported since attention was brought to the NNs vulnerabilities over the past few years [37, 5, 24, 8].

Fault tolerance as a part of theoretical NNs research. Understanding complex systems requires understanding how they can tolerate failures of their components. This has been a particularly fruitful method in systems biology, where the mapping of the full network of metabolite molecules is a computationally quixotic venture. Instead of fully mapping the network, biologists improved their understanding of biological networks by studying the effect of deleting some of their components, one or a few perturbations at a time [7, 12]. Biological systems in general are found to be fault tolerant [28], which is thus an important criterion for biological plausibility of mathematical models.

Neuromorphic hardware (NH). Current Machine Learning systems are bottlenecked by the underlying computational power [1]. One significant improvement over the now prevailing CPU/GPUs is neuromorphic hardware. In this paradigm of computation, each neuron is a physical entity [9], and the forward pass is done (theoretically) at the speed of light. However, components of such hardware are small and unreliable, leading to small random perturbations of the weights of the model [41]. Thus, robustness to weight faults is an overlooked concrete Artificial Intelligence (AI) safety problem [2]. Since we ground the assumptions of our model in the properties of NH and of biological networks, our fundamental theoretical results can be directly applied in these computing paradigms.

Research on NN fault tolerance. In the 2000s, the fault tolerance of NNs was a major motivation for studying them [14, 16, 4]. In the 1990s, the exploration of microscopic failures was fueled by the hopes of developing neuromorphic hardware (NH) [22, 6, 34]. Taylor expansion was one of the tools used for the study of fault tolerance [13, 26]. Another line of research proposes sufficient conditions for robustness [33]. However, most of these studies are either empirical or are limited to simple architectures [41]. In addition, those studies address the worst case [5], which is known to be more severe than a random perturbation. Recently, fault tolerance was studied experimentally as well. DeepMind proposes to focus on neuron removal [25] to understand NNs. NVIDIA [21] studies error propagation caused by micro-failures in hardware [3]. In addition, mathematically similar problems are raised in the study of generalization [29, 30] and robustness [42].

The quest for guarantees. Existing NN approaches do not guarantee fault tolerance: they only provide heuristics and evaluate them experimentally. Theoretical papers, in turn, focus on the worst case and not on errors in a probabilistic sense. It is known that there exists a set of small worst-case perturbations, adversarial examples [5], leading to pessimistic bounds not suitable for the average case of random failures, which is the most realistic case for hardware faults. Other branch of theoretical research studies robustness and arrives at error bounds which, unfortunately, scale exponentially with the depth of the network [29]. We define the goal of this paper to guarantee that the probability of loss exceeding a threshold is lower than a pre-determined small value. This condition is sensible. For example, self-driving cars are deemed to be safe once their probability of a crash is several orders of magnitude less than of human drivers [40, 15, 36]. In addition, current fault tolerant architectures use mean as the aggregation of copies of networks to achieve redundancy. This is known to require exponentially more redundancy compared to the median approach and, thus, hardware cost. In order to apply this powerful technique and reduce costs, certain conditions need to be satisfied which we will evaluate for neural networks.

Contributions. Our main contribution is a theoretical bound on the error in the output of an NN in the case of random neuron crashes obtained in the continuous limit, where close-by neurons compute similar functions. We show that, while the general problem of fault tolerance is NP-hard, realistic assumptions with regard to neuromorphic hardware, and a probabilistic approach to the problem, allow us to apply a Taylor expansion for the vast majority of the cases, as the weight perturbation is small with high probability. In order for the Taylor expansion to work, we assume that a network is smooth enough, introducing the continuous limit [39] to prove the properties of NNs: it requires neighboring neurons at each layer to be similar. This makes the moments of the error linear-time computable. To our knowledge, the tightness of the bounds we obtain is a novel result. In turn, the bound allows us to build an algorithm that enhances fault tolerance of neural networks. Our algorithm uses median aggregation which results in only a logarithmic extra cost – a drastic improvement on the initial NP-hardness of the problem. Finally, we show how to apply the bounds to specific architectures and evaluate them experimentally on real-world networks, notably the widely used VGG [38].

Outline. In Sections 2-4, we set the formalism, then state our bounds. In Section 5, we present applications of our bounds on characterizing the fault tolerance of different architectures. In Section 6 we present our algorithm for certifying fault tolerance. In Section 7, we present our experimental evaluation. Finally, in Section 8, we discuss the consequences of our findings. Full proofs are available in the supplementary material. Code is provided at the address github.com/LPD-EPFL/ProbabilisticFaultToleranceNNs. We abbreviate Assumption 1 $\to$ A1, Proposition $1$ $\to$ P1, Theorem 1 $\to$ T1, Definition 1 $\to$ D1.

2 Definitions of Probabilistic Fault Tolerance

In this section, we define a fully-connected network and fault tolerance formally.

Notations. For any two vectors $x,y\in\mathbb{R}^{n}$ we use the notation $(x,y)=\sum_{i=1}^{n}x_{i}y_{i}$ for the standard scalar product. Matrix $\gamma$ -norm for $\gamma=(0,+\infty]$ is defined as $\|A\|_{\gamma}=\sup_{x\neq 0}\|Ax\|_{\gamma}/\|x\|_{\gamma}$ . We use the infinity norm $\|x\|_{\infty}=\max|x_{i}|$ and the corresponding operator matrix norm. We call a vector $0\neq x\in\mathbb{R}^{n}$ $q$ -balanced if $\min|x_{i}|\geq q\max|x_{i}|$ . We denote $[n]=\{1,2,...,n\}$ . We define the Hessian $H_{ij}={\partial^{2}y(x)}/{\partial x^{i}\partial x^{j}}$ as a matrix of second derivatives. We write layer indices down and element indices up: $W_{l}^{ij}$ . For the input, we write $x_{i}\equiv x^{i}$ . If the layer is fixed, we omit its index. We use the element-wise Hadamard product $(x\odot y)_{i}=x_{i}y_{i}$ .

Definition 1.

*(Neural network)

A neural network with $L$ layers is a function $y_{L}\colon\mathbb{R}^{n_{0}}\to\mathbb{R}^{n_{L}}$ defined by a tuple $(L,W,B,\varphi)$ with a tuple of weight matrices $W=(W_{1},...,W_{L})$ (or their distributions) of size $W_{l}\colon n_{l}\times n_{l-1}$ , biases $B=(b_{1},...,b_{L})$ (or their distributions) of size $b_{l}\in\mathbb{R}^{n_{l}}$ by the expression $y_{l}=\varphi(z_{l})$ with pre-activations $z_{l}=W_{l}y_{l-1}+b_{l},\,l\in[L]$ , $y_{0}=x$ and $y_{L}=z_{L}$ . Note that the last layer is linear. We additionally require $\varphi$ to be 1-Lipschitz 1111-Lipschitz $\varphi$ s.t. $|\varphi(x)-\varphi(y)|\leqslant|x-y|$ . If $\varphi$ is $K$ -Lipschitz, we rescale the weights to make $K=1$ : $W^{ij}_{l}\to W^{ij}_{l}/K$ . This is the general case. Indeed, if we rescale $\varphi(x)\to K\varphi(x)$ , then, $y_{l-1}\to Ky_{l-1}^{\prime}$ , and in the sum $z_{l}^{\prime}=\sum W^{ij}/K\cdot Ky_{l-1}\equiv z_{l}$ . We assume that the network was trained using input-output pairs $x,y^{*}\sim X\times Y$ using ERM222Empirical Risk Minimization – the standard task $1/k\sum_{k=1}^{m}\omega(y_{L}(x_{k}),y^{*}_{k})\to\min$ for a loss $\omega$ . Loss layer for input $x$ and the true label $y^{*}(x)$ is defined as $y_{L+1}(x)=\mathbb{E}_{y^{*}\sim Y|x}\omega(y_{L}(x),y^{*}))$ with $\omega\in[-1,1]$ 333The loss is bounded for the proof of Algorithm 1’s running time to work*

Definition 2.

*(Weight failure)

Network $(L,W,B,\varphi)$ with weight failures $U$ of distribution $U\sim D|(x,W)$ is the network $(L,W+U,B,\varphi)$ for $U\sim D|(x,W)$ . We denote a (random) output of this network as $y^{W+U}(x)=\hat{y}_{L}(x)$ with activations $\hat{y}_{l}$ and pre-activations $\hat{z}_{l}$ , as in D1.*

Definition 3.

*(Bernoulli neuron failures)

Bernoulli neuron crash distribution is the distribution with i.i.d. $\xi_{l}^{i}\sim{\rm{Be}}(p_{l})$ , $U_{l}^{ij}=-\xi_{l}^{i}\cdot W_{l}^{ij}$ . For each possible crashing neuron $i$ at layer $l$ we define $U_{l}^{i}=\sum_{j}|U_{l}^{ij}|$ and $W_{l}^{i}=\sum_{j}|W_{l}^{ij}|$ , the crashed incoming weights and total incoming weights. We note that we see neuron failure as a sub-type of weight failure.*

This definition means that neurons crash independently, and they start to output [math] when they do. We use this model because it mimics essential properties of NH [41]. Components fail relatively independently, as we model faults as random [41]. In terms of [41], we consider stuck-at-0 crashes, and passive fault tolerance in terms of reliability.

Definition 4.

*(Output error for a weight distribution)

The error in case of weight failure with distribution $D|(x,W)$ is $\Delta_{l}(x)=y^{W+U}_{l}(x)-y^{W}_{l}(x)$ for layers $l\in[L+1]$ *

We extend the definition of $\varepsilon$ -fault tolerance from [23] to the probabilistic case:

Definition 5.

*(Probabilistic fault tolerance)

A network $(L,W,B,\varphi)$ is said to be $(\varepsilon,\delta)$ -fault tolerant over an input distribution $(x,y^{*})\sim X\times Y$ and a crash distribution $U\sim D|(x,W)$ if $\mathbb{P}_{(x,y^{*})\sim X\times Y,\,U\sim D|(x,W)}\{\Delta_{L+1}(x)\geq\varepsilon\}\leq\delta$ . For such network, we write $(W,B)\in\operatorname{FT}(L,\varphi,p,\varepsilon,\delta)$ .*

Interpretation. To evaluate the fault tolerance of a network, we compute the first moments of $\Delta_{L+1}$ . Next, we use tail bounds to guarantee $(\varepsilon,\delta)$ -FT. This definition means that with high probability $1-\delta$ additional loss due to faults does not exceed $\varepsilon$ . Expectation over the crashes $U\sim D|x$ can be interpreted in two ways. First, for a large number of neural networks, each having permanent crashes, $\mathbb{E}\Delta$ is the expectation over all instances of a network implemented in the hardware multiple times. For a single network with intermittent crashes, $\mathbb{E}\Delta$ is the output of this one network over repetitions. The recent review study [41] identifies three types of faults: permanent, transient, and intermittent. Our definition 2 thus covers all these cases.

Now that we have a definition of fault tolerance, we show in the next section that the task of certifying or even computing it is hard.

3 The Hardness of Fault Tolerance

In this section, we show why fault tolerance is a hard problem. Not only it is NP-hard in the most general setting but, also, even for small perturbations, the error of the output of can be unacceptable.

3.1 NP-Hardness

A precise assessment of an NN’s fault tolerance should ideally diagnose a network by looking at the outcome of every possible failure, i.e. at the Forward Propagated Error [23] resulting from removing every possible subset of neurons. This would lead to an exact assessment, but would be impractical in the face of an exponential explosion of possibilities as by Proposition 1 (proof in the supplementary material).

Proposition 1.

The task of evaluating $\mathbb{E}\Delta^{k}$ for any $k=1,2,...$ with constant additive or multiplicative error for a neural network with $\varphi\in C^{\infty}$ , Bernoulli neuron crashes and a constant number of layers is NP-hard.

We provide a theoretical alternative for the practical case of neuromorphic hardware. We overcome NP-hardness in Section 4 by providing an approximation dependent on the network, and not a constant factor one: for weights $W$ we give $\overline{\Delta}$ and $\underline{\Delta}$ dependent on $W$ such that $\underline{\Delta}(W)\leq\mathbb{E}\Delta\leq\overline{\Delta}(W)$ . In addition, we only consider some subclass of all networks.

3.2 Pessimistic Spectral Bounds

By Definition 4, the fault tolerance assessment requires to consider a weight perturbation $W+U$ given current weights $W$ and the loss change $y_{L+1}(W+U)-y_{L+1}(W)$ caused by it. Mathematically, this means calculating a local Lipschitz coefficient $K$ [43] connecting $|y_{L+1}(W+U)-y_{L+1}(W)|\leq K|U|$ . In the literature, there are known spectral bounds on the Lipschitz coefficient for the case of input perturbations. These bounds use the spectral norm of the matrix $\|\cdot\|_{2}$ and give a global result, valid for any input. This estimate is loose due to its exponential growth in the number of layers, as $\|W\|_{2}$ is rarely $<1$ . See Proposition 2 for the statement:

Proposition 2 ( $K$ using spectral properties).

$\|y_{L}(x_{2})-y_{L}(x_{1})\|_{2}\leqslant\|x_{2}-x_{1}\|_{2}\cdot\prod_{l=1}^{L}\|W_{l}\|_{2}$ **

The proof can be found in [29] or in the supplementary material. It is also known that high perturbations under small input changes are attainable. Adversarial examples [5] are small changes to the input resulting in a high change in the output. This bound is equal to the one of [23], which is tight in case if the network has the fewest neurons. In contrast, in Section 4, we derive our bound in the limit $n\to\infty$ .

We have now shown that even evaluating fault tolerance of a given network can be a hard problem. In order to make the analysis practical, we use additional assumptions based on the properties of neuromorphic hardware.

4 Realistic Simplifying Assumptions for Neuromorphic Hardware

In this section, we introduce realistic simplifying assumptions grounded in neuromorphic hardware characteristics. We first show that if faults are not too frequent, the weight perturbation would be small. Inspired by this, we then apply a Taylor expansion to the study of the most probable case. 444 The inspiration for splitting the loss calculation into favorable and unfavorable cases comes from [27]

Assumption 1.

The probability of failure $p=\max\{p_{l}\big{|}l\in[L]\}$ is small: $p\lesssim 10^{-4}..10^{-3}$

This assumption is based on the properties of neuromorphic hardware [35]. Next, we then use the internal structure of neural networks.

Assumption 2.

The number of neurons at each layer $n_{l}$ is sufficiently big, $n_{l}\gtrsim 10^{2}$

This assumption comes from the properties of state-of-the-art networks [1].

The best and the worst fault tolerance. Consider a 1-layer NN with $n=n_{0}$ and $n_{L}=n_{1}=1$ at input $x_{i}=1$ : $y(x)=\sum x_{i}/n$ . We must divide $1/n$ to preserve $y(x)$ as $n$ grows. This is the most robust network, as all neurons are interchangeable. Here $\mathbb{E}\Delta=-p$ and $\mathrm{Var}\Delta=p/n$ , variance decays with $n$ . In contrast, the worst case $y(x)=x_{1}$ has all but one neuron unused. Therefore $\mathbb{E}\Delta=p$ and $\mathrm{Var}\Delta=p$ , variance does not decay with $n$ .

The next proposition shows that under a mild additional regularity assumption on the network, Assumptions 1 and 2 are sufficient to show that the perturbation of the norm of the weights is small.

Proposition 3.

Under A1,2 and if $\{W_{l}^{i}\}_{i=1}^{n_{l}}$ are $q$ -balanced, for $\alpha>p$ , the norm of the weight perturbation $U_{l}^{i}$ at layer $l$ is probabilistically bounded as: $\delta_{0}=\mathbb{P}\{\|U_{l}^{\cdot}\|_{1}\geq\alpha\|W_{l}^{\cdot}\|\}\leq\exp\left(-n_{l}\cdot q\cdot d_{KL}(\alpha||p_{l})\right)$ with KL-divergence between numbers $a,b\in(0,1)$ , $d_{KL}(a,b)=a\log{a}/{b}+(1-a)\log{(1-a)}/{(1-b)}$ and $W_{l}^{i}$ from D3

Inspired by this result, next, we compute the error $\Delta$ given a small weight perturbation $U$ using a Taylor expansion. 555In order to certify fault tolerance, we need a precise bounds on the remainder of the Taylor approximation. For example, for ReLU functions, Taylor approximation fails. The supplementary material contains another counter-example to the Taylor expansion of an NN. Instead, we give sufficient conditions for which the Taylor approximation indeed holds.

Assumption 3.

As the width $n$ increases, networks $NN_{n}$ have a continuous limit [39] $NN_{n}\to NN_{c}$ , where $NN_{c}$ is a continuous neural network [19], and $n=\min\{n_{l}\}$ . That network $NN_{c}$ has globally bounded operator derivatives $D_{k}$ for orders $k=1,2$ . We define $D_{12}=\max\{D_{1},D_{2}\}$ .666A necessary condition for $D_{k}$ to be bounded is to have a reasonable bound on the derivatives of the ground truth function $y^{*}(x)$ . We assume that this function is sufficiently smooth.

See Figure 1 for a visualization of A3 and Table 1 for the description of A3. The assumption means that with the increase of $n$ , the network uses the same internal structure which just becomes more fine-grained. The continuous limit holds in the case of explicit duplication, convolutional networks and corresponding explicit regularization. The supplementary material contains a more complete explanation.

The derivative bound for order $2$ is in contrast to the worse-case spectral bound which would be exponential in depth as in Proposition 2. This is consistent with experimental studies [11] and can be connected to generalization properties via minima sharpness [17].

Proposition 4.

Under A3, derivatives are equal the operator derivatives of the continuous limit:

[TABLE]

For example 777The proposition is illustrated in proof-of-concept experiments with explicit regularization in the supplementary material. There are networks for which the conclusion of P4 would not hold, for example, a network with $w^{ij}=1$ . However, such a network does not approximate the same function as $n$ increases since $y(x)\to\infty$ , violating A3, consider $y(x)=1/n_{1}\sum_{i_{1}=1}^{n_{1}}\varphi\left(\sum_{i_{0}=1}^{n_{0}}x^{i_{0}}/n_{0}\right)$ at $x_{i}\equiv 1$ . Factors $1/n_{0}$ and $1/n_{1}$ appear because the network must represent the same $y^{*}$ as $n_{0},n_{1}\to\infty$ . Then, $\partial y/\partial x_{i}=\varphi^{\prime}(1)/n_{1}$ and $\partial^{2}y/\partial x_{i}\partial x_{j}=\varphi^{\prime\prime}(1)/n_{1}^{2}$ .

Theorem 1.

For crashes at layer $l$ and output error $\Delta_{L}$ at layer $L$ under A1-3 with $q=1/n_{l}$ and $r=p+q$ , the mean and variance of the error can be approximated as

[TABLE]

By $\Theta_{\pm}(1)$ we denote any function taking values in $[-1,1]$ .888The derivative $\partial y_{L}/\partial\xi^{i}(\xi)\equiv-\partial y_{L}(y_{l}-\xi\odot y_{l})/\partial y_{l}^{i}\cdot y_{l}^{i}$ is interpreted as if $\xi^{i}$ was a real variable.

The full proof of the theorem is in the supplementary material. The remainder terms are small as both $p$ and $q={1}/{n_{l}}$ are small quantities under A1-2. In addition, P4 implies $\partial y_{L}/\partial\xi^{i}\sim 1/n_{l}$ and thus, when $n_{l}\to\infty$ , $\mathbb{E}\Delta=\mathcal{O}(1)$ remains constant, and $\mathrm{Var}\Delta_{L}=\mathcal{O}(1/n_{l})$ . This is the standard rate in case if we estimate the mean of a random variable by averaging over $n_{l}$ independent samples, and our previous example in the beginning of the Section shows that it is the best possible rate. Our result shows sufficient conditions under which neural networks allow for such a simplification.999However, the dependency $\mathrm{Var}\Delta\sim 1/n_{l}$ is only valid if $n<p^{-2}\sim 10^{8}$ to guarantee the first-order term to dominate, $p/n>r^{3}$ . In case if this is not true, we can still render the network more robust by aggregating multiple copies with a mean, instead of adding more neurons. Our current guarantees thus work in case if $p^{2}\leq n^{-1}\leq p$ . In the supplementary material, we show that a more tight remainder, depending only on $p/n$ , hence decreasing with $n$ , is possible. However, it complicates the equation as it requires $D_{3}$ . In the next sections we use the obtained theoretical evaluation to develop a regularizer increasing fault tolerance, and say which architectures are more fault-tolerant.

5 Probabilistic Guarantees on Fault Tolerance Using Tail Bounds

In this section, we apply the results from the previous sections to obtain a probabilistic guarantee on fault tolerance. We identify which kinds of architectures are more fault-tolerant.

Under the assumptions of previous sections, the variance of the error decays as $\mathrm{Var}\Delta\sim\sum C_{l}p_{l}/n_{l}$ as the error superposition is linear (see supplementary material for a proof), with $C_{l}$ not dependent on $n_{l}$ . Given a fixed budget of neurons, the most fault-tolerant NN has its layers balanced: one layer with too few neurons becomes a single point of failure. Specifically, an optimal architecture with a fixed sum $N=\sum n_{l}$ has $n_{l}\sim\sqrt{p_{l}C_{l}}$

Given the previous results, certifying $(\varepsilon,\delta)$ -fault tolerance is trivial via a Chebyshev tail bound (proof in the supplementary material):

Proposition 5.

A neural network under assumptions 1-3 is $(\varepsilon,\delta)$ -fault tolerant for $t=\varepsilon-\mathbb{E}\Delta_{L}>0$ with $\delta=t^{-2}\mathrm{Var}\Delta_{L}$ for $\mathbb{E}\Delta$ and $\mathrm{Var}\Delta$ calculated by Theorem 1.

Evaluation of $\mathbb{E}\Delta$ or $\mathrm{Var}\Delta$ using Theorem 1 would take the same amount of time as one forward pass. However, the exact assessment would need $\mathcal{O}(2^{n})$ forward passes by Proposition 1.

In order to make the networks more fault tolerant, we now want to solve the problem of loss minimization under fault tolerance rather than ERM (as previously formulated in [41]): $\inf_{(W,B)\in\operatorname{FT}}\mathcal{L}(w,B)$ where $\operatorname{FT}=\operatorname{FT}(L,\varphi,p,\varepsilon,\delta)$ from Definition 5. Regularizing101010We note that the gradient of $\mathrm{Var}\Delta$ is linear time-computable since it is a Hessian-vector product. with Equation 1 can be seen as an approximate solution to the problem above. Indeed, $\mathrm{Var}\Delta\approx p_{l}\sum_{i}\left(\frac{\partial L}{\partial y_{l}^{i}}\cdot y_{l}^{i}\right)^{2}$ (from T1) is connected to the target probability (P5). Moreover, the network is required to be continuous by A3, which is achieved by making nearby neurons’ weights close using a smoothing regularizing function $\mbox{smooth}(W)\approx\int|W^{\prime}_{t}(t,t^{\prime})|dtdt^{\prime}$ . The $\mu$ term for $q$ -balancedness comes from P3 as it is a necessary condition for A3. See the supplementary material for complete details. Here $\hat{\mathcal{L}}$ is the regularized loss, $\mathcal{L}$ the original one, and $\lambda,\,\mu,\,\nu,\,\psi$ are the parameters:

[TABLE]

We define the terms corresponding to $\lambda,\mu,\psi$ as $R_{1}\approx\mathrm{Var}\Delta/p_{l}$ , $R_{2}=q^{2}$ , $R_{3}=\mbox{smooth}(W_{l})$ . If we have achieved $\delta<1/3$ by P5, we can apply the well-known median trick technique [31], drastically increasing fault tolerance. We only use $R$ repetitions of the network with component-wise median aggregation to obtain $(\varepsilon,\delta\cdot\exp(-R))$ -fault tolerance guarantee. See supplementary material for the calculations.

In addition, we show that after training, when $\mathbb{E}_{x}\nabla_{W}y_{L+1}(x)=0$ , then $\mathbb{E}_{x}\mathbb{E}_{\xi}\Delta_{L+1}=0+\mathcal{O}(r^{2})$ (proof in the supplementary material). This result sheds some light on why neural networks are inherently fault-tolerant in a sense that the mean $\Delta_{L+1}$ is [math]. Convolutional networks of architecture Conv-Activation-Pool can be seen as a sub-type of fully connected ones, as they just have locally-connected matrices $W_{l}$ , and therefore our techniques still apply. Using large kernel sizes (see supplementary material for discussion), smooth pooling and activations lead to a better approximation.

We developed techniques to assess fault tolerance and to improve it. Now we combine all the results into a single algorithm to certify fault tolerance.

6 An algorithm for Certifying Fault Tolerance

We are now in the position to provide an algorithm (Algorithm 1) allowing to reach the desired $(\varepsilon,\delta)$ -fault tolerance via training with our regularizer and then physically duplicating the network a logarithmic amount of times in hardware, assuming independent faults. We note that our algorithm works for a single input $x$ but is easily extensible if the expressions in Propositions are replaced with expectations over inputs (see supplementary material).

In order to estimate the required number of neurons, we use bounds from T1 and P5 which require $n\sim p/\varepsilon^{2}$ . However, using the median approach allows for a fast exponential decrease in failure probability. Once the threshold of failing with probability $1/3$ is reached by P5, it becomes easy to reach any required guarantee. The time complexity (compared to the one of training) of the algorithm is $\mathcal{O}(D_{12}+C_{l}p_{l}/\varepsilon^{2})$ and space complexity is equal to that of one training call. See supplementary material for the proofs of resource requirements and correctness.

7 Experimental Evaluation

In this section, we test the theory developed in previous sections in proof-of-concept experiments. We first show that we can correctly estimate the first moments of the fault tolerance using T1 for small (10-50 neurons) and larger networks (VGG). We test the predictions of our theory such as decay of $\mathrm{Var}\Delta$ , the effect of our regularizer and the guarantee from Algorithm 1. See the supplementary material for the technical details where we validate the assumption of derivative decay (A3) explicitly. Our code is provided at the address github.com/LPD-EPFL/ProbabilisticFaultToleranceNNs.

Increasing training dropout. We train sigmoid networks with $N\sim 100$ on MNIST (see ComparisonIncreasingDropoutMNIST.ipynb). We use probabilities of failure at inference and training stages $p_{i}=0.05$ at the first layer and $10$ values of $p_{t}\in[0,1.2p_{i}]$ . The experiment is repeated $10$ times. When estimating the error experimentally, we choose $6$ repetitions of the training dataset to ensure that the variance of the estimate is low. The results are in the Table 12. The experiments show that crashing MAE (Mean Absolute Error for the network with crashes at inference) is most dramatically affected by dropout. Specifically, training with $p_{t}\sim p_{i}$ makes network more robust at inference, which was well-established before. Moreover, the bound from T1 can correctly order which network is trained with bigger dropout parameter with only $4\%$ rank loss, which is the fraction of incorrectly ordered pairs. All other quantities, including norms of the weights, are not able to order networks correctly. See supplementary material for a complete list of metrics in the experiment.

Regularization for fault tolerance. Previously, our bound is demonstrated to be able to correctly predict which network is more resilient. We therefore use it as a regularization technique suggested by Eq. 1, see Regularization.ipynb. We establish that the resilience of the network regularized with Dropout is similar to that of a network regularized with the bound

Testing the bound on larger networks. We test the bound on VGG16 and on a smaller convnet, see ConvNetTest-MNIST.ipynb and ConvNetTest-VGG16.ipynb and verify that they correctly predict the magnitude of the error

Architecture and fault tolerance. Comparing different architectures on a single image with $p=0.01$ (VGG16, VGG19, MobileNet) shows (and ConvNetTest-ft.ipynb) that the bigger the mean width of the layer (approximated by the number of parameters), the better is the fault tolerance, as predicted in Section 5. In addition, training networks on the MNIST dataset (see FaultTolerance-Continuity-FC-MNIST.ipynb) shows a decrease in variance with $n_{l}$ as predicted by Theorem 1, see Figure 2: the variance decays as $1/n_{l}$ . We regularize with $\psi=(10^{-4},10^{-2})$ for derivatives and smoothing respectively (see supplementary material for explanation of coefficients) and $\lambda=0.001$ .

Testing the algorithm. We test the Algorithm 1 on the MNIST dataset for $\varepsilon=9\cdot 10^{-3}$ , $\delta=10^{-5}$ and obtain $R=20$ , $n_{1}=500$ , $\lambda=10^{-6}$ , $\mu=10^{-10}$ , $\psi=(10^{-4},10^{-2})$ . We evaluate the tail bound experimentally. Our experiment demonstrates the guarantee given by Proposition 5 and can be seen as an experimental confirmation of the algorithm’s correctness. See TheAlgorithm.ipynb.

We hence conclude that our proof-of-concept experiments show an overall validity of our assumptions and of our approach.

8 Conclusion

Fault tolerance is an important overlooked concrete AI safety issue [2]. This paper describes a probabilistic fault tolerance framework for NNs that allows to get around the NP-hardness of the problem. Since the crash probability in neuromorphic hardware is low, we can simplify the problem to allow for a polynomial computation time. We use the tail bounds to motivate the assumption that the weight perturbation is small. This allows us to use a Taylor expansion to compute the error. To bound the remainder, we require sufficient smoothness of the network, for which we use the continuous limit: nearby neurons compute similar things. After we transform the expansion into a tail bound to give a bound on the loss of the network. This gives a probabilistic guarantee of fault tolerance. Using the framework, we are able to guarantee sufficient fault tolerance of a neural network given parameters of the crash distribution. We then analyze the obtained expressions to compare fault tolerance between architectures and optimize for fault tolerance of one architecture. We test our findings experimentally on small networks (MNIST) as well as on larger ones (VGG-16, MobileNet). Using our framework, one is able to deploy safer networks into neuromorphic hardware.

Mathematically, the problem that we consider is connected to the problem of generalization [29, 27] since the latter also considers the expected loss change under a small random perturbation $\mathbb{E}_{W+U}\mathcal{L}(W+U)-\mathcal{L}(W)$ , except that these papers consider Gaussian noise and we consider Bernoulli noise. Evidence [32], however, shows that sometimes networks that generalize well are not necessarily fault-tolerant. Since the tools we develop for the study of fault tolerance could as well be applied in the context of generalization, they could be used to clarify this matter.

Acknowledgements

We thank Michael Kapralov, Martin Jaggi, Manuel Le Gallo, Arthur Jacot, Clement Hongler for helpful discussions. We thank the anonymous ICML and NeurIPS reviewers for constructive criticism of earlier versions of this work.

Bibliography43

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] D. Amodei and D. Hernandez. AI and compute. Downloaded from https://blog.openai.com/ai-and-compute , 2018.
2[2] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané. Concrete problems in ai safety. ar Xiv preprint ar Xiv:1606.06565 , 2016.
3[3] A. P. Arechiga and A. J. Michaels. The robustness of modern deep learning architectures against single event upset errors. In 2018 IEEE High Performance extreme Computing Conference (HPEC) , pages 1–6. IEEE, 2018.
4[4] Y. Bengio and R. De Mori. Use of neural networks for the recognition of place of articulation. In Acoustics, Speech, and Signal Processing, 1988. ICASSP-88., 1988 International Conference on , pages 103–106. IEEE, 1988.
5[5] B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. Šrndić, P. Laskov, G. Giacinto, and F. Roli. Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases , pages 387–402. Springer, 2013.
6[6] C.-T. Chiu et al. Robustness of feedforward neural networks. In IEEE International Conference on Neural Networks , pages 783–788. IEEE, 1993.
7[7] M. Costanzo, B. Vander Sluis, E. N. Koch, A. Baryshnikova, C. Pons, G. Tan, W. Wang, M. Usaj, J. Hanchard, S. D. Lee, et al. A global genetic interaction network maps a wiring diagram of cellular function. Science , 353(6306):aaf 1420, 2016.
8[8] E. El Mhamdi, R. Guerraoui, and S. Rouault. The hidden vulnerability of distributed learning in Byzantium. In Proceedings of the 35th International Conference on Machine Learning , volume 80 of Proceedings of Machine Learning Research , pages 3521–3530, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Taxonomy

The Probabilistic Fault Tolerance of Neural Networks in The Continuous Limit

Abstract

1 Introduction

2 Definitions of Probabilistic Fault Tolerance

Definition 1**.**

Definition 2**.**

Definition 3**.**

Definition 4**.**

Definition 5**.**

3 The Hardness of Fault Tolerance

3.1 NP-Hardness

Proposition 1**.**

3.2 Pessimistic Spectral Bounds

Proposition 2** (KKK using spectral properties).**

4 Realistic Simplifying Assumptions for Neuromorphic Hardware

Assumption 1**.**

Assumption 2**.**

Proposition 3**.**

Assumption 3**.**

Proposition 4**.**

Theorem 1**.**

5 Probabilistic Guarantees on Fault Tolerance Using Tail Bounds

Proposition 5**.**

6 An algorithm for Certifying Fault Tolerance

7 Experimental Evaluation

8 Conclusion

Acknowledgements

Definition 1.

Definition 2.

Definition 3.

Definition 4.

Definition 5.

Proposition 1.

Proposition 2 ( $K$ using spectral properties).

Assumption 1.

Assumption 2.

Proposition 3.

Assumption 3.

Proposition 4.

Theorem 1.

Proposition 5.