Accurate deep neural network inference using computational phase-change memory
Vinay Joshi, Manuel Le Gallo, Simon Haefeli, Irem Boybat, S.R., Nandakumar, Christophe Piveteau, Martino Dazzi, Bipin Rajendran, Abu, Sebastian, Evangelos Eleftheriou

TL;DR
This paper presents a training methodology for deep neural networks that ensures minimal accuracy loss when deploying on phase-change memory-based in-memory computing hardware, enabling energy-efficient inference.
Contribution
It introduces a novel training approach and compensation technique for PCM-based in-memory computing, achieving high accuracy retention on CIFAR-10 and ImageNet datasets.
Findings
Achieved 93.7% accuracy on CIFAR-10 after mapping to PCM
Attained 71.6% top-1 accuracy on ImageNet with PCM hardware
Maintained over 93.5% accuracy on CIFAR-10 over one day
Abstract
In-memory computing is a promising non-von Neumann approach for making energy-efficient deep learning inference hardware. Crossbar arrays of resistive memory devices can be used to encode the network weights and perform efficient analog matrix-vector multiplications without intermediate movements of data. However, due to device variability and noise, the network needs to be trained in a specific way so that transferring the digitally trained weights to the analog resistive memory devices will not result in significant loss of accuracy. Here, we introduce a methodology to train ResNet-type convolutional neural networks that results in no appreciable accuracy loss when transferring weights to in-memory computing hardware based on phase-change memory (PCM). We also propose a compensation technique that exploits the batch normalization parameters to improve the accuracy retention over time.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Accurate deep neural network inference using computational phase-change memory
Vinay Joshi
IBM Research - Zurich, Säumerstrasse 4, 8803 Rüschlikon, Switzerland
King’s College London, Strand, London WC2R 2LS, United Kingdom
Manuel Le Gallo
IBM Research - Zurich, Säumerstrasse 4, 8803 Rüschlikon, Switzerland
Simon Haefeli
IBM Research - Zurich, Säumerstrasse 4, 8803 Rüschlikon, Switzerland
ETH Zurich, Rämistrasse 101, 8092 Zurich, Switzerland
Irem Boybat
IBM Research - Zurich, Säumerstrasse 4, 8803 Rüschlikon, Switzerland
Ecole Polytechnique Federale de Lausanne (EPFL), 1015 Lausanne, Switzerland
S.R. Nandakumar
IBM Research - Zurich, Säumerstrasse 4, 8803 Rüschlikon, Switzerland
Christophe Piveteau
IBM Research - Zurich, Säumerstrasse 4, 8803 Rüschlikon, Switzerland
ETH Zurich, Rämistrasse 101, 8092 Zurich, Switzerland
Martino Dazzi
IBM Research - Zurich, Säumerstrasse 4, 8803 Rüschlikon, Switzerland
ETH Zurich, Rämistrasse 101, 8092 Zurich, Switzerland
Bipin Rajendran
King’s College London, Strand, London WC2R 2LS, United Kingdom
Abu Sebastian
IBM Research - Zurich, Säumerstrasse 4, 8803 Rüschlikon, Switzerland
Evangelos Eleftheriou
IBM Research - Zurich, Säumerstrasse 4, 8803 Rüschlikon, Switzerland
Abstract
In-memory computing is a promising non-von Neumann approach for making energy-efficient deep learning inference hardware. Crossbar arrays of resistive memory devices can be used to encode the network weights and perform efficient analog matrix-vector multiplications without intermediate movements of data. However, due to device variability and noise, the network needs to be trained in a specific way so that transferring the digitally trained weights to the analog resistive memory devices will not result in significant loss of accuracy. Here, we introduce a methodology to train ResNet-type convolutional neural networks that results in no appreciable accuracy loss when transferring weights to in-memory computing hardware based on phase-change memory (PCM). We also propose a compensation technique that exploits the batch normalization parameters to improve the accuracy retention over time. We achieve a classification accuracy of 93.7% on the CIFAR-10 dataset and a top-1 accuracy on the ImageNet benchmark of 71.6% after mapping the trained weights to PCM. Our hardware results on CIFAR-10 with ResNet-32 demonstrate an accuracy above 93.5% retained over a one day period, where each of the 361,722 synaptic weights of the network is programmed on just two PCM devices organized in a differential configuration.
I Introduction
Deep neural networks (DNNs) have revolutionized the field of artificial intelligence and have achieved unprecedented success in cognitive tasks such as image and speech recognition. Platforms for deploying the trained model of such networks and performing inference in an energy-efficient manner are highly attractive for edge computing applications. In particular, internet-of-things battery-powered devices and autonomous cars could especially benefit from fast, low-power, and reliably accurate DNN inference engines. Significant progress in this direction has been made with the introduction of specialized hardware for inference operating at reduced digital precision (4 to 8-bit), such as Google’s tensor processing unit (TPU) Jouppi et al. (2017) and low-power graphical processing units (GPUs) such as NVIDIA T4 Jia et al. (2019). While these platforms are very flexible, they are based on architectures where there is a physical separation between memory and processing units. The models are typically stored in off-chip memory, leading to constant shuttling of data between memory and processing units, which limits the maximum achievable energy efficiency.
In order to reduce the data transfers to a minimum in inference accelerators, a promising avenue is to employ in-memory computing using non-volatile memory devices Shafiee et al. (2016). Both charge-based storage devices, such as Flash memory Merrikh-Bayat et al. (2018), and resistance-based (memristive) storage devices, such as metal-oxide resistive random-access memory (ReRAM) Chen et al. (2019); Hu et al. (2018) and phase-change memory (PCM) Le Gallo et al. (2018); Boybat et al. (2018); Ambrogio et al. (2018) are being investigated for this. In this approach, the network weights are encoded as the analog charge state or conductance state of these devices organized in crossbar arrays, and the matrix-vector multiplications during inference can be performed in-situ in a single time step by exploiting Kirchhoff’s circuit laws. The fact that these devices are non-volatile (the weights will be retained when the power supply is turned off) and have multi-level storage capability (a single device can encode an analog range of values as opposed to 1 bit) is very attractive for inference applications. However, due to the analog nature of the weights programmed in these devices, only limited precision can be achieved in the matrix-vector multiplications and this could limit the achievable inference accuracy of the accelerator.
One potential solution to this problem is to train the network fully on hardware Nandakumar et al. (2018); Ambrogio et al. (2018), such that all hardware non-idealities would be de facto included as constraints during training. Another similar approach is to perform partial optimizations of the hardware weights after transferring a trained model to the chip Mohanty et al. (2017); Gonugondla, Kang, and Shanbhag (2018). The drawback of these approaches is that every neural network would have to be trained on each individual chip before deployment. Off-line variation-aware training schemes have also been proposed, where hardware non-idealities such as device-to-device variations Liu et al. (2015); Chen et al. (2017), defective devices Chen et al. (2017), or IR drop Liu et al. (2015) are first characterized and then fed into the training algorithm running in software. However, these approaches would require characterizing and training the neural network from scratch for every chip. A more practical approach would be to have a single custom generic training algorithm that is run entirely in software which would make the network immune to most of the hardware non-idealities, but at the same time would require only very little knowledge about the specific hardware it will be deployed on. In this way, the model would have to be trained only once and could be deployed on a multitude of different chips. To this end, several works have proposed to inject noise in the training algorithm to the layer inputs Moon, Shin, and Jeon (2019), synaptic weights Miyashita et al. (2017), and pre-activations Klachko, Mahmoodi, and Strukov (2019); Rekhi et al. (2019). However, previous demonstrations have generally been limited to rather simple and shallow networks, and experimental validations of the effectiveness of the various approaches have been missing. We are aware of one recent work that analyzed more complex problems such as ImageNet classification Rekhi et al. (2019), however the hardware model used was rather abstract and no experimental validation was presented.
In this work, we explore injecting noise to the synaptic weights during the training of DNNs in software as a generic method to improve the network resilience against analog in-memory computing hardware non-idealities. We focus on the ResNet convolutional neural network (CNN) architecture, and introduce a number of techniques that allow us to achieve a classification accuracy of 93.7% on the CIFAR-10 dataset and a top-1 accuracy of 71.6% on the ImageNet benchmark after mapping the trained weights to PCM synapses. In contrast to previous approaches, the noise injected during training is crudely estimated from a one-time all-around hardware characterization, and captures the combined effect of read and write noise without introducing additional noise-related training hyperparameters. We validate the training approach through hardware/software experiments, where each of the 361,722 weights of ResNet-32 is programmed on two PCM devices of a prototype chip, and the rest of the network functionality is simulated in software. We achieve an experimental accuracy of after programming, which stays above over a period of 1 day. To improve the accuracy retention further, we develop a method to periodically calibrate the batch normalization parameters to correct the activation distributions during inference. We demonstrate a significant improvement in the accuracy retention with this method (up to on hardware for CIFAR-10) compared with a simple global scaling of the layers’ outputs, at the cost of additional digital computations during calibration. Finally, we discuss our training approach with respect to other methods and quantify the tradeoffs in terms of accuracy and ease of training.
II Results
II.1 Problem statement
For our experiments, we consider two residual networks on two different datasets: ResNet-32 on the CIFAR-10 dataset, and ResNet-34 on the ImageNet dataset He et al. (2016). As shown in Fig. 1a, ResNet-32 consists of 3 different ResNet blocks with ten kernels each, and is used to classify -pixel RGB images that belong to one out of 10 classes (see Methods). The network contains 361,722 synaptic weights. The ResNet-34 network used for the 1000-class ImageNet dataset is shown in Supplementary Fig. 1. The main differences compared to ResNet-32 are the number and size of the ResNet blocks, and a larger number of input/output channels (see Methods).
The weights of all convolution layers along with the fully connected layer of ResNet-32 can be mapped on memristive crossbar arrays as explained in Fig. 1b. Each synaptic weight can be mapped on a differential pair of memristive devices that are located on two different columns. For a given layer , the synaptic weight of the synaptic element is represented by the effective synaptic conductance given by
[TABLE]
where and are the conductance values of the two devices forming the differential pair. Those device conductance values are defined as the effective conductance perceived in the operation of a non-ideal memristive crossbar array, and therefore include all the circuit non-idealities from the crossbar and peripheral circuitry.
The mapping between the synaptic weight obtained after software training and the corresponding synaptic conductance is given by
[TABLE]
where is the maximum reliably programmable device conductance and is the maximum absolute synaptic weight value of layer . represents the synaptic conductance error from the ideal target conductance value . is a time-varying random variable that describes the effects of non-ideal device programming (inaccuracies associated with write) and conductance fluctuations over time (inaccuracies associated with read). Possible factors leading to such conductance errors include inaccuracies in programming the synaptic conductance to , noise from memristive devices and circuits, temporal conductance drift, device-to-device variations, defective (stuck) devices, and circuit non-idealities (e.g. IR drop).
Clearly, a direct mapping of the synaptic weights of a DNN trained with 32-bit floating point (FP32) precision to the same DNN with memristive synapses is expected to degrade the network accuracy due to the added error in the weights arising from . For existing memristive technologies, the magnitude of may range from % of the magnitude of Le Gallo et al. (2018), which in general is not tolerable by DNNs trained with FP32 without any constrains. Imposing such errors as constraints during training can be beneficial in improving the network accuracy. In fact, quantization of the weights or activations Merolla et al. (2016), and injecting noise on the weights Blundell et al. (2015), activations Gulcehre et al. (2016) or gradients Neelakantan et al. (2015) have been widely used as DNN regularizers during training to reduce overfitting on the training dataset An (1996); Jim, Horne, and Giles (1994). These techniques can improve the accuracy of DNN inference when it is performed with the same model precision as during training. However, achieving baseline accuracy while performing DNN inference on a model which is inevitably different from the one obtained after training, as it is the case for any analog in-memory computing hardware, is a more difficult problem and requires additional investigations.
Although a large body of efficient techniques to train DNNs with reduced digital precision has been reported Gupta et al. (2015); McKinstry et al. (2018), it is unlikely that such procedures can generally be applied as-is to analog in-memory computing hardware due to the random nature of . Since quantization errors coming from rounding to reduced fixed-point precision are not random, DNNs trained in this way are not a priori expected to be suitable for deployment on analog in-memory computing hardware. Techniques that inject random Gaussian noise during training are a much more natural fit to make the network robust to errors from analog in-memory computing hardware. As early as in 1994, it was shown that injecting noise on the synaptic weights during training enhances the tolerance to weight perturbations of multi-layer perceptrons, and the application of this technique to analog neural hardware was discussed Murray and Edwards (1994). Recent works have also proposed to apply noise to the layer inputs or pre-activations in order to improve the network tolerance to hardware noise Moon, Shin, and Jeon (2019); Rekhi et al. (2019). In this work, we follow the original approach of Murray et al. Murray and Edwards (1994) of injecting Gaussian noise to the synaptic weights during training. Next, we discuss different techniques that we employed together with synaptic weight noise in order to improve the accuracy of inference on ResNet and achieve close to software-equivalent accuracy after transferring the weights to PCM hardware.
II.2 Training procedure
When performing inference with analog in-memory computing hardware, the DNN experiences errors primarily due to (i) inaccurate programming of the network weights onto the devices (write noise) and (ii) temporal fluctuations of the hardware weights (read noise). We can cast the effect of these errors into a single error term that distorts each synaptic weight when performing forward propagation during inference. Hence, we propose to add random noise that corresponds to the error induced by to the synaptic weights at each forward pass during training (see Fig. 1b). The backward pass and weight updates are performed with weights that did not experience this noise. We found that adding noise to the weights only in the forward propagation is sufficient to achieve close to baseline accuracy for a noise magnitude comparable to that of our hardware, and adding noise during the backward propagation did not improve the results further. For simplicity, we assume that is Gaussian distributed, which is usually the case for analog memristive hardware. Weights are linearly mapped to the entire conductance range of the hardware, hence the standard deviation of the Gaussian noise on weights to be applied during training, for a layer , can be computed as
[TABLE]
where is a representative standard deviation of the combined read and write noise measured from hardware. During training, the weight distribution of every layer and hence changes, therefore is recomputed after every weight update so that stays constant throughout training. We found this to be especially important in achieving good training convergence with this method.
Weight initialization can have a significant effect on DNN training He et al. (2015). Two different weight initializations can lead to completely different minima when optimizing the network objective function. The network optimum when training with additive noise could be closer to the FP32 training optimum than to a completely random initialization. So it can be beneficial to initialize weights from a pretrained baseline network and then retrain this network by injecting noise. A similar observation was reported for training ResNet with reduced digital precision McKinstry et al. (2018). For achieving high classification accuracy in our experiments, we found this strategy more helpful than random initialization.
The noise injected during training according to Eq. (3) is closely related to the maximum weight of a layer, and can thus grow uncontrollably with outlier weight values. Controlling the weight distribution in a desirable range can improve the network training convergence and makes the mapping of weights to hardware with limited conductance range easier. We therefore clip the synaptic weights at layer after every weight update in the range , where is the standard deviation of weights in layer and is a tunable hyperparameter. In our studies, and worked the best for ResNet-32 and ResNet-34, respectively.
DNN convergence accuracy, in general, is sensitive to the learning rate used during training. Since we initialize the network parameters from a baseline network, using the same learning rate scheduling as that of the baseline network does not guarantee accurate convergence. To choose appropriate learning rate scheduling for ResNet-32, we first forward propagate the training set on the pretrained baseline network with injected synaptic weight noise and note the resulting accuracy. We note the learning rate evolution starting from this accuracy in the baseline network training curve until convergence, and use the same learning rate evolution while retraining the network by injecting noise.
We performed simulations to characterize the inference performance after training incorporating the injection of Gaussian noise in conjunction with the techniques presented above. We computed the classification accuracy for different amounts of injected noise during training. We also show how the accuracy is affected when the inference weights are perturbed by a certain amount of relative noise , where is the standard deviation of the noise injected to the weights of layer before performing inference on the test dataset.
The test accuracy of ResNet-32 on CIFAR-10 obtained for different amounts of noise injected during training, without inducing any perturbation during inference (), is plotted in Fig. 2a. It can be seen that the training algorithm is able to achieve a test accuracy close to the software baseline of with up to approximately . The tolerance of the networks trained with different amounts of to weight perturbations during inference, , is shown in Fig. 2b. For a given value of , in general, the highest test accuracy can be obtained for the network that has been trained with a comparable amount of synaptic weight noise, i.e. for . The test accuracy for is shown in Fig. 2c. It can be seen that for up to , an accuracy within of the software baseline is achievable. The impact of the weight initialization, clipping, and learning rate scheduling on the accuracy is shown in Supplementary Fig. 2. Not incorporating any of those three techniques results in at least 1% drop in test accuracy for .
The top-1 accuracy of ResNet-34 on ImageNet for is shown in Fig. 2d. Consistent with previous observations Rekhi et al. (2019); McKinstry et al. (2018), we found that the network recovers high accuracy extremely quickly when retraining with additive noise due to quick updates of the batch normalization parameters (see Supplementary Note 1), and obtained satisfactory convergence after only 8 epochs. The accuracy on ImageNet is much more sensitive to the noise injected during training than for CIFAR-10, and when noise in injected on all layers, there is more than accuracy drop from the baseline even down to relative noise. In the literature, many network compression techniques allow higher precision for the first and last layers, which are more sensitive to noise Rastegari et al. (2016); McKinstry et al. (2018). We applied the same simplification to our problem, which means that we removed the noise during training on the first convolutional layer and the last dense layer, and performed inference with the first and last layer without noise. The obtained accuracy after training, by injecting the same training and inference noise as previously, can be increased by more than 1% with this technique (see Fig. 2d).
II.3 Weight transfer to PCM-based synapses
In order to experimentally validate the effectiveness of the above training methodology, we performed experiments on a prototype multi-level PCM chip comprising 1 million PCM devices fabricated in 90 nm CMOS baseline technology Close et al. (2010). PCM is a memristive technology which records data in a nanometric volume of phase-change material sandwiched between two electrodes Burr et al. (2016). The phase-change material is in the low-resistive crystalline phase in an as-fabricated device. By applying a current pulse of sufficient amplitude (typically referred to as the RESET pulse) an amorphous region around the narrow bottom electrode is created via a melt-quench process. The device will be in a low conductance state if the high-resistive amorphous region blocks the current path between the two electrodes. The size of the amorphous region can be modulated in an almost completely analog manner by the application of suitable electrical pulses. Hence, a continuum of conductance values can be programmed in a single PCM device over a range of more than two orders of magnitude.
An optimized iterative programming algorithm was developed to program the conductance values in the PCM devices with high accuracy (see Methods). The experimental cumulative distributions of conductance values for 11 representative programmed levels, measured approximately 25 seconds after programming, are shown in Fig. 3a. The standard deviation of these distributions is extracted and fitted with a polynomial function of the target conductance (dashed lines in Fig. 3a) as shown in Fig. 3b. For all levels, we achieve a standard deviation less than 1.2 S, which is more than 2 times lower than that reported in previous works on nanoscale PCM arrays for a similar conductance range Le Gallo et al. (2018); Tsai et al. (2019).
To study the effect of weight transfer to PCM synapses, Eq. (2) is computed using the conductance standard deviation measured from hardware. is modeled as a Gaussian distributed random variable with 0 mean and standard deviation given by the fitted curve of Fig. 3b for the corresponding target conductance, , computed with S. The resulting test accuracy obtained after software training and after weight transfer to PCM synapses for ResNet-32 on CIFAR-10 is shown in Fig. 3c for different training procedures. It can be seen that standard FP32 training without constraints performs the worst after transfer to PCM synapses. Training with 4-bit precision weights (using the method described in Ref. McKinstry et al., 2018), which is roughly the effective precision of our PCM devices Le Gallo et al. (2018), improves the performance after transfer with respect to FP32, but nevertheless the accuracy decreases by more than 1% after transferring the 4-bit weights to PCM. Training ternary digital weights Li, Zhang, and Liu (2016) leads to a lower performance drop () when transferring weights to PCM, although we were not able to reach the FP32 baseline with ternary weights on this network. Therefore the accuracy after transfer is worse than for the 4-bit weights. When performing training by injecting Gaussian noise as described in Section II.2 with , corresponding to S (median of the 11 values reported in Fig. 3b), the best overall performance after transfer to PCM is obtained. The resulting accuracy of 93.7% is less than below the FP32 baseline. A rather broad range of values of lead to a similar resulting accuracy (see Supplementary Fig. 3), demonstrating that does not have to be very precisely determined for obtaining satisfactory results on PCM. The accuracy obtained without perturbing the weights after training by injecting noise is slightly higher than the FP32 baseline, which could be attributed to improved generalization resulting from the additive noise training.
The top-1 accuracy for ResNet-34 on ImageNet after transfer to PCM synapses for different training procedures is shown in Fig. 3d. Training with additive noise increases the accuracy by approximately 6% on PCM compared with FP32 and 4-bitMcKinstry et al. (2018) training. The accuracy of achieved with additive noise training on PCM is significantly higher than that reported in Fig. 2d with , which could be attributed to a high percentage of network weights mapped to low conductance values with lower standard deviation than the median of S.
II.4 Hardware/software inference experiment on CIFAR-10
Although we could achieve good test accuracy after weight transfer to PCM synapses as shown in the previous section, an important challenge for any analog in-memory computing hardware is to be able to retain this accuracy over time. This is especially true for PCM due to the high noise experienced in these devices as well as temporal conductance drift. The conductance values in PCM drift over time according to the relation , where is the conductance measured at time after programming and is the drift exponent, which depends on the device, phase-change material, and phase configuration of the PCM ( is higher for the amorphous than the crystalline phase) Le Gallo et al. (2018). In our PCM devices, on average. Therefore, it is essential to measure experimentally how the test accuracy evolves over time during inference with PCM.
Here, we present experiments where all 361,722 synaptic weights of ResNet-32 trained with are programmed individually on two PCM devices of the chip. Depending on the sign of , either or is iteratively programmed to , and the other device is RESET close to 0 S with a single pulse of 450 A amplitude and 50 ns width. The iterative programming algorithm converged on of the devices programmed to nonzero conductance, and no screening for defective devices on the chip was performed prior to the experiments. The scatter plot of the PCM weights measured approximately 25 seconds after programming versus the target weights is shown in Fig. 4a. After programming, the PCM analog conductance values were periodically read from hardware over a period of 1 day, scaled to the network weights, and reported to the software that performed inference on the test dataset (see Methods).
In addition to the experiment, we developed an advanced behavioral model of the hardware in order to precisely capture the conductance evolution over time during inference (see Supplementary Note 2). The model is built based on an extensive experimental characterization of the array-level statistics of hardware noise and drift. Conductance drift is modeled using a Gaussian distributed drift exponent across devices, whose mean and standard deviation both depend on the target conductance state . Conductance noise with the experimentally observed frequency dependence is also incorporated with a magnitude that depends on the target conductance state and time. The model is able to accurately reproduce both the array-level statistics (see Fig. 4b) and individual device behavior (see Fig. 4c) observed over the duration of the experiment. Accurate modeling of all the complex dependencies of noise and drift as a function of time and conductance state was found to be very critical in being able to reproduce the experimental evolution of the accuracy on ResNet.
The resulting accuracy on CIFAR-10 over time is shown in Fig. 4d. The test accuracy measured 25 seconds after programming is , which is very similar to the result obtained in Fig. 3c. However, if nothing is done to compensate for conductance drift, the accuracy quickly decreases down to (random guessing) within approximately 1000 seconds. This is because the magnitude of the PCM weights gradually reduces over time due to drift and this prevents the activations from properly propagating throughout the network. A simple global scaling calibration procedure can be used to compensate for the effect of drift on the matrix-vector multiplications performed with PCM crossbars. As proposed in Ref. Le Gallo et al., 2018, the summed current of a subset of the columns in the array can be periodically read over time at a constant voltage. The resulting total current is then divided by the summed current of the same columns but read at time . This results in a single scaling factor that can be applied to the output of the entire crossbar in order to compensate for a global conductance shift (see Methods and Supplementary Fig. 4). Since this factor can be combined with the batch normalization parameters, it does not incur any additional overhead when performing inference. This simple global drift compensation (GDC) procedure was implemented for every layer before carrying out inference on the test set, and the results are shown in Fig. 4d. It can be seen that GDC allows the retention of a test accuracy above for 1 day on the PCM chip, and effectively prevents the effect of global weight decay over time as illustrated in Supplementary Fig. 4. A good agreement of the accuracy evolution between model and experiment is obtained, hence validating its use for extrapolating results over a longer period of time and for assessing the accuracy of larger networks that cannot fit on our current hardware.
II.5 Adapting batch normalization statistics to improve the accuracy retention
Although GDC can compensate for a global conductance shift across the array, it cannot mitigate the effect of noise and drift variability across devices. From the model, we observe that noise is responsible for the random accuracy fluctuations, whereas drift variability and its dependence on the target conductance state cause the monotonous accuracy decrease over time (see Supplementary Fig. 5). In order to improve the accuracy retention further, we propose to leverage the batch normalization parameters to correct the activation distributions during inference such that their mean and variance match those that were optimally learned during training. During inference, batch normalization is performed by normalizing the preactivations by their corresponding running mean and variance computed during training. Then, scale and shift factors ( and ) that were learned through backpropagation are applied to the normalized preactivations. Since and are learnable parameters, it is not desirable to change them since it would require retraining the model on the PCM devices. However, updating and is more intuitive, since the mean and variance of the preactivations are affected by noise and drift. Leveraging this idea, we introduce a new compensation technique called adaptive batch normalization statistics update (AdaBS), which improves the accuracy retention beyond GDC at the cost of additional computations during the calibration phase.
As described in Fig. 5a, the calibration phase consists in sending multiple mini-batches from a set of calibration images that come from the same distribution than the images seen during inference. In this study, we use the images from the training dataset as calibration images. The running mean and variance of preactivations are computed across the entire calibration dataset. The new values of and computed during calibration are then used for subsequent inference. The main advantage of this technique is that it does not incur additional digital computations nor weight programming during inference, since we are only updating the batch normalization parameters and when the calibration is performed. However, injecting the entire training dataset to compute and in the calibration phase would bring significant overhead. When reducing the amount of injected images, the number of updates of the running statistics becomes smaller, and if the momentum used for computing and is not properly tuned to account for this, the network accuracy heavily decreases. To tackle this issue, we developed a procedure to obtain the optimal momentum as a function of the number of mini-batches used for calibration (see Methods and Supplementary Note 3). With this method, we were able to reduce the number of calibration images down to of the CIFAR-10 training dataset (2,600 images) without affecting the accuracy. With that number of images, the overhead in terms of digital computations of the AdaBS calibration is about 52% of performing batch normalization during inference on the whole CIFAR-10 test set (see Supplementary Note 3). It may appear cumbersome to send so many images to the device to perform the calibration, however since it is only performed periodically over time when the device is idle and not every time an image is inferred by the network, the high calibration cost can be amortized. The calibration overhead can be further reduced by using more efficient variants of batch normalization such as the -norm version (see Supplementary Note 3). Moreover, although we used AdaBS (and GDC) to compensate solely for the drift of the PCM devices, the same procedure can be applied to mitigate conductance changes due to ambient temperature variations, a critical issue for any analog in-memory computing hardware. The resulting accuracy when performing AdaBS on ResNet-32 with hardware weights before carrying out inference on the test set is shown in Fig. 5b. AdaBS allows to retain a test accuracy above over one day, an improvement of compared with GDC. This improvement becomes for one year when extrapolating the results using the PCM model.
We also applied AdaBS on the ImageNet classification task with ResNet-34, trained with , using the PCM model to simulate the weight evolution for one year. By applying the same AdaBS method as for CIFAR-10 using only 0.1% of the ImageNet training dataset for calibration (1300 images), the accuracy after one year is increased by 7% compared with GDC when all layers are implemented with PCM synapses (see Fig. 5c). When the first and last layers are implemented in digital FP32, the initial accuracy increases to and the retention is significantly improved. This technique, combined with AdaBS, allows the retention of an accuracy above for one year. Drawbacks in efficiency when performing inference on hardware in this way have to be mentioned, but they stay limited given the small number of parameters and input size of the first and last layers 111The first layer’s input is a large image, but it has only 3 channels. For the last layer, the input is flattened to a single 512-dimensional vector (assuming a batch size of 1). The first and last layers contain less than 3% of the network weights. .
III Discussion
Combined together, the strategies developed in this study allow us to achieve the highest accuracies reported so far with analog resistive memory on the CIFAR-10 and ImageNet benchmarks with residual networks close to their original implementation He et al. (2016). Although there is still room for improvement especially on ImageNet, those accuracies are already comparable or higher than those reported on ternary weight networks Li, Zhang, and Liu (2016), for example top-1 accuracy of ResNet-34 on ImageNet with first layer in FP32 Venkatesh, Nurvitadhi, and Marr (2017). Importantly, the accuracies we report are achieved with just a single nanoscale PCM device encoding the absolute value of a weight. A common approach that could improve the accuracy further is to use multiple devices to encode different bits of a weight Shafiee et al. (2016); Tsai et al. (2019), at the expense of area and energy penalty, and additional support required by the peripheral circuitry. Aligned with previous observations Rekhi et al. (2019); McKinstry et al. (2018), we notice that retraining ResNet with additive noise results mainly in adapting the batch normalization parameters, whereas the weights stay close to the full-precision weights trained without noise. Hence, retraining by injecting noise from a pretrained baseline network rather than from scratch is very effective since the network recovers high accuracy very quickly, especially for ImageNet. Although our experiments are not done on a fully-integrated chip that supports all functions of deep learning inference, the most critical effects of array-level variability, noise, and drift, are fully accounted for because each weight of the network is programmed on individual PCM devices of our array. Aspects of a fully-integrated chip that are not entirely captured in our experiments such as IR drop and additional circuit nonidealities such as offsets and noise have been studied in previous works and could be mitigated by additional retraining methods Moon, Shin, and Jeon (2019); Liu et al. (2015). Additional errors due to quantization coming from the crossbar data converters are analyzed further below.
There exist many different methods of training a neural network with noise that aim to improve the resilience of the model to analog mixed-signal hardware. These include injecting additive noise on the inputs of every layer Moon, Shin, and Jeon (2019), on the preactivations Klachko, Mahmoodi, and Strukov (2019); Rekhi et al. (2019), or just adding noise on the input data Bishop (1995). Moreover, injecting multiplicative Gaussian noise to the weights Murray and Edwards (1994) () is also defensible regarding the observed noise on the hardware. We analyzed the four aforementioned methods, attempting to reach the same accuracy demonstrated previously after weight transfer to PCM devices, to identify their possible benefits and drawbacks (see Supplementary Note 4). We found that it is possible to adjust the training procedure of all four methods to achieve a similar accuracy on CIFAR-10 after transferring the weights to PCM synapses. Somewhat surprisingly, even adding noise on the input data during training, which is just a simple form of data augmentation, leads to a model which is more resilient to weight perturbations during inference. This shows that it is not necessary to train a model with very complicated noise models that imitate the observed hardware noise precisely. As long as the data propagated through the network is corrupted by a Gaussian noise of the right magnitude, the model is expected to be robust to mapping on PCM devices. However, all four methods require one or multiple noise scaling factor hyperparameters to tune in order to reach satisfactory accuracy after transfer to PCM. In contrast, our proposed methodology estimates the additive noise to inject on the weights, , from a simple hardware characterization, avoiding any hyperparameter search for noise scaling factors. The value of does not have to be very precise either, because there is a range of values that lead to similar accuracy after transfer to PCM (see Supplementary Fig. 3). Moreover, we found that injecting noise on weights achieves better accuracy retention over time (see Supplementary Note 4), which suggests that weight noise mimics the behavior of the PCM hardware better.
A critical issue for in-memory computing hardware is the need for digital-to-analog (analog-to-digital) conversion every time data goes in (out) of the crossbar arrays. These data conversions lead to quantization of the activations and preactivations, respectively, which introduce additional errors in the forward propagation. Based on a recent ADC survey Rekhi et al. (2019), 8-bit data conversion is a good tradeoff between precision and energy consumption. Hence, we analyzed the effect of quantizing the input and output of every layer of ResNet-32 and ResNet-34 to 8-bit on the inference accuracy. We set the input/output quantization ranges to the -th percentile of the activation/preactivation distributions that are obtained when forward propagating 10k randomly sampled images from the training dataset through the baseline network. As shown in Supplementary Fig. 6, even though the 8-bit quantization is not included in our training algorithm, the quantization has a minimal effect on the mean accuracy of ResNet-32 on CIFAR-10 ( drop) and ResNet-34 on ImageNet ( drop) after weight transfer to PCM synapses. The accuracy evolution over time, retaining the same quantization ranges, does not degrade significantly further and stays well within one standard deviation of that obtained without quantization. The small accuracy deviations could be potentially overcome by including the quantization in the retraining process, which will likely be necessary if less than 8-bit resolution is desired for higher energy efficiency.
Although a computational memory accelerates the matrix-vector multiplication operations in a DNN, communicating activations between computational memory cores executing different layers can become a bottleneck. This bottleneck depends upon two factors, (i) the way different layers are connected to each other and (ii) the latency of the hardware implementation to transfer activations from one core to another. Designing optimal interconnectivity between the cores for state-of-the-art deep CNNs is an open research problem. Indeed, having the network weights stationary during execution in a computational memory puts limits on what portion of the computation can be forwarded to different cores and what cannot. This ultimately results in long-established hardware communication fabrics being ill-fit for the task. One topology for communication fabrics that is well-suited for computational memory is proposed by Dazzi et al. Dazzi et al. (2019). It is based on a 5 parallel prism (5PP) graph topology and facilitates inter-layer pipelined execution of CNNs Shafiee et al. (2016). The proposed 5PP topology allows the mapping of all the primary connectivities of state-of-the-art neural networks, including ResNet, DenseNet and Inception-style networks Dazzi et al. (2019). As discussed in Ref. Dazzi et al., 2019, the ResNet-32 implementation with 5PP can result in potentially improvement in pipeline stage latency with similar bandwidth requirements compared with a standard 2D-mesh. Assuming 8-bit activations, communication links with data rate of 5Gbps Sacco et al. (2017), and crossbar computational cycle time of 100 ns, a single image inference latency of 52 s and frame rate of frames per second (FPS) for ResNet-32 on CIFAR-10 is estimated. As an approximate comparison, YodaNN Andri et al. (2017), a digital DNN inference accelerator for binary weight networks with ultra-low power budget, achieves 434.8 FPS in high throughput mode for a 9-layer CNN (BinaryConnect Courbariaux, Bengio, and David (2015)) on CIFAR-10. Although not a direct comparison, the proposed topology and pipelined execution of ResNet-32 could result in speedup, with a deeper network than the digital solution.
In summary, we introduced strategies for training ResNet-type CNNs for deployment on analog in-memory computing hardware, as well as improving the accuracy retention on such hardware. We proposed to inject noise to the synaptic weights which is proportional to the combined read and write conductance noise of the hardware during the forward pass of training. This approach combined with judicious weight initialization, clipping, and learning rate scheduling, allowed us to achieve an accuracy of 93.7% on the CIFAR-10 dataset and a top-1 accuracy on the ImageNet benchmark of 71.6% after mapping the trained weights to analog PCM synapses. Our methods introduce only a single additional hyperparameter during training, the weight clip scale , since the magnitude of the injected noise can be easily deduced from a one-time hardware characterization. After programming the trained weights of ResNet-32 on 723,444 PCM devices of a prototype chip, the accuracy computed from the measured hardware weights stayed above over a period of 1 day, which is to the best of our knowledge the highest accuracy experimentally reported to-date on the CIFAR-10 dataset by any analog resistive memory hardware. A global scaling procedure was used to compensate for the conductance drift of the PCM devices, which was found to be critical in improving the accuracy retention. However, global scaling could not mitigate the effect of noise and drift variability across devices, which led to accuracy fluctuations and monotonous accuracy decrease over time, respectively. Periodically calibrating the batch normalization parameters before inference allowed to alleviate those issues at the cost of additional digital computations, increasing the 1-day accuracy to on hardware. These results demonstrate the feasibility of realizing accurate inference on complex DNNs through analog in-memory computing using existing PCM devices.
Methods
III.1 Experiments on PCM hardware platform
The experimental platform is built around a prototype PCM chip that comprises 3 million PCM devices. The PCM array is organized as a matrix of word lines (WL) and bit lines (BL). In addition to the PCM devices, the prototype chip integrates the circuitry for device addressing and for write and read operations. The PCM chip is interfaced to a hardware platform comprising two field programmable gate array (FPGA) boards and an analog-front-end (AFE) board. The AFE board provides the power supplies as well as the voltage and current reference sources for the PCM chip. The FPGA boards are used to implement overall system control and data management as well as the interface with the data processing unit. The experimental platform is operated from a host computer, and a Matlab environment is used to coordinate the experiments. The PCM devices were integrated into the chip in 90-nm CMOS technology using the key-hole process described in Ref. Breitwisch et al., 2007. The phase-change material is doped Ge2Sb2Te5. The bottom electrode has a radius of nm and a length of nm. The phase-change material is nm thick and extends to the top electrode, whose radius is nm. All experiments performed in this work were done on an array containing 1 million devices accessed via transistors, which is organized as a matrix of 512 WL and 2048 BL.
A PCM device is selected by serially addressing a WL and a BL. To read a PCM device, the selected BL is biased to a constant voltage ( mV) by a voltage regulator via a voltage generated off chip. The sensed current is integrated by a capacitor, and the resulting voltage is then digitized by the on-chip 8-bit cyclic analog-to-digital converter (ADC). The total duration of applying the read pulse and converting the data with the ADC is s. The readout characteristic is calibrated via on-chip reference polysilicon resistors. To program a PCM device, a voltage generated off chip is converted on chip into a programming current. This current is then mirrored into the selected BL for the desired duration of the programming pulse. Iterative programming involving a sequence of program-and-verify steps is used to program the PCM devices to the desired conductance values Papandreou et al. (2011). The devices are initialized to a high-conductance state via a staircase-pulse sequence. The sequence starts with a RESET pulse of amplitude 450 A and width 50 ns, followed by 6 pulses of amplitude decreasing regularly from 160 A to 60 A and with a constant width of 1000 ns. After initialization, each device is set to a desired conductance value through a program-and-verify scheme. The conductance of all devices in the array is read 5 times consecutively at a voltage of 0.3 V, and the mean conductance of these reads is used for verification. If the read conductance of a specific device does not fall within 0.25 S from its target conductance, it receives a programming pulse where the pulse amplitude is incremented or decremented proportionally to the difference between the read and target conductance. The pulse amplitude ranges between 80 A and 400 A. This program-and-verify scheme is repeated for a maximum of 55 iterations.
In the hardware/software inference experiments, the analog conductance values of the PCM devices encoding the network weights, and , are serially read individually with the 8-bit on-chip ADC at predefined timestamps spaced over a period of one day. The read conductance values at every timestamp are reported to a TensorFlow-based software. This software performs the forward propagation of the CIFAR-10 test set on the weights read from hardware and computes the resulting classification accuracy. The drift compensation techniques, GDC and AdaBS, are performed entirely in software at every timestamp based on the conductance values read from hardware.
III.2 PCM-based deep learning inference simulator
We developed a simulation framework to test the efficacy of DNN inference using PCM devices. We chose Google’s TensorFlow(Abadi et al., 2015) deep learning framework for the simulator development. The large library of algorithms in TensorFlow enables us to use native implementation of required activation functions and batch normalization. Moreover, any regular TensorFlow code of a DNN can be easily ported to our simulator. As shown in Supplementary Fig. 7, custom made TensorFlow operations are implemented that generate PCM conductance values from the behavioral model of hardware PCM devices that was developed (see Supplementary Note 2). All the nonidealities including conductance range, programming noise, read noise, and conductance drift are implemented in TensorFlow following the equations shown in Supplementary Note 2. The simulator can also take the PCM conductance data measured from hardware as input, in order to perform inference on the hardware data. Data converters that simulate digital quantization of data at the input and output of crossbars are also implemented with tunable quantization ranges and precision. In this study, the data converters were turned off for all simulations except those presented in Supplementary Fig. 6. The drift correction techniques are implemented post quantization of the crossbar output.
III.3 Training implementation of ResNet-32 on CIFAR-10
ResNet-32 has 31 convolution layers with kernels, 2 convolution layers with kernels, and a final fully-connected layer. The network contains 361,722 synaptic weights. It consists of 3 different ResNet blocks with 10 kernels each. After the first convolution layer, there is a unity residual feed forward connection after every two convolution layers, except the residual convolution connection to make output channels compatible between two layers. Each convolution layer is followed by batch normalization Ioffe and Szegedy (2015). ReLU activation is used after every batch normalization except in case of residual connections, where the ReLU activation is computed after summation. The output of the last convolution layer is then downsampled using global average pooling Zhou et al. (2015), which is followed by a single fully-connected layer. For the last fully-connected layer, no batch normalization is performed. The architecture of ResNet-32 used in this study is a slightly modified version of the original implementation He et al. (2016) with fewer input and output channels in ResNet blocks 2 and 3. This network is trained on the well-known CIFAR-10 classification dataset Krizhevsky, Nair, and Hinton . It has pixels RGB images that belong to one of the 10 classes.
The network is trained on the 50,000 images of the training set, and evaluation is performed on the 10,000 images of the test set. The training is performed using stochastic gradient descent with a momentum of 0.9. The network objective is categorical cross entropy function over 10 classes of the input image. Learning rate scheduling is performed to reduce learning rate by 90% at every 50th training epoch. The initial learning rate for the baseline network is 0.1 and training converges in 200 epochs with a mini-batch size of 128. Weights of all convolution and fully connected layers of the baseline network are initialized using He Normal He et al. (2015) initialization. The baseline network is retrained by injecting Gaussian noise for up to 150 epochs with weight clip scale . We preprocess the training images by randomly cropping a patch after padding 2 pixels along the height and width of the image. We also apply a random horizontal flip on the images from the train set. Additionally, we apply cutoutDevries and Taylor (2017) on the training set images. For both training and test set, we apply channel wise normalization for 0 mean and unit standard deviation.
III.4 Training implementation of ResNet-34 on ImageNet
The architecture of the ResNet-34 network for ImageNet classification is derived from Ref. He et al., 2016. It has 32 convolution layers with kernels, 3 convolution layers with kernels, a first convolution layer with kernels and a final fully-connected layer. The network has 21,797,672 synaptic weights. The first convolution layer downsamples the input by using a stride of 2 pixels, followed by a maxpooling layer with kernel size of and stride of 2 to downsample the feature maps to the resolution of pixels. Each residual connection with convolution and first layer of ResNet blocks 2, 3, 4 downsample the input by using a stride of 2 pixels. A global average pooling layer before the final fully-connected layer downsamples the input to resolution. The final fully-connected layer computes the output prediction corresponding to 1,000 classes.
We trained ResNet-34 on the ImageNetRussakovsky et al. (2015) dataset. The ImageNet dataset has 1.3M images in the training set and 50k images in the test set. Images in the ImageNet dataset are preprocessed by following the same preprocessing steps as that of the Pytorch baseline model. Training images are randomly cropped to a patch and then random horizontal flip is applied on the images. Channel wise normalization is performed on the images in both training and test set for 0 mean and unit standard deviation. Only for the test set, images are first resized to using bilinear interpolation method and then a center crop is performed to obtain the image patch.
The network objective function is softmax cross entropy on network output and corresponding 1,000 labels. The network objective is minimized using the stochastic gradient descent algorithm with a momentum of 0.9. We obtained our baseline network architecture and its parameters from the Pytorch model zoopyt . We use this network to perform additive noise training by injecting Gaussian noise for a total of 10 training epochs. In contrast to ResNet-32 on CIFAR-10, no learning rate scheduling was performed since the network was trained only for 10 epochs with additive noise. We use mini-batch size of 400 and learning rate of 0.001 for the additive noise training simulations. We also use L2 weight decay of 0.0001 and weight clip scale of for the additive noise training.
III.5 Global drift compensation (GDC) method
The GDC calibration phase consists of computing the summed current of columns in each array encoding a network layer (see Supplementary Fig. 4). Those columns contain devices initially programmed to known conductance values . By reading those column currents, , periodically with applied voltage on all rows, we can compensate for a global conductance shift in the array during inference. When input data is processed by the crossbar during inference, the crossbar output can be scaled by , where
[TABLE]
This procedure is especially simple because can be chosen to be small, enough to get sufficient statistics. Moreover, is computed from the device data itself, without resorting to any assumption on how the conductance changes nor requiring extra timing information. The term needs to be computed only once, stored in the digital memory of the chip, and is reused for all calibrations. Reading the subset of columns of the crossbar can be done while the PCM array is idle, i.e., when there are no incoming images to be processed by the device. Performing the current summations can be implemented either with on-chip digital circuitry or in the control unit of the chip. At the end of the calibration phase, is computed and stored locally in digital unit of the crossbar. The output scaling by during inference can be combined with batch normalization because it is a linear operation. In our experiments, the calibration procedure was performed using all columns of each layer (e.g. is equal to two times number of output channels) every time before inference is performed on the whole test set.
III.6 Adaptive batch normalization statistics update (AdaBS) technique
Batch normalization is performed differently in the training and inference phases of a DNN. During the training of a DNN, batch normalization normalizes the input to zero mean and unit variance by computing the mean () and variance () over the mini-batch of images
[TABLE]
The normalized input is then scaled and shifted by and . During the training phase, and are learned through backpropagation. In parallel, a global running mean () and variance () are computed by exponentially averaging and respectively, over all the training batches
[TABLE]
where is the momentum. After training, the estimates of the global mean and variance and are then used during the inference phase. When performing forward propagation during inference, the batch normalization coefficients , , , and are used for normalization, scale, and shift.
The calibration phase of AdaBS consists in recomputing and updating and for every layer where batch normalization is present. We recompute and by feeding a randomly sampled set of mini-batches from the training dataset. In recomputing and , hyper-parameters such as mini-batch size () and momentum () need to be carefully tuned to achieve the best network accuracy.
For AdaBS calibration, we observed that using an optimal value of the momentum is necessary to achieve good inference accuracy evolution over time. For this, we have developed an algorithm to estimate the optimal value of momentum by an empirical analysis, which is explained in Supplementary Note 3. Based on this analysis, the formula we used to compute the optimal momentum as a function of the number of injected mini-batches is
[TABLE]
Using Eq. (8) to compute the momentum, we found that with a fixed mini-batch size of images, it is sufficient to inject mini-batches for the AdaBS calibration of the ResNet-32 network, that is approximately of the CIFAR-10 training set (2,600 images). The sensitivity of the accuracy to the number of images used for AdaBS calibration is shown in Supplementary Note 3. For ResNet-34 on ImageNet, we used mini-batch size of and mini-batches, that is of the ImageNet training set (1,300 images). In the experiments presented in Fig. 5, AdaBS calibration was performed for every layer before performing inference on the test set, except the last layer because it does not have batch normalization.
Acknowledgments
We thank O. Hilliges for discussions, and our colleagues at IBM TJ Watson Research Center, in particular M. BrightSky, for help with fabricating the PCM prototype chip used in this work. This work was partially funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 682675).
Author Contributions
V.J., M.L., S.H., C.P. and A.S. conceived the training methodology. V.J., M.L., S.H., I.B. and A.S. conceived the drift correction techniques. V.J. and S.H. performed the software training and inference simulations under the guidance of M.L.. I.B. performed the PCM hardware experiments with the support of V.J.. S.R.N. and V.J. developed the PCM model. V.J. and C.P. developed the PCM deep learning inference TensorFlow-based software. M.D. provided critical in-memory computing hardware insights and performed the ResNet-32 performance estimation. M.L. wrote the manuscript with input from all authors. M.L., A.S., B.R. and E.E. supervised the project.
References
- Jouppi et al. (2017) N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al., “In-datacenter performance analysis of a tensor processing unit,” in 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA) (IEEE, 2017) pp. 1–12.
- Jia et al. (2019) Z. Jia, M. Maggioni, J. Smith, and D. P. Scarpazza, “Dissecting the NVidia turing T4 GPU via microbenchmarking,” arXiv preprint arXiv:1903.07486 (2019).
- Shafiee et al. (2016) A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar, “ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars,” in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA) (2016) pp. 14–26.
- Merrikh-Bayat et al. (2018) F. Merrikh-Bayat, X. Guo, M. Klachko, M. Prezioso, K. K. Likharev, and D. B. Strukov, “High-performance mixed-signal neurocomputing with nanoscale floating-gate memory cell arrays,” IEEE Transactions on Neural Networks and Learning Systems 29, 4782–4790 (2018).
- Chen et al. (2019) W.-H. Chen, C. Dou, K.-X. Li, W.-Y. Lin, P.-Y. Li, J.-H. Huang, J.-H. Wang, W.-C. Wei, C.-X. Xue, Y.-C. Chiu, et al., “CMOS-integrated memristive non-volatile computing-in-memory for AI edge processors,” Nature Electronics 2, 420–428 (2019).
- Hu et al. (2018) M. Hu, C. E. Graves, C. Li, Y. Li, N. Ge, E. Montgomery, N. Davila, H. Jiang, R. S. Williams, J. J. Yang, et al., “Memristor-based analog computation and neural network classification with a dot product engine,” Advanced Materials 30, 1705914 (2018).
- Le Gallo et al. (2018) M. Le Gallo, A. Sebastian, R. Mathis, M. Manica, H. Giefers, T. Tuma, C. Bekas, A. Curioni, and E. Eleftheriou, “Mixed-precision in-memory computing,” Nature Electronics 1, 246 (2018).
- Boybat et al. (2018) I. Boybat, M. Le Gallo, S. Nandakumar, T. Moraitis, T. Parnell, T. Tuma, B. Rajendran, Y. Leblebici, A. Sebastian, and E. Eleftheriou, “Neuromorphic computing with multi-memristive synapses,” Nature communications 9, 2514 (2018).
- Ambrogio et al. (2018) S. Ambrogio, P. Narayanan, H. Tsai, R. M. Shelby, I. Boybat, C. Nolfo, S. Sidler, M. Giordano, M. Bodini, N. C. Farinha, et al., “Equivalent-accuracy accelerated neural-network training using analogue memory,” Nature 558, 60 (2018).
- Nandakumar et al. (2018) S. Nandakumar, M. Le Gallo, I. Boybat, B. Rajendran, A. Sebastian, and E. Eleftheriou, “Mixed-precision architecture based on computational memory for training deep neural networks,” in International Symposium on Circuits and Systems (ISCAS) (IEEE, 2018) pp. 1–5.
- Mohanty et al. (2017) A. Mohanty, X. Du, P.-Y. Chen, J.-s. Seo, S. Yu, and Y. Cao, “Random sparse adaptation for accurate inference with inaccurate multi-level rram arrays,” in 2017 IEEE International Electron Devices Meeting (IEDM) (IEEE, 2017) pp. 6–3.
- Gonugondla, Kang, and Shanbhag (2018) S. K. Gonugondla, M. Kang, and N. R. Shanbhag, “A variation-tolerant in-memory machine learning classifier via on-chip training,” IEEE Journal of Solid-State Circuits 53, 3163–3173 (2018).
- Liu et al. (2015) B. Liu, H. Li, Y. Chen, X. Li, Q. Wu, and T. Huang, “Vortex: variation-aware training for memristor x-bar,” in Proceedings of the 52nd Annual Design Automation Conference (ACM, 2015) p. 15.
- Chen et al. (2017) L. Chen, J. Li, Y. Chen, Q. Deng, J. Shen, X. Liang, and L. Jiang, “Accelerator-friendly neural-network training: Learning variations and defects in rram crossbar,” in Proceedings of the Conference on Design, Automation & Test in Europe (European Design and Automation Association, 2017) pp. 19–24.
- Moon, Shin, and Jeon (2019) S. Moon, K. Shin, and D. Jeon, “Enhancing reliability of analog neural network processors,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems 27, 1455–1459 (2019).
- Miyashita et al. (2017) D. Miyashita, S. Kousai, T. Suzuki, and J. Deguchi, “A neuromorphic chip optimized for deep learning and CMOS technology with time-domain analog and digital mixed-signal processing,” IEEE Journal of Solid-State Circuits 52, 2679–2689 (2017).
- Klachko, Mahmoodi, and Strukov (2019) M. Klachko, M. R. Mahmoodi, and D. B. Strukov, “Improving noise tolerance of mixed-signal neural networks,” arXiv preprint arXiv:1904.01705 (2019).
- Rekhi et al. (2019) A. S. Rekhi, B. Zimmer, N. Nedovic, N. Liu, R. Venkatesan, M. Wang, B. Khailany, W. J. Dally, and C. T. Gray, “Analog/mixed-signal hardware error modeling for deep learning inference,” in Proceedings of the 56th Annual Design Automation Conference (ACM, 2019) pp. 81:1–81:6.
- He et al. (2016) K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition (2016) pp. 770–778.
- Gokmen, Onen, and Haensch (2017) T. Gokmen, M. Onen, and W. Haensch, “Training Deep Convolutional Neural Networks with Resistive Cross-Point Devices,” Frontiers in Neuroscience 11, 1–22 (2017), 1705.08014 .
- Merolla et al. (2016) P. Merolla, R. Appuswamy, J. Arthur, S. K. Esser, and D. Modha, “Deep neural networks are robust to weight binarization and other non-linear distortions,” arXiv preprint arXiv:1606.01981 (2016).
- Blundell et al. (2015) C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra, “Weight uncertainty in neural networks,” arXiv preprint arXiv:1505.05424 (2015).
- Gulcehre et al. (2016) C. Gulcehre, M. Moczulski, M. Denil, and Y. Bengio, “Noisy activation functions,” in International conference on machine learning (2016) pp. 3059–3068.
- Neelakantan et al. (2015) A. Neelakantan, L. Vilnis, Q. V. Le, I. Sutskever, L. Kaiser, K. Kurach, and J. Martens, “Adding gradient noise improves learning for very deep networks,” arXiv preprint arXiv:1511.06807 (2015).
- An (1996) G. An, “The effects of adding noise during backpropagation training on a generalization performance,” Neural computation 8, 643–674 (1996).
- Jim, Horne, and Giles (1994) K. Jim, B. G. Horne, and C. L. Giles, “Effects of noise on convergence and generalization in recurrent networks,” in Proceedings of the 7th International Conference on Neural Information Processing Systems, NIPS’94 (MIT Press, Cambridge, MA, USA, 1994) pp. 649–656.
- Gupta et al. (2015) S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep learning with limited numerical precision,” in Proceedings of the 32nd International Conference on Machine Learning (ICML-15) (2015) pp. 1737–1746.
- McKinstry et al. (2018) J. L. McKinstry, S. K. Esser, R. Appuswamy, D. Bablani, J. V. Arthur, I. B. Yildiz, and D. S. Modha, “Discovering low-precision networks close to full-precision networks for efficient embedded inference,” CoRR abs/1809.04191 (2018), arXiv:1809.04191 .
- Murray and Edwards (1994) A. F. Murray and P. J. Edwards, “Enhanced MLP performance and fault tolerance resulting from synaptic weight noise during training,” IEEE Transactions on neural networks 5, 792–802 (1994).
- He et al. (2015) K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” CoRR abs/1502.01852 (2015), arXiv:1502.01852 .
- Rastegari et al. (2016) M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet classification using binary convolutional neural networks,” in European Conference on Computer Vision (Springer, 2016) pp. 525–542.
- Close et al. (2010) G. Close, U. Frey, M. Breitwisch, H. Lung, C. Lam, C. Hagleitner, and E. Eleftheriou, “Device, circuit and system-level analysis of noise in multi-bit phase-change memory,” in 2010 IEEE International Electron Devices Meeting (IEDM) (IEEE, 2010) pp. 29–5.
- Burr et al. (2016) G. W. Burr, M. J. Brightsky, A. Sebastian, H.-Y. Cheng, J.-Y. Wu, S. Kim, N. E. Sosa, N. Papandreou, H.-L. Lung, H. Pozidis, et al., “Recent progress in phase-change memory technology,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems 6, 146–162 (2016).
- Le Gallo et al. (2018) M. Le Gallo, A. Sebastian, G. Cherubini, H. Giefers, and E. Eleftheriou, “Compressed sensing with approximate message passing using in-memory computing,” IEEE Transactions on Electron Devices 65, 4304–4312 (2018).
- Tsai et al. (2019) H. Tsai, S. Ambrogio, C. Mackin, P. Narayanan, R. M. Shelby, K. Rocki, A. Chen, and G. W. Burr, “Inference of long-short term memory networks at software-equivalent accuracy using 2.5m analog phase change memory devices,” in 2019 Symposium on VLSI Technology (2019) pp. T82–T83.
- Li, Zhang, and Liu (2016) F. Li, B. Zhang, and B. Liu, “Ternary weight networks,” arXiv preprint arXiv:1605.04711 (2016).
- Le Gallo et al. (2018) M. Le Gallo, D. Krebs, F. Zipoli, M. Salinga, and A. Sebastian, “Collective structural relaxation in phase-change memory devices,” Advanced Electronic Materials 4, 1700627 (2018).
- Venkatesh, Nurvitadhi, and Marr (2017) G. Venkatesh, E. Nurvitadhi, and D. Marr, “Accelerating deep convolutional networks using low-precision and sparsity,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017) pp. 2861–2865.
- Bishop (1995) C. M. Bishop, “Training with noise is equivalent to Tikhonov regularization,” Neural computation 7, 108–116 (1995).
- Dazzi et al. (2019) M. Dazzi, A. Sebastian, P. A. Francese, T. Parnell, L. Benini, and E. Eleftheriou, “5 parallel prism: A topology for pipelined implementations of convolutional neural networks using computational memory,” arXiv preprint arXiv:1906.03474 (2019).
- Sacco et al. (2017) E. Sacco, P. A. Francese, M. Brändli, C. Menolfi, T. Morf, A. Cevrero, I. Ozkaya, M. Kossel, L. Kull, D. Luu, H. Yueksel, G. Gielen, and T. Toifl, “A 5Gb/s 7.1fJ/b/mm 8x multi-drop on-chip 10mm data link in 14nm FinFET CMOS SOI at 0.5V,” in 2017 Symposium on VLSI Circuits (2017) pp. C54–C55.
- Andri et al. (2017) R. Andri, L. Cavigelli, D. Rossi, and L. Benini, “YodaNN: An architecture for ultralow power binary-weight CNN acceleration,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 48–60 (2017).
- Courbariaux, Bengio, and David (2015) M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training deep neural networks with binary weights during propagations,” in Advances in Neural Information Processing Systems (2015) pp. 3123–3131.
- Breitwisch et al. (2007) M. Breitwisch, T. Nirschl, C. Chen, Y. Zhu, M. Lee, M. Lamorey, G. Burr, E. Joseph, A. Schrott, J. Philipp, et al., “Novel lithography-independent pore phase change memory,” in Proc. IEEE Symposium on VLSI Technology (2007) pp. 100–101.
- Papandreou et al. (2011) N. Papandreou, H. Pozidis, A. Pantazi, A. Sebastian, M. Breitwisch, C. Lam, and E. Eleftheriou, “Programming algorithms for multilevel phase-change memory,” in Proc. International Symposium on Circuits and Systems (ISCAS) (2011) pp. 329–332.
- Abadi et al. (2015) M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” (2015), software available from tensorflow.org.
- Ioffe and Szegedy (2015) S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” CoRR abs/1502.03167 (2015), arXiv:1502.03167 .
- Zhou et al. (2015) B. Zhou, A. Khosla, À. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” CoRR abs/1512.04150 (2015), arXiv:1512.04150 .
- (49) A. Krizhevsky, V. Nair, and G. Hinton, “Cifar-10 (canadian institute for advanced research),” .
- Devries and Taylor (2017) T. Devries and G. W. Taylor, “Improved regularization of convolutional neural networks with cutout,” CoRR abs/1708.04552 (2017), arXiv:1708.04552 .
- Russakovsky et al. (2015) O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “Imagenet large scale visual recognition challenge,” Int. J. Comput. Vision 115, 211–252 (2015).
- (52) “Torchvision.models,” https://pytorch.org/docs/stable/torchvision/models.html, accessed: 2019-10-24.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Jouppi et al. (2017) N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al. , “In-datacenter performance analysis of a tensor processing unit,” in 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA) (IEEE, 2017) pp. 1–12.
- 2Jia et al. (2019) Z. Jia, M. Maggioni, J. Smith, and D. P. Scarpazza, “Dissecting the N Vidia turing T 4 GPU via microbenchmarking,” ar Xiv preprint ar Xiv:1903.07486 (2019).
- 3Shafiee et al. (2016) A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar, “ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars,” in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA) (2016) pp. 14–26. · doi ↗
- 4Merrikh-Bayat et al. (2018) F. Merrikh-Bayat, X. Guo, M. Klachko, M. Prezioso, K. K. Likharev, and D. B. Strukov, “High-performance mixed-signal neurocomputing with nanoscale floating-gate memory cell arrays,” IEEE Transactions on Neural Networks and Learning Systems 29 , 4782–4790 (2018).
- 5Chen et al. (2019) W.-H. Chen, C. Dou, K.-X. Li, W.-Y. Lin, P.-Y. Li, J.-H. Huang, J.-H. Wang, W.-C. Wei, C.-X. Xue, Y.-C. Chiu, et al. , “CMOS-integrated memristive non-volatile computing-in-memory for AI edge processors,” Nature Electronics 2 , 420–428 (2019).
- 6Hu et al. (2018) M. Hu, C. E. Graves, C. Li, Y. Li, N. Ge, E. Montgomery, N. Davila, H. Jiang, R. S. Williams, J. J. Yang, et al. , “Memristor-based analog computation and neural network classification with a dot product engine,” Advanced Materials 30 , 1705914 (2018).
- 7Le Gallo et al. (2018) M. Le Gallo, A. Sebastian, R. Mathis, M. Manica, H. Giefers, T. Tuma, C. Bekas, A. Curioni, and E. Eleftheriou, “Mixed-precision in-memory computing,” Nature Electronics 1 , 246 (2018).
- 8Boybat et al. (2018) I. Boybat, M. Le Gallo, S. Nandakumar, T. Moraitis, T. Parnell, T. Tuma, B. Rajendran, Y. Leblebici, A. Sebastian, and E. Eleftheriou, “Neuromorphic computing with multi-memristive synapses,” Nature communications 9 , 2514 (2018).
