Cheetah: Mixed Low-Precision Hardware & Software Co-Design Framework for DNNs on the Edge
Hamed F. Langroudi, Zachariah Carmichael, David Pastuch, Dhireesha, Kudithipudi

TL;DR
Cheetah is a co-design framework enabling low-precision DNN training and inference on edge devices using mixed numerical formats, notably posits, to improve efficiency and performance.
Contribution
It introduces a versatile framework supporting posit-based DNN training and inference, with mixed-precision formats, for edge computing applications.
Findings
16-bit posits outperform 16-bit floating point in training.
Inference with 5-8 bit posits improves performance-energy trade-offs.
Framework supports various quantization approaches and formats.
Abstract
Low-precision DNNs have been extensively explored in order to reduce the size of DNN models for edge devices. Recently, the posit numerical format has shown promise for DNN data representation and compute with ultra-low precision in [5..8]-bits. However, previous studies were limited to studying posit for DNN inference only. In this paper, we propose the Cheetah framework, which supports both DNN training and inference using posits, as well as other commonly used formats. Additionally, the framework is amenable for different quantization approaches and supports mixed-precision floating point and fixed-point numerical formats. Cheetah is evaluated on three datasets: MNIST, Fashion MNIST, and CIFAR-10. Results indicate that 16-bit posits outperform 16-bit floating point in DNN training. Furthermore, performing inference with [5..8]-bit posits improves the trade-off between performance and…
| Format | Dynamic Range 8-bit |
|---|---|
| Posit (=0) | 94.12% |
| Posit (=1) | 81.57% |
| Posit (=2) | 69.02% |
| Float (=4) | 66.66% |
| Float (=3) | 85.71% |
| Fixed-point (=4) | 100.0% |
| Dataset | Layers1 | # Parameters | # EMAC Ops2 | Memory | Accuracy |
|---|---|---|---|---|---|
| MNIST | 4 FC | 0.34 M | 0.78 k | 1.34 MB | 98.46% |
| 2 Conv, 2 FC, 1 PL | 1.40 M | 58.7 k | 5.84 MB | 99.32% | |
| Fashion-MNIST | 4 FC | 0.34 M | 0.78 k | 1.34 MB | 89.51% |
| 2 Conv, 3 FC, 2 PL, 1 BN | 1.88 M | 69.8 k | 7.77 MB | 92.54% | |
| CIFAR-10 | 7 Conv, 1 FC, 3 PL | 0.95 M | 312.6 k | 6.23 MB | 81.37% |
| Dataset | DNN | Posit | Float | Fixed | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8-bit | 7-bit | 6-bit | 5-bit | 8-bit | 7-bit | 6-bit | 5-bit | 8-bit | 7-bit | 6-bit | 5-bit | |||||
| MNIST | FC | 98.45% | 98.39% | 98.37% | 98.30% | 98.42% | 98.39% | 98.33% | 93.91% | 98.31% | 97.95% | 97.87% | 97.88% | |||
| Conv | 99.35% | 99.33% | 99.20% | 98.94% | 99.34% | 99.25% | 99.12% | 92.27% | 99.18% | 97.14% | 97.08% | 96.96% | ||||
| Fashion | FC | 89.59% | 89.44% | 89.24% | 88.14% | 89.56% | 89.36% | 88.92% | 83.00% | 89.16% | 87.27% | 85.20% | 83.97% | |||
| MNIST | Conv | 92.70% | 92.60% | 91.64% | 88.92% | 92.63% | 92.22% | 89.58% | 68.21% | 89.59% | 88.63% | 85.31% | 83.46% | |||
| CIFAR-10 | Conv | 80.40% | 76.90% | 68.51% | 41.33% | 79.75% | 76.09% | 53.68% | 12.83% | 24.27% | 17.43% | 12.54% | 9.71% | |||
| Numerical Format | Rounding Quantization | Linear-Quantization with Multiplication | Linear-Quantization with Shift | |||||||||||
| 8-bit | 7-bit | 6-bit | 5-bit | 8-bit | 7-bit | 6-bit | 5-bit | 8-bit | 7-bit | 6-bit | 5-bit | |||
| Posit () | 98.42% | 98.37% | 98.30% | 91.05% | 98.46% | 98.48% | 98.46% | 98.19% | 98.48% | 98.46% | 98.39% | 98.28% | ||
| Posit () | 98.45% | 98.39% | 98.34% | 98.30% | 98.49% | 98.47% | 98.42% | 98.34% | 98.48% | 98.42% | 98.38% | 98.42% | ||
| Posit () | 98.44% | 98.39% | 98.37% | 98.16% | 98.45% | 98.49% | 98.38% | 97.96% | 98.46% | 98.41% | 98.41% | 98.13% | ||
| Fixed-point | 98.31% | 97.95% | 97.87% | 97.88% | 98.47% | 98.32% | 98.11% | 96.41% | 98.42% | 98.29% | 98.16% | 97.17% | ||
| Floating point | 98.42% | 98.39% | 98.33% | 93.91% | 98.46% | 98.42% | 98.36% | 98.02% | 98.46% | 98.45% | 98.38% | 98.06% | ||
| 32-bit Floating point | 98.46% | 98.46% | 98.46% | |||||||||||
| Numerical Format | Rounding Quantization | Linear-Quantization with Multiplication | Linear-Quantization with Shift | |||||||||||
| 8-bit | 7-bit | 6-bit | 5-bit | 8-bit | 7-bit | 6-bit | 5-bit | 8-bit | 7-bit | 6-bit | 5-bit | |||
| Posit () | 89.57% | 89.21% | 88.46% | 76.87% | 89.64% | 89.58% | 89.36% | 88.17% | 89.59% | 89.61% | 88.31% | 88.10% | ||
| Posit () | 89.59% | 89.44% | 89.22% | 88.14% | 89.58% | 89.52% | 89.35% | 88.98% | 89.58% | 89.45% | 89.48% | 89.07% | ||
| Posit () | 89.56% | 89.33% | 89.24% | 87.07% | 89.53% | 89.55% | 88.98% | 87.06% | 89.49% | 89.52% | 89.18% | 87.06% | ||
| Fixed-point | 89.16% | 87.27% | 85.20% | 83.97% | 89.52% | 88.83% | 87.46% | 76.58% | 89.40% | 88.93% | 87.10% | 82.10% | ||
| Floating point | 89.56% | 89.36% | 88.92% | 83.00% | 89.59% | 89.45% | 89.00% | 87.25% | 89.73% | 89.32% | 88.86% | 87.37% | ||
| 32-bit Floating point | 89.51% | 89.51% | 89.51% | |||||||||||
| Task | Format | Accuracy |
|---|---|---|
| MNIST | Posit-32 | 98.131% |
| Float-32 | 98.087% | |
| Posit-16 | 96.535% | |
| Float-16 | 90.646% | |
| Fashion MNIST | Posit-32 | 89.263% |
| Float-32 | 89.105% | |
| Posit-16 | 87.400% | |
| Float-16 | 81.725% |
| Courbariaux et al. [45] | Gysel et al. [16] | Hashemi et al. [15] | Carmichael et al. [20] | Wang et al. [14] | Johnson et al. [22] | This Work | |
|---|---|---|---|---|---|---|---|
| Dataset | MNIST, CIFAR-10, | ImageNet | MNIST, CIFAR-10, | WI BC, Iris, Mushroom | ImageNet | ImageNet | MNIST, FMNIST |
| SVHN | SVHN | MNIST, FMNIST | CIFAR-10 | ||||
| Numerical Format | FP, FX, | FP, FX, | FP, FX | FP, FX | FP | FX, FP | FX, FP |
| BFP | BFP | Binary | PS | PS | PS | ||
| Bit-precision | 12 | 8 | All | [5..8] | All | 8 | [5..8] |
| Utility | Training | Inference | Inference | Inference | Training | Inference | Inference & Training |
| Inference Quantization | - | Rounding | Rounding | Rounding | - | Log | Rounding & Linear |
| Implementation | SW | SW & HW | SW & HW | SW & HW | SW & HW | SW & HW | SW & HW |
| DNN library | Theano | Caffe | Caffe | Keras/TensorFlow | Home Suite | PyTorch | Keras/TensorFlow |
| Device | - | ASIC | ASIC | Virtex-7 FPGA | ASIC | ASIC | Virtex-7 FPGA |
| Technology Node | - | 65 nm | 65 nm | 28 nm | 14 nm | 28 nm | 28 nm |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Ferroelectric and Negative Capacitance Devices
Cheetah: Mixed Low-Precision Hardware & Software Co-Design Framework for DNNs on the Edge
Hamed F. Langroudi, Zachariah Carmichael, David Pastuch, Dhireesha Kudithipudi Hamed. F. Langroudi, Zachariah Carmichael, David Pastuch, and Dhireesha Kudithipudi are with the Department of Computer Engineering, Rochester Institute of Technology, Rochester, NY, USA
Abstract
Low-precision DNNs have been extensively explored in order to reduce the size of DNN models for edge devices. Recently, the posit numerical format has shown promise for DNN data representation and compute with ultra-low precision bits. However, previous studies were limited to studying posit for DNN inference only. In this paper, we propose the Cheetah framework, which supports both DNN training and inference using posits, as well as other commonly used formats. Additionally, the framework is amenable for different quantization approaches and supports mixed-precision floating point and fixed-point numerical formats. Cheetah is evaluated on three datasets: MNIST, Fashion MNIST, and CIFAR-10. Results indicate that 16-bit posits outperform 16-bit floating point in DNN training. Furthermore, performing inference with [5..8]-bit posits improves the trade-off between performance and energy-delay-product over both [5..8]-bit float and fixed-point.
Index Terms:
Deep neural networks, low-precision arithmetic, posit numerical format
I Introduction
Edge computing is an emerging design paradigm that offers intelligence-at-the-edge of mobile networks, while addressing some of the shortcomings of cloud datacenters [1]. The nodes of the edges host the computing, storage, and communication capabilities, which provide on-demand learning for several applications, such as intelligent transportation, smart cities, and industrial robotics. Inherent characteristics of edge devices include low latency, reduced data movement cost, low communication bandwidth, and decentralized real-time processing [2, 3]. However, deploying intelligence-at-the-edge is a formidable challenge for several of the deep neural network (DNN) models. For instance, DNN inference with AlexNet requires 61 M parameters and 1.4 gigaFLOPS [4]. Moreover, the cost of the multiply-and-accumulate (MAC) units, a fundamental DNN operation, is non-trivial. In a 45 nm CMOS process, energy consumption doubles from 16-bit floats to 32-bit floats for addition and it increases by 4x for multiplication [5]. Memory access cost increases by 10x from 8 k to 1 M memory size with 64-bit cache [5]. In general, there is a gap between memory storage, bandwidth, compute requirements, and energy consumption of today’s DNN models and hardware resources available on edge devices [6, 7].
An apparent solution to address this gap is by compressing the size of the networks and reduce the computation requirements to match putative edge resources. Several groups have proposed compressed DNN models with new compute-and memory-efficient neural networks [8, 9, 10] and parameter-efficient neural networks, such as DNN pruning [11], distillation [12], and low-precision arithmetic [13, 14].
Among these approaches to compress DNN models, low-precision arithmetic is noted for its ability to reduce memory capacity, bandwidth, latency, and energy consumption associated with MAC units in DNNs, and an increase in the level of data parallelism [13, 15, 16]. For instance, DNN inference with compressed models, such as MobileNet with 8-bit fixed-point parameters, utilizes only 4.2 M parameters and 1.1 megaFLOPS [8]. While this alleviates some of the design constraints for the edge, DNN models must still run quickly with high accuracy for complex visual or video recognition tasks on-device. Therefore, a conflicting design constraint here is that the network’s precision cannot compromise a DNN’s overall performance. For instance, there is a gap between the performance of low-precision DNN models (e.g, MobileNet with 8-bit fixed-point DNN parameters) and high-precision DNN models (e.g, MobileNet with 32-bit floating point DNN parameters) for real-time (30 FPS) classification on ImageNet data with a Snapdragon 835 LITTLE core [13].
The ultimate goal of designing the low-precision DNN is reducing the hardware complexity of the high-precision DNN model such that it can be ported on to edge devices with performance similar to the high-precision DNN. The hardware complexity and performance in low-precision DNNs rely heavily on the quantization approach and the numerical format. Prevailing techniques, such as complex vector quantization or hardware-friendly numerical formats, lead to undesirable hardware complexity or performance penalties [17, 18].
To understand the correlation between hardware complexity and performance of low-precision neural networks for the edge, a hardware and software co-design framework is required. Previous studies have addressed this by proposing low-precision frameworks [16, 13, 14, 15, 19, 20, 21, 22]. However, the scope of these studies is limited, as highlighted below:
None of the previous works explore the propriety of the posit numerical format for both DNN training and inference by comprehensive comparison with fixed and float formats [19, 20, 21, 22]. 2. 2.
There is a lack of comparison between the efficacy of quantization approaches, numerical formats, and the associated hardware complexity. 3. 3.
In most of the previous works, the comparison across numerical formats are conducted for varying bit-widths (e.g. 32-bit floating point compared to 8-bit fixed-point [15]). Such comparisons do not offer insights on viability of utilizing the same bit-precision across numerical formats for a particular task.
To address the gaps in previous studies, we are motivated to propose Cheetah as a comprehensive hardware and software co-design framework to explore the advantage of low-precision for both DNN training and inference. The current version of Cheetah supports three numerical formats (fixed-point, floating point, and posit), two quantization approaches (rounding and linear), and two DNN models (feedforward neural networks and convolutional neural networks).
II Background
II-A Deep Neural Network
Deep neural networks (DNNs) [23] are artificial neural networks that are used for various tasks, such as classification, regression and prediction, by learning the correlation between examples from a corpus of data called training sets [24]. These networks are capable of learning a non-linear input-to-output mapping in either a supervised, unsupervised, or semi-supervised manner. The DNN models contain a sequence of layers, each comprising a set of nodes. The connectivity between layers depends on the DNN architecture (e.g. globally connected in feedforward neural network or locally connected in convolutional neural network).
A major computation in a DNN node is the MAC operation. Specifically, a node in feedforward neural and convolutional neural network computes (1) where indicates the bias vector, is the weights tensor with numerical values that are associated with each connection, represents the activation vector as input values to each node, is the feature vector at the output of each node, and equals either the number of nodes for a feedforward neural network or the product of the filter parameters: the number of filter channels, the filter heights, and the filter weights, respectively, for a convolutional neural network.
[TABLE]
In a supervised learning scenario for all of these networks, the correctness of classifications is given by the distance between and the desired output as calculated by , a cost function with respect to the weights. Then, during training, the weights are learned through stochastic gradient descent (SGD) to minimize as given by (2).
[TABLE]
II-B Posit Numerical Format
The posit, a Type III unum, is a new numerical format with tapered precision characteristic and was proposed as an alternative to IEEE-754 floating format to represent real numbers [25]. Posit revamped the IEEE-754 floating format and addressed complaints about Type I and Type II unums [26]. Posits provides better accuracy, dynamic range, and program reproducibility than IEEE floating point. The essential advantage of posits is their capability to represent non-linearly distributed numbers in a specific dynamic range around 1 with maximum accuracy. The value of a posit number is represented by (3), where represents the sign, and represent the maximum number of bits allocated for the exponent and fraction, respectively, and indicate the exponent and fraction values, respectively, and , as computed by (4), represents the regime value.
[TABLE]
The regime bit-field is encoded based on the runlength of identical bits terminated by either a regime terminating bit or the end of the -bit value. Note that there is no requirement to distinguish between negative and positive zero since only a single bit pattern represents zero. Furthermore, instead of defining a NaN for exceptional values and infinity by various bit patterns, a single bit pattern , “Not-a-Real” (), represents exception values and infinity. More details about the posit number format can be found in [25].
[TABLE]
III Related Work
As lately as the 1980s, low-precision arithmetic has been studied for shallow neural networks to reduce compute and memory complexity for training and inference without sacrificing performance [27, 28, 29, 30]. In some scenarios, it also improves the performance of training and inference since the quantization noise generated from the use of low-precision parameters in shallow neural network acts as a regularization method [30, 31]. The outcome of these studies indicate that 16- and 8-bit precision DNN parameters are sufficient for training and inference on shallow networks [28, 29, 30]. The capability of low-precision arithmetic is reevaluated in the deep learning era to reduce memory footprint and energy consumption during training and inference [32, 33, 34, 35, 14, 36, 37, 16, 15, 38, 20, 21, 19, 22].
III-A Low-Precision DNN Training
Several of the previous studies have shown that to perform DNN training, either variants of low-precision block floating point (BFP), where a block of floating point DNN parameters used a shared exponent [39], such as Flexpoint [35] (16-bit fraction with 5-bit shared exponent for DNN parameters), or mixed-precision floating point (16-bit weights, activations, and gradients and 32-bit accumulators in the SGD weight update process) are sufficient to maintain similar performance as 32-bit high-precision floating point. For instance, Courbariaux et al. trained a low-precision DNN on the MNIST, CIFAR-10, and SVHN datasets with the floating point, fixed-point, and BFP numerical formats [32]. They demonstrate that BFP is the most suitable choice for low-precision training due to variability between the dynamic range and precision of DNN parameters [32]. Following this work, Koster et al. proposed the Flexpoint numerical format and a new algorithm called Autoflex to automatically predict the optimal shared exponents for DNN parameters in each iteration of SGD by statistically analyzing the values of DNN parameters in previous iterations [35].
Aside from managing the shared exponent in the BFP numerical format, Narang et al. used mixed-precision floating point [34]. They used a 16-bit floating point to represent weights, activations, and gradients to perform forward and backward passes. To prevent accuracy loss caused by underflow in the product of learning rate and gradients with (2) in 16-bit floating point, the weights are updated in 32-bit floating point. Additionally, to prevent gradients with very small magnitude from becoming zero when represented by 16-bit float, a new loss scaling approach is proposed [34].
Recently, Wang et al. and Mellempudi et al. reduce the bit-precision required to represent weights, activations, and gradients to 8-bit by exhaustively analyzing DNN training parameters [14, 36]. Even in [36], a new chunk-based addition is presented to solve the truncation issue caused by addition of large- and small-magnitude numbers and thus the number of bits demanded for accumulator and weight updates is reduced to 16-bits. To prevent the requirement of the loss scaling in mixed-precision floating point, Kalamkar et al. [37] proposed the brain floating point (BFLOAT-16) half-precision format with similar dynamic range (7-bit exponent) and less precision (8-bit fraction) compared to 32-bit floating point. The same dynamic range between BFLOAT-16 and 32-bit floating point reduces the conversion complexity between these two formats in DNN training. In training a ResNet model on the ImageNet dataset, BFLOAT-16s achieve the same performance as 32-bit floating point.
III-B Low-Precision DNN Inference
The performance of DNN inference without retraining is more robust to the noise that is generated from low-precision DNN parameters as the DNN parameters during inference are static; several groups have demonstrated that either 8-bit BFP or 8-bit fixed-point, coupled with linear quantization, are adequate to represent weights and activations without significantly degrading performance yielded with 32-bit floating point. Note that the accumulation bit-width is selected to be 32 bits to preserve accuracy in performing, in general, thousands of additions in the MAC operations. For instance, Gysel et al. demonstrate that an 8-bit block floating point for representing weights and activations, 8-bit multipliers, and 32-bit accumulation results in 1% accuracy loss on AlexNet with the ImageNet corpus [16]. Following this work, Hashemi et al. introduce low-precision DNN inference networks to better understand the impact of numerical formats on the energy consumption and performance of DNNs [15, 16]. For instance, performing inference on AlexNet with the 8-bit fixed-point format yields a improvement in energy consumption over 32-bit fixed-point for the CIFAR-10 dataset [15]. Chung et al. proposed the Brainwave accelerator using 8-bit block floating point with a 5-bit exponent to classify ImageNet dataset on ResNet-50 with 2% accuracy loss [38]. However, the scaling factor parameter in the block floating point numerical format needs to be updated according to the DNN parameter statistics, thus increasing the computational complexity of inference.
To alleviate this problem, researchers have used posits in DNNs [20, 21, 19, 22]. Posits represent numbers more accurately around 1 and less accurately for very small and large numbers, unlike the uniform precision of the floating point numerical format [40]. This characteristic of posits arises from its tapered precision and suits the distribution of DNN parameters well [25, 19]. For instance, Langroudi et al. explored the efficacy of posits for representing DNN weights and have shown that it is possible to achieve a loss in accuracy within 1% on the AlexNet and ImageNet corpora with weight representation at 7-bit [19]. They also demonstrate that posits have a 30% less voracious memory footprint than fixed-point for multiple DNNs while maintaining a 1% drop in accuracy. However, in the work, the 7-bit posit quantized weights are converted to 32-bit floats, limiting the posit numerical format for memory storage only.
To take full advantage of the posit numerical format, Carmichael et al. proposed the Deep Positron DNN accelerator which employs the posit numerical format to represent weights and activations combined with an FPGA soft core for 8-bit precision exact-MAC operations [20, 21]. They demonstrate that 8-bit posits outperform 8-bit fixed-point and floating point on low-dimensional datasets, such as Iris [41]. Following these works, most recently, Jeff Johnson proposed a log float format as a combination of the posit numerical format and exact log-linear multiply-add (ELMA), which is the logarithmic version of the exact MAC operation. This work shows that it is possible to classify ImageNet with the ResNet DNN architecture with 1% accuracy degradation [22].
This research builds on these earlier studies [20, 21, 19, 22] and extends low-precision arithmetic to both DNN training and DNN inference with different quantization approaches for both feedforward and convolution neural networks on various datasets.
IV Proposed Framework
The Cheetah framework, shown in Fig. 1, comprises a two-level software component and a single-level hardware component. The software framework is used to evaluate the performance of various numerical formats and quantization approaches by emulating low-precision DNN training and inference. The hardware framework is a soft-core implemented on FPGA and used for evaluating hardware characteristics of the MAC (multiply-and-accumulate) operations as a fundamental computation in DNN models coupled with various quantization techniques. For each level, two optimization stages are considered to convert the baseline DNN model with 32-bit high-precision floating point with soft-core MACs to a low-precision DNN model with either posit, floating point, or fixed-point arithmetic soft-core exact-MACs (EMACs). This optimization is performed iteratively, reducing the bit-precision by one at each step; the performance degradation and hardware complexity reduction achieved by a numerical format in both DNN training and inference is computed and compared with the specified design constraints (e.g. 3 EDP reduction with similar performance). This iterative process is repeated for the next numerical format after one of the design constraints is violated. Essentially, Cheetah approximates the optimal bit-width for each numerical format based on the performance and hardware complexity constraints. Note that there is a priority between optimization approaches; the numerical format parameter has a higher precedence in the optimization process. This design decision is made to limit the search space and the hardware complexity overhead of the quantization approaches. In performing DNN inference, the current version of Cheetah supports three low-precision numerical formats (fixed-point, floating point and posit), two quantization approaches (rounding and linear), and two DNN models (feedforward and convolutional neural networks). To perform DNN training on feedforward neural networks, Cheetah supports two numerical formats (floating point and posit) with 32-bit and 16-bit precision. For brevity, the architecture explained here is based on single hidden layer feedforward neural network training and inference with the posit numerical format for both rounding and linear quantization approaches, as shown in Fig. 2.
IV-A Software Design and Exploration
In emulating feedforward and convolutional DNNs, the output of each layer is calculated as in (5)
[TABLE]
where and are scale factors, is the bias term, is the activation vector, is the weight matrix, indicates the number of MAC operations, and is the quantization function. First, the feedforward or convolutional neural network is trained by either 32- or 16-bit floating point or posit numbers as shown by Fig. 5. To perform DNN inference, the 32-bit floating point high-precision learned weights and 32-bit floating point high-precision activations are quantized to either -bit low-precision fixed-point, floating point, or posit numbers ().
In the quantization procedure, the values of and are dependent on the quantization approach. To perform rounding quantization, and are both set to 1 and the 32-bit high-precision floating point values that lie outside dynamic range of one of the low-precision posit numerical formats (e.g. 8-bit posit) are clipped appropriately to either the format’s maximum or minimum. During quantization by rounding, a value that is interleaved between two arbitrary numbers is rounded to the nearest number. To perform linear quantization, the activations and weights are quantized to the range by calculating and setting .
In the next step, the MAC operation is employed to calculate . To minimize arithmetic error, the MAC operation in this paper is calculated using the EMAC algorithm [20]. In the EMAC, to preserve precision in computing the products, the posit weights and activations are multiplied in a posit format without truncation or rounding at the end of multiplications. To avoid rounding during accumulation, the products are stored in a wide register, or quire in the posit literature, with a width given by (6). The products are then converted to the fixed-point format , where is the exponent bit-width and is the fraction bit-width. Finally, the fixed-point products are accumulated and the result is descaled in linear quantization, again using and , and converted back to posit.
[TABLE]
Algorithm 1 Posit DOT operation for -bit inputs each with exponent bits [20]
1:procedure PositDOT()
2:
3:
4: {\tt{sf_{w}}}\leftarrow\{{\tt{reg_{w}}},{\tt{exp_{w}}}\}$$\triangleright Gather scale factors
5: Multiplication
6:
7:
8: {\tt{ovf_{mult}}}\leftarrow{\tt{frac_{mult}}}[{\tt{MSB}}]$$\triangleright Adjust for overflow
9:
10: Accumulation
11: {\tt{fracs_{mult}}}\leftarrow{\tt{sign_{mult}}}\leavevmode\nobreak\ ?\leavevmode\nobreak\ {\scalebox{0.75}[1.0]{-}\tt{frac_{mult}}}:{\tt{frac_{mult}}}
12: {\tt{sf_{biased}}}\leftarrow{\tt{sf_{mult}}}+bias$$\triangleright Bias the scale factor
13: {\tt{fracs_{fixed}}}\leftarrow{\tt{fracs_{mult}}}\ll{\tt{sf_{biased}}}$$\triangleright Shift to fixed
14: {\tt{sum_{quire}}}\leftarrow{\tt{fracs_{fixed}}}+{\tt{sum_{quire}}}$$\triangleright Accumulate Fraction & SF Extraction
15:
16: {\tt{mag_{quire}}}\leftarrow{\tt{sign_{quire}}}\leavevmode\nobreak\ ?\leavevmode\nobreak\ {\scalebox{0.75}[1.0]{-}\tt{sum_{quire}}}:{\tt{sum_{quire}}}
17:
18: {\tt{frac_{quire}}}\leftarrow{\tt{mag_{quire}}}[2{\times}(n\scalebox{0.75}[1.0]{-}2\scalebox{0.75}[1.0]{-}es)\scalebox{0.75}[1.0]{-}1{+}{\tt{zc}}:{\tt{zc}}]
19: {\tt{sf_{quire}}}\leftarrow{\tt{zc}}\scalebox{0.75}[1.0]{-}bias Convergent Rounding & Encoding
20:
21:
22: {\tt{exp}}\leftarrow{\tt{sf_{quire}}}[es\scalebox{0.75}[1.0]{-}1:0]$$\triangleright Unpack scale factor
23: {\tt{reg_{tmp}}}\leftarrow{\tt{sf_{quire}}}[{\tt{MSB}}\scalebox{0.75}[1.0]{-}1:es]
24: {\tt{reg}}\leftarrow{\tt{sign_{sf}}}\leavevmode\nobreak\ ?\leavevmode\nobreak\ \scalebox{0.75}[1.0]{-}{\tt{reg_{tmp}}}:{\tt{reg_{tmp}}}
25: {\tt{ovf_{reg}}}\leftarrow{\tt{reg}}[{\tt{MSB}}]$$\triangleright Check for overflow
26: {\tt{reg_{f}}}\leftarrow{\tt{ovf_{reg}}}\leavevmode\nobreak\ ?\leavevmode\nobreak\ \{\{\lceil\log_{2}(n)\rceil\scalebox{0.75}[1.0]{-}2\{{\tt{1}}\}\}),{\tt{0}}\}:{\tt{reg}}
27:
28: {\tt{tmp1}}\leftarrow\{{\tt{nzero}},{\tt{0}},{\tt{exp_{f}}},{\tt{frac_{quire}}}[{\tt{MSB}}\scalebox{0.75}[1.0]{-}1:0], \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \{n\scalebox{0.75}[1.0]{-}1\{{\tt{0}}\}\}\}
29: {\tt{tmp2}}\leftarrow\{{\tt{0}},{\tt{nzero}},{\tt{exp_{f}}},{\tt{frac_{quire}}}[{\tt{MSB}}\scalebox{0.75}[1.0]{-}1:0], \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \{n\scalebox{0.75}[1.0]{-}1\{{\tt{0}}\}\}\}
30:
31: if then
32:
33:
34: else
35:
36:
37: end if
38:
39: {\tt{lsb}},{\tt{guard}}\leftarrow{\tt{tmp}}[{\tt{MSB}}\scalebox{0.75}[1.0]{-}(n\scalebox{0.75}[1.0]{-}2):{\tt{MSB}}\scalebox{0.75}[1.0]{-}(n\scalebox{0.75}[1.0]{-}1)]
40: \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ (\leavevmode\nobreak\ {\tt{guard}}\leavevmode\nobreak\ \&\leavevmode\nobreak\ ({\tt{lsb}}\leavevmode\nobreak\ |\leavevmode\nobreak\ (|{\tt{tmp}}[{\tt{MSB}}\scalebox{0.75}[1.0]{-}n:0]))\leavevmode\nobreak\ ):{\tt{0}}
41: {\tt{result_{tmp}}}\leftarrow{\tt{tmp}}[{\tt{MSB}}:{\tt{MSB}}\scalebox{0.75}[1.0]{-}n{+}1]{+}{\tt{round}}
42: {\tt{result}}\leftarrow{\tt{sign_{quire}}}\leavevmode\nobreak\ ?\leavevmode\nobreak\ \scalebox{0.75}[1.0]{-}{\tt{result_{tmp}}}:{\tt{result_{tmp}}}
43: return
44:end procedure
IV-B Hardware Framework
The MAC operation, as introduced as the fundamental DNN operation, calculates the weighted sum of a set of inputs. In many implementations, this operation is inexact, i.e. arithmetic error grows due to iterative rounding and truncation. The EMAC mitigates this concern by adapting the concept of the Kulisch accumulator [42]. The error due to rounding is deferred until after the accumulation of all products, which low-precision arithmetic further benefits from. In the EMAC, as mentioned beforehand, the fixed-point values of products are accumulated in a wide register sized as given by (6). The posit EMAC, illustrated by Fig. 3, is parameterized by , the bit-width, and , the number of exponential bits. “NaR” is not considered as posits do not overflow or underflow and all DNN parameters and data are real numbers. Algorithm IV-A describes the bitwise operation of the EMAC dot product. Each EMAC is pipelined into three stages: multiplication, accumulation, and rounding. For further details on EMACs and the exact dot product, we suggest reviewing [42, 20, 21].
V Simulation Results & Analysis
The Cheetah software is implemented in the Keras [43] and TensorFlow [44] frameworks. Rounding quantization, linear quantization, and the EMAC operations with [5,32]-bit precision fixed-point, floating point, and posit numbers for DNN inference and {16, 32}-bit floating point and posit numbers for DNN training are extended to these frameworks via software emulation. To reduce the search space of the and parameters, is selected from which still provides, on average, a wide coverage (82%) of the dynamic range of each numerical format, as shown in Table I.
V-A Exploiting Numerical Formats for DNN Inference
To evaluate Cheetah performance on DNN inference, a feedforward neural network and different convolutional neural networks are trained on three benchmarks with 32-bit floating point. The specification of these tasks and inference performance are summarized in Table II. The accuracies of performing DNN inference on these tasks are presented in Table III in the [5..8]-bit precision version of Cheetah. The results show that posit with [5..8]-bit precision (mostly ) outperforms the fixed-point and floating point formats (mostly exponential bits). For instance, the accuracy of performing DNN inference on Fashion-MNIST is improved by and with 5-bit posits in comparison to 5-bit floating point and fixed-point, respectively. On the CIFAR-10 dataset, these performance gains are further noticeable with 5-bit posits having and improvements over floating point and fixed-point, respectively. The benefits of the posit numerical format are intuitively explained by the nonlinear distribution of its values, similar to that of DNN inference parameters. This hypothesis is explored empirically by calculating the distortion rate of DNN inference parameters with respect to each numerical format. The distortion rate is described by (7) where indicates the high-precision parameters and represents the quantized parameters. The results, as shown in Fig. 4, validate the hypothesis, especially at 5-bit precision where the distortion rate of posit is significantly less than that of the other numerical formats.
[TABLE]
V-B Exploiting Numerical Formats with Quantization Approaches for DNN Inference
As mentioned before, quantization with rounding has less overhead when compared to the other quantization approaches, but it is not possible to perform DNN inference with 5-bit posits with similar performance of DNN inference as 32-bit floating point. To improve performance of DNN inference, the [5..8]-bit posit numerical format is combined with linear quantization approaches and evaluated for a 4-layer feedforward neural network on the MNIST and Fashion-MNIST datasets. The and in (5) can be either implemented by constant multiplication or by a shift operation where the and values are approximated by a power of two. The results, as shown in Table IV, exhibit that 5-bit low-precision DNN inference achieves similar performance to 32-bit floating point DNN inference on the MNIST data set. Essentially, by deploying this approach, the quantization error produced by the values that lie outside of posit’s dynamic range is zeroed out. The linear quantization approach also plays a key role in reducing the hardware complexity of posit EMACs used for DNN inference. Notably, the accuracy of DNN inference with posits is significantly enhanced by using the linear quantization approach in comparison to quantization with rounding. Therefore, the overhead of adding linear quantization is offset by reducing the hardware complexity, i.e. carrying out the posit EMAC operation with instead of , which is explained in depth in the next section.
V-C Exploiting Posit and Floating Point for DNN Training
To explore the efficacy of the posit numerical format over the floating point numerical format, a 4-layer feedforward neural network is trained with each number system on the MNIST and Fashion-MNIST datasets. The results indicate that the posit numerical format has a slightly better accuracy in comparison to the floating point number system, as shown in Table V. 16-bit posits outperform 16-bit floats in terms of accuracy. Although Cheetah is evaluated on small datasets, there are two advantages compared to [14, 36]. Mellempudi et al. [36] use 32-bit numbers for accumulation to reduce the hardware cost of stochastic rounding. Wang et al. [14] reduce the accumulation bit-precision to 16 by using stochastic rounding. However, in this paper, we show the potential of using 16-bit posits for all DNN parameters with a simple and hardware-friendly round-to-nearest algorithm and show less than 1% accuracy degradation without exhaustively analyzing DNN training parameters.
V-D EMAC Soft-Core FPGA Implementation
To show the effectiveness of the posit numerical format over floating point and fixed-point, we evaluate the trade-off between the energy-delay-product and latency of the EMAC operation vs. average accuracy degradation from 32-bit floating point per bit-width across the three datasets (two for the linear-quantization experiment) with the Cheetah framework, as shown in Figs. 5, 6, 7, 8, and 9. The energy-delay-product, a combined measure of the latency and resource cost of the EMAC operation, coupled with quantization with rounding [20] and the EMAC operation coupled with linear quantization are selected for all numerical formats and measured on a Virtex-7 FPGA (xc7vx485t-2ffg1761c) with synthesis through Vivado 2017.2. Note that the average accuracy degradation per bit-width is computed using the accuracy results in Table IV.
The results, as shown by Fig. 5, indicate that posit coupled with rounding quantization achieves up to 23% average accuracy improvement over fixed-point. However, this accuracy enhancement is gained at the cost of a increase in energy-delay-product to implement the EMAC unit. Posit also consistently shows better performance, especially at 5-bit compared to the floating point number system at a comparable energy-delay-product. The posit EMAC operation achieves lower latencies, as shown in Fig. 6, due to a lack of subnormal detection and other exception cases, but exhibits resource-hungry encoding and decoding due to the variable-length regime of the posit numerical format, as shown in Fig. 7. Overall, the 6-bit posit shows the best trade-off between energy-delay-product and average accuracy degradation from 32-bit floating point on the two benchmarks (when analyzed across the [5..8]-bit range). Looking at the posit numerical format in terms of classification performance and EMAC energy-delay-product, posits with provide a better trade-off compared to posits with . At [5..7]-bit precision, the average performance of DNN inference with among the three datasets is 2% and 4% better than with and , respectively. These accuracy benefits are coupled with 2.1 less energy-delay-product and 1.4 more energy-delay-product in comparison to and , respectively. These results are measured when the rounding quantization is used. Linear quantization with the shift operation requires similar hardware overhead across all of the numerical formats, as shown in Figs. 8 and 9. However, the accuracy of performing DNN inference with linear quantization with posits () is similar to the accuracy when . Therefore, it is possible to use EMACs with instead of and thereby achieve 18% energy-delay-product savings.
A summary of previous studies that propose low-precision frameworks are shown in Table VI. Several research groups have explored the efficacy of floats and fixed-point on the performance and hardware complexity of DNNs with multiple image classification tasks [32, 15, 16, 14, 35, 34]. However, none of these works analyze the appropriateness of the posit numerical format for both DNN training and inference. Additionally, current work does not offer insight on the impact of the quantization approach vs. numerical format on both accuracy and hardware complexity, as investigated in this paper.
VI Conclusions
A low-precision DNN framework, Cheetah, for edge devices is proposed in this work. We explored the capacity of various numerical formats, including floating point, fixed-point and posit, for both DNN training and inference. We show that the recent posit numerical format has high efficacy for DNN training at {16, 32}-bit precision and inference at 8-bit precision. Moreover, we show that it is possible to achieve better performance and reduce energy consumption by using linear quantization with the posit numerical format. The success of low-precision posits in reducing DNN hardware complexity with negligible accuracy degradation motivates us to evaluate ultra-low precision training in future work.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] E. Li, Z. Zhou, and X. Chen, “Edge intelligence: On-demand deep learning model co-inference with device-edge synergy,” in Proceedings of the 2018 Workshop on Mobile Edge Communications . ACM, 2018, pp. 31–36.
- 2[2] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, “Edge computing: Vision and challenges,” IEEE Internet of Things Journal , vol. 3, no. 5, pp. 637–646, 2016.
- 3[3] M. Satyanarayanan, “The emergence of edge computing,” Computer , vol. 50, no. 1, pp. 30–39, 2017.
- 4[4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems, Neur IPS , P. L. Bartlett, F. C. N. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds., Lake Tahoe, Nevada, USA, Dec. 2012, pp. 1106–1114. [Online]. Available: http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-ne
- 5[5] M. Horowitz, “1.1 computing’s energy problem (and what we can do about it),” in 2014 IEEE international solid-state circuits conference digest of technical papers (ISSCC) . IEEE, 2014, pp. 10–14.
- 6[6] X. Xu, Y. Ding, S. X. Hu, M. Niemier, J. Cong et al. , “Scaling for edge inference of deep neural networks,” Nature Electronics , vol. 1, no. 4, p. 216, 2018.
- 7[7] C.-J. Wu, D. Brooks, K. Chen, D. Chen, S. Choudhury et al. , “Machine learning at facebook: Understanding inference at the edge,” in 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA) . IEEE, 2019, pp. 331–344.
- 8[8] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang et al. , “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” ar Xiv preprint ar Xiv:1704.04861 , 2017.
