Mixed-Signal Charge-Domain Acceleration of Deep Neural networks through   Interleaved Bit-Partitioned Arithmetic

Soroush Ghodrati; Hardik Sharma; Sean Kinzer; Amir Yazdanbakhsh,; Kambiz Samadi; Nam Sung Kim; Doug Burger; Hadi Esmaeilzadeh

arXiv:1906.11915·cs.AR·July 15, 2019

Mixed-Signal Charge-Domain Acceleration of Deep Neural networks through Interleaved Bit-Partitioned Arithmetic

Soroush Ghodrati, Hardik Sharma, Sean Kinzer, Amir Yazdanbakhsh,, Kambiz Samadi, Nam Sung Kim, Doug Burger, Hadi Esmaeilzadeh

PDF

TL;DR

This paper introduces a mixed-signal, charge-domain accelerator for deep neural networks that uses bit-partitioned arithmetic and switched-capacitor circuitry to improve power efficiency and reduce A/D conversion overhead.

Contribution

It proposes a novel charge-domain architecture with bit-partitioned operations and switched-capacitor design, enabling low-power, high-parallelism DNN acceleration.

Findings

01

Reduces A/D conversion overhead in DNN accelerators

02

Improves noise mitigation through low-bitwidth operations

03

Achieves efficient charge-domain computation with switched-capacitor circuits

Abstract

Low-power potential of mixed-signal design makes it an alluring option to accelerate Deep Neural Networks (DNNs). However, mixed-signal circuitry suffers from limited range for information encoding, susceptibility to noise, and Analog to Digital (A/D) conversion overheads. This paper aims to address these challenges by offering and leveraging the insight that a vector dot-product (the basic operation in DNNs) can be bit-partitioned into groups of spatially parallel low-bitwidth operations, and interleaved across multiple elements of the vectors. As such, the building blocks of our accelerator become a group of wide, yet low-bitwidth multiply-accumulate units that operate in the analog domain and share a single A/D converter. The low-bitwidth operation tackles the encoding range limitation and facilitates noise mitigation. Moreover, we utilize the switched-capacitor design for our…

Tables4

Table 1. Table 1: Energy breakdown for ℳ 𝒮 ℳ 𝒮 \mathcal{MS} - BPMacc

Units	Energy (femto Joule)
1 MACC	5.1 fJ
256 MACCs	1,305.6 fJ
SAR ADC (for 256 MACCs)	1,660.0 fJ
Total Energy	1,956.6 fJ
Total Energy per 2b-2b MACC	11.6 fJ
Total Energy per 8b-8b MACC	185.3 fJ

Table 2. Table 2: Evaluated benchmarked DNNs

DNN	Type	Domain	Dataset	Multiply-Adds	Model Weights
AlexNet [56]	CNN	Image Classification	Imagenet [57]	2,678 MOps	56.1 MBytes
CIFAR-10 [58, 59]	CNN	Image Classification	CIFAR-10 [60]	617 MOps	13.4 MBytes
GoogLeNet [61]	CNN	Image Classification	Imagenet	1,502 MOps	13.5 MBytes
ResNet-18 [62]	CNN	Image Classification	Imagenet	4,269 MOps	11.1 MBytes
ResNet-50 [62]	CNN	Image Classification	Imagenet	8,030 MOps	24.4 MBytes
VGG-16 [58]	CNN	Object Recognition	Imagenet	31 GOps	131.6 MBytes
VGG-19 [58]	CNN	Object Recognition	Imagenet	39 GOps	137.3 MBytes
YOLOv3 [63]	CNN	Object Recognition	Imagenet	19 GOps	39.8 MBytes
PTB-RNN [59]	RNN	Language Modeling	Penn TreeBank [64]	17 MOps	16 MBytes
PTB-LSTM [65]	RNN	Language Modeling	Penn TreeBank	13 MOps	12.3 MBytes

Table 3. Table 3: BiHiwe and baselines platforms

Parameters	ASIC		Parameters	GPU
Chip	BiHiwe	Tetris	Chip	RTX 2080 TI	Titan Xp
MACCs	16,384	3,136	Tensore Cores	544	—
On-chip Memory	9216 KB	3698 KB	Memory	11 GB (GDDR6)	12 GB (GDDR5X)
Chip Area ( $m m^{2}$ )	122.3	56	Chip Area ( $m m^{2}$ )	754	471
Chip Area ( $m m^{2}$ )	122.3	56	Total Dissipation Power	250 W	250 W
Frequency	500 Mhz	500 Mhz	Frequency	1545 Mhz	1531 Mhz
Technology	45 nm	45 nm	Technology	12 nm	16 nm

Table 4. Table 4: Accuracy before and after fine-tuning.

DNN Model

Dataset

Top-1 Accuracy

(With non-idealities)

Top-1 Accuracy

(After fine-tuning)

Top-1 Accuracy

(Ideal)

Accuracy Loss

AlexNet

Imagenet

53.12%

56.64%

57.11%

0.47%

CIFAR-10

90.82%

91.01%

91.03%

0.02%

GoogLeNet

Imagenet

67.15%

68.39%

68.72%

0.33%

ResNet-18

Imagenet

66.91%

68.96%

68.98%

0.02%

ResNet-50

Imagenet

74.5%

75.21%

75.25%

0.04%

VGG-16

Imagenet

70.31%

71.28%

71.46%

0.18%

VGG-19

Imagenet

73.24%

74.20%

74.52%

0.32%

YOLOv3

Imagenet

75.92%

77.1%

77.22%

0.21%

PTB-RNN

Penn TreeBank

1.1 BPC

1.6 BPC

1.1 BPC

0.0 BPC

PTB-LSTM

Penn TreeBank

97 PPW

170 PPW

97 PPW

0.0 PPW

Equations18

Q_{s x} = v_{D D} \times (∣ X ∣ C_{x})

Q_{s x} = v_{D D} \times (∣ X ∣ C_{x})

V_{s} = \frac{Q _{s x}}{C _{e q}} = \frac{v _{D D} \times ( ∣ X ∣ \textsc C \textsubscript x )}{3 \textsc C \textsubscript x + ∣ w ∣ \textsc C \textsubscript w}

V_{s} = \frac{Q _{s x}}{C _{e q}} = \frac{v _{D D} \times ( ∣ X ∣ \textsc C \textsubscript x )}{3 \textsc C \textsubscript x + ∣ w ∣ \textsc C \textsubscript w}

Q_{s w} = V_{s} \times ∣ w ∣ \textsc C \textsubscript w = ∣ x ∣ \times ∣ w ∣ (\frac{\textsc C \textsubscript w \textsc C \textsubscript x v _{D D}}{3 \textsc C \textsubscript x + ∣ w ∣ \textsc C \textsubscript w})

Q_{s w} = V_{s} \times ∣ w ∣ \textsc C \textsubscript w = ∣ x ∣ \times ∣ w ∣ (\frac{\textsc C \textsubscript w \textsc C \textsubscript x v _{D D}}{3 \textsc C \textsubscript x + ∣ w ∣ \textsc C \textsubscript w})

V_{A C C} = ∣ x ∣∣ w ∣ (\frac{\textsc C \textsubscript w v _{D D}}{3 \times C _{A C C}})

V_{A C C} = ∣ x ∣∣ w ∣ (\frac{\textsc C \textsubscript w v _{D D}}{3 \times C _{A C C}})

σ_{A C C} = \frac{k T ( α ∣ W _{m - 1} ∣ + 3 α + 3 )}{9 α ( α + 1 ) ^{2} C _{w}} (i = 0 \sum m - 1 (\frac{α}{1 + α})^{2 i}) \times n

σ_{A C C} = \frac{k T ( α ∣ W _{m - 1} ∣ + 3 α + 3 )}{9 α ( α + 1 ) ^{2} C _{w}} (i = 0 \sum m - 1 (\frac{α}{1 + α})^{2 i}) \times n

N (μ = 0, σ^{2} = (σ_{A C C} \times r \times 85)^{2})

N (μ = 0, σ^{2} = (σ_{A C C} \times r \times 85)^{2})

V_{A C C, I d e a l} [m] = i = 1 \sum m \frac{V \textsubscript D D}{9 α} W_{i} X_{i}

V_{A C C, I d e a l} [m] = i = 1 \sum m \frac{V \textsubscript D D}{9 α} W_{i} X_{i}

\frac{3 α}{3 α + ∣ W _{m} ∣} V_{A C C, R} [m - 1] + \frac{W _{m} X _{m} β}{( 3 α + ∣ W _{m} ∣ ) ( 3 β + ∣ W _{m} ∣ )} V \textsubscript D D

\frac{3 α}{3 α + ∣ W _{m} ∣} V_{A C C, R} [m - 1] + \frac{W _{m} X _{m} β}{( 3 α + ∣ W _{m} ∣ ) ( 3 β + ∣ W _{m} ∣ )} V \textsubscript D D

W_{i}^{'} = \frac{W _{i}}{3 α + ∣ W _{i} ∣} \frac{β V _{D D}}{3 β + ∣ W _{i} ∣} j = i + 1 \prod m - 1 \frac{3 α}{3 α + ∣ W _{j} ∣} \forall0 \leq i \leq m - 1

W_{i}^{'} = \frac{W _{i}}{3 α + ∣ W _{i} ∣} \frac{β V _{D D}}{3 β + ∣ W _{i} ∣} j = i + 1 \prod m - 1 \frac{3 α}{3 α + ∣ W _{j} ∣} \forall0 \leq i \leq m - 1

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Mixed-Signal Charge-Domain Acceleration of Deep Neural Networks through Interleaved Bit-Partitioned Arithmetic

Soroush Ghodrati Hardik Sharma*†* Sean Kinzer Amir Yazdanbakhsh*‡*

Kambiz Samadi♭ Nam Sung Kim¶ Doug Burger§ Hadi Esmaeilzadeh

Alternative Computing Technologies (ACT) Lab

University of California This work has been done when the author was a PhD student at Georgia Institute of Technology.

San Diego

*†*Georgia Institute of Technology *‡*Google Research ♭Qualcomm Technologies ¶Samsung Electronics §Microsoft

[email protected] [email protected] [email protected] [email protected]

Abstract

Low-power potential of mixed-signal design makes it an alluring option to accelerate Deep Neural Networks (DNNs). However, mixed-signal circuitry suffers from limited range for information encoding, susceptibility to noise, and Analog to Digital (A/D) conversion overheads. This paper aims to address these challenges by offering and leveraging the insight that a vector dot-product (the basic operation in DNNs) can be bit-partitioned into groups of spatially parallel low-bitwidth operations, and interleaved across multiple elements of the vectors. As such, the building blocks of our accelerator become a group of wide, yet low-bitwidth multiply-accumulate units that operate in the analog domain and share a single A/D converter. The low-bitwidth operation tackles the encoding range limitation and facilitates noise mitigation. Moreover, we utilize the switched-capacitor design for our bit-level reformulation of DNN operations. The proposed switched-capacitor circuitry performs the group multiplications in the charge domain and accumulates the results of the group in its capacitors over multiple cycles. The capacitive accumulation combined with wide bit-partitioned operations alleviate the need for A/D conversion per operation. With such mathematical reformulation and its switched-capacitor implementation, we define a 3D-stacked microarchitecture, dubbed BiHiwe 111BiHiwe: Bit-Partitioned and Interleaved Hierarchy of Wide Acceleration through Electrical Charge

—pronounced Bee Hive—that leverages clustering and hierarchical design to best utilize power-efficiency of the mixed-signal domain and 3D stacking. For ten DNN benchmarks, BiHiwe delivers 4.9 $\times$ speedup over a leading purely-digital 3D-stacked accelerator Tetris, with a mere of less than 0.5% accuracy loss achieved by careful treatment of noise, computation error, and various forms of variation. Compared to RTX 2080 TI with tensor cores and Titan Xp GPUs, all with 8-bit execution, BiHiwe offers 33.1 $\times$ and 66.5 $\times$ higher Performance-per-Watt, respectively. BiHiwe also outperforms other leading digital and analog accelerators in power efficiency. The results suggest that BiHiwe is an effective initial step in a road that combines mathematics, circuits, and architecture.

. ntroduction

Deep Neural Networks (DNNs) are revolutionizing a wide range of services and applications such as language translation [1], transportation [2], intelligent search [3], e-commerce [4], and medical diagnosis [5]. These benefits are predicated upon delivery on performance and energy efficiency from hardware platforms. With the diminishing benefits from general-purpose processors [6, 7, 8, 9], there is an explosion of digital accelerators for DNNs [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]. Mixed-signal acceleration [32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42] is also gaining traction. Albeit low-power, mixed-signal circuitry suffers from limited range of information encoding, is susceptible to noise, imposes Analog to Digital (A/D) and Digital to Analog (D/A) conversion overheads, and lacks fine-grained control mechanism. Realizing the full potential of mixed-signal technology requires a balanced design that brings mathematics, architecture, and circuits together.

This paper sets out to explore this conjunction of areas by inspecting the mathematical foundation of deep neural networks. Across a wide range of models, the large majority of DNN operations belong to convolution and fully-connected layers [23, 28, 32]. Consequently, based on Amdahl’s Law, our architecture executes these two types of layers in the mixed-signal domain. Nevertheless, to maintain generality for the ever-expanding roster of other layers required by modern DNNs, the architecture handles the other layers digitally. Normally, the convolution and fully-connected layers are broken down into a series of vector dot-products, that generate a scalar and comprise a set of Multiply-Accumulate (MACC) operations. State-of-the-art digital [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31] and mixed-signal [32, 33, 35, 36, 37, 38, 39, 40, 34, 43, 41, 42] accelerators use a large array of stand-alone MACC units to perform the necessary computations. When moving to the mixed-signal domain, this stand-alone arrangement of MACC operations imposes significant overhead in the form of A/D and D/A conversions for each operation. The root cause is the high cost of converting the operands and outputs of each MACC to and from the analog domain, respectively.

This paper aims to address the aforementioned list of challenges by making the following three contributions.

(1) This work offers and leverages the insight that the set of MACC operations within a vector dot-product can be partitioned, rearranged, and interleaved at the bit level without affecting the mathematical integrity of the vector dot-product. Unlike prior work [33, 42, 44], this work does not rely on changing the mathematics of the computation to enable mixed-signal acceleration. Instead, it only rearranges the bit-wise arithmetic calculations to utilize lower bitwidth analog units for higher bitwidth operations. The key insight is that a binary value can be expressed as the sum of products similar to dot-product, which is also a sum of multiplications ( $a=\vec{X}\bullet\vec{W}=\sum_{i}x_{i}\times w_{i}$ ). Value $b$ can be expressed as $b=\sum_{i}(2^{i}\times b_{i})$ where $b_{i}$ s are the individual bits or as $b=\sum_{i}(2^{4i}\times bp_{i})$ , where $bp_{i}$ s are 4-bit partitions for instance. Our interleaved bit-partitioned arithmetic effectively utilizes the distributive and associative property of multiplication and addition at the bit granularity.

The proposed model, first, bit-partitions all elements of the two vectors, and then distributes the MACC operations of the dot-product over the bit partitions. Therefore, the lower bitwidth MACC becomes the basic operator that is applied to each bit-partition. Then, our mathematical formulation exploits the associative property of the multiply and add to group bit-partitions that are at the same significance position. This significance-based rearrangement enables factoring out the power-of-two multiplicand that signifies the position of the bit-partitions. The factoring enables performing the wide group-based low-bitwidth MACC operations simultaneously as a spatially parallel operation in the analog domain, while the group shares a single A/D convertor. The power-of-two multiplicand will be applied later in the digital domain to the accumulated result of the group operation. To this end, we rearchitect vector dot-product as a series of wide (across multiple elements of the two vectors), interleaved and bit-partitioned arithmetic and re-aggregation. Therefore, our reformulation significantly reduces the rate of costly A/D conversion by rearranging the bit-level operations across the elements of the vector dot-product. Using low-bitwidth operands for analog MACCs provides a larger headroom between the value encoding levels in the analog domain. The headroom leads tackles the limited range of encoding and offers higher robustness to noise, an inherent non-ideality in the analog mode. Additionally, using lower bitwidth operands reduces the energy/area overhead imposed by A/D and D/A convertors that roughly scales exponentially with the bitwidth of operands.

(2) At the circuit level, the accelerator is designed using switched-capacitor circuitry that stores the partial results as electric charge over time without conversion to the digital domain at each cycle. The low-bitwidth MACCs are performed in charge domain with a set of charge-sharing capacitors. This design choice lowers the rate of A/D conversion as it implements accumulation as a gradual storage of charge in a set of parallel capacitors. These capacitors not only aggregate the result of a group of low-bitwidth MACCs, but also enable accumulating results over time. As such, the architecture enables dividing the longer vectors into shorter sub-vectors that are multiply-accumulated over time with a single group of low-bitwidth MACCs. The results are accumulated over multiple cycles in the group’s capacitors. Because the capacitors can hold the charge from cycle to cycle, the A/D conversion is not necessary in each cycle. This reduction in rate of A/D conversion is in addition to the amortized cost of A/D convertors across the bit-partitioned analog MACCs of the group.

(3) Based on these insights, we devise a clustered 3D-stacked microarchitecture, dubbed BiHiwe, that provides the capability to integrate copious number of low-bitwidth switched-capacitor MACC units that enables the interleaved bit-partitioned arithmetic. The lower energy of mixed-signal computations offers the possibility of integrating a larger number of these units compared to their digital counterpart. To efficiently utilize the more sizable number of compute units, a higher bandwidth memory subsystem is needed. Moreover, one of the large sources of energy consumption in DNN acceleration is off-chip DRAM accesses [30, 28, 23]. Based on these insights, we devise a clustered architecture for BiHiwe that leverages 3D-stacking for its higher bandwidth and lower data transfer energy.

Evaluating the carefully balanced design of BiHiwe with ten DNN benchmarks shows that BiHiwe delivers 4.9 $\times$ speedup over the leading purely digital 3D-stacked DNN accelerator, Tetris [12], with only 0.5% loss in accuracy achieved after mitigating noise, computation error, and Process-Voltage-Temperature (PVT) variations. With 8-bit execution, BiHiwe offers 33.1 $\times$ and 66.5 $\times$ higher Performance-per-Watt compared to RTX 2080 TI and Titan Xp, respectively. With these benefits, this paper marks an initial effort that paves the way for a new shift in DNN acceleration.

. ide, Interleaved, and Bit-Partitioned Arithmetic

A key idea of this work is the mathematical insight that enables utilizing low bitwidth mixed-signal units in spatially parallel groups. This section demonstrates this insight.

Bit-Level partitioning and interleaving of MACCs. To further detail the proposed mathematical reformulation, Figure 1(a) delves into the bit-level operations of dot-product on vectors with 2-elements containing 4-bit values. As illustrated with different colors, each 4-bit element can be written in the form of sum of 2-bit partitions multiplied by powers of 2 (shift). As discussed, vector dot-product is also a sum of multiplications. Therefore, by utilizing the distributive property of addition and multiplication, we can rewrite the vector-dot product in terms of the bit partitions. However, we also leverage the associativity of the addition and multiplication to group the bit-partitions in the same positions together. For instance, in Figure 1, the black partitions that represent the Most Significant Bits (MSBs) of the $\vec{W}$ vector are multiplied in parallel to the teal222Color teal in Figure 1 is the darkest gray in black and white prints. partitions, representing the MSBs of the $\vec{X}$ . Because of the distributivity of multiplication, the shift amount of (2+2) can be postponed after the bit-partitions are multiply-accumulated. The different colors of the boxes in Figure 1 illustrates the interleaved grouping of the bit-partitions. Each group is a set of spatially parallel bit-partitioned MACC operations that are drawn from different elements of the two vectors. The low-bitwidth nature of these operations enables execution in the analog domain without the need for A/D conversion for each individual bit-partitioned operation. As such, our proposed reformulation amortizes the cost of A/D conversion across the bit-partitions of different elements of the vectors as elaborated below.

Wide, interleaved, and bit-partitioned vector dot-product. Figure 1(b) illustrates the proposed vector dot-product operation with 4-bit elements that are bit partitioned to 2-bit sub-elements. For instance, as illustrated, the elements of vector $X$ , denoted as $x_{i}$ , are first bit partitioned to $x_{i}^{L}$ and $x_{i}^{M}$ . The former represents the two Least Significant Bits (LSBs) and the latter represents the Most Significant Bits (MSBs). Similarly, the elements of vector $W$ are also bit partitioned to the $w_{i}^{L}$ and $w_{i}^{M}$ sub-elements. Then, each vector (e.g., $W$ ) is rearranged into two bit-partitioned sub-vectors, $W^{LSBs}$ and $W^{MSBs}$ . In the current implementations of BiHiwe architecture, the size of bit-partition is fixed across the entire architecture. Therefore, the rearrangement is just rewiring the bits to the compute units that imposes modestly minimal overhead (less than 1%). Figure 1 is merely an illustration and there is no need for extra storage or movement of elements. As depicted with color coding, after the rewiring, $W^{LSBs}$ represents all the least significant bit-partitions from different elements of vector $W$ , while the MSBs are rewired in $W^{MSBs}$ . The same rewiring is repeated for the vector $X$ . This rearrangement, puts all the bit-partitions from all the elements of the vectors with the same significance in one group, denoted as $W^{LSBs}$ , $W^{MSBs}$ , $X^{LSBs}$ , $X^{MSBs}$ . Therefore, when a pair of the groups (e.g., $X^{MSBs}$ and $W^{MSBs}$ in Figure 1(c)) are multiplied to generate the partial products, (1) the shift amount (“ $\ll 4$ ” in this case) is the same for all the bit-partitions and (2) the shift can be done after partial products from different sub-elements are accumulated together.

As shown in Figure 1(c), the low-bitwidth elements are multiplied together and accumulated in the analog domain. Accumulation in the digital domain would require an adder tree which is costly compared to the analog accumulation that merely requires connectivity between the multiplier outputs. It is only after several analog multiply-accumulations that the results are converted back to digital for shift and aggregation with partial products from the other groups. The size of the vectors usually exceeds the number of parallel low-bitwidth MACCs, in which case the results need to be accumulated over multiple iterations. As will be discussed in the next section, the accumulations are performed in two steps. The first step accumulates the results in the analog domain through charge accumulation in capacitors before A/D convertors (see Figure 1(c)). In the second step, these converted accumulations will be added up in the digital domain using a register. For this pattern of computation, we are effectively utilizing the distributive and associative property of multiplication and addition for dot-product but at the bit granularity. This rearrangement and spatially parallel (i.e., wide) bit-partitioned computation is in contrast with temporally bit-serial digital [17, 13, 31, 45] and analog [32] DNN accelerators.

The next section describes the architecture of the mixed-signal accelerator that leverages our mathematical reformulation. This architecture is essentially a collection of the structure that is depicted in Figure 1(c). The structure is the Mixed-Signal Wide Aggregator ( $\mathcal{MS}$ -WAgg) that spatially aggregates the results from its four units as illustrated. Each of these four units, which are also wide, is a Mixed-Signal Bit-Partitioned MACC ( $\mathcal{MS}$ -BPMacc). Note that the number of $\mathcal{MS}$ -BPMaccs in a $\mathcal{MS}$ -WAgg is a function of the bitwidth of the vector elements and the value of bit-partitioning.

. ixed-Signal Architecture Design for Wide Bit-Partitioning

To exploit the aforementioned arithmetic, BiHiwe comes with a mixed-signal building block that performs wide bit-partitioned vector dot-product. BiHiwe then organizes these building blocks in a clustered hierarchical design to efficiently make use of its copious number of parallel low-bitwidth mixed-signal MACC units. The clustered design is crucial as mixed-signal paradigm enables integrating a larger number of parallel operators than the digital counterpart.

3.1 Wide Bit-Partitioned Mixed-Signal MACC

As Figure 2(a) shows, the building block of BiHiwe is a collection of low-bitwidth analog MACCs that operate in parallel on sub-elements from the two vectors under dot-product. This wide structure is dubbed $\mathcal{MS}$ -BPMacc. We design the low-bitwidth MACCs using switched-capacitor circuitry for the following reason. This design choice lowers the rate of A/D conversion as it implements accumulation as a gradual storage of charge in a set of parallel capacitors. These capacitors not only aggregate the results of low-bitwidth MACCs, but also enable accumulating results over time. As such, longer vectors are divided into shorter sub-vectors that are multiply-accumulated over time without the need to convert the intermediate results back to the digital domain. It is only after processing multiple sub-vectors that the accumulated result is converted to digital, significantly reducing the rate of costly A/D conversions. As shown in Figure 2(a), each low-bitwidth MACC unit is equipped with its own pair of local capacitors, which perform the accumulation over time across multiple sub-vectors. As will be discussed in Section 4, the pair is used to handle positive and negative values by accumulating them separately on one or the other capacitor. After a pre-determined number of private accumulations in the analog domain, the partial results need to be accumulated across the low-bitwidth MACCs. In that cycle, the transmission gates between the capacitors (Figure 2(a)) connect them and a simple charge sharing between the capacitors yields the accumulated result for the $\mathcal{MS}$ -BPMacc. That is when a single A/D conversion is performed, the cost of which is not only amortized across the parallel MACC units but also over time across multiple sub-vectors.

3.2 Mixed-Signal Wide Aggregator

$\mathcal{MS}$ -BPMaccs only process low-bitwidth operands; however, they cannot combine these operations to enable higher bit-width dot-products. A collection of $\mathcal{MS}$ -BPMaccs can provide this capability as discussed with Figure 1 in Section 2. This structure is named $\mathcal{MS}$ -WAgg as it is a Mixed Signal Wide Aggregator. Figure 2(b) depicts a 2D array of a possible $\mathcal{MS}$ -WAgg design, comprising 16 $\mathcal{MS}$ -BPMaccs that are necessary to perform 8-bit by 8-bit vector dot-product with 2-bit partitioning. In this case, the number 16 comes from the fact that each of the two 8-bit operands can be partitioned to four 2-bit values. Each of the four 2-bit partitions of the multiplicand need to be multiply-accumulated with all the multiplier’s four 2-bit partitions. As discussed in Section 2, each $\mathcal{MS}$ -WAgg also performs the necessary shift operations to combine the low-bitwidth results from its 16 $\mathcal{MS}$ -BPMaccs. By aggregating the partial results of each $\mathcal{MS}$ -BPMacc, the $\mathcal{MS}$ -WAgg unit generates a scalar output which is stored on its output register. As illustrated in Figure 2, a collection of these $\mathcal{MS}$ -WAggs constitute an accelerator core from which the clustered architecture of BiHiwe is designed.

3.3 Hierarchically Clustered Architecture

As discussed in Section 4, the proposed $\mathcal{MS}$ -WAgg consumes $5.4\times$ less energy for a single 8-bit MACC in comparison with a digital logic (1 pJ taken from the Tetris simulator [46], which is commensurate with other reports [47, 48]). As such, it is possible to integrate a larger number of mixed-signal compute units on a chip with a given power budget compared to a digital architecture. To efficiently utilize the larger number of available compute units, a high bandwidth memory substrate is required. Moreover, one of the large sources of energy consumption in DNN acceleration is off-chip DRAM accesses [30, 28, 23]. To maximize the benefits of the mixed-signal computation, 3D-stacked memory is an attractive option since it reduces the cost of data accesses and provides a higher bandwidth for data transfer between the on-chip compute and off-chip memory [12, 25]. Based on these insights, we devise a clustered architecture for BiHiwe with a 3D-stacked memory substrate as shown in Figure 2(c). The mixed-signal logic die of BiHiwe is stacked over the DRAM dies with multiple vaults, each of which is connected to the logic die with several through-silicon-via (TSV)s. The 3D memory substrate of BiHiwe is modeled using Micron’s Hybrid Memory Cube (HMC) [49, 50] which has been shown to be a promising technology for DNN acceleration [12]. As the results in Section 8.2 Figure 15 shows, a flat systolic design would result in significant underutilization of the compute resources and bandwidth from 3D stacking.

Therefore, BiHiwe is a hierarchically clustered architecture that allocates multiple accelerator cores as a cluster to each vault. Figure 2(b) depicts a single core. As shown in Figure 2(b), each core is self-sufficient and packs a mixed-signal systolic array of $\mathcal{MS}$ -WAggs as well as the digital units that perform pooling, activation, normalization, etc. The mixed-signal array is responsible for the convolutional and fully connected layers. Generally, wide and interleaved bit-partitioned execution within $\mathcal{MS}$ -WAggs is orthogonal to the organization of the accelerator architecture. This paper explores how to embed them and the proposed compute model, within a systolic design and enables end-to-end programmable mixed-signal acceleration for a variety of DNNs.

Accelerator core. As Figure 2(b) depicts, the first level of hierarchy is the accelerator core and its 2D systolic array that utilizes the $\mathcal{MS}$ -WAggs. As depicted, the Input Buffers and Output Buffers are shared across the columns and rows, respectively. Each $\mathcal{MS}$ -WAgg has its own Weight Buffer. This organization is commensurate with other designs and reduces the cost of on-chip data accesses as inputs are reused with multiple filters [26]. However, what makes our design different is the fact that each buffer needs to supply a sub-vector not a scalar in each cycle to the $\mathcal{MS}$ -WAggs. However, the $\mathcal{MS}$ -WAgg generates only a scalar since dot-product generates a scalar output. The rewiring of the inputs and weights is already done inside the $\mathcal{MS}$ -WAggs since the size of bit-partitions is fixed. As such, there is no need to reformat any of inputs, activations, or weights. As the outputs of $\mathcal{MS}$ -WAggs flow down the columns, they get accumulated to generate the output activations that are fed to each columns dedicated Normalization/Activation/Pooling Unit s. To preserve the accuracy of the DNN model, the intermediate results are stored as 32-bit digital values and intra-column aggregations are performed in the digital mode.

On-chip data delivery for accelerator cores. To minimize data movement energy and maximally exploit the large degrees of data-reuse offered by DNNs, BiHiwe uses a statically-scheduled bus that is capable of multicasting/broadcasting data across accelerator cores. Compared to complex interconnections, the choice of statically-scheduled bus significantly simplifies the hardware by alleviating the need for complicated arbitration logic and FIFOs/buffers required for dynamic routing. Moreover, the static schedule enables the BiHiwe compiler stack to cut the DNN layers across cores while maximizing inter- and intra-core data-reuse. The static schedule is encoded in the form of data communication instructions (Section 7) that are responsible for (1) fetching data tiles from the 3D-stacked memory and distributing them across cores or (2) writing output tiles back from the cores to the memory.

Parallelizing computations across accelerator cores. Data-movement energy is a significant portion of the overall energy consumption both for digital designs [12, 23, 24, 28, 30, 51] and analog designs [33, 35]. As such, the BiHiwe clustered architecture (1) divides the computations into tiles that fit within the limited on-chip capacity of the scratchpads that are private for each accelerator core, and (2) cuts the tiles of computations across cores to minimize DRAM accesses by maximally utilizing the multicast/broadcasting capabilities of BiHiwe on-chip data delivery network. To simplify the design of the accelerator cores, the scratchpad buffers are private to each core and the shared data is replicated across multiple cores. Thus, a single tile of data can be read once from the 3D-stacked memory and then be broadcasted/multicasted across cores to reduce DRAM accesses. The cores use double-buffering to hide the latency for memory accesses for subsequent tiles. The accelerator cores use output-stationary dataflow that minimizes the number of ADC conversions by accumulating results in the charge-domain. Section 6 discusses the BiHiwe compiler stack that optimizes the cuts and tile sizes for individual DNN layers.

. witched-Capacitor Circuit Design for Bit-Partitioning

BiHiwe exploits switched-capacitor circuitry [36, 34, 43, 42, 41] for $\mathcal{MS}$ -BPMacc by implementing MACC operations in the charge-domain rather than using resistive-ladders to compute in current domain [32, 40, 44]. Compared to the current-domain approach, switched-capacitors (1) enable result accumulation in the analog domain by storing them as electric charge, eliminating the need for A/D conversion at every cycle, and (2) make multiplications depend only on the ratio of the capacitor sizes rather than their absolute capacitances. The second property enables reduction of capacitor sizes, improving the energy and area of MACC units as well as making them more resilient to process variation. The following discusses the details of the $\mathcal{MS}$ -BPMacc circuitry.

4.1 Low-Bitwidth Switched-Capacitor MACC

Figure 3 depicts the design of a single 3-bit sign-magnitude MACC. The $x_{s}x_{1}x_{0}$ and $w_{s}w_{1}w_{0}$ denote the bit-partitions operands. The result of each MACC operation is retained as electric charge in the accumulating capacitor (CACC). In addition to CACC, the MACC unit contains two capacitive Digital-to-Analog Converters, one for inputs (C-DACx) and one for weights (C-DACw). The C-DACx and C-DACw convert the 2-bit magnitude of the input and weight to the analog domain as an electric charge proportional to $|x|$ and $|w|$ respectively. C-DACx and C-DACw are each composed of two capacitors ((Cx, $2$ Cx) and (Cw, $2$ Cw)) which operate in parallel and are combined to convert the operands to analog domain. Each of these capacitors are controlled by a pair of transmission gates which determine if a capacitor is active or inactive. Another set of transmission gates connects the two C-DACsand shares charge when partitions of $x$ and $w$ are multiplied. The resulting shared charge is stored on either CACC+ or CACC- depending on the “sign” control signal produced by $x_{s}\oplus w_{s}$ . During multiplication, the transmission gates are coordinated by a pair of complimentary non-overlapping clock signals, $Clk$ and $\overline{Clk}$ .

Charge-domain MACC. Figure 4 shows the phase-by-phase process of a MACC and its corresponding active circuits, the phases of which are described below.

$Clk_{\phi(1)}$ : The first phase (Figure 4(a)) consists of the input capacitive DAC converting digital input ( $x$ ) to a charge proportional to the magnitude of the input $|x|\textsc{C\textsubscript{x}}$ . As a result, the sampled charge ( $Q_{sx}$ ) in C-DACx in the first phase is equal to:

[TABLE]

$\overline{Clk}_{\phi(2)}$ : In the second phase (Figure 4(b)), the multiplication happens via a charge-sharing process between C-DACx and C-DACw. C-DACw converts the $|w|$ to the charge domain. At the same time, the C-DACx redistributes its sampled charge ( $Q_{sx}$ ) over all of its capacitors ( $3\times\textsc{C\textsubscript{x}}$ ) as well as the equivalent capacitor of C-DACw. The voltage ( $V_{s}$ ) at the junction of C-DACx and C-DACw is as follows:

[TABLE]

Because the sampled charge is shared with the weight capacitors, the stored charge ( $Q_{sw}$ ) on C-DACw is equal to:

[TABLE]

Equation 3 shows that the stored charge on C-DACw is proportional to $|x|\times|w|$ , but includes a non-linearity due to the $|w|$ term in the denominator. To suppress this non-linearity, Cx and Cw must be chosen such that $3\textsc{C\textsubscript{x}}>>|w|\textsc{C\textsubscript{w}}$ . Although this design choice does not completely suppress this non-linearity, it can be mitigated as discussed in Section 5. With this choice, $Q_{sw}$ becomes $Q_{sw}=|x|\times|w|\frac{\textsc{C\textsubscript{w}}v_{DD}}{3}$ .

$Clk_{\phi(3)}$ : In the last phase, (Figure 4(c)), the charge from multiplication is shared with CACC for accumulation. The sign bits ( $x_{s}$ and $w_{s}$ ) determine which of CACC+ or CACC- is selected for accumulation. The sampled charge by $|w|\textsc{C\textsubscript{w}}$ is then redistributed over the selected CACC as well as all the capacitors of C-DACw ( $=3\textsc{C\textsubscript{w}}$ ). Theoretically, CACC must be infinitely larger than $3\textsc{C\textsubscript{w}}$ to completely absorb the charge from multiplication. However, in reality, some charge remains unabsorbed, leading to a pattern of computational error, which is mitigated as discussed in Section 5 Ideally, the $V_{ACC}$ voltage on CACC is:

[TABLE]

While the charge sharing and accumulation happens on CACC, a new input is fed into C-DACx, starting a new MACC process in a pipelined fashion. This process repeats for all low-bitwidth MACC units over multiple cycles before one A/D conversion.

4.2 Wide Mixed-Signal Bit-Partitioned MACC

Figure 5(a) depicts an array of $n$ switched-capacitor MACCs, constituting the $\mathcal{MS}$ -BPMacc unit, which perform operations for $m$ cycles in the analog domain and store the results locally on their CACCs. Figure 5(b) depicts the control signals and cycles of operations. For the BiHiwe microarchitecture, $m$ and $n$ are selected to $32$ and $8$ based on design space exploration (see Figure 14). Over $m$ cycles, the results of $m\times n$ low-bitwidth MACC operations get accumulated in CACCs, private to each MACC unit. In cycle $m+1$ , the private results get aggregated across all the MACC units within the $\mathcal{MS}$ -BPMacc. The single A/D converter in the $\mathcal{MS}$ -BPMacc is responsible for converting the aggregated result, which also starts at cycle $m+1$ .

In the first phase of cycle $m+1$ , all the $n$ accumulating capacitors which store the positive values (CACC+) are connected together through a set of transmission gates to share their charge. Simultaneously, the same process happens for the CACC-. $Clk$ ACC in Figure 5 is the control signal which connects the CACCs. The accumulating capacitors (CACCs), are also connected to a Successive Approximation Register (SAR) ADC and share their stored charge with the Sample and Hold block (S&H) of the ADC. This (S&H) block has differential inputs which samples the positive and negative results separately, subtracts them and holds them for the process of A/D conversion. In the second phase of the cycle $m+1$ , $Clk$ rst connects all the CACCs to ground to clear them for the next iteration of wide, bit-interleaved calculations.

There is a tradeoff between resolution and sampling rate of ADC, which also defines its topology. SAR ADC is a better choice when it comes to medium resolution (8-12 bits) and sampling rate (1-500 Mega-Samples/sec). We choose a 10-bit, 15 Mega-Samples/sec SAR ADC as it strikes the better balance between speed and resolution for $\mathcal{MS}$ -BPMaccs. The design space exploration in Figure 14 shows that this choice makes the grouping of 8 low-bitwidth MACCs optimal for $m=32$ cycles of operation. The process of A/D conversion takes $m+1$ cycles, pipelined with the sub-vector dot-product. Table 1 shows the energy breakdown within a $\mathcal{MS}$ -BPMacc that uses 2-bit partitioning. As shown, performing an 8-bit MACC using the interleaved bit-partitioned arithmetic requires $5.4\times$ less energy than a digital MACC which consumes around 1 pJ [12].

. ixed-Signal Non-Idealities and Their Mitigation

Although analog circuitry offers significant reduction in energy, they might lead to accuracy degradation. Thus, their error needs to be properly modeled and accounted for. Specifically, $\mathcal{MS}$ -BPMaccs, the main analog component, can be susceptible to (1) thermal noise, (2) computational error caused by incomplete charge transfer, and (3) PVT variations. Traditionally, analog circuit designers mitigate sources of error by just configuring hardware parameters to values which are robust to non-idealities. Such hardware parameter adjustments require rather significant energy/area overheads that scale linearly with number of modules. The overheads are acceptable in conventional analog designs since modules are few in numbers. However, due to the repetitive and scaled-up nature of our design, we need to mitigate these non-idealities in a higher and algorithmic level. We leverage the training algorithm’s inherent mechanism to reduce error (loss) and use mathematical models to represent these non-idealities. We, then, apply these models during the forward pass to adjust and fine-tune pre-trained neural models with just a few more epochs across the chips within a technology node. The rest of this section details non-idealities and their modeling. It, then elaborates on how PVT variations are considered in formulations.

5.1 Thermal Noise

Thermal noise is an inherent perturbation in analog circuits caused by the thermal agitation of electrons, distorting the main signal. This noise can be modeled according to a normal distribution, where the ideal voltage deviates relative to a value comprised of the working temperature (T), Boltzmann constant (k), and capacitor size (C) which produce the deviation $\sigma=\sqrt{kT/C}$ . Within BiHiwe, switched-capacitor MACC units are mainly effected by the combined thermal noise resulting from weights and accumulator capacitors (Cw and CACC respectively). The noise from these capacitors gets accumulated during the $m$ cycles of computation for each individual MACC unit and then gets aggregated across the $n$ MACC units in $\mathcal{MS}$ -BPMacc. By applying the thermal noise equation used for similar MACC units [42] to a $\mathcal{MS}$ -BPMacc unit, the standard deviation at the output is described by Equation 5:

[TABLE]

In the above equation, $\alpha$ is equal to $\frac{\textsc{C\textsubscript{ACC}}}{3\textsc{C\textsubscript{w}}}$ . We apply the effect of thermal noise in the forward propagation of DNN by adding an error tensor to the output of convolutional and fully connected layers. Having computed the standard deviation of noise for a single $\mathcal{MS}$ -BPMacc ( $\sigma_{ACC}$ ), each element of the error tensor is sampled from a normal distribution as follows:

[TABLE]

In the above equation, $\sigma_{ACC}$ is scaled by $r$ which is the amount of $\mathcal{MS}$ -BPMacc operations required to generate one element in the output feature map as well as the amount of total bit-shifts applied to each result by $\mathcal{MS}$ -WAgg unit, $85$ .

5.2 Computational Error

Another source of error in BiHiwe’s charge-domain computations arises when charge is shared between capacitors during the multiplication and accumulation. Within each MACC unit, the input capacitors (C-DACx) transfer a sampled charge to the weight capacitors (C-DACw) to produce charge proportional to the multiplication result. But the resulting charge is subject to error dependent on the ratio of weight and input capacitor sizes ( $\beta=C_{x}/C_{w}$ ) as shown in Equation 3. This shared charge in the weight capacitors introduces more error when it is redistributed to the accumulating capacitor (CACC) which cannot absorb all of the charge, leaving a small portion remaining on the weight capacitors in subsequent cycles. The ideal voltage ( $V_{ACC,Ideal}$ ) produced after $m$ cycles of multiplication can be derived from Equation 4 as follows:

[TABLE]

By considering the computational error from incomplete charge sharing, the actual voltage at the accumulating capacitor after $m$ cycles of MACC operations ( $V_{ACC,R}[m]$ ) becomes:

[TABLE]

Computational error is accounted for in the fine-tuning pass by including the multiplicative factors shown in Equation 8 in weights. During the forward pass, the fine-tuning algorithm decomposes weight tensors in convolutional and fully-connected layers into groups corresponding to $\mathcal{MS}$ -WAgg configuration and updates the individual weight values ( $W_{i}$ ) to new values ( $W_{i}^{\prime}$ ) with the computational error in Equation 9:

[TABLE]

5.3 Process-Voltage-Temperature Variations

Process variations. We use the sizing of the capacitors to provision and mitigate for the process variations to which the switched-capacitor circuits are generally robust. The robustness and the mitigation are effective because the capacitors are implemented using a number of smaller unit capacitors with common-centroid layout technique [52]. We, specifically, use the metal-fringe capacitors for MACCs with mismatch of just 1% standard deviation [53] with the max variation of 6% ( $6\sigma$ ) which is well below the error margins considered for the computational correctness of $\mathcal{MS}$ -BPMaccs.

Temperature variations. We model the temperature variations by adding a perturbation term to $T$ in Equation 5 as a gaussian distribution $\mathcal{N}_{T}(\mu,\sigma^{2})$ . We consider the maximum value of the temperature as 358°K which is commensurate with existing practices [54], and the minimum value as 300°K (This is the peak-to-peak range for the gaussian distribution ( $6\sigma$ )).

Voltage variations. We also model the voltage variation by adding a gaussian distribution to $V_{DD}$ term in Equation 9. Our experiments show that, variations in voltage can be mitigated up to 20%. The extensive amount of vector dot-product operations in DNNs, allows for the minimum and maximum values of the distributions being sampled sufficient amount of times, leading to coverage of the corner cases.

Atop all these considerations, we use differential signaling for ADCs which attenuates the common-mode fluctuations such as PVT variations. To show the effectiveness of our techniques, Figure 6 plots the result of fine-tuning process of two benchmarks, ResNet-50 and VGG-16 for ten epochs. Table 4 reports the summary of accuracy trends for all the benchmarks, which achieve less than 0.5% loss. As Figure 6 shows, the fine-tuning pass compensates the initial loss (0.73% for top-1 and 2.41% for top-5) to only 0.04% for top-1 and 0.02% for top-5. VGG-16 is slightly different and reduces the initial loss (1.16% for top-1 and 2.24% for top-5) to less than 0.18% for top-1 and 0.13% for top-5 validation accuracy. The trends are similar for other benchmarks and omitted due to space constraints.

. BiHiwe Compiler Stack

As Figure 7 shows, DNNs are compiled to BiHiwe through a multi-stage process beginning with a Caffe2 [55] DNN specification file. The high-level specification provided in the Caffe2 file is translated to a layer DataFlow Graph (DFG) that preserves the structure of the network. The DFG goes through an algorithm that cuts the DFG and tiles the data to map the DNN computations to the accelerator clusters and cores. The tiling also aims to minimize the transfer of model parameters to limited on-chip scratchpads on the logic die from the 3D-stacked DRAM, while maximizing the utilization of the compute resources. In addition to the DFG, the cutting/titling algorithm takes in the architectural specification of the BiHiwe. These specifications include the organizations and configurations (# rows, #columns) of the clusters, vaults, and cores as well as details of the $\mathcal{MS}$ -BPMaccs. To identify the best cuts and tilings, the cutting/tiling algorithm exhaustively searches the space of possibilities, which is enabled through an estimation tool. The tool estimates the total energy consumption and runtime for each cuts/tiles pair which represent the data movement and resource utilization in BiHiwe. Estimation is viable, as the DFG does not change, there is no hardware managed cache, and the accelerator architecture is fixed during execution. Thus, there are no irregularities that can hinder estimation. Algorithm 1 depicts the cutting/tiling procedure. When cuts and tiles are determined, the compiler generates the binary code that contains the communication and computation instruction blocks. As commensurate with state-of-the-art accelerators [12, 28, 23, 25, 18], all the instructions are statically scheduled. We extend the static scheduling to cluster coordination, data communication and transfer.

. BiHiwe Instruction Set

The BiHiwe ISA exposes the following unique properties of its architecture to the software: (1) efficient mixed-signal execution using bit-partitioned $\mathcal{MS}$ -WAgg and capacitive accumulation, and (2) clustered architecture, that takes advantage of the power efficiency of mixed-signal acceleration to scale-up the number of $\mathcal{MS}$ -WAggs in BiHiwe. As such, BiHiwe uses a block-structured ISA that segregates the execution of the DNN into (1) data communication instruction blocks that accesses tiles of data from the 3D-stacked memory and populates the on-chip scratchpads (Input Buffer/Weight Buffer/Output Buffer in Figure 2), and (2) compute instruction blocks each of which consumes the tile of data produced by a corresponding communication instruction block and produces an output tile. The BiHiwe compiler stack statically assigns communication and compute instruction blocks to accelerator clusters, shifting the complexity from hardware to the compiler. By splitting the data transfer and on-chip data processing into separate instructions, the BiHiwe ISA enables software pipelining between clusters and allows the memory accesses to run ahead and fetch data for the next tile while processing the current tile.

Compute instruction block. A block of compute instructions expresses the entire computation to produce a single tile in an accelerator core. Further, the compute block governs how the input data for a DNN layer is bit-partitioned and distributed across wide aggregators within a single core. As such, the compiler has complete control over the read/write accesses to on-chip scratchpads, A/D and D/A conversion, and execution using the $\mathcal{MS}$ -WAggs and digital blocks in an accelerator core. The granularity of bit-partitioning and charge-based accumulation is determined for each microarchitectural implementation based on the technology node and circuit design paradigm. As such, to support different technology nodes and design styles and allow extensions to the architecture, the BiHiwe ISA encodes the bit-partitioning and accumulation cycles. However, we need to explore the design space to find the optimal design choice for each combination of technology node and circuits (Section 8).

Communication instruction block. The key challenge when scaling up the design is to minimize data-movement while parallelizing the execution of the DNN across the on-chip compute resources. To simplify the hardware, BiHiwe instruction set captures the static schedule of data movement as a series of communication instruction block s. Static scheduling is possible as the topology of the DNN does not change during inference and the order of layers and neurons is known statically. The BiHiwe compiler stack assigns the communication blocks to the cores according to the order of the layers. This static ordering enables BiHiwe to use a simple statically scheduled bus instead of a more complex interconnection.

To maximize energy efficiency, it is imperative to exploit the high degree of data-reuse offered by DNNs. To exploit data-reuse when parallelizing computations across cores of the BiHiwe architecture, the communication instructions support broadcasting/multicasting to distribute the same data across multiple cores, minimizing off-chip memory accesses. Once a communication block writes a tile of data to the on-chip scratchpads, it can be reused over multiple compute blocks to exploit temporal data locality within a single accelerator core.

. valuation

8.1 Methodology

Benchmarks. We use ten diverse CNN and RNN models to evaluate BiHiwe, described in Table 2 that perform image classification, real-time object detection (YOLOv3), and character-level (PTB-RNN) and word-level (PTB-LSTM) language modeling. This set of benchmarks includes medium to large scale models (from 11.1 MBytes to 137.3 MBytes) and variety of multiply-add operations (from 13 Million to 39 Billion).

Simulation infrastructure. We develop a cycle-accurate simulator and a compiler for BiHiwe that takes in a caffe-2 specification of the DNN, finds the optimum tiling and cutting for each layer, and maps it to BiHiwe architecture. The simulator executes each of the optimized network using the BiHiwe architecture model and reports the total runtime and energy.

Tetris comparison. We compare BiHiwe with Tetris, a state-of-the-art fully-digital 3D-stacked dataflow accelerator. We match the on-chip power dissipation of BiHiwe and Tetris and compare the total runtime and energy, including energy for DRAM accesses. We also perform an iso-area comparison and scale up original Tetris with 16 vaults to 36 vaults to match its area to BiHiwe’s. The baseline Tetris supports 16-bit execution while BiHiwe supports 8-bit. For fairness, we modify the open-source Tetris simulator [46] and proportionally scale its runtime and energy. BiHiwe supports 8-bit operands since this representation has virtually no impact by itself on the final accuracy of the DNNs [59, 66, 67, 68, 69].

GPU comparison. We also compare BiHiwe to two Nvidia GPUs (i.e., RTX 2080 TI and Titan Xp) based on Turing and Pascal architecture respectively, listed in Table 3. RTX 2080 TI’s Turing architecture provides tensor cores which are specialized hardware for deep learning inference. We use 8-bit on GPUs using Nvidia’s own TensorRT 5.1 [70] library compiled with the optimized cuDNN 7.5 and CUDA 10.1. For each DNN benchmark, we perform 1,000 warmup iterations and report the average runtime across 10,000 iterations.

Comparison with other recent accelerators. We also compare BiHiwe to Google TPU [26], mixed-signal CMOS RedEye [35], and two analog memristive accelerators. All the comparisons are in 8-bits. The original designs [32, 71] use 16-bits. Scaling from 16-bit to 8-bit execution for memristive designs would optimistically provide a $4\times$ increase in efficiency.

Energy and area measurement. All hardware modelings are performed using FreePDK 45-nm standard cell library [72]. We implement the switched-capacitor MACCs in Cadence Analog Design Environment V6.1.3 and use Spectre SPICE V6.1.3 to model the system. We then, use Layout XL of Cadence to lay out the MACC units and extract the energy/area. The ADC’s energy/area are obtained from [73]. Based on the $\mathcal{MS}$ -BPMacc configuration, we use the ADC architecture from [74].

We implement all digital blocks of BiHiwe, including adders, shifters, interconnection, and accumulators in Verilog RTL and used Synopsys Design Compiler (L-2016.03-SP5) to synthesize them and measure their energy and area. For on-chip SRAM buffers, we use CACTI-P [75] to measure the energy and area of the memory blocks. The 3D-stacked DRAM architecture is based on HMC stack [49, 50], the same as Tetris, and the bandwidth and access energy are adopted form that work.

Error modeling. For error modeling, we use Spectre SPICE V6.1.3 to extract the noise behavior of MACCs via circuit simulations. Thermal noise, computational error, and PVT variations are considered based on details in Section 5. We implement the extracted hardware error models and the corresponding mathematical modelings using PyTorch v1.0.1 [76] and integrate them into Neural Network Distiller v0.3 framework [77] for a fine-tuning pass over the evaluated benchmarks.

8.2 Experimental Results

8.2.1 Comparison with Tetris

Iso-power performance and energy comparison. Figure 8 shows the performance and energy reduction of BiHiwe over Tetris under the same on-chip power budget. On average, BiHiwe delivers a 4.9 $\times$ speedup over Tetris. This significant improvement is attributed to the use of wide mixed-signal $\mathcal{MS}$ -BPMaccs in BiHiwe as opposed to PEs in Tetris. The wide bit-partitioned mixed-signal design of $\mathcal{MS}$ -BPMacc in BiHiwe enables us to cram $\approx$ 5 $\times$ more compute units within the same power budget as Tetris. The highest speedup is observed in YOLOv3 and PTB-RNN, where their networks’ configurations favor the wide vectorized execution in BiHiwe by better utilizing compute resources. The lowest speedup is observed in ResNet-18, since its relatively small size leads to under-utilization of compute resources in BiHiwe.

Figure 8 demonstrates the total energy reduction for BiHiwe across the evaluated benchmarks as compared to Tetris. On average, BiHiwe yields 2.4 $\times$ energy reduction over Tetris, including energy for DRAM accesses, while consuming the same on-chip power as Tetris. CIFAR-10 enjoys the highest energy reduction, since BiHiwe is able to take advantage of CIFAR-10’s smaller memory footprint to maximize on-chip data reuse and reduce DRAM accesses. The lowest energy reduction is observed in RNN benchmarks, PTB-RNN and PTB-LSTM since the matrix-vector operations in these benchmarks require a significant number of memory accesses, diminishing the benefits from mixed-signal computations.

Energy breakdown.

Figure 9 shows the energy breakdown normalized to Tetris. Energy breakdown is reported across four major architectural components: (1) on-chip compute units, (2) on-chip memory (buffers and register file), (3) interconnect, and (4) 3D-stacked DRAM. DRAM accesses account for the highest portion of the energy in BiHiwe, since BiHiwe significantly reduces the on-chip compute energy. While BiHiwe has a significantly larger number of compute resources compared to Tetris, the number of DRAM accesses remain almost the same. This is because the statically-scheduled bus allows data to be multicasted/broadcasted across multiple cores in BiHiwe without significantly increasing the number of DRAM accesses. Furthermore, the statically-scheduled bus offers the BiHiwe compiler stack the freedom to optimize partitioning the computations across cores. Most layers in the benchmarks benefit for partitioning the different inputs in a single batch (batch size is 16) across BiHiwe cores and broadcasting weights, which is not explored in Tetris. As a result, these networks have lower DRAM accesses. The breakdown of energy consumption varies with the type of computations required by the DNN as well as the degree of data-reuse. Benchmarks PTB-RNN and PTB-LSTM are recurrent neural networks that perform large matrix-vector operations and require significant DRAM accesses for weights. Therefore, PTB-RNN and PTB-LSTM use more energy for DRAM accesses compared to other benchmarks.

Unlike the fully-digital PEs in Tetris that perform a single operation in a cycle, BiHiwe uses $\mathcal{MS}$ -WAggs which perform wide vectorized operations–crucial in BiHiwe to amortize the high cost of ADCs. As shown in Table 1, each MACC operation in BiHiwe consumes 5.4 $\times$ less energy compared to Tetris. The output-stationary dataflow enabled by capacitive accumulation in addition to the systolic organization of $\mathcal{MS}$ -WAggs in each core of BiHiwe which eliminates the need for register files unlike Tetris, leads to 4.4 $\times$ reduction for on-chip data movement on average.

Iso-area comparison with Tetris. We compare the total runtime and energy of BiHiwe with a scaled up version of Tetris which matches BiHiwe’area. Figure 10 shows the results for the workloads. Scaling-up the compute resources in Tetris by 2.25 $\times$ to match the chip-area of BiHiwe results in a sub-linear increase in performance by $\approx 60\%$ . This improvement in performance comes at a cost of reduced energy-efficiency due to an increase in memory accesses to feed the additional compute resources. The trends in speedup and energy-reduction remain the same as iso-power comparison, with the exception of ResNet-18, which now sees resource underutilization in Tetris after scaling up number of compute resources.

8.2.2 Comparison to GPUs

Figure 11 compares performance of BiHiwe with Titan Xp and RTX 2080 TI. RTX 2080 TI is based on Nvidia’s latest architecture, Turing. For a fair comparison, we enable vectorized 8-bit operations and optimized GPU compilations. The results are normalized to Titan Xp. BiHiwe, on average, yields 70% speedup over Titan Xp GPU and performs 15% slower than RTX 2080 TI. Convolutional networks require large amount of matrix-matrix multiplications that are well-suited for tensor cores, leading to RTX 2080 TI’s outperformance on both BiHiwe and Titan Xp. VGG-16 and VGG-19 see the maximum benefits. However, BiHiwe outperforms RTX 2080 TI GPU in PTB-RNN and PTB-LSTM with 11.2 $\times$ and 10.6 $\times$ , respectively. These RNN networks require matrix-vector multiplications, which is particularly suitable for the wide vectorized operations supported in BiHiwe’s $\mathcal{MS}$ -WAggs–not the best case for tensor cores. In terms of performance-per-Watt, BiHiwe outperforms both Titan Xp and RTX 2080 TI GPUs by a large margin, 66.5 $\times$ and 33.1 $\times$ , respectively.

8.2.3 Comparison with Other Accelerators

We also compare the power efficiency (GOPS/s/Watt) and area efficiency GOPS/s/ $mm^{2}$ of BiHiwe with other recent digital and analog accelerators. Due to the lack of available raw performance/energy numbers for specific workloads, we use these metrics that is commensurate with comparisons for recent designs [21, 71, 78]. Figure 12 depicts the peak power and area efficiency results.

On average for the evaluated benchmarks, BiHiwe achieves 72% of its peak efficiency. This information is not available in the publications for the other designs.

Digital systolic: Google TPU [26]. In comparison with TPU, which also uses systolic design, BiHiwe delivers 4.5 $\times$ more peak power efficiency and almost the same area efficiency. Leveraging the wide, interleaved, and bit-partitioned arithmetic with its switched-capacitor implementation in BiHiwe architecture, reduces the cost of MACC operations significantly compared with TPU which uses 8-bit digital logic, leading to significant improvement in power efficiency.

Mixed-signal CMOS: RedEye [35]. RedEye is an in-sensor CNN accelerator baed on mixed-signal CMOS technology which also uses switched-capacitor circuitry for MACC operations. Compared to RedEye, BiHiwe offers 5.5 $\times$ better power efficiency and 167 $\times$ better area efficiency. Utilizing the proposed wide, interleaved, and bit-partitioned arithmetic amortizes the cost of ADC in BiHiwe by reducing its required resolution and sampling rate, leading to significant curtailment of ADC power and area, in contrast to RedEye.

Analog Memristive designs [32, 71]. Prior work in ISAAC [32] and PipeLayer [71] have explored analog memristive technology for DNN acceleration, which integrates both compute and storage within the same die, and offers higher compute density compared to traditional analog CMOS technology. However, this increase in compute density comes at the cost of reduced power-efficiency. Generally, memrisitive designs perform computations in the current domain, requiring the costly ADCs to sample the current-domain signals at the same rate as the compute/storage for memristors. PipeLayer significantly reduces this cost. Overall, compared to ISAAC and PipeLayer, BiHiwe improves the power efficiency by 3.6 $\times$ and 9.6 $\times$ , respectively.

8.2.4 Design Space Explorations

Design space exploration for bit-partitioning.

To evaluate the effectiveness of bit-partitioning, we perform a design space exploration with various bit-partitioned options. Figure 13 shows the reduction in energy and area compared to an 8-bit $\times$ 8-bit design when two vectors with 32 elements go under dot-product. The other design points also perform 8-bit $\times$ 8-bit MACC operations while utilizing our wide and interleaved bit-partitioned arithmetic. As depicted, the design with 2-bit partitioning strikes the best balance in energy and area with the switched-capacitor design of MACC units at 45 nm CMOS node. The difference between 2-bit and 1-bit is that single-bit partitioning quadratically increases the number of low bitwidth MACCs from 16 (2-bit partitioning ) to 64 (1-bit partitioning) to support 8-bit operations. This imposes disproportionate overhead that outweighs the benefit of decreasing each MACC units area and energy.

Design space exploration for $\mathcal{MS}$ -BPMacc configuration.

The number of accumulation cycles ( $m$ ) before the A/D conversion and the number of MACC units ( $n$ ) are two main parameters of $\mathcal{MS}$ -BPMacc which define resolution and the sample rate of the ADC, determining its power. Figure 14 shows the design space exploration for different configurations of the $\mathcal{MS}$ -BPMacc. In a fixed power budget of $2$ W for compute units, we measure the total runtime and energy of BiHiwe over the evaluated workloads which are normalized to Tetris. As shown in Figure 14, increasing number of MACCs, limits the number of accumulation cycles, consequently leading to using ADCs with high sample-rates. Using high sample-rate ADCs significantly increases power, making the design less efficient. On the other hand, increasing number of accumulation cycles, limits the number of MACCs, which restricts the number of $\mathcal{MS}$ -WAggs that can be integrated into the design under the given power budget. Overall, the optimal design point that delivers the best performance and energy is with eight MACC units and 32 accumulation cycles.

Design space exploration for clustered architecture.

BiHiwe uses a hierarchical architecture with multiple cores in each vault. Having a larger number of small cores for each vault yields increased utilization of compute resources, but requires data transfer across cores. We explore the design space with 1, 2, 4, and 8 cores per cluster.As Figure15 shows, BiHiwe with four cores per each vault (default configuration in BiHiwe) strikes the best balance between speedup and energy reduction. Performance increases as we increase the number of cores per vault from 1 to 8. However, the 8-core configuration results in a higher number of data accesses. Therefore, the 4-core design point provides the optimal balance.

8.2.5 Evaluation of Circuitry Non-Idealities

Table 4 shows the Top-1 accuracy with considering non-idealities, after fine-tuning, the ideal accuracy, and the final loss in accuracy.

As shown in Table 4, some of the networks, namely AlexNet and ResNet-18, are more sensitive to the non-idealities, leading to a higher initial accuracy degradation. To recover the accuracy loss due to the circuitry non-idealities, we perform a fine-tuning step for a few epochs. By performing this fine-tuning step, the accuracy loss of the CIFAR-10, ResNet-18, and ResNet-50 networks is fully recovered (loss is less than 0.04%) which within these networks, CIFAR-10 and ResNet-50 are more robust to non-idealities. The accuracy loss for other networks is below 0.5% which within those AlexNet has the maximum loss. The final two networks, namely PTB-RNN and PTB-LSTM perform character-level and word-level language modeling, respectively. The accuracy for these two networks is measured in Bits-Per-Character (BPC) and Perplexity-per-Word (PPW), respectively. Both PTB-RNN and PTB-LSTM recover all the loss after fine-tuning. The final results after fine-tuning step show the effectiveness of this approach in recovering the accuracy loss due to the non-idealities pertinent to analog computation.

. elated Work

There is a large body of work on digital accelerators for DNNs [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]. Mixed-signal acceleration has also been explored previously for neural network [34, 40] and is gaining traction again for the deep models [32, 33, 35, 36, 37, 38, 39, 41, 42]. This paper fundamentally differs from these inspiring efforts as it delves into the mathematics of basic operations in DNNs, reformulates and defines the wide, interleaved, and bit-partitioned approach to overcome the challenges of mixed-signal acceleration. By partitioning and re-aggregating the low-bitwidth MACC operations, this paper addresses the limited range of encoding and reduces the cost of cross-domain conversions. Additionally, it combines the proposed mathematical reformulation with switched-capacitor circuitry to share and delay A/D conversions, which amortizes their cost and reduce their rate, respectively. Below, we discuss the most related works.

Switched-capacitor design. Switched-capacitor circuits [43] have a long history, having been mainly used for designing amplifiers[79], A/D and D/A converters[80] and filters[81]. Similar to resistive circuits, they have been used even for the previous generation of neural networks [34]. More recently, they have also been used for matrix multiplication[82, 42], which can benefit DNNs. This work takes inspiration from these efforts but differes from them in that it defines and leverages wide, interleaved, and bit-partitioned reformulation of DNN operations. Additionally, it offers a comprehensive architecture that can accelerate a wide variety of DNNs.

Programmable mixed-signal accelerators. PROMISE [33] offers a mixed-signal architecture that integrates analog units within the SRAM memory blocks. RedEye[35] is a low-power near-sensor mixed-signal accelerator that uses charge-domain computations. These works do not offer wide interleavings of bit-partitioned basic operations as described in this paper.

Fixed-functional mixed-signal accelerators. They are designed for a specific DNN. Some focus on handwritten digit classification [82, 83] or binarized mixed-signal acceleration of CIFAR-10 images [38]. Another work focuses on spiking neural networks’ acceleration [39]. In contrast, our design is programmable and supports interleaved bit-partitioning.

Resistive memory accelerators. There is a large body of work using resistive memory [32, 71, 78, 84, 85, 86, 87, 88]. We provided a direct comparison to ISAAC [32] and PipeLayer [71]. ISAAC [32] most notably introduces the concept of temporally bit-serial operations, also explored in PRIME [44], and is augmented with the concept of spike-base data scheme in PipeLayer [71]. BiHiwe, in contrast, formulates a partitioning that spatially groups lower-bitwidth MACCs across different vector elements and performs them in-parallel. PRIME does not provide absolute measurements and its simulated baseline is not available for a head-to-head comparison. PRIME also uses multiple truncations that change the mathematics. Conversely, our formulation does not induce truncation or mathematical changes.

0. onclusion

This work proposes wide, interleaved, and bit-partitioned arithmetic to overcome two key challenges in mixed-signal acceleration of DNNs: limited encoding range, and costly A/D conversions. This bit-partitioned arithmetic enables rearranging the highly parallel MACC operations in modern DNNs into wide low-bitwidth computations that are mapped efficiently to mixed-signal units. Further, these units operate in charge domain using switched-capacitor circuitry and reduce the rate of A/D conversions by accumulating partial results in the charge domain. The resulting microarchitecture, named BiHiwe, offers significant benefits over its state-of-the-art analog and digital counterparts. These encouraging results suggest that the combination of mathematical insights with architectural innovations can enable new avenues in DNN acceleration.

Bibliography88

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Niehues et al. [2018] J. Niehues, N.-Q. Pham, T.-L. Ha, M. Sperber, and A. Waibel. Low-Latency Neural Speech Translation. Ar Xiv e-prints , August 2018.
2Mo and Sattar [2018] J. Mo and J. Sattar. Safe Drive: Enhancing Lane Appearance for Autonomous and Assisted Driving Under Limited Visibility. Ar Xiv e-prints , July 2018.
3Li et al. [2018] R. Li, Y. Shu, J. Su, H. Feng, and J. Wang. Using deep Residual Network to search for galaxy-Ly { { \{ \ \ \backslash alpha } } \} emitter lens candidates based on spectroscopic-selection. Ar Xiv e-prints , July 2018.
4Rohde et al. [2018] D. Rohde, S. Bonner, T. Dunlop, F. Vasile, and A. Karatzoglou. Reco Gym: A Reinforcement Learning Environment for the problem of Product Recommendation in Online Advertising. Ar Xiv e-prints , August 2018.
5Grabec et al. [2018] I. Grabec, E. Švegl, and M. Sok. Development of a sensory-neural network for medical diagnosing. Ar Xiv e-prints , July 2018.
6Esmaeilzadeh et al. [2011] Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. Dark silicon and the end of multicore scaling. In ISCA , 2011.
7Hardavellas et al. [2011] N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. Toward dark silicon in servers. IEEE Micro , 31(4):6–15, July–Aug. 2011.
8Venkatesh et al. [2010] Ganesh Venkatesh, Jack Sampson, Nathan Goulding, Saturnino Garcia, Vladyslav Bryksin, Jose Lugo-Martinez, Steven Swanson, and Michael Bedford Taylor. Conservation cores: Reducing the energy of mature computations. In ASPLOS , 2010.