Ultra-low Precision Multiplication-free Training for Deep Neural   Networks

Chang Liu; Rui Zhang; Xishan Zhang; Yifan Hao; Zidong Du; Xing Hu,; Ling Li; Qi Guo

arXiv:2302.14458·cs.LG·March 1, 2023

Ultra-low Precision Multiplication-free Training for Deep Neural Networks

Chang Liu, Rui Zhang, Xishan Zhang, Yifan Hao, Zidong Du, Xing Hu,, Ling Li, Qi Guo

PDF

Open Access

TL;DR

This paper introduces an ultra-low precision, multiplication-free training method for deep neural networks that drastically reduces energy consumption while maintaining high accuracy.

Contribution

It proposes a novel multiplication-free training scheme using INT4 additions and 1-bit XOR operations, achieving up to 95.8% energy savings with minimal accuracy loss.

Findings

01

Reduces energy consumption by up to 95.8% in linear layers.

02

Maintains less than 1% accuracy degradation on ImageNet and WMT En-De tasks.

03

Outperforms existing energy-efficient training methods in both efficiency and accuracy.

Abstract

The training for deep neural networks (DNNs) demands immense energy consumption, which restricts the development of deep learning as well as increases carbon emissions. Thus, the study of energy-efficient training for DNNs is essential. In training, the linear layers consume the most energy because of the intense use of energy-consuming full-precision (FP32) multiplication in multiply-accumulate (MAC). The energy-efficient works try to decrease the precision of multiplication or replace the multiplication with energy-efficient operations such as addition or bitwise shift, to reduce the energy consumption of FP32 multiplications. However, the existing energy-efficient works cannot replace all of the FP32 multiplications during both forward and backward propagation with low-precision energy-efficient operations. In this work, we propose an Adaptive Layer-wise Scaling PoT Quantization…

Tables6

Table 1. Table 1 : Energy consumption of different operations.

Unit Energy Consumption(pJ)
Multiplier	FP32	INT32	FP8	INT8	INT4
Multiplier	3.7	3.1	0.23	0.19	0.048
Adder	FP32	INT32	INT16	INT8	INT4
Adder	0.9	0.14	0.05	0.03	0.015
Shift	INT32-4	INT32-3		INT4-3
Shift	0.96	0.72		0.081

Table 2. Table 2 : “A”, “W”, and “G” refer to activations, weights, and activation gradients. “From Scratch” refers to if the method trains the models from scratch or fine-tunes the pre-trained models. “Multiplication” refers to which operations are used to replace the multiplication in MAC. “FW” and “BW” refer to forward and backward propagation. “Energy” refers to the energy consumption of MACs for training ResNet50 on ImageNet at one iteration. “*” means ignoring the energy consumption of the multiplications in the quantization process.

Method

W

A

G

Training

From Scratch

Large Dataset

Multiplication

Energy (J)

FW

BW

FW

BW

Total

Original

FP32

-

FP32 Mul

4.84

9.69

14.53

INQ

PoT5

FP32

\times

✓

FP32 Mul

(INT32-4 Shift)

FP32 Mul

4.84

(1.97)

9.69

14.53

LogNN

PoT4

FP32

\times

\times

FP32 Mul

(INT3 Add)

FP32 Mul

(INT32-3 Shift)

4.84

(0.95)

9.69

(1.92)

14.53

(2.87)

ShiftCNN

PoT4

FP32

\times

✓

FP32 Mul

(INT32-4 Shift)

FP32 Mul

4.84

(1.70)

9.69

14.53

ShiftAddNet

PoT5

INT32

✓

\times

INT32-4 Shift

INT32 Add

INT32 Mul

INT32-4 Shift

2.45

6.63

9.08

AdderNet

FP32

✓

FP32 Add

1.90

3.80

5.70

DeepShift-Q

PoT5

INT32

FP32

✓

INT32-4 Shift

FP32 Mul

INT8 Add

1.97

5.84

7.81

DeepShift-PS

PoT5

INT32

FP32

✓

INT32-4 Shift

FP32 Mul

INT8 Add

1.97

5.84

7.81

S2FP8

FP8

✓

FP8 Mul

1.19*

2.38*

3.57*

LUQ

INT4

PoT5

✓

INT4 Mul

Shift4-3

1.00*

2.06*

3.07*

Ours

PoT5

✓

INT4 Add

0.16

0.33

0.49

Table 3. Table 3 : CNN accuracy results on ImageNet. “Bit-width” refers to the bit-width to represent data. Accuracy refers to the accuracy results of different methods. Δ Δ \Delta refers to the accuracy degradation compared with FP32 training.

Model

Method

bit-width

W/A/G

Accuracy

(%)

Δ

(%)

AlexNet

Original

32/32/32

58.00

-

INQ

5/32/32

56.13

-1.87

Ultra-low

4/4/4

56.38

-1.62

Ours

5/5/5

57.22

-0.78

ResNet18

Original

32/32/32

70.10

-

INQ

5/32/32

68.98

-1.12

ShiftCNN

4/32/32

64.24

-5.86

AdderNet

32/32/32

67.00

-3.10

DeepShift-Q

5/32/32

65.32

-4.78

DeepShift-PS

5/32/32

65.34

-4.76

S2FP8

8/8/8

69.6

-0.50

Ultra-low

4/4/4

68.27

-1.83

LUQ

4/4/4

69.0

-1.10

Ours

5/5/5

69.52

-0.58

ResNet50

Original

32/32/32

76.32

-

INQ

5/32/32

74.81

-1.51

ShiftCNN

4/32/32

72.58

-3.74

AdderNet

32/32/32

74.9

-1.42

DeepShift-Q

5/32/32

70.73

-5.59

DeepShift-PS

5/32/32

71.90

-4.42

S2FP8

8/8/8

75.2

-1.12

Ultra-low

4/4/4

74.01

-2.31

LUQ

4/4/4

75.32

-1.00

Ours

5/5/5

75.36

-0.96

Table 4. Table 4 : BLEU results on WMT En-De tasks. “bit-width” refers to the bit-width to represent data. Δ Δ \Delta refers to the BLEU degradation compared with FP32 training.

Model

Method

bit-width

W/A/G

BLEU

(%)

Δ

(%)

Transformer -base

Original

32/32/32

27.5

-

Ultra-low

4/4/4

25.4

-2.1

LUQ

4/4/4

27.2

-0.3

Ours

5/5/5

27.2

-0.3

Table 5. Table 5 : Comparison of Adaptive Layer-wise PoT Scaling (ALPS) , Parameterized Ratio Clipping (PRC) and Weight Bias Correction (WBC) for ResNet-50 on ImageNet.

ALS	$\times$	✓	✓	✓	✓
WBC	$\times$	$\times$	✓	$\times$	✓
PRC	$\times$	$\times$	$\times$	✓	✓
Accuracy(%)	0.0	12.0/74.2	74.1	13.6	75.4

Table 6. Table 6 : CNN accuracy results on ImageNet. “Bit-width” refers to the bit-width to represent data. Accuracy refers to the accuracy results of different methods. Δ Δ \Delta refers to the accuracy degradation compared with FP32 training.

Model

Method

bit-width

W/A/G

Accuracy

(%)

Δ

(%)

ResNet101

Original

32/32/32

78.05

-

Ours

5/5/5

77.21

-0.84

Equations24

{0, \pm 2^{- 2^{b - 2} + 1}, \pm 2^{- 2^{b - 2} + 2}, \dots, \pm 2^{2^{b - 2} - 1}},

{0, \pm 2^{- 2^{b - 2} + 1}, \pm 2^{- 2^{b - 2} + 2}, \dots, \pm 2^{2^{b - 2} - 1}},

e = R o u n d (l o g_{2} (∣ f ∣))

e = R o u n d (l o g_{2} (∣ f ∣))

p=\left\{\begin{array}[]{lc}0,\qquad\qquad\qquad\qquad if\quad e<-2^{b-2}+1,\\ sign(f)\cdot 2^{2^{b-2}-1},\quad\quad if\quad e\geq 2^{b-2}-1,\\ sign(f)\cdot 2^{e},\quad\quad\qquad else,\end{array}\right.

p=\left\{\begin{array}[]{lc}0,\qquad\qquad\qquad\qquad if\quad e<-2^{b-2}+1,\\ sign(f)\cdot 2^{2^{b-2}-1},\quad\quad if\quad e\geq 2^{b-2}-1,\\ sign(f)\cdot 2^{e},\quad\quad\qquad else,\end{array}\right.

2^{k} \cdot 2^{m} = 2^{k + m},

2^{k} \cdot 2^{m} = 2^{k + m},

f l i p (s_{1}, s_{2}) = s_{1} \oplus s_{2},

f l i p (s_{1}, s_{2}) = s_{1} \oplus s_{2},

2^{k}\cdot x=\left\{\begin{array}[]{lc}x\ll k,\qquad if\quad k>0,\\ x\gg k,\qquad if\quad k<0,\\ x,\quad\qquad if\quad k==0.\end{array}\right.

2^{k}\cdot x=\left\{\begin{array}[]{lc}x\ll k,\qquad if\quad k>0,\\ x\gg k,\qquad if\quad k<0,\\ x,\quad\qquad if\quad k==0.\end{array}\right.

α = \frac{ma x ( ∣ F ∣ )}{2 ^{2^{b - 2} - 1}}

α = \frac{ma x ( ∣ F ∣ )}{2 ^{2^{b - 2} - 1}}

F_{sc a l e d} = \frac{F}{α}

F_{sc a l e d} = \frac{F}{α}

\hat{F} = α \cdot P .

\hat{F} = α \cdot P .

β = R o u n d (l o g_{2} (α)),

β = R o u n d (l o g_{2} (α)),

\tilde{W} = W - m e an (W)

\tilde{W} = W - m e an (W)

\bar{a_{i}}=\left\{\begin{array}[]{lc}-max(|A|)\cdot\gamma,\quad if\quad a_{i}<-max(|A|)\cdot\gamma,\\ max(|A|)\cdot\gamma,\qquad if\quad a_{i}>max(|A|)\cdot\gamma,\\ a_{i},\quad\qquad\qquad\qquad else.\end{array}\right.

\bar{a_{i}}=\left\{\begin{array}[]{lc}-max(|A|)\cdot\gamma,\quad if\quad a_{i}<-max(|A|)\cdot\gamma,\\ max(|A|)\cdot\gamma,\qquad if\quad a_{i}>max(|A|)\cdot\gamma,\\ a_{i},\quad\qquad\qquad\qquad else.\end{array}\right.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Model Reduction and Neural Networks · Brain Tumor Detection and Classification

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Label Smoothing · Softmax · Adam · Layer Normalization · Residual Connection · Dense Connections

Full text

Ultra-low Precision Multiplication-free Training for Deep Neural Networks

Chang Liu1,2,3, Rui Zhang1,3, Xishan Zhang1,3, Yifan Hao1,3, Zidong Du1,

Xing Hu 1, Ling Li2,4, Qi Guo1

1SKL of Processor, Institute of Computing Technology, CAS, Beijing, China

2University of Chinese Academy of Sciences, Beijing, China

3Cambricon Technologies, Beijing, China

4SKL of Computer Science, Institute of Software, CAS

{liuchang18s, zhangrui, zhangxishan,

haoyifan, duzidong, huxing, guoqi}@ict.ac.cn

{liling}@iscas.ac.cn

Abstract

The training for deep neural networks (DNNs) demands immense energy consumption, which restricts the development of deep learning as well as increases carbon emissions. Thus, the study of energy-efficient training for DNNs is essential. In training, the linear layers consume the most energy because of the intense use of energy-consuming full-precision (FP32) multiplication in multiply–accumulate (MAC). The energy-efficient works try to decrease the precision of multiplication or replace the multiplication with energy-efficient operations such as addition or bitwise shift, to reduce the energy consumption of FP32 multiplications. However, the existing energy-efficient works cannot replace all of the FP32 multiplications during both forward and backward propagation with low-precision energy-efficient operations. In this work, we propose an Adaptive Layer-wise Scaling PoT Quantization (ALS-POTQ) method and a Multiplication-Free MAC (MF-MAC) to replace all of the FP32 multiplications with the INT4 additions and 1-bit XOR operations. In addition, we propose Weight Bias Correction and Parameterized Ratio Clipping techniques for stable training and improving accuracy. In our training scheme, all of the above methods do not introduce extra multiplications, so we reduce up to 95.8% of the energy consumption in linear layers during training. Experimentally, we achieve an accuracy degradation of less than 1% for CNN models on ImageNet and Transformer model on the WMT En-De task. In summary, we significantly outperform the existing methods for both energy efficiency and accuracy.

1 Introduction

In recent years, deep neural networks (DNNs) have achieved remarkable success in many AI applications. However, this success comes at the cost of training the models, which consumes substantial energy. For example, training a BERTbase model demands energy consumption of 948 $kw\cdot h$ , which is more than the global average household electricity consumption per capita per year (731 $kw\cdot h$ ) [28, 36]. Furthermore, the immense energy consumption of DNN training significantly increases carbon emissions, leading to a negative impact on the global climate [3, 32, 2]. Thus, reducing the energy consumption of training for DNNs is desperately required.

In DNN training, the linear layers (including convolutional and fully-connected layers) consume the most energy because of the intense use of FP32 energy-consuming multiplication in multiply–accumulate (MAC). The FP32 multiplication is energy-consuming partly because of its high precision. For example, the energy consumption of an FP32 multiplication is approximately 4x higher than that of FP16 multiplication. Thus, to obtain energy-efficient DNNs, there are many quantization methods replacing the full-precision multiplications with low-precision multiplications by quantization techniques. Some quantization methods start with the full-precision (FP32) pre-trained model [14, 27], hence they cannot reduce the energy consumption of training. The other methods train models from scratch by quantizing the weights ( $W$ ), activations ( $A$ ), and activation gradients ( $G$ ) to 16-bit [11, 20] or 8-bit [34, 40, 29, 6].

Besides, the multiplication itself has significantly higher energy consumption than energy-efficient operations such as addition and bitwise shift. For example, the energy consumption of an INT32 multiplication is approximately 22x higher than that of an INT32 addition. Thus, there are multiplication-less methods that directly replace the multiplication with energy-efficient operations such as addition and bitwise shift [38, 24, 15, 13, 7, 37, 8]. Most of these works, such as INQ [38], ShiftCNN [15], and LogNN [24] also start with the FP32 pre-trained models rather than training from scratch so they cannot reduce the energy consumption of training. Among the methods that can train from scratch [7, 13, 8], AdderNet [7] replaces all of the FP32 multiplications in the linear layer with FP32 additions whose energy consumption is still higher than the fixed point operations. The other works [13, 8] apply low-precision Power-of-Two (PoT) numbers, whose value is zero or power of 2, to replace a part of the multiplications in training with bitwise shifts and sign flip operations. However, they cannot replace all of the multiplications during forward or backward propagation. For example, DeepShift [13] only converts $W$ to 5-bit PoT numbers because only the value range of $W$ can be limited to the representation range of their 5-bit PoT numbers. Similarly, LUQ [8] only converts $G$ to PoT numbers because their method cannot approximate distributions of $W$ and $A$ well. These methods cannot use one data format to represent each of $W$ , $A$ , and $G$ whose data ranges and distributions vary widely from each other, so they keep one-third of the multiplications in training.

In summary, the existing quantization methods and multiplication-less methods cannot replace all of the FP32 multiplications during both forward and backward propagation with low-precision energy-efficient operations. Thus the training energy consumption savings of all these works are limited, as shown in Figure 1.

In this work, we propose a Multiplication-Free MAC (MF-MAC) to replace the FP32 multiplication with an INT4 addition and a 1-bit XOR operation, which are significantly more energy-efficient than the existing methods. To support the MF-MAC, we propose an Adaptive Layer-wise Scaling PoT Quantization (ALS-PoTQ) method that accommodates all the data range of $W$ , $A$ , and $G$ to convert them to unified 5-bit PoT numbers. It is also important to note that the proposed ALS-PoTQ method does not introduce extra multiplications while the exiting methods [34, 40, 29, 6, 8] introduce multiplications when quantizing data. Thus, all of the operations in the proposed MF-MAC and ALS-PoTQ are energy-efficient.

In addition, to keep training stable and improve accuracy, we propose a Weight Bias Correction (WBC) technique to correct the bias of $W$ , and a Parameterized Ratio Clipping (PRC) technique to avoid the rigid resolution problem of $A$ . These two techniques also do not introduce extra multiplications. Finally, by combing all of the above techniques, the complete multiplication-free training scheme replaces all of the multiplications during forward and backward propagation with unified energy-efficient operations, which the existing methods cannot achieve.

Applying our multiplication-free training scheme, we conduct experiments for AlexNet [21], ResNet18 [16], ResNet50 [16] models on ImageNet [12] and Transformer model on the WMT En-De task, achieving an accuracy degradation of less than 1%. Compared with FP32 training, we reduce up to 95.8% of the energy consumption in linear layers during training with negligible accuracy degradation. To sum up, our method significantly outperforms the existing methods for both energy efficiency and accuracy, as shown in Figure 1. To the best of our knowledge, it is the first work to replace all of the multiplications with low-precision additions during both forward and backward propagation in DNN training.

Our contributions can be listed as follows:

•

We propose a Multiplication-free MAC (MF-MAC) with an Adaptive Layer-wise Scaling PoT Quantization (ALS-POTQ) method to replace the FP32 multiplication in MAC with an INT4 addition and an XOR operation.

•

We propose two multiplication-free techniques, Weight Bias Correction (WBC) and Parameterized Ratio Clipping (PRC), that can keep training stable and improve accuracy.

•

Our method reduces up to 95.8% of the energy consumption in linear layers during training compared with FP32 training with an accuracy degradation of less than 1%. It significantly outperforms other methods for both energy -efficiency and accuracy results.

2 Related Work

Recently, to avoid the high energy-consuming of the multiplications in deep learning, many low-precision works quantize the $W$ , $A$ , or $G$ to low-precision before multiplication [39, 11, 20, 34, 40, 27, 31, 4, 5, 25, 19, 26, 10, 17, 29, 6]. Among these works, some quantization methods start with the full-precision (FP32) pre-trained model [14, 27], hence they cannot reduce the energy consumption of training. The other methods train models from scratch by quantizing the weights $W$ , activations $A$ , and activation gradients $G$ to 16-bit numbrs [11, 20], 8-bit numbers [34, 40, 29, 6], or other formats such as radix-4 numbers [30].

Besides, there are some multiplication-less works that directly replace multiplication with energy-efficient operations such as additions [7], bitwise shifts [38, 15, 24, 13] or the combination of them [37]. AdderNet [7] takes the $l$ 1-norm distance between filters and input feature as the output response and replaces the multiplications in the linear layers with additions. The network is still computed and stored with the FP32 numbers. This work achieves approximately 3% of accuracy degradation for the ResNet models on ImageNet. The methods replacing multiplications with bitwise shifts are based on a logarithm quantization method. The logarithm quantization method quantizes full precision data to zeros and powers of a radix number. For example, Ultra-Low training [30] uses radix-4 logarithm format to represent gradients, which need specialized hardware support. When the radix of the logarithmic function is 2, it is called power-of-two quantization and the multiplications can be replaced with a bitwise shift. Incremental Network Quantization (INQ) [38] partitions the pre-trained weights into two sets, one of which is PoT quantized while the other is retrained to compensate for accuracy degradation. ShiftCNN [15] quantizes $W$ of a pre-trained model to PoT representation. LogNN [24] quantizes pre-trained $W$ and $A$ to 4-bit PoT numbers. All of the above methods are applied to the pre-trained FP32 models rather than training from scratch. Hence, they cannot be used to reduce the energy consumption of training. Besides, some works can reduce the energy consumption for training: Deepshift [13] converts all $W$ to PoT numbers with two training methods DeepShift-Q and DeepShift-PS, achieving an accuracy degradation of 4.42% $\sim$ 5.59%. Logarithmic Unbiased Quantization(LUQ) [8] applies a logarithmic unbiased quantization with pruning operation to quantization gradients during training, achieving an accuracy degradation of less than 2%. These works replace a part of the multiplications in training with bitwise shifts. However, they keep at least one-third of the multiplications during forward or backward propagation.

In summary, the existing works cannot replace all of the multiplications with the most energy-efficient low-precision fixed-point additions, during both forward and backward propagation.

3 Preliminaries

In this section, we give a definition of Power-of-Two (PoT) quantization and how the multiplication of numbers after PoT quantization is calculated. The value of a Power-of-Two number $p$ is either power of two or zero:

[TABLE]

where $b$ is the bit-width to represent the number. The bit-width $b$ contains 1 sign bit and $b-1$ exponent bits.

To convert a FP32 number $f$ to a $b$ -bit PoT number $p$ , a basic PoT quantization method is as follows:

[TABLE]

where $e$ is the exponent of a PoT number $p=2^{e}$ , $round$ refers to round-to-nearest.

After converting the FP32 numbers to $b$ -bit PoT numbers, the multiplication between two $b$ -bit PoT numbers $2^{k}$ and $2^{m}$ can be replaced with a $(b-1)$ -bit fixed-point addition in the logarithm domain and a 1-bit sign flip operation:

[TABLE]

where $k,m\in[-2^{b-2}+1,2^{b-2}-1]$ . The 1-bit sign flip operation can be formulated as,

[TABLE]

where $s_{1}$ and $s_{2}$ are the sign bits of the two numbers. If a PoT number’s sign bit is 1, the PoT number is negative. If the sign bit is 0, the PoT number is positive.

In addition, the multiplication between a $b$ -bit PoT number $s\cdot 2^{k}$ and a fixed-point number $x$ can be replaced with a bitwise shift and a 1-bit sign flip. The bitwise shift can be formulated as,

[TABLE]

4 Methodology

To replace all the multiplications with energy-efficient operations, we aim to convert all of the FP32 $W$ , $A$ , and $G$ in the DNN training to PoT numbers with negligible accuracy degradation.

4.1 Adaptive Layer-wise Scaling PoT Quantization

We first observe their distributions in extensive layers and networks at different training steps. We find that the distributions of $W$ , $A$ , and $G$ in DNNs are all spiky and long-tailed near-lognormal distributions, as shown in Figure 2 2(b)2(a)2(c). (For more distribution figures, please refer to Appendix A.) In other words, most of the numbers concentrate around zero (peak area) and a few numbers are of relatively high magnitude. Consistently, the resolution of PoT numbers is dense in the region near zero and sparse in the regions far from zero. This consistency provides the foundation for applying PoT quantization to DNNs.

However, the basic PoT quantization method cannot be applied to different $W$ , $A$ , and $G$ directly. The representation range of $b$ -bit PoT numbers is $[-2^{2^{b-2}-1},2^{2^{b-2}-1}]$ , while the data ranges of $W$ , $A$ , and $G$ are different and change along with the layer and training step as shown in Figure 2. For a practical hardware implementation, the bit-width should be fixed. Thus, to accommodate the representation range of fixed bit-width PoT numbers, we apply an adaptive scaling factor $\alpha$ to scale the $W$ , $A$ , and $G$ before the PoT quantization.

Suppose a set of FP32 (i.e., 32-bit floating-point) number $F=\{f_{1},f_{2},\cdots,f_{n}\}$ . ( $F$ can be the $W$ , $A$ , or $G$ in a linear layer.) We limit the data range of $F$ to the $b$ -bit representation range of PoT format, i.e., $[-2^{2^{b-2}-1},2^{2^{b-2}-1}]$ . The layer-wise scaling factor $\alpha$ is:

[TABLE]

Then, we quantize $F_{scaled}$ to the PoT numbers $P$ as described in Section 3. Thus, the real value of quantized $\hat{F}$ is

[TABLE]

The scaling operation introduces extra multiplications as shown in Equation (8), which is contrary to our purpose. Thus, we further round $\alpha$ to the nearest PoT number:

[TABLE]

where $\beta$ is an integer and different for $W$ , $A$ , and $G$ in various layers and step. Its value is approximately in the range $[-5,-2]$ for $W$ and $A$ , and $[-20,-10]$ for $G$ empirically.

Then, the multiplications in Equation (8) can be replaced with the additions between $\beta$ with the exponent part of $F$ . Moreover, the consumption of $\alpha$ ’s storage and the scalar multiplication in Equation (7) can be ignored because $\alpha$ is layer-wise. There is only one $\alpha$ for tens of thousands of $W$ (or $A$ , $G$ ) in a layer.

In summary, we can convert each $W$ , $A$ , and $G$ to $b$ -bit PoT numbers along with the Adaptive Layer-wise Scaling PoT Quantization (ALS-PoTQ) method without introducing extra multiplications. In this work, we choose $b=5$ , so each FP32 multiplication in MAC can be replaced with an INT4 addition and a 1-bit sign flipping. The distributions of $W$ , $A$ , and $G$ , and corresponding ALS-PoTQ quantized data are shown in Figure 2. We can observe that although the distributions of $W$ , $A$ , and $G$ vary widely, the ALS-PoTQ quantized data can fit each of them well.

4.2 Weight Bias Correction

In practical experiments, we find that the distribution of W changes frequently and its mean deviates during training as shown in Figure 3. The weights with biases are not consistent with the symmetry of PoT quantization as shown in Figure 2 2(d). Thus, the mean-square-error between quantized $\hat{W}$ and original $W$ becomes larger. In addition, W’s bias can be accumulated to the activation gradients ( $G$ ) in backward propagation, and the bias on $G$ is theoretically and empirically proven to impact the training convergence [23]. Thus, we propose a weight bias correction technique that is effective to eliminate the bias.

Given a set of FP32 weights $W=\{w_{1},w_{2},\cdots,w_{n}\}$ , we use the weight bias correction technique to obtain unbiased weights $\tilde{W}$ :

[TABLE]

We apply the Weight Bias Correction (WBC) technique to $W$ before ALS-PoTQ. With this technique, the training converges normally. Unlike the weight normalization technique used in previous work [22], the weight bias correction technique does not introduce any extra multiplication.

4.3 Parameterized Ratio Clipping

When applying the ALS-PoTQ method to the activations, we observe an accuracy degradation of approximately 1% $\sim$ 3% on different models. This is because of the PoT quantization’s rigid resolution problem [22]. The rigid resolution problem means that the resolution in the long-tail area is sparse and fixed. As shown in Figure 4, the 4-bit PoT quantization only has higher resolution than 3-bit PoT quantization in the small area close to zero. Since the format of PoT numbers cannot be changed, we can change the quantization range to improve the resolution of the long-tail area, which is the “clipping” technique in quantization works.

Thus, we change the data range of $A$ , inspired by PACT [9]. The clipping threshold should be adaptive for different layers because the data distributions are different. To apply our technique to $A$ of all linear layers, we propose a Parameterized Ratio Clipping (PRC) technique with a clipping ratio factor $\gamma$ . Given a set of FP32 activations $A=\{a_{1},a_{2},\cdots,a_{n}\}$ and a clipping ratio factor $\gamma$ , the clipped activations $\bar{A}=\{\bar{a_{1}},\bar{a_{2}},\cdots,\bar{a_{n}}\}$ is defined as:

[TABLE]

5 Multiplication-Free Training Scheme

In Section 4.1, we propose the ALS-PoTQ method to obtain 5-bit PoT numbers. In this section, we further describe how we implement the Multiplication-Free MAC (MF-MAC) and ALS-PoTQ.

As shown in Figure 5, we replace the FP32 MAC (consisting of FP32 multiplication and FP32 accumulation) with the ALS-PoTQ and MF-MAC. The inputs of MAC are two sets of FP32 numbers $\{x_{11},x_{12},\cdots,x_{ij},\cdots,x_{mn}\}$ (shorten as $\{x_{ij}\}$ , we call it a data block) and $\{x^{\prime}_{ij}\}$ . In our method, we use the ALS-PoTQ to convert both of the FP32 inputs first. In the ALS-PoTQ, the data block $\{x_{ij}\}$ (or $\{x^{\prime}_{ij}\}$ ) is first scaled by the layer-wise scaling factor $\alpha=2^{\beta}$ , which is implemented as an INT8 addition between the integer $\beta$ and the exponent part of FP32 number to obtain scaled numbers $\{y_{ij}\}$ . Then, we round the scaled numbers to the nearest PoT numbers $p=2^{e}$ to obtain an INT4 block $\{e_{ij}\}$ with a 1-bit sign block $\{s_{ij}\}$ .

After the ALS-PoTQ, the FP32 data blocks $\{x_{ij}\}$ and $\{x_{ij}\}$ are converted to two INT4 data blocks $\{e_{ij}\}$ and $\{e^{\prime}_{ij}\}$ , two 1-bit sign data blocks $\{s_{ij}\}$ and $\{s^{\prime}_{ij}\}$ , and two integers $\beta$ and $\beta^{\prime}$ , which are the inputs of MF-MAC. In MF-MAC, we apply a INT4 adder to compute the INT4 addition between $\{p_{ij}\}$ and $\{p^{\prime}_{ij}\}$ . Meanwhile, we apply an XOR gate to process the sign flip operation in Equation (5). Then, we combine the output of the INT4 adder with the output of the XOR gate and accumulate the signed numbers to an INT32 number $z$ . Finally, we shift the INT32 number $z$ by $\beta+\beta^{\prime}$ bits to obtain the output of the MF-MAC.

Combining the ALS-PoTQ and MF-MAC with the proposed WBS and PRC techniques, we give a complete multiplication-free training scheme as shown in Algorithm 1. In the forward propagation, we first correct the bias of $W^{l}$ and clip $A^{l}$ to obtain $W_{unbias}^{l}$ and $A_{clipped}^{l}$ . Then, we use the ALS-PoTQ method to convert $W_{unbias}^{l}$ and $A_{clipped}^{l}$ to 5-bit PoT numbers $W_{q}^{l}$ and $A_{q}^{l}$ . After that, we apply the MF-MAC to obtain $A^{l+1}$ in the next layer. During backward propagation, we use the ALS-PoTQ to convert $G^{l}$ to 5-bit PoT numbers $G_{q}^{l}$ and apply the MF-MAC to obtain $G^{l-1}$ and $\Delta W^{l}$ to update $W^{l}$ . We repeat the above processes until the network convergence.

6 Energy Consumption Analysis

In this section, we analyze the energy consumption of our method and compare it with the related works. First, Table 1 shows the unit energy consumption values of different operations implemented in 45nm CMOS technology, following [35, 37].

In our work, we replace each FP32 multiplication in the MAC with an INT4 addition and an XOR gate. As shown in Table 1, the energy consumption of an INT4 addition is approximately only 0.4% of the FP32 multiplication and the energy consumption of an XOR gate is less than 0.01 pJ [35]. In addition, we can replace the FP32 accumulator in MAC with an INT32 accumulator which can reduce 84% of energy. Thus, our multiplication-free MAC can reduce approximately 96.6% of energy compared with the FP32 MAC. Moreover, we take the extra energy consumption introduced by our PoT quantization into account. We introduce INT8 additions, rounding operations in the ALS-PoTQ, and a scalar INT32 bitwise shift after the accumulation in MF-MAC. These operations consume approximately 0.04 pJ for every number. A detailed analysis of energy consumption is in Appendix B. In summary, our multiplication-free MAC with PoT quantizer can reduce approximately 95.8% of energy compared with the FP32 MAC.

We take a comprehensive comparison with related works including the novel low-precision quantization method [6] as well as the multiplication-free networks [38, 24, 15, 7, 37, 13, 8]. As shown in Table 2, we list what operations each method uses to replace the FP32 multiplication in MAC during forward and backward propagation. According to the energy consumption of these operations in Table 1, we compute the energy consumption of different methods for training ResNet50 on ImageNet at one iteration, whose details are described in Appendix C. These comparison results show that our method significantly outperforms the existing methods for the energy consumption of DNN training.

In addition, there are some flaws in the existing methods while not in our method: INQ [38], LogNN [24], and ShiftCNN [15] use FP32 pre-trained models instead of training from scratch, so they cannot reduce the energy consumption of training. LogNN [24] and ShiftAddNet [37] do not conduct experiments on large-scale datasets such as ImageNet. S2FP8 [6] and LUQ [8] introduce extra multiplications in the quantization process, which increase the energy consumption as stated in [18].

7 Experiments

7.1 Training Accuracy Results

7.1.1 CNN models on ImageNet

We train official models provided by TensorFlow [1] and TensorPack. The detailed hyperparameter settings are in Appendix D. It is important to note that the initializer of weight should be untruncated normal distribution instead of truncated normal distribution. For a comprehensive comparison, we evaluate our method for the image classification task, which is chosen by most quantization works to evaluate performance. We do experiments with AlexNet [21], ResNet18 [16], ResNet50 [16] on the ILSVRC12 ImageNet classification dataset [12].

Table 3 gives the comprehensive accuracy comparison with related works, including INQ [38], ShiftCNN [15], AdderNet [7], DeepShift [13], Ultra-low [30], and LUQ [8]. Here, we do not compare with LogNN [24] and ShiftAddNet [37] because they do not apply their methods to the training on large-scale datasets such as ImageNet. We compare with the training-from-scratch results in DeepShift and LUQ, instead of their fine-tuning results. However, INQ and ShiftCNN start with the pre-trained FP32 models instead of training from scratch, so we show their inference accuracy results here.

7.1.2 Transformer model on WMT En-De task

In addition, looking beyond CNNs, we apply our training scheme to the Transformer-base model [33] on the WMT En-De dataset for machine translation. We do not change any hyperparameter for FP32 training and the official model provided by TensorFlow. Compared with Ultra-low [30] and LUQ [8], we achieve the highest BLEU score, with less than 0.3% BLEU score degradation as shown in Table 4.

7.2 Accuracy-Energy Joint Comparison

Synthesizing the previous experimental results, we give an energy-accuracy joint comparison whose result is shown in Fig. 1. The x-axis refers to the accuracy results and the y-axis refers to the energy consumption of both forward and backward propagation at one iteration. The joint comparison shows that our method has the lowest energy consumption as well as the highest accuracy among the methods that try to reduce the energy consumption of training.

7.3 Ablation Study

In this section, we conduct an ablation study for the proposed techniques, including Adaptive Layer-wise Scaling PoT Quantization, Weight Bias Correction, and Parameterized Ratio Clipping. The training accuracy results of ResNet18 in Table 5 proves the effects of these techniques. If there is no layer-wise scaling, the training collapses and accuracy drops to 0%. This is because the representation range of PoT quantization cannot accommodate the data range, especially for the gradients. If there is no weight bias correction, the training is unstable, which is consistent with the analysis in Section 4.2. Moreover, the Parameterized Ratio Clipping technique improves the accuracy by 1.3% for ResNet50.

8 Conclusion

In this paper, we propose an Adaptive Layer-wise Scaling PoT Quantization (ALS-POTQ) method and a Multiplication-Free MAC (MF-MAC) to replace the FP32 multiplication in the original MAC with an INT4 addition and an XOR operation. We reduce up to 95.8% of energy consumption in linear layers during training, with an accuracy degradation of less than 1%. In summary, we significantly outperform the existing methods for both energy efficiency and accuracy.

APPENDIX

A Data Distribution

As shown in Figure 6, we give more distributions of $W$ , $A$ , and $G$ to support the observation in Section 4.1.

B Energy Consumption of ALS-PoTQ

We give a detailed analysis of the proposed ALS-PoTQ method’s energy consumption.

In the proposed ALS-PoTQ method, we use an INT8 addition to scale each item in data block $x_{ij}$ whose size is $m\cdot n$ , and thus the energy consumption of scaling is $(0.03\cdot m\cdot n)pJ$ . Then, we round the scaled FP32 data block $y_{ij}$ to PoT numbers. After scaling, the exponent part of FP32 number $y_{ij}$ only takes 4-bit actually, and thus the round operation is a carry operation for INT4, which requires 4 half adders. However, the probability of carry operation is 50%, that is, the round operation can be bypassed with a 50% probability, then the dynamic power consumption of the round operation is approximately 1/4 of the 4-bit addition, which requires 3 full adders and 1 half adder. Thus, the energy consumption of a round operation is approximately $0.004pJ$ and the total energy consumption of rounding the data block is $(0.004\cdot m\cdot n)pJ$ . In summary, the energy consumption of ALS-PoTQ is $(0.034\cdot m\cdot n)pJ$ .

In addition, after the MF-MAC, we apply an INT32 shift to conduct the dequantization process. Because there is only one INT32 shift for the data blocks whose size is $m\cdot n$ , the energy consumption of the INT32 shift is less than 0.05 for each number. In our training scheme, we apply both the ALS-PoTQ and MF-MAC three times during forward and backward propagation, so our ALS-PoTQ method consumes approximately $0.04pJ$ for each number on average. Thus, the total energy consumption of an ALS-PoTQ and a MF-MAC is approximately $0.195pJ$ .

C Energy Consumption of Existing Methods

We compute the energy consumption of existing methods, including INQ [38], LogNN [24], ShiftCNN [15], AdderNet [7], DeepShift [13], ShiftAddNet [37], S2FP8 [6], and LUQ [8]. First, we give which operations are used in MACs for forward and backward propagation. Then, we compute the energy consumption of MACs for training ResNet50 on ImageNet with the data in Table 1. There are $12.36G$ MACs for training ResNet50 on ImageNet at one iteration. We compute the total energy consumption by multiplying the energy consumption of operations in a MAC with the MAC numbers.

INQ [38] and ShiftCNN [15] convert $W$ to 5-bit or 4-bit PoT by fine-tuning the pre-trained models. Thus, the MACs during forward and backward propagation consist of FP32 multiplications and FP32 additions for training, while the FP32 multiplications in MACs during forward for inference are replaced with INT32-4 bitwise shift (shifting up to 4-bit on INT32 numbers). LogNN [24] converts $W$ and $A$ to 4-bit PoT by fine-tuning the pre-trained models, which is similar or INQ. In addition, it also try to train from scratch, however, it does not conduct experiments on large-scale datasets such as ImageNet. AdderNet [7] replaces the FP32 multiplications in MACs with FP32 additions, so there are two FP32 additions in a MAC operation. DeepShift [13] converts $W$ to 5-bit PoT numbers and trains the models from scratch. Thus, the multiplications during forward propagation are replaced with INT32-4 bitwise Shift, and half of the multiplications ( $WG$ ) during backward propagation can be replaced with INT8 additions on the exponent part of FP32 $G$ . ShiftAddNet [37] combines the bitwise shift operations in DeepShift and the additions in AdderNet, so it replaces the multiplications during forward propagation to INT32-4 bitwise shifts and FP32 additions, and replaces half of the multiplications during backward propagation to INT32-4 bitwise shifts.

Moreover, S2FP8 [6] quantizes $W$ , $A$ , and $G$ to FP8 numbers, and the multiplications in MAC for forward and backward propagation are replaced with FP8 multiplications. Ultra-low [30] uses radix-4 float point numbers, which are not supported by the radix-2 hardware, so we do not compute its energy consumption here. LUQ [8] quantizes $W$ and $A$ to INT4 numbers, and converts $G$ to 4-bit PoT numbers. Thus, it replaces the multiplications during forward propagation to INT4 multiplications and the multiplications during backward propagation to INT4-3 bitwise shifts. In addition, these three quantization training works introduce extra FP32 multiplications in their methods. To avoid ambiguity, we do not compute the energy consumption of these extra multiplications and ignore these multiplications when computing their energy consumption.

In total, we compute the MAC energy consumption in these works based on the energy consumption data in Table 1 and the analyses of operations.

D Training Settings

We train AlexNet [21] on 4 GPUs with standard hyperparameter: 100 epochs, batch size 256, SGD with the momentum of 0.9, the initial learning rate of 0.02 decreased by a factor of 10 after epoch 30, 60, 90; ResNet18 and ResNet50 [16] on 8 GPUs with standard hyperparameter: 105 epochs, batch size of 256, SGD with the momentum of 0.9, the initial learning rate of 0.1 decreased by a factor of 10 after epoch 30, 60, 90. In addition, all of the weights are initialized as untruncated normal distributions and we convert $G$ in the last layer to 6-bit PoT numbers instead of 5-bit PoT numbers.

E Accuracy result on ResNet101

We apply our method to deeper network ResNet101 and also achieve high accuracy as shown in Table 6.

Bibliography40

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensor Flow: a system for Large-Scale machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16) , pages 265–283, 2016.
2[2] Open AI. Ai and compute. https://openai.com/blog/ai-and-compute/ .
3[3] Lasse F Wolff Anthony, Benjamin Kanding, and Raghavendra Selvan. Carbontracker: Tracking and predicting the carbon footprint of training deep learning models. ar Xiv preprint ar Xiv:2007.03051 , 2020.
4[4] M. V. Baalen, C. Louizos, M. Nagel, R. A. Amjad, and M. Welling. Bayesian bits: Unifying quantization and pruning. In Conference and Workshop on Neural Information Processing Systems (Neur IPS) , 2020.
5[5] Z. Cai and N. Vasconcelos. Rethinking differentiable search for mixed-precision neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2020.
6[6] Leopold Cambier, Anahita Bhiwandiwalla, Ting Gong, Oguz H Elibol, Mehran Nekuii, and Hanlin Tang. Shifted and squeezed 8-bit floating point format for low-precision training of deep neural networks. In International Conference on Learning Representations (ICLR) , 2020.
7[7] Hanting Chen, Yunhe Wang, Chunjing Xu, Boxin Shi, Chao Xu, Qi Tian, and Chang Xu. Addernet: Do we really need multiplications in deep learning? In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 1468–1477, 2020.
8[8] Brian Chmiel, Ron Banner, Elad Hoffer, Hilla Ben Yaacov, and Daniel Soudry. Logarithmic unbiased quantization: Practical 4-bit training in deep learning. ar Xiv preprint ar Xiv:2112.10769 , 2021.