An Optimized Recurrent Unit for Ultra-Low-Power Keyword Spotting

Justice Amoh; Kofi Odame

arXiv:1902.05026·cs.LG·February 14, 2019

An Optimized Recurrent Unit for Ultra-Low-Power Keyword Spotting

Justice Amoh, Kofi Odame

PDF

TL;DR

This paper introduces eGRU, an optimized recurrent neural network unit designed for ultra-low-power devices, enabling efficient keyword spotting and acoustic event detection on micro-controllers with minimal accuracy loss.

Contribution

The paper presents the eGRU architecture, a highly efficient variant of GRU tailored for resource-constrained edge devices, with significant improvements in speed and size while maintaining accuracy.

Findings

01

eGRU is 60x faster than standard GRU

02

eGRU is 10x smaller than standard GRU

03

Achieves 95.3% accuracy on embedded hardware

Abstract

There is growing interest in being able to run neural networks on sensors, wearables and internet-of-things (IoT) devices. However, the computational demands of neural networks make them difficult to deploy on resource-constrained edge devices. To meet this need, our work introduces a new recurrent unit architecture that is specifically adapted for on-device low power acoustic event detection (AED). The proposed architecture is based on the gated recurrent unit (`GRU') but features optimizations that make it implementable on ultra-low power micro-controllers such as the Arm Cortex M0+. Our new architecture, the Embedded Gated Recurrent Unit (eGRU) is demonstrated to be highly efficient and suitable for short-duration AED and keyword spotting tasks. A single eGRU cell is 60x faster and 10x smaller than a GRU cell. Despite its optimizations, eGRU compares well with GRU across tasks of…

Tables3

Table 1. (a) State equations of traditional GRU and proposed eGRU cells.

GRU	eGRU (this work)
$\begin{matrix} z_{t} & = σ (W_{z} ⊙ [h_{t - 1}, x_{t}]) \\ r_{t} & = σ (W_{r} ⊙ [h_{t - 1}, x_{t}]) \\ \tilde{h_{t}} & = \tanh (W_{h} ⊙ [r_{t} * h_{t - 1}, x_{t}]) \\ h_{t} & = (1 - z_{t}) * h_{t - 1} + z_{t} * \tilde{h_{t}} \end{matrix}$	$\begin{matrix} z_{t} & = (ς (W_{z} ⊙ [h_{t - 1}, x_{t}]) + 1) / 2 \\ \tilde{h_{t}} & = ς (W_{h} ⊙ [h_{t - 1}, x_{t}]) \\ h_{t} & = (1 - z_{t}) * h_{t - 1} + z_{t} * \tilde{h_{t}} \end{matrix}$

Table 2. Table 2: Encoding of quantized weights. Encoding scheme is chosen for fast computation of weight transformations using only bitwise operations in Algorithm 1 . No additional look-up table is required at run-time.

Weight	Binary	Decimal
+1.00	000	0
+0.50	001	1
+0.25	010	2
0	111	7
-0.25	110	6
-0.50	101	5
-1.0	100	4

Table 3. Table 3: Comparison of time cost of integer vs floating point arithmetic on M0+. For basic maths, integer operations are beyond 10 × 10\times faster than floating point ones. Also, our fixed point approximation of the softsign is 5 × 5\times faster than the floating point implementation.

Operation	int32 (ns)	float32 (ns)
$\pm$	252	4,853
$<<, >>$	273	-
$\times$	253	6,578
$\div$	1,923	12,942
softsign	4,740	24,910

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsGated Recurrent Unit

Full text

An Optimized Recurrent Unit for Ultra-Low-Power Keyword Spotting

Justice Amoh

Thayer School of Engineering

Dartmouth College

Hanover, NH 03755

[email protected]

&Kofi M. Odame

Thayer School of Engineering

Dartmouth College

Hanover, NH 03755

[email protected]

Abstract

There is growing interest in being able to run neural networks on sensors, wearables and internet-of-things (IoT) devices. However, the computational demands of neural networks make them difficult to deploy on resource-constrained edge devices.

To meet this need, our work introduces a new recurrent unit architecture that is specifically adapted for on-device low power acoustic event detection (AED). The proposed architecture is based on the gated recurrent unit (‘GRU’ – introduced by Cho2014) but features optimizations that make it implementable on ultra-low power micro-controllers such as the Arm Cortex M0+.

Our new architecture, the Embedded Gated Recurrent Unit (eGRU) is demonstrated to be highly efficient and suitable for short-duration AED and keyword spotting tasks. A single eGRU cell is 60 $\times$ faster and 10 $\times$ smaller than a GRU cell. Despite its optimizations, eGRU compares well with GRU across tasks of varying complexities.

The practicality of eGRU is investigated in a wearable acoustic event detection application. An eGRU model is implemented and tested on the Arm Cortex M0-based Atmel ATSAMD21E18 processor. The Arm M0+ implementation of the eGRU model compares favorably with a full precision GRU that is running on a workstation. The embedded eGRU model achieves a classification accuracy 95.3%, which is only 2% less than the full precision GRU.

1 Introduction

Deep neural networks are a powerful way to extract information from noisy, raw sensor data that is collected in an unconstrained real-world environment. This approach has been used successfully in computer vision [howard2017mobilenets], computer audition [hannun2014deep] and physiological monitoring [ravi2017deep]. Since deep neural networks are generally memory- and computationally-intensive, they are typically implemented on high-end cloud compute servers. However, transmitting raw data from the sensor to the cloud has negative implications for battery life [miettinen2010energy], real-time responsiveness [dillon2010cloud] and data privacy [takabi2010security]. It also demands a reliable communications connection to the cloud which is not always possible. These challenges can be avoided by implementing the deep neural networks directly onto sensors.

Two approaches to realizing neural networks on sensors, or “edge” devices [teerapittayanon2017distributed] are (1) customized hardware processors like the Nvidia Drive Px2 and the A11 Bionic Chip [nvidia_drive, apple_2017], and (2) light-weight software libraries like uTensor and CMSIS-NN [Tan2017, Lai2018]. Unfortunately, these approaches are inadequate for wearable applications like the recently-introduced cough-detection device (see Fig. 1) [amoh2016deep, amoh2015deepcough, amoh2013technologies, amohmobile]. To promote long-term use, wearable devices have stringent size requirements that translate to a small battery (often the single largest component) and a correspondingly small power budget, on the order of 10 mW. Custom neural network processors like the Jetson Tx2 would drain such a battery in less than a minute. An ultra-low-power micro controller unit (MCU) [ti:msp430, stm:STM32L010C6, siliconlabs:C8051F96x, microchip:PIC16LF1509] could stretch the limited power budget further, but it only provides a limited amount of memory and computational resources; even light-weight libraries have some minimum hardware requirements that cannot be met by the sparse resources of an ultra-low-power MCU.

To address these challenges, we developed a novel recurrent neural network that is specifically adapted for implementation on the Arm M0+ [Armltd] class of ultra-low-power MCUs. In particular, we introduce a new architecture for the recurrent unit which is optimized for keyword spotting tasks like wearable cough detection. This paper presents details on the architecture, training scheme and hardware implementation of the proposed recurrent cell. We also present experimental results that show our architecture requires $12\times$ less memory and $60\times$ less computational time than comparable conventional recurrent units.

2 Related Work

Neural network optimizations similar to those proposed in this work have previously been considered in isolation by other researchers. For instance, recent efforts to optimize the GRU architecture include the removal of one of its gates. Zhou2016 achieved this single gate architecture by coupling the update and reset gates into a single forget gate. The resulting ‘minimal’ gated unit (MGU) directly effects both cell update and reset using the same gate. Ravanelli2017 discarded the reset gate altogether, highlighting that it is potentially redundant in speech recognition where input signals evolve slowly. Our work investigates this further in AED tasks where events are sudden and infrequent, in sharp contrast to speech recognition. Furthermore, our proposed cell architecture replaces the sigmoid and hyperbolic tangent activation functions with more efficient softsign variants.

Another common area of optimization is weight quantization. In previous works [Han2015, Wu2016], 8-bit weight quantization was shown to drastically reduce network size without significant loss in accuracy. However, these approaches focused solely on convolutional and fully-connected networks. They also required the storage of a codebook for decoding weights at run-time. With regards to recurrent networks, [Ott2016] demonstrated that 2-bit weight quantization is possible in GRU, with only a slight loss in accuracy. However, the combined effect of such a low precision quantization and other optimizations like a single gate architecture remained unknown.

Using solely integer arithmetic in neural networks has also been studied in prior works [Han2015, Gupta2015, Chen2015]. Courbariaux2014 found that in a fully connected network, a dynamic point numeric format yields much better results than 20-bit fixed point. Another work demonstrates that 32-bit fixed point formats are effective in convolutional networks [Chen2015]. Once again, all these efforts consider fully-connected and convolutional architectures. Since recurrent architectures are fundamentally different and much more difficult to train than other architectures [pascanu2013difficulty], it is worthwhile to investigate fixed point arithmetic in recurrent neural networks.

To date, no one has ever simultaneously applied all of these modifications to a GRU and validated that they yield a usable network that is implementable on a low power, low resource MCU like the Arm Cortex M0+. This work bridges that gap.

3 Embedded Gated Recurrent Unit

Previous work showed that a network of gated recurrent units (GRU) [Cho2014] outperforms classical approaches like Hidden Markov Models on an acoustic event detection task when evaluated in a noisy, non-ideal environment [amoh2016deep]. But the GRU network described in [amoh2016deep] is too memory- and computationally-expensive to be implemented on a low-resource MCU like the Arm M0+.

To meet the resource constraints of a low power MCU like the M0+, we propose a new recurrent architecture: the embedded Gated Recurrent Unit (eGRU). It is based on the traditional GRU, but with four major modifications: (1) a single gate mechanism, (2) faster activation functions, (3) 3-bit exponential weight quantization and (4) fixed point arithmetic in all network operations. These modifications lead to massive reductions in memory and computations in the recurrent cell, making it feasible to run eGRU networks on our target device. Below, we discuss the features of the eGRU in detail.

3.1 Single Gate Mechanism

A practical idea for optimizing recurrent units is the removal of gates. Modern recurrent cell architectures like the Long Short-Term Memory (LSTM) unit [Hochreiter1997] and GRU are characterized by gating mechanisms that regulate information flow in and out of the cell’s memory. For instance, the LSTM cell has 3 gates: two for controlling the cell’s input and output, and a third for forgetting or resetting the cell’s internal state. Since gates are implemented using weights and activations, omitting a gate reduces the required memory and computations in a recurrent cell. Accordingly, GRU was introduced as an optimized form of LSTM by reducing the number of gates from 3 to 2: the update and reset gates.

In the same vein, GRU can also be optimized further by discarding yet another gate. Zhou2016 accomplished this very ‘minimal’ gated unit by using a single gate for both resetting and updating the cell’s internal state. Ravanelli2017 extended that work further by highlighting a redundancy between the two gates. They deduced that in applications like speech recognition where signals change slowly, reset gates are unnecessary and can be omitted altogether. However, in applications where events of interest are abrupt and isolated (eg. detecting cough sounds), the assumption by Ravanelli2017 that state resets are irrelevant does not hold. In fact, we found that without state resets, recurrent units in our application are unable to recover from large impulse signals. Thus, for keyword spotting applications, we can eliminate the reset gate and rely on only the update gate.

3.2 Activation Functions

The original GRU cell utilized two kinds of activations; sigmoid functions ( $\sigma$ ) in the update and reset gate equations, and the hyperbolic tangent function ( $\tanh$ ) in the state update equation. However, since the M0+ lacks a dedicated floating point unit and DSP instruction set, executing either sigmoid or $\tanh$ functions are quite slow (see Table 3(b)). For a more efficient recurrent unit, it was suggested in [Ravanelli2017] that the $\tanh$ be replaced with a rectified linear unit (ReLU). Unfortunately, when combined with heavily quantized weights, recurrent cells with ReLU activations are too lossy to learn well. And since quantization is an essential part of our proposed eGRU architecture, ReLU activations are not an option.

A desirable alternative to the activation functions above is the softsign function. In [glorot2010understanding], softsign was shown to perform comparably with $\tanh$ and sigmoid in feed-forward networks. Although not as cheap (computationally) as ReLU, softsign is much cheaper than sigmoid or $\tanh$ . A floating point implementation of the softsign function is more than 10x faster than either sigmoid or $\tanh$ functions on the M0+ (Table 3(b)). Furthermore, the simplicity of the softsign function permits an even faster fixed point implementation. For these reasons, we adopt the softsign and a shifted version of it as activation functions for our proposed eGRU recurrent architecture.

3.3 Weight Quantization

Efforts to reduce neural network memory footprint often involve weight quantization. As networks are purely defined by learned parameters or weights, a reduction in memory for each weight through quantization results in tremendous shrinkage in the overall network size. Several studies have shown that neural networks are still effective even after weights are quantized to only 8-bits, leading to $20-49\times$ memory reduction [Han2015, Wu2016]. Specifically in GRU architectures, ternarization (2-bit weights) has been proposed as feasible for recurrent neural networks at a small cost in performance [Ott2016].

However, we discovered that 2-bit quantization does not work well for our application. When combined with the single gate and activation function optimizations, ternarized weights in eGRU lead to poor performance. On the other hand, 3-bit quantization with septenary weights (7 levels) proved to be quite effective. Thus, 3-bit quantization was adopted for eGRU.

Besides the reduction in bits, our quantization scheme also ensures that quantized levels are negative integer exponents of two, similar to an approach in [Ott2016]. This exponential quantization enables the replacement of weight multiplications with bit shifting, which in turn drastically reduces the computation time of an eGRU cell. The septenary weights used are: $[0,\pm 0.25,\pm 0.5,\pm 1]$ and they are encoded using the mapping in Table 2 such that no external look-up tables are required at run-time. To multiply an input by a quantized weight, a simple, fast procedure featuring only bitwise operations is employed (Algorithm 1).

3.4 Fixed Point Arithmetic

The final area of optimization in eGRU is the numeric format used for all math operations in the neural network. On better equipped processing units, single or double precision floating point formats are typically used. Unfortunately, since M0+ lacks an FPU, floating point operations are very costly (see Table 3.4). Hence, using solely integer operations in eGRU is desirable.

Considering the M0+ has a 32-bit architecture, optimal execution of operations is achieved when operands are contained within 32-bits. Hence, we adopt the Q15 16-bit fixed point format for all arithmetic within eGRU. In Q15 format, 16-bit signed integers from 32,767 to 32,768 are used to represent decimals in the range of [-1,1) at intervals of $2^{-15}$ . Necessary precautions ought to be taken during basic operations to prevent overflow and ensure closure under Q15. For instance, to divide two Q15 numbers, it is necessary to left-shift the dividend by 16 bits (yielding a 32 bit value) before undertaking the division to avoid losing precision. Hence, all eGRU operations (including activation functions) need to be translated to Q15 versions.

Weight multiplication is the most frequent operation in a neural network. Fortunately, by virtue of our exponential quantization, affine transformations for all layers in the network can be implemented by right-shift operations, which remains the same in Q15 format. The summation of all transformed inputs can exceed 16 bits and is thus accumulated in a 32 bit register. However, since the activation function is bounded by [-1,1), the output of the recurrent node remains a 16-bit Q15 number which can then be fed into yet another node. From simulations, we discovered that all inputs to an eGRU network will flow through the entire model in Q15 format and result in an output that is precise to at least 2 decimal places compared to those from an equivalent full precision (floating point) network.

An interesting modification worth mention pertains to the Q15 implementation of the softsign activation function. Inputs to activations, in the scheme described above, is a 32-bit accumulation of all transformed inputs. Since such inputs are already in 32-bit, undertaking a Q15 division would be impractical as it would require left-shifting, resulting in an overflow. One way to circumvent this is to clip the accumulated value to a certain domain to prevent overflow. We found that clipping to the domain (64,-64] resulted in a fast and accurate integer softsign approximation (see Table 3.4). A simple C++ definition of our softsign function is provided in Listing LABEL:lst:softsign.