Reducing state updates via Gaussian-gated LSTMs

Matthew Thornton; Jithendar Anumula; Shih-Chii Liu

arXiv:1901.07334·cs.LG·January 23, 2019

Reducing state updates via Gaussian-gated LSTMs

Matthew Thornton, Jithendar Anumula, Shih-Chii Liu

PDF

Open Access

TL;DR

This paper introduces the Gaussian-gated LSTM (g-LSTM), a novel RNN architecture that improves long-term dependency learning, reduces computation, and accelerates convergence through timing gates and curriculum learning.

Contribution

The paper proposes a timing-gated LSTM model with learnable Gaussian time gates that enhance long-term memory, reduce computation, and improve training efficiency over standard LSTMs.

Findings

01

g-LSTM captures long-term dependencies better than LSTM

02

The model reduces computation by at least 10x

03

Curriculum learning accelerates convergence on long sequences

Abstract

Recurrent neural networks can be difficult to train on long sequence data due to the well-known vanishing gradient problem. Some architectures incorporate methods to reduce RNN state updates, therefore allowing the network to preserve memory over long temporal intervals. To address these problems of convergence, this paper proposes a timing-gated LSTM RNN model, called the Gaussian-gated LSTM (g-LSTM). The time gate controls when a neuron can be updated during training, enabling longer memory persistence and better error-gradient flow. This model captures long-temporal dependencies better than an LSTM and the time gate parameters can be learned even from non-optimal initialization values. Because the time gate limits the updates of the neuron state, the number of computes needed for the network update is also reduced. By adding a computational budget term to the training loss, we can…

Figures38

Click any figure to enlarge with its caption.

Tables4

Table 1. Table 1: Network architectures and performance for the convergence experiments in subsection 4.1 . The performance metric is the final mean squared error (MSE) loss for the adding task, and the label error rate for both sMNIST and sCIFAR-10.

Dataset	# units	$μ$	$σ$	g-LSTM	LSTM
		Initialization		Performance
Adding (N=1000)	110	$\sim U (300, 700)$	40	$3.8 \cdot 10^{- 5}$	$1.4 \cdot 10^{- 3}$
Adding (N=2000)	110	$\sim U (500, 1500)$	40	$1.3 \cdot 10^{- 3}$	$1.6 \cdot 10^{- 1}$
sMNIST	110	$\sim U (1, 784)$	250	$1.3 %$	$1.8 %$
sCIFAR-10	128	$\sim U (1, 1024)$	650	$41.1 %$	$41.8 %$

Table 2. Table 2: Comparison of label error rates across different networks.

Network	sMNIST	pMNIST	sCIFAR-10
g-LSTM (ours)	$1.3 %$	$7.5 %$	$41.1 %$
LSTM (ours)	$1.8 %$	$8.4 %$	$41.8 %$
r-LSTM (Trinh et al., 2018)	$1.6 %$	$4.8 %$	$27.8 %$
Zoneout (Krueger et al., 2016)	$1.3 %$	$6.9 %$	-
IndRNN (6 layers) (Li et al., 2018)	$1.0 %$	$4.0 %$	-
BN-LSTM (Cooijmans et al., 2016)	$1.0 %$	$4.6 %$	-
Skip LSTM (Campos et al., 2017)	$2.7 %$	-	-

Table 3. Table 3: Adding task (T=1000): 110 unit g-LSTM network initializations and performances.

	Initialization
Experiment ID	$μ$	$σ$	Final MSE Loss
A1	$\sim U (300, 700)$	1	$4.4 \cdot 10^{- 4}$
A2	$\sim U (0, 400)$	40	$2.0 \cdot 10^{- 5}$
A3	$\sim U (600, 1000)$	40	$4.0 \cdot 10^{- 4}$

Table 4. Table 4: Adding task (T=1000): Comparing 110 unit g-LSTM and PLSTM networks with similar initializations, MSE computed after training for 500 500 500 epochs.

Network	Initialization	Final MSE Loss
g-LSTM	$μ \sim U (300, 700)$ , $σ = 40$	$7.7 \cdot 10^{- 5}$
PLSTM	$τ = 1000$ , $s \sim U (250, 650)$ , $r = 0.10$	$2.4 \cdot 10^{- 4}$

Equations34

i_{n} = σ (x_{n} W_{x i} + h_{n - 1} W_{hi} + b_{i}), f_{n} = σ (x_{n} W_{x f} + h_{n - 1} W_{h f} + b_{f})

i_{n} = σ (x_{n} W_{x i} + h_{n - 1} W_{hi} + b_{i}), f_{n} = σ (x_{n} W_{x f} + h_{n - 1} W_{h f} + b_{f})

\tilde{c}_{n} = f_{n} ⊙ c_{n - 1} + i_{n} ⊙ σ (x_{n} W_{xg} + h_{n - 1} W_{h g} + b_{g})

o_{n} = σ (x_{n} W_{x o} + h_{n - 1} W_{h o} + b_{o}), \tilde{h}_{n} = o_{n} ⊙ tanh (\tilde{c}_{n})

c_{n} = k_{n} ⊙ \tilde{c}_{n} + (1 - k_{n}) ⊙ c_{n - 1}

h_{n} = k_{n} ⊙ \tilde{h}_{n} + (1 - k_{n}) ⊙ h_{n - 1}

\frac{\partial L}{\partial W _{h}} = \frac{\partial L}{\partial h _{N}} \frac{\partial h _{N}}{\partial W _{h}}

\frac{\partial L}{\partial W _{h}} = \frac{\partial L}{\partial h _{N}} \frac{\partial h _{N}}{\partial W _{h}}

L = L_{d a t a} + λ L_{b u d g e t} .

L = L_{d a t a} + λ L_{b u d g e t} .

L_{b u d g e t} = E [k_{n}] \approx n = 1 \sum N j = 1 \sum J k_{n}^{(j)}

L_{b u d g e t} = E [k_{n}] \approx n = 1 \sum N j = 1 \sum J k_{n}^{(j)}

N_{L S T M} = T H (8 D + 8 H + 29)

N_{L S T M} = T H (8 D + 8 H + 29)

N_{g - L S T M} = N_{L S T M} + N_{g a t e}

N_{g - L S T M} = N_{L S T M} + N_{g a t e}

h_{n} = k_{n} \cdot \tilde{h}_{n} + (1 - k_{n}) \cdot h_{n - 1}

h_{n} = k_{n} \cdot \tilde{h}_{n} + (1 - k_{n}) \cdot h_{n - 1}

\tilde{h}_{n} = f (W_{x} x_{n} + W_{h} h_{n - 1})

\tilde{h}_{n} = f (W_{x} x_{n} + W_{h} h_{n - 1})

\frac{\partial h _{N}}{\partial W _{h}} = \frac{\partial h ~ _{0}}{\partial W _{h}} n = 1 \prod N (k_{n} W_{h} f_{n}^{'} + (1 - k_{n})) + n = 1 \sum N (k_{n} f_{n}^{'} h_{n - 1}) s = n + 1 \prod N (k_{s} W_{h} f_{s}^{'} + (1 - k_{s}))

\frac{\partial h _{N}}{\partial W _{h}} = \frac{\partial h ~ _{0}}{\partial W _{h}} n = 1 \prod N (k_{n} W_{h} f_{n}^{'} + (1 - k_{n})) + n = 1 \sum N (k_{n} f_{n}^{'} h_{n - 1}) s = n + 1 \prod N (k_{s} W_{h} f_{s}^{'} + (1 - k_{s}))

k_{5} = 1, k_{n} = 0 \forall n \in {1, ..., N} \ {5}

k_{5} = 1, k_{n} = 0 \forall n \in {1, ..., N} \ {5}

\frac{\partial h _{N}}{\partial W _{h}} = f_{5}^{'} + f_{5}^{'} f_{5}^{'} h_{4} = f_{5}^{'} W_{h} (1 + f_{5}^{'} h_{4})

\frac{\partial h _{N}}{\partial W _{h}} = f_{5}^{'} + f_{5}^{'} f_{5}^{'} h_{4} = f_{5}^{'} W_{h} (1 + f_{5}^{'} h_{4})

k_{2} = 1, k_{3} = 1, k_{4} = 1, k_{5} = 1, k_{6} = 1, k_{n} = 0 \forall n \in {1, ..., N} \ {2, 3, 4, 5, 6}

k_{2} = 1, k_{3} = 1, k_{4} = 1, k_{5} = 1, k_{6} = 1, k_{n} = 0 \forall n \in {1, ..., N} \ {2, 3, 4, 5, 6}

\frac{\partial h _{N}}{\partial W _{h}} = f_{2}^{'} W_{h} f_{3}^{'} W_{h} f_{4}^{'} W_{h} f_{5}^{'} W_{h} f_{6}^{'} W_{h} (1 + f_{2}^{'} h_{1} + f_{3}^{'} h_{2} + f_{4}^{'} h_{3} + f_{5}^{'} h_{4} + f_{6}^{'} h_{5})

\frac{\partial h _{N}}{\partial W _{h}} = f_{2}^{'} W_{h} f_{3}^{'} W_{h} f_{4}^{'} W_{h} f_{5}^{'} W_{h} f_{6}^{'} W_{h} (1 + f_{2}^{'} h_{1} + f_{3}^{'} h_{2} + f_{4}^{'} h_{3} + f_{5}^{'} h_{4} + f_{6}^{'} h_{5})

Γ \in R_{+}^{N}

Γ \in R_{+}^{N}

Γ = \frac{1}{K} E (k) \sum \frac{\partial L}{\partial h _{n}^{(k)}} \approx \frac{1}{L K} (l, k) \sum \frac{\partial L}{\partial h _{n}^{(k)}}

Γ = \frac{1}{K} E (k) \sum \frac{\partial L}{\partial h _{n}^{(k)}} \approx \frac{1}{L K} (l, k) \sum \frac{\partial L}{\partial h _{n}^{(k)}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Neural Networks and Applications

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory

Full text

Reducing state updates via Gaussian-gated LSTMs

Matthew Thornton, Jithendar Anumula, and Shih-Chii Liu

Institute of Neuroinformatics

University of Zurich and ETH Zurich

Zurich, Switzerland

[email protected], [email protected], [email protected]

Abstract

Recurrent neural networks can be difficult to train on long sequence data due to the well-known vanishing gradient problem. Some architectures incorporate methods to reduce RNN state updates, thereby allowing the network to preserve memory over long temporal intervals. We propose a timing-gated LSTM RNN model, called the Gaussian-gated LSTM (g-LSTM) for reducing state updates. The time gate controls when a neuron can be updated during training, enabling longer memory persistence and better error-gradient flow. This model captures long temporal dependencies better than an LSTM on very long sequence tasks and the time gate parameters can be learned even from a non-optimal initialization. Because the time gate limits the updates of the neuron state, the number of computes needed for the network update is also reduced. By adding a computational budget term to the training loss, we obtain a network which further reduces the number of computes by at least $10\times$ . Finally, we propose a temporal curriculum learning schedule for the g-LSTM that helps speed up the convergence time of the equivalent LSTM on long sequences.

1 Introduction

Numerous methods and architectures have been proposed to mitigate the vanishing gradient problem in RNNs, with LSTMs (Hochreiter & Schmidhuber, 1997) as one of the first prominent solutions doing so by including gating structures in the computation. Although the LSTM has excelled at handling many tasks (Schmidhuber, 2015; Lipton, 2015), it still has difficulties in learning complex and long time dependencies (Neil et al., 2016; Chang et al., 2017; Trinh et al., 2018).

In the last few years, various methods which reduce the state updates of an RNN (LSTM) have been explored to better learn long time dependencies from data. Clockwork RNNs (Koutnik et al., 2014) group the hidden units of the RNN into “modules,” where each module is executed at pre-specified time steps thereby skipping time steps which helps learn longer time dependencies. Recently, various other methods have been proposed which can be characterized by the use of additional “time gates,” $k_{n}$ , that control the information flow from one time step to the next (Krueger et al., 2016; Campos et al., 2017). Phased LSTM (PLSTM) (Neil et al., 2016) learns a parameterized function, $k_{n}$ , from the the time input of the current state and was proven to be successful at learning over long sequences.

The PLSTM time gate, parameterized by period, phase, and ratio parameters for each hidden unit, is defined through a modulo function with an ill-defined gradient. Furthermore, with periodic functions being hard to learn using gradient-based methods (Shamir, 2016) and with $k_{n}$ being periodic, the PLSTM was unable to learn the time gate parameters and hence relied on careful initialization. In order to offset these difficulties, this work proposes a new LSTM variant called the Gaussian-gated LSTM (g-LSTM). Similar to the PLSTM, it is an LSTM model with a parameterized $k_{n}$ but with only two parameters per hidden unit. Unlike the PLSTM which uses a periodic formulation for $k_{n}$ , the g-LSTM uses a Gaussian function.

We show in this work that the g-LSTM can provide a number of possible advantages over the LSTM, in particular, on long sequence tasks that pose convergence problems during training:

•

The g-LSTM network can process very long sequences by reducing the time over which the neurons can be updated. It converges faster than the LSTM, especially on sequences that are over 500 steps.

•

The “openness” of the neuron for an update can be adapted according to the task during training, even for extreme, non-optimal initializations of the time gate parameters.

•

By introducing a computational budget term into the loss function during training, the “openness” of the neuron can be optimized for a reduced computational budget. This reduction can be achieved with little or no degradation to the network performance and is useful for network pruning.

•

A “temporal curriculum” training schedule can be set up for the g-LSTM so that it helps to speed up the convergence of a normal LSTM.

The paper is structured as follows: In section 2, we discuss briefly the related work. Then, in section 3, we present the formulation of the g-LSTM, the datasets used in this work and details about the experimental hyperparameters. In section 4, we present experiments demonstrating the usefulness of the g-LSTM with respect to the four claims listed above. We provide gradient analysis in section 5 to further explain the faster convergence results of the g-LSTM. Finally in section 6, we conclude with a brief discussion of the results.

2 Related Work

There have been a multitude of proposed methods to improve the training of RNNs, especially for long sequences. Apart from incorporating additional gating structures, for example the LSTM and the GRU (Cho et al., 2014), more recently various techniques were proposed to further increase the capabilities of recurrent networks to learn on sequences of length over 1000. Proposed initialization techniques such as the orthogonal initialization of kernel matrices (Cooijmans et al., 2016), chrono initialization of the biases (Tallec & Ollivier, 2018), and diagonal recurrent kernel matrices (e.g. Li et al. (2018)) have demonstrated success. Trinh et al. (2018) propose using truncated backpropagation with an additional auxiliary loss to reconstruct previous events.

Methods that enable more efficient learning on long temporal sequences use solutions that preserve memory over longer timescales. Such solutions were first explored by Koutnik et al. (2014) in the Clockwork RNN (CW-RNN). This network skips state updates by allowing different neurons to be “activated” on different, modulated clock cycles. More recently proposed models for skipping updates include the Phased LSTM (PLSTM) (Neil et al., 2016) which uses a modulo-periodic timing gate to limit state updates; the Zoneout network (Krueger et al., 2016) which skips state updates in a random manner; and the Skip RNN (Campos et al., 2017) which learns a state skipping scheme from the data to shorten the effective sequence length for the task. Additionally, the LSTM-Jump (Yu et al., 2017) uses a reinforcement learning algorithm to learn when to skip state updates, showing a method to more quickly process (long) sequential data with an RNN while maintaining an accuracy comparable to a baseline LSTM.

It has been suggested but not yet demonstrated in the literature that the parameters of the CW-RNN clock cycle and PLSTM timing gate could be learned in training. Currently, the implementation of these networks requires a careful initialization of these parameters. With the Gaussian-gated LSTM (g-LSTM) in this work we present a time gated RNN network that converges on long sequence tasks and also has the ability to learn its time gate parameters even when initialized in a non-optimal way.

3 Methods

3.1 g-LSTM

The g-LSTM is an LSTM model with an additional time gate (Fig. 1). This time gate is used to regulate the information flow in time. Equations 1 - 3 describe the update equations for the hidden and cell states of the LSTM. Equations 4 and 5 describe the gating mechanism of the time gate, $k_{n}$ .

[TABLE]

In a standard LSTM, the gating functions $i_{n}$ , $f_{n}$ , $o_{n}$ , represent the input, forget, and output gates respectively at sequence index $n$ . $c_{n}$ is the cell activation vector, and $x_{n}$ and $h_{n}$ represent the input feature vector and the hidden output vector, respectively. The cell state $c_{n}$ is updated with a fraction of the previous cell state that is controlled by $f_{n}$ , and a new input state created from the element-wise (Hadamard) product, denoted by $\odot$ , of $i_{n}$ and the candidate cell activation as in Eq. 2.

In the g-LSTM, we further control the cell state and the output hidden state through the $k_{n}$ gate which is independent of the input data and hidden states, and is purely dependent on the time input corresponding to the sequence index $n$ . The use of the Hadamard product ensures that each hidden unit is independently controlled by the corresponding time gate unit, thus enabling the different units in the layer to process the input at different time scales.

The time gate $k_{n}$ is defined based on a Gaussian function as: $k_{n}=e^{-(t_{n}-\mu)^{2}/\sigma^{2}}$ where the mean parameter, $\mu$ , defines the time when the hidden unit is “open” and the standard deviation, $\sigma$ , controls the openness of the time gate for each unit around its corresponding $\mu$ . The time inputs $\textbf{t}=\{t_{1},t_{2},...,t_{n},...,t_{N}\}$ for the sequence $\textbf{x}=\{x_{1},x_{2},...,x_{n},...,x_{N}\}$ can correspond to the physical notion of time at the respective sequence input. In the absence of a standard notion of time, we use the sequence indices as the time input, i.e. $\textbf{t}=\{1,2,...,n,...,N\}$ . In this work, we assume this notion of time by default. The “openness” of $k_{n}$ for a neuron is defined by the parameterization of its Gaussian function.

3.2 Back Propagation for g-LSTM

An important characteristic of the g-LSTM is reduced gradient flow in back propagation training methods. By having the gating structure as in Eqs. 4 and 5 there are fewer gradient product terms, which reduces the likelihood of vanishing or exploding gradients. In a gradient descent learning scheme for a given loss function, $L$ , when training the recurrent parameters, $W_{h}$ (from Eqs. 1 - 3), the gradient as in Eq. 6 is used.

[TABLE]

By the chain rule $\frac{\partial h_{N}}{\partial W_{h}}$ expands for all time steps of the sequence, $n\in\{1,...,N\}$ . Because each output state is gated by the time gate, $k_{n}$ , the gradient terms in the expansion of $\frac{\partial h_{N}}{\partial W_{h}}$ are scaled by $k_{n}$ . When the time gate is open less often, i.e. with a small $\sigma$ value, then there are fewer influential gradient terms. More details are given in Appendix A.

3.3 Datasets

The experiments described in the paper are carried out on the adding task and two standard long sequence datasets: the sequential MNIST and the sequential CIFAR-10 datasets.

Adding task: In order to test the long sequence learning capability of the g-LSTM, we use the adding task (Hochreiter & Schmidhuber, 1997). In this task, the network is presented with two sequences of length $N$ , $\textbf{x}=\{x_{1},...,x_{N}\}$ , $x_{n}\sim U(0,1)$ ) and $\textbf{m}=\{m_{1},...,m_{N}\},m_{n}\in\{0,1\},\sum_{t=1}^{N}m_{n}=2$ . The sequence m has exactly two values of $1$ and the remaining values of the sequence are [math]. The indices of the “1” values are chosen at random. For each pair in the sequence $(\textbf{x},\textbf{m})$ , the associated label value, $y$ , is the sum of the two values in x corresponding to the “1” values of m. The objective of this task is to minimize the mean squared error between the predicted sum from the network, $\hat{y}$ , and the labeled sum, $y$ . A new training set of 5000 sequence samples is presented in every epoch during training in order to avoid overfitting. The test set consists of a separate 5000 samples. For $N>1000$ , it is known that LSTMs have difficulty learning the task and hence we focus on values of $N>1000$ in this work.

sMNIST: The sequential MNIST dataset is widely used to analyze the performance of a recurrent model. This dataset consists of 60,000 training samples and 10,000 test samples, each a single vector sequence of length 784 corresponding to the 28 $\times$ 28 pixel images in the MNIST dataset (LeCun et al., 1998). We also use permuted MNIST (pMNIST), a permuted variant of the sMNIST dataset where the sequences are processed with a fixed random permutation, making the task harder.

sCIFAR-10: The sequential CIFAR-10 dataset is another long sequence dataset based on CIFAR-10 (Krizhevsky et al., 2014) with 10 classes. The $32\times 32$ RGB pixel images are reshaped into sequences of length $1024$ with 3 dimensional features corresponding the RGB channels at every time step. Like in the sMNIST dataset, the dataset consists of 60,000 training samples and 10,000 test samples.

3.4 Experimental Hyperparameters

For the adding task, a mean squared error (MSE) loss was used with the Adam optimizer (Kingma & Ba, 2014) with a learning rate of $10^{-3}$ . The g-LSTM time gate parameters were trained using a learning rate of $10^{0}$ . For both sMNIST and sCIFAR-10 datasets, the cross-entropy loss function was used along with the RMSProp optimizer (Tieleman & Hinton, 2012) with a learning rate of $10^{-3}$ . Decay parameters of $0.5$ and $0.9$ were used for sMNIST and sCIFAR-10, respectively. The bias of the forget gate is initialized to $1$ following (Jozefowicz et al., 2015).

4 Results

Section 4.1 presents results that demonstrate the faster convergence properties of the g-LSTM on long sequence tasks. Section 4.2 shows the trainability of the time gate parameters of the g-LSTM even when the parameters are initialized in a non-optimal way. Section 4.3 presents a modified loss function used during training to reduce the number of computes for the network update and Section 4.4 presents a new “temporal curriculum” learning schedule that allows g-LSTMs to help LSTMs converge faster.

4.1 Fast convergence properties of g-LSTM

First, we look at the convergence properties of the g-LSTM on the long-sequence adding task, the sMNIST task and the sCIFAR-10 task. Table 1, above, details the network architectures used in the experiments in this section. Similar to the architecture from Trinh et al. (2018), the recurrent layer of the sCIFAR-10 network is followed by two $256$ unit fully-connected (FC) layers, where Drop-Connect (Wan et al. (2013)) ( $p=0.5$ ) is applied to the second FC layer. The kernel matrices in the LSTM networks were initialized in an orthogonal manner as described in (Cooijmans et al., 2016).

The test performances of these networks during the course of the training on different datasets are shown in Fig. 2, while the corresponding final performance metrics at the end of training are shown in Table 1. From Fig. 2, it is evident that the test loss of the g-LSTM decreases faster in training than the LSTM across all datasets. Further experiments show that this trend is maintained with different training optimizers, LSTM initializations including the bias initialization following Tallec & Ollivier (2018), and network sizes as shown in Appendix D.

Table 2 compares the performance of various networks including the g-LSTM and the baseline LSTM on sMNIST and sCIFAR-10 (from Table 1). The results show that the g-LSTM consistently performs better than the LSTM and has a similar performance to other state-of-the-art networks. Different network sizes were also investigated for the sMNIST task, see Appendix D.

4.2 Trainability of the time gate parameters of g-LSTM

To demonstrate that the g-LSTM can be trained even with non-optimal initializations, we look at the performance of the g-LSTM on the adding task with different time gate parameter initializations. We concern ourselves with sequences of length $1000$ that are difficult for the LSTM. The time gate parameters are initialized in a way to temporally constrain the network so that it can only process for a short period of time. For example, a network with time gate parameters initialized with $\mu=500$ and $\sigma=40$ as in Figure 3 (a) can only process a short period of time around the middle of sequence. It follows that the network would be unable to learn with these parameters because in the adding task the input data is distributed equally across the sequence length ( $T=1000$ ). Therefore, in order to learn the task from this initialization, the time gate parameters must learn a distribution such that the gates over all hidden units are open across the entirety of the sequence.

We observe that the time gate parameters do learn, as shown in Figure 3 (b), thereby enabling the network to solve the task. Independent of various time gate initializations, the network reaches an MSE of around $3.9\times 10^{-5}$ at the end of 700 epochs; details of which could be found in Appendix B. The ability of the network to learn the time gate parameters necessary to cover the entire sequence is especially significant because it shows that even with this narrow time window initialization that requires learning of the time gates, the g-LSTM learns the task, whereas the PLSTM does not learn the task as well. An example of this is shown in Appendix C.

4.3 Reduction in computation

Although the formulation of the g-LSTM appears to require more computes, it offers substantial speedup as a large proportion of the neurons can be skipped in a timestep at runtime. We can set a threshold on the time gate so that we skip all corresponding computations for time steps where $k_{n}$ is below this threshold. To further reduce the number of operations, it is preferred that the $\sigma$ of the $k_{n}$ for different neurons should be small but the network performance should not be significantly degraded. To achieve this goal, we included a “computational budget” loss term during the optimization of the gate parameters, $\mu$ and $\sigma$ . The loss equation for updating the $k_{n}$ parameters is given by:

[TABLE]

Similar to the Skip RNN network (Campos et al., 2017), a budget loss term which minimizes the average openness of the time gate over time is applied:

[TABLE]

for every neuron $j$ of the g-LSTM. The study was carried out on sMNIST using a network with 110 units, $\sigma$ initialized to 50, $\mu$ initialized uniformly at random between 1 to 784, and a $\lambda$ value of 1. The network’s performance of $2.2\%$ LER was comparable to the network’s performance of $1.3\%$ when no additional budget constraint was imposed. The final $\sigma$ range for the budgeted g-LSTM is much smaller compared to that of the g-LSTM as shown in Fig. 4. There is only a slight increase in LER for the budgeted g-LSTM versus the g-LSTM (see Table 2), even though there is a significant decrease in the average time gate openness across all hidden units.

In order to reduce the number of computes, we set a threshold, $v_{T}$ for $k_{n}$ so that the update steps are carried out only if $k_{n}>v_{T}$ , if $k_{n}<v_{T}$ the previous neuron state can be copied over to the current state. By increasing $v_{T}$ , the number of computes decreases as shown in Fig. 6. In the case of $v_{T}=0.01$ , only $8.2\%$ of the time gates are open on average across all hidden units and all time steps. Furthermore, the LER increased only slightly to $2.3\%$ from $2.2\%$ .

We give a quantitative estimate for the number of operations (Ops) corresponding to the number of update equations for a g-LSTM. In the estimate, we count a multiply and an add operation as 1 Op and non-linear functions as 5 Ops. For an LSTM, the number of operations is given by

[TABLE]

where $T$ is the number of time steps, $H$ is the number of hidden units, and $D$ is the dimension of the input data. For a g-LSTM, the number of operations is given by

[TABLE]

where $N_{gate}=13\,T\,H$ is the total number of operations for computing the time gate. The total number of operations for the g-LSTM network on the sMNIST dataset is around 80 MOps for $N=110$ and $T=784$ , after thresholding the budgeted g-LSTM this number is reduced to 7.6 MOps. Additional $\lambda$ hyperparameters were also investigated for the sMNIST task, see Appendix D.

4.4 Temporal curriculum training schedule for LSTMs

We demonstrate that it is possible to train an LSTM network to converge faster on a difficult task by using a “temporal curriculum” training schedule for the equivalent g-LSTM network. According to this schedule, the initial $\sigma$ values of the g-LSTM network are increased continuously throughout the training period ending up with high values by the end of training. With such high values, the time gates are essentially open, resulting in an LSTM network. At every training epoch, the lowest $\rho\%$ of the $\sigma$ values, $\boldsymbol{\hat{\sigma}}$ in the layer are updated as: $\boldsymbol{\hat{\sigma}}\longrightarrow(1+\alpha)\cdot\boldsymbol{\hat{\sigma}}$ .

We analyze the impact of this training schedule for training an LSTM network on sCIFAR-10. For the equivalent g-LSTM network with 110 units, $\mu$ is initialized uniformly at random between 1 and 1024 and $\sigma$ is initialized to 50. An $\alpha$ value of 1/6 and $\rho$ value of $15\%$ are chosen. To ensure that the time gate is fully open by the end of training, $\sigma$ is set to $5000$ across all units during the last $10$ epochs of training. The learning rate of the time gate parameters is set to [math], i.e. $\mu$ and $\sigma$ are no longer updated. Figure 6 shows that the temporal curriculum training schedule allows for faster convergence of an LSTM network. The final weights of the trained g-LSTM network can then be copied over to a LSTM network for inference.

5 Gradient flow

We present results regarding backpropagation flow through the LSTM and g-LSTM networks. Following the hypothesis presented in Section 3.2 on the reduced likelihood of vanishing or exploding gradients in the g-LSTM, we investigate the average gradient norms across time steps, similar to the work in (Krueger et al., 2016). We compute the gradient norms of the loss with respect to the hidden activations, the exact definition is given in Appendix E. Comparing the error propagation of the g-LSTM and LSTM networks on the SMNIST task (as in Sec. 4.1), Figure 7 shows the gradient norms at each time step after training for two different epochs.

Interpreting the gradient flow from higher to lower time steps (right to left), the gradients of the g-LSTM shown in Fig. 7 show higher gradient values in earlier time steps than the LSTM. It is possible that one of the reasons the g-LSTM converges more quickly (as in Fig. 2 (c)) is that this back-propagated gradient information is more consistent across time steps and does not vanish at early time steps.

6 Conclusion

This work proposes a novel RNN variant with a time gate which is parameterized by the input in time. The convergence speeds of the g-LSTM and LSTM are similar for short sequence tasks but the g-LSTM shows faster convergence and produces higher accuracies than LSTM networks on long sequence tasks, as demonstrated for adding task sequences which are longer than 1000 timesteps; and on the sMNIST and sCIFAR-10 datasets. We also demonstrate that the time gate parameters of the g-LSTM (unlike those of the PLSTM) are learnable even when the time gates are initialized in an extreme non-optimal manner for the adding task. The time gate of the g-LSTM can reduce the number of computes that is needed for the updates of the LSTM equations and with an additional loss term to reduce the compute budget, the $\sigma$ values of the time gate are reduced leading to a $10\times$ decrease in the number of actual computes and with little loss in network accuracy, for the sMNIST dataset. The observation that the budgeted g-LSTM has neurons which are closed by the timing gate suggests that this method can be used to prune a network. We also show that our proposed temporal curriculum training schedule for the g-LSTM can help a corresponding LSTM network to converge during training on long sequence tasks. For future work, it will be of interest to investigate whether these properties carry over to larger or domain-specific datasets.

Appendix A Back propagation in Gaussian-gated RNN

For ease of illustration we analyze the gradient of a plain RNN with a Gaussian time gate (Eqs. 7 and 8).

[TABLE]

where $\frac{\partial\tilde{h}_{0}}{\partial W_{h}}=1,h_{0}=\tilde{h}_{0}=0$ .

From Eq. 9 we can deduce some information about the advantages of the Gaussian time gate in gradient flow for two simple cases of the function $k_{n}$ .

In Case 1 we choose a timing gate openness which corresponds to a very small $\sigma$ for the Gaussian gate, i.e. the gate is only open for $1$ time step.

[TABLE]

In Case 2 we choose a timing gate openness which corresponds to a slightly larger $\sigma$ for the Gaussian gate, i.e. it is open for $5$ time steps.

[TABLE]

These cases show that there are fewer terms in the gradient for a timing gate that is opened for only a small fraction of the sequence.

Appendix B Comparing various g-LSTM initializations

Appendix C Comparing time gate parameter trainability in g-LSTM and PLSTM

Appendix D Hyperparameter Investigation

We look at the network performance for different hyperparameter values, focusing on the sMNIST task.

Network Initialization and Optimizer

In Fig. 2(c) of Section 4.1, we show that the g-LSTM network converges faster than the LSTM for the sMNIST task using the RMSProp optimizer and with an orthogonal initialization of LSTM kernels of both networks, as in (Cooijmans et al., 2016). In addition to using this initializer and optimizer we include results using the ADAM initializer (learning rate of $10^{-3}$ ) and a random weight initialization, “Xavier” as in (Glorot & Bengio, 2010). Across all of these different training techniques we consistently observe that the g-LSTM converges more quickly than the LSTM.

We ran further experiments to compare the chrono initialization of the LSTM forget and input biases from (Tallec & Ollivier, 2018). The forget and input biases are set as $b_{f}\sim log(U([1,T_{max}-1]))$ and $b_{i}=-b_{f}$ where $T_{max}=784$ for the sMNIST task. The use of the time gate with the g-LSTM shortens the effective sequence length for each unit; to account for this, we also provide the results of using a smaller $T_{max}$ value for the chrono initialization, $T_{max}=\sigma=250$ . The comparison of both g-LSTM and LSTM with the chrono initialization and with the “constant initialization” ( $b_{f}=1$ ) in Fig. 12 shows that the g-LSTM with the constant initialization converges the fastest. We hypothesize that the g-LSTM can converge faster when using the constant initialization over the chrono initialization because the time gate’s effect of sequence-length-shortening reduces the necessity for long memory, for which chrono initialization seeks to provide. We see that when we reduce the maximum temporal dependency for the chrono initialization (to $T_{max}=250$ , “chrono-g-LSTM-250”) this g-LSTM network converges more quickly, similar to the g-LSTM with a constant bias initialization. This suggests that these two techniques, chrono initialization and a Gaussian time gate, could be used together to improve convergence in LSTM networks.

Network Size

Aside from the network size of $110$ hidden units, we investigated the training convergence for two additional network sizes: $25$ and $220$ hidden units. Note that the LSTM for network size $25$ is trained for $100$ additional epochs until convergence was observed. Across all different network sizes the g-LSTM converges much faster than the LSTM network. With fewer hidden units, as seen in Fig. 13 (a), an even more dramatic speed-up in convergence is seen for the g-LSTM compared with the LSTM. The final LERs (g-LSTM, LSTM) for each network size are: 25 units ( $2.79\%$ , $3.58\%$ ), 110 units ( $1.35\%$ , $1.81\%$ ) , 220 units ( $1.10\%$ , $1.34\%$ ).

Budgeted g-LSTM

We provide additional results of the budgeted network (subsection 4.3) for 2 additional $\lambda$ values, $\lambda=0.1$ and $\lambda=10$ , comparing with the original result, for $\lambda=1$ . The final LERs: $2.4\%$ ( $\lambda=0.1$ ), $2.2\%$ ( $\lambda=1$ ), $2.8\%$ ( $\lambda=10$ ). The number of computes used by the network trained with $\lambda=10$ is significantly lower than the network that was trained for both $\lambda=0.1$ and $1$ .

Appendix E Average Gradient Norm Definition

The average gradient norm in Section 5 is defined as:

[TABLE]

where $N$ is the number of time steps of the sequence (for SMNIST, $N=784$ ).

[TABLE]

summing over all $L$ samples of the training set and all $K$ hidden units.

Bibliography22

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Campos et al. (2017) Victor Campos, Brendan Jou, Xavier Giro i Nieto, Jordi Torres, and Shih-Fu Chang. Skip RNN: learning to skip state updates in recurrent neural networks. Co RR , abs/1708.06834, 2017. URL http://arxiv.org/abs/1708.06834 .
2Chang et al. (2017) Shiyu Chang, Yang Zhang, Wei Han, Mo Yu, Xiaoxiao Guo, Wei Tan, Xiaodong Cui, Michael J. Witbrock, Mark Hasegawa-Johnson, and Thomas S. Huang. Dilated recurrent neural networks. Co RR , abs/1710.02224, 2017. URL http://arxiv.org/abs/1710.02224 .
3Cho et al. (2014) Kyunghyun Cho, Bart van Merriënboer, Çağlar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN Encoder–Decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pp. 1724–1734, Doha, Qatar, October 2014. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/D 14-1179 .
4Cooijmans et al. (2016) Tim Cooijmans, Nicolas Ballas, Cesar Laurent, and Aaron C. Courville. Recurrent batch normalization. Co RR , abs/1603.09025, 2016. URL http://arxiv.org/abs/1603.09025 .
5Glorot & Bengio (2010) Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Yee Whye Teh and Mike Titterington (eds.), Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics , volume 9 of Proceedings of Machine Learning Research , pp. 249–256, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR. URL http://proceedings.mlr.press/v 9/glorot 10a.html .
6Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Comput. , 9(8):1735–1780, November 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735 . URL http://dx.doi.org/10.1162/neco.1997.9.8.1735 . · doi ↗
7Jozefowicz et al. (2015) Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. An empirical exploration of recurrent network architectures. In International Conference on Machine Learning , pp. 2342–2350, 2015.
8Kingma & Ba (2014) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Co RR , abs/1412.6980, 2014. URL http://arxiv.org/abs/1412.6980 .