Dynamic Cell Structure via Recursive-Recurrent Neural Networks

Xin Qian; Matthew Kennedy; and Diego Klabjan

arXiv:1905.10540·cs.LG·May 28, 2019

Dynamic Cell Structure via Recursive-Recurrent Neural Networks

Xin Qian, Matthew Kennedy, and Diego Klabjan

PDF

Open Access

TL;DR

This paper introduces a dynamic neural architecture search method that constructs customized cell structures for each data sample and time step, improving efficiency and accuracy in recurrent neural networks.

Contribution

It presents a novel recursive-recurrent neural network algorithm that dynamically searches for optimal cell structures tailored to individual data samples.

Findings

01

Achieves higher prediction accuracy than GRU in language modeling.

02

Discovers high-performance cell architectures through experiments.

03

Demonstrates efficiency in architecture search for recurrent models.

Abstract

In a recurrent setting, conventional approaches to neural architecture search find and fix a general model for all data samples and time steps. We propose a novel algorithm that can dynamically search for the structure of cells in a recurrent neural network model. Based on a combination of recurrent and recursive neural networks, our algorithm is able to construct customized cell structures for each data sample and time step, allowing for a more efficient architecture search than existing models. Experiments on three common datasets show that the algorithm discovers high-performance cell architectures and achieves better prediction accuracy compared to the GRU structure for language modelling and sentiment analysis.

Tables3

Table 1. Table 1: Performance of models on three datasets

	Wiki-5k (BPC)		Wiki-10k (BPC)
	Val	Test	Val	Test
RRNN	2.58 (-5.5%)	2.63 (-1.9%)	–	–
RRNN-GRU	–	–	2.43 (-5.8%)	2.42 (-5.8%)
GRU	2.73	2.68	2.58	2.57
	SST (Accuracy)		PTB (Perplexity)
	Val	Test	Val	Test
RRNN-GRU	65.1% (-0.8%)	68.7% (-5.2%)	281	239
GRU	64.6%	65.3 %	247 (-12.1%)	239

Table 2. Table 2: Hyperpameter search range for RRNN-GRU

	Wiki-10k	SST	PTB
Batch size	$[1, 316]$	$[1, 316]$	$[1, 316]$
Learning rate	$[10^{- 5}, 10^{- 2}]$	$[10^{- 5}, 10^{- 2}]$	$[10^{- 5}, 10^{- 2}]$
$λ_{2}$	$[10^{- 2}, 1]$	$[10^{- 5}, 10^{- 2}]$	$[3 \times 10^{- 2}, 3]$
$λ_{3}$	$[10^{- 16}, 1]$	$[10^{- 5}, 10^{- 2}]$	$[3 \times 10^{- 16}, 3 \times 10^{- 2}]$
$λ_{4}$	$[10^{- 6}, 10^{- 2}]$	$[10^{- 8}, 10^{- 4}]$	$[3 \times 10^{- 6}, 3 \times 10^{- 2}]$
Scoring margin $M$	$[0.1, 10]$	$[0.1, 10]$	$[0.1, 10]$
Gradient clipping threshold	$[0.1, 100]$	$[0.1, 100]$	$[0.1, 100]$
Alternate frequency	$[1, 10]$	$[1, 10]$	$[1, 10]$

Table 3. Table 3: Hyperpameter search range for GRU

	Wiki-5k/Wiki-10k	SST	PTB
Batch size	$[8, 256]$	$[4, 128]$	$[8, 256]$
Learning rate	$[10^{- 5}, 10^{- 1}]$	$[10^{- 6}, 10^{1}]$	$[10^{- 4}, 10^{- 1}]$
$ℓ_{2}$ weight decay coefficient	$[10^{- 16}, 1]$	$[10^{- 16}, 10^{- 2}]$	$[10^{- 16}, 1]$

Equations88

r_{t}

r_{t}

z_{t}

\tilde{h}_{t}

h_{t}

f = f_{N} \circ f_{N - 1} \circ \dots \circ f_{2} \circ f_{1},

f = f_{N} \circ f_{N - 1} \circ \dots \circ f_{2} \circ f_{1},

(T_{t}^{\pred}, h_{t}) = f (\calN_{0}^{t} (x_{t}, h_{t - 1})), q_{t} = g (x_{t}, h_{t}; Γ) .

(T_{t}^{\pred}, h_{t}) = f (\calN_{0}^{t} (x_{t}, h_{t - 1})), q_{t} = g (x_{t}, h_{t}; Γ) .

\displaystyle L(\Phi)=\mathbb{E}_{(X,Y)}\Bigg{[}\sum_{t=1}^{T}\bigg{\{}\lambda_{1}l(y_{t},q_{t})

\displaystyle L(\Phi)=\mathbb{E}_{(X,Y)}\Bigg{[}\sum_{t=1}^{T}\bigg{\{}\lambda_{1}l(y_{t},q_{t})

+ λ_{4} ϕ \in Φ \sum ∥ ϕ ∥^{2},

VD (T_{1}, T_{2}) = i \in \calI (T_{1}) \cap \calI (T_{2}) \sum ∥ v_{i, T_{1}} - v_{i, T_{2}} ∥^{2} + i \in \calI (T_{1}) ∖ \calI (T_{2}) \sum ∥ v_{i, T_{1}} ∥^{2} + i \in \calI (T_{2}) ∖ \calI (T_{1}) \sum ∥ v_{i, T_{2}} ∥^{2}

VD (T_{1}, T_{2}) = i \in \calI (T_{1}) \cap \calI (T_{2}) \sum ∥ v_{i, T_{1}} - v_{i, T_{2}} ∥^{2} + i \in \calI (T_{1}) ∖ \calI (T_{2}) \sum ∥ v_{i, T_{1}} ∥^{2} + i \in \calI (T_{2}) ∖ \calI (T_{1}) \sum ∥ v_{i, T_{2}} ∥^{2}

\mathrm{TD}(T^{\mathrm{pd}},T^{\mathrm{tgt}})=\sum_{n_{1}\in V(T^{\mathrm{pd}})}\min_{n_{2}\in V(T^{\mathrm{tgt}})}\Big{\{}\mathrm{VD}\big{(}\Subtree(T^{\mathrm{pd}},n_{1}),\Subtree(T^{\mathrm{tgt}},n_{2})\big{)}\Big{\}}.

\mathrm{TD}(T^{\mathrm{pd}},T^{\mathrm{tgt}})=\sum_{n_{1}\in V(T^{\mathrm{pd}})}\min_{n_{2}\in V(T^{\mathrm{tgt}})}\Big{\{}\mathrm{VD}\big{(}\Subtree(T^{\mathrm{pd}},n_{1}),\Subtree(T^{\mathrm{tgt}},n_{2})\big{)}\Big{\}}.

(T_{t, i}^{\pred}, h_{t, i}) = f^{i} (\calN_{0, i}^{t}) i = 1, 2, \dots, M,

(T_{t, i}^{\pred}, h_{t, i}) = f^{i} (\calN_{0, i}^{t}) i = 1, 2, \dots, M,

q_{t} = g (x_{t}, h_{t, M}; Γ),

\displaystyle L(\Phi)=\mathbb{E}_{(X,Y)}\Bigg{[}\sum_{t=1}^{T}\bigg{\{}\lambda_{1}l(y_{t},q_{t})

\displaystyle L(\Phi)=\mathbb{E}_{(X,Y)}\Bigg{[}\sum_{t=1}^{T}\bigg{\{}\lambda_{1}l(y_{t},q_{t})

+ λ_{4} ϕ \in Φ \sum ∥ ϕ ∥^{2},

(T_{t, 1}^{\pred}, c_{t}) = f^{1} (\calN_{0, 1}^{t} (x_{t}, c_{t - 1}, h_{t - 1})),

(T_{t, 1}^{\pred}, c_{t}) = f^{1} (\calN_{0, 1}^{t} (x_{t}, c_{t - 1}, h_{t - 1})),

(T_{t, 2}^{\pred}, h_{t}) = f^{2} (\calN_{0, 2}^{t} (x_{t}, c_{t - 1}, h_{t - 1}, c_{t})),

q_{t} = g (x_{t}, h_{t}; Γ),

f_{t}

f_{t}

i_{t}

o_{t}

c_{t}

h_{t}

α (v; Θ) = k = 1 \sum n k exp (- \frac{∥ v - v _{k} ∥ ^{2}}{2 σ _{0}^{2}})

α (v; Θ) = k = 1 \sum n k exp (- \frac{∥ v - v _{k} ∥ ^{2}}{2 σ _{0}^{2}})

α (v_{i + 1}; Θ) = k = 1 \sum n k exp (- \frac{∥ v _{i + 1} - v _{k} ∥ ^{2}}{2 σ _{0}^{2}}) \leq 1 \leq k \leq n k \neq = i + 1 \sum k = \frac{n ( n + 1 )}{2} - (i + 1) .

α (v_{i + 1}; Θ) = k = 1 \sum n k exp (- \frac{∥ v _{i + 1} - v _{k} ∥ ^{2}}{2 σ _{0}^{2}}) \leq 1 \leq k \leq n k \neq = i + 1 \sum k = \frac{n ( n + 1 )}{2} - (i + 1) .

α (v_{i}; Θ)

α (v_{i}; Θ)

= \frac{n ( n + 1 )}{2} - i - \frac{n - 1}{n} > \frac{n ( n + 1 )}{2} - (i + 1) .

α (v_{i + 1}; Θ) \leq \frac{n ( n + 1 )}{2} - (i + 1) < α (v_{i}; Θ),

α (v_{i + 1}; Θ) \leq \frac{n ( n + 1 )}{2} - (i + 1) < α (v_{i}; Θ),

∥ L ∥ \leq C_{1}, \forall L \in \calL, ∥ R ∥ \leq C_{1}, \forall R \in \calR,

∥ L ∥ \leq C_{1}, \forall L \in \calL, ∥ R ∥ \leq C_{1}, \forall R \in \calR,

∥ u^{'} ∥_{\infty} \leq C_{2}, \forall u \in \calU,

\frac{\partial o ( L _{i} v _{1} , R _{i} v _{2} )}{\partial ( L _{i} v _{1} )} \leq C_{3}, \frac{\partial o ( L _{i} v _{1} , R _{i} v _{2} )}{\partial ( R _{i} v _{2} )} \leq C_{3}, 1 \leq i \leq n_{l}, \forall v_{1}, v_{2} \in \calV_{t}, v_{1} \neq = v_{2}, \forall o \in O,

\frac{\partial \calE _{t}}{\partial h _{t}} \leq C_{4}, t = 0, 1, \dots,

C_{1} C_{2} C_{3} < \frac{1}{2} .

\frac{\partial \calE _{T}}{\partial ϕ} = t^{'} = 1 \sum T \frac{\partial \calE _{T}}{\partial h _{T}} \frac{\partial h _{T}}{\partial h _{t^{'}}} \frac{\partial ^{+} h _{t^{'}}}{\partial ϕ} = t^{'} = 1 \sum T \frac{\partial \calE _{T}}{\partial h _{T}} (t^{'} < t \leq T \prod \frac{\partial h _{t}}{\partial h _{t - 1}}) \frac{\partial ^{+} h _{t^{'}}}{\partial ϕ} .

\frac{\partial \calE _{T}}{\partial ϕ} = t^{'} = 1 \sum T \frac{\partial \calE _{T}}{\partial h _{T}} \frac{\partial h _{T}}{\partial h _{t^{'}}} \frac{\partial ^{+} h _{t^{'}}}{\partial ϕ} = t^{'} = 1 \sum T \frac{\partial \calE _{T}}{\partial h _{T}} (t^{'} < t \leq T \prod \frac{\partial h _{t}}{\partial h _{t - 1}}) \frac{\partial ^{+} h _{t^{'}}}{\partial ϕ} .

\frac{\partial v}{\partial v _{1}}

\frac{\partial v}{\partial v _{1}}

= \diag {u^{'} (o (L_{i} v_{1}, R_{i} v_{2}) + b_{i})} \frac{\partial o ( L _{i} v _{1} , R _{i} v _{2} )}{\partial ( L _{i} v _{i} )} L_{i}

\leq ∥ \diag {u^{'} (o (L_{i} v_{1}, R_{i} v_{2}) + b_{i})} ∥ \frac{\partial o ( L _{i} v _{1} , R _{i} v _{2} )}{\partial ( L _{i} v _{i} )} ∥ L_{i} ∥

\leq C_{1} C_{2} C_{3},

\norm \frac{\partial h _{t}}{\partial h _{t - 1}}

\norm \frac{\partial h _{t}}{\partial h _{t - 1}}

= k = 1 \sum N + 1 \norm j = 1 \prod l_{k} \frac{\partial P _{j}^{k}}{\partial P _{j - 1}^{k}} \leq k = 1 \sum N + 1 j = 1 \prod l_{k} \norm \frac{\partial P _{j}^{k}}{\partial P _{j - 1}^{k}} \leq k = 1 \sum N + 1 C_{0}^{l_{k}},

k = 1 \sum N + 1 C_{0}^{l_{k}} \leq 1 - ϵ .

k = 1 \sum N + 1 C_{0}^{l_{k}} \leq 1 - ϵ .

k = 1 \sum N + 1 C_{0}^{l_{k}} \leq C_{0}^{N} + k = 1 \sum N C_{0}^{k} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Model Reduction and Neural Networks · Neural Networks and Reservoir Computing

MethodsSigmoid Activation · Tanh Activation · Softmax · Long Short-Term Memory · Gated Recurrent Unit

Full text

Dynamic Cell Structure via Recursive-Recurrent Neural Networks

Xin Qian

Matthew Kennedy

Weinberg College of Arts and Sciences, Northwestern University

Diego Klabjan

Abstract

In a recurrent setting, conventional approaches to neural architecture search find and fix a general model for all data samples and time steps. We propose a novel algorithm that can dynamically search for the structure of cells in a recurrent neural network model. Based on a combination of recurrent and recursive neural networks, our algorithm is able to construct customized cell structures for each data sample and time step, allowing for a more efficient architecture search than existing models. Experiments on three common datasets show that the algorithm discovers high-performance cell architectures and achieves better prediction accuracy compared to the GRU structure for language modelling and sentiment analysis.

1 Introduction

First proposed by Hopfield [12], recurrent Neural Network (RNN) models excel at machine learning tasks that involve sequential data such as natural language processing. Researchers soon noted that a major obstacle of RNN models is in backpropagation when computing gradients. Since RNNs are trained by backpropagation through time, when the recurrent structure is unfolded into a huge feed-forward network with many layers, gradients tend to grow or vanish exponentially in the same way as in very deep feed-forward neural networks [20]. Many extensions of RNN models, such as Long Short-Term Memory (LSTM) [11] and Gated Recurrent Units (GRU) [5], are proposed to address this problem. These models achieve state-of-the-art results in many machine learning tasks like language modeling [6] and speech recognition [15, 22].

However, the cell structure of these hand-crafted RNN models, like LSTM and GRU, is fixed across all time steps and data samples. It is also a time-consuming and tedious effort to find a suitable cell structure through trial and error [19]. Lastly, there is no universal answer to which cell structure to use when facing different types of data and a different problem at hand. Therefore, a more flexible model that can automatically determine the cell structures based on a finite set of trainable parameters is needed to deal with more and more complicated and diversified data sources and problems.

There is another line of research about Recursive Neural Network (RecNN) models [24]. A RecNN model is defined over recursive tree structures – each node of the tree corresponds to a vector computed from its child nodes, and the information passes from the leaf nodes and internal nodes to the root node in a bottom-up manner. The model produces a structured prediction such as a tree by applying the same set of trainable parameters recursively. Derivatives of errors are computed with back-propagation over the tree structures [8]. RecNN has shown great success in learning tree structures of certain natural language processing tasks [25] because the structures it dynamically produces are customized for each data sample.

We consider how to replace the cell structure in RNN models to be time-variant and sample-dependent. We note that the equations governing a cell can be represented as a computational tree where each non-leaf node corresponds to a vector that is computed from the vectors on its two child nodes. The initial multiset of vectors is composed of the current feature vector at time $t$ and all vectors produced by the previous cell (hidden state representation). If we augment this multiset with constant vectors, such as the zero vector, we can then express mathematical equations behind a cell as a tree on this multiset. RecNN is an appropriate model to capture such a tree by means of a finite set of trainable parameters. In summary, our proposed model is using RecNN in each time step as a replacement for a fixed set of equations. In this way we obtain an architecture with cells depending on time and on each individual sample. In addition to this flexibility, the approach does not require hand-crafting of cells.

Our model shows great results on a series of language modeling and sentiment prediction tasks. In the experiments we show that RRNN is able to design sample-dependent tree structures on the Wikipedia dataset and achieves 5.5% improvement in Bits per Character (BPC) compared to GRU. The performance on the datasets also show the advantage of dynamically designing cell structures for each sample.

The major contribution of this paper is a novel architecture that dynamically searches for the structure of cells in an RNN. Our model, called a Recursive-Recurrent Neural Network (RRNN), recursively designs the cell structure with the help of a scoring function and allows us to build different cell structures under a fixed set of parameters. The proposed model can generate the cell structure of some traditional RNN models, like GRU and LSTM which we establish theoretically. Most importantly, the output tree structure of hidden cells in RRNN are customized based on each data sample, and therefore they are time-variant and data-dependent. Besides, we define a new tree distance metric that can measure the difference between the tree with vectors on each of its nodes. We also exhibit and prove the sufficient and necessary conditions for avoiding the gradient exploding and vanishing problem that usually appears in recurrent neural network models. While such results are known for RNNs, they have not yet been established for RecNNs. Furthermore, our result applies to RRNNs which are a combination of RNN and RecNN.

The rest of the manuscript is structured as follows. In Section 2 we review the literature while in Section 3 we present the RRNN model, including an algorithm to construct trees, the design of the loss function, and other extensions. Section 4 presents some properties of the RRNN model. In Section 5 we introduce the data sets and discuss all experimental results. We defer the proofs of the theorems and other technical details to Appendix.

2 Literature Review

A recurrently connected structure in RNN can improve the performance of a model by its ability to infer sequential dependencies [16]. Despite their success, vanilla RNN models are still limited by the algorithms employed due to the problems of exploding or vanishing gradients that may appear in the training phase [2]. LSTM [11] is one of the most popular ways to address this problem. Many variants are then proposed to improve the performance of LSTM [10, 14]. RNN models often work well if a hand-crafted cell structure is well-designed, which requires time and expertise, and it leads to a fragile setting that works only on a particular problem or, worse, on a single dataset. This is clearly less general and less flexible than the method proposed in this paper where the cells are algorithmically designed.

Recursion is the division of a problem into subproblems of the same type and the application of an algorithm to each subproblem. It can help with augmenting neural architectures and improving the generalization ability of a model [3]. RecNN greedily searches hierarchical tree structures and achieves state-of-the-art performances on tasks like semantic analysis in natural language processing and image segmentation [24, 25].

To provide better flexibility and robustness, automatically searching a neural network architecture is thus a logical next step. Neural Architecture Search (NAS), a subfield of AutoML, is a method which algorithmically finds an architecture; it has significant overlap with hyper-parameter optimization and meta-learning [7]. A simple approach to NAS is to build a layer-chained neural network where layers are differentiated by their choices of operations (pooling, convolution, etc.), activation functions (ReLU, Sigmoid, etc.), width, etc. [4, 26, 1]. Despite its impressive empirical performance, NAS is computationally expensive and time consuming [29].

Various methods of producing novel cell structures for RNNs have been recently proposed. [28] introduce a reinforcement learning approach that utilizes policy gradient to search for convolutional and recurrent neural architectures. However, the reinforcement learning approach is computationally expensive in the sense that obtaining an architecture with state-of-the-art performance on CIFAR-10 and ImageNet requires 1,800 GPU days [29]. [21] accelerate the search process by sharing parameters among potential architectures. [23] introduce a more flexible algorithm that searches for novel RNNs of arbitrary depth and width. [17] relax the discrete architecture space by continuous probability vectors and utilize a gradient based optimization method to derive an optimal architecture. All these methods are extremely computationally demanding and they yield a fixed network architecture for all times and samples. Some exceptions are in [9] and [27] where the proposed models automatically adjust the number of layers of the LSTM model based on time and sample but the cell structures are static. Our RRNN model further extends this property such that the predicted cell structures are time-variant and sample-dependent.

3 Recursive-Recurrent Neural Network Model

Generally, RNNs consist of two parts which are a hidden cell (recurrent cell) and an output layer. A single sample input of an RNN is a sequence of vectors $\left\{x_{t}\in\mathbb{R}^{p}:t=1,2,\ldots,T\right\}$ , labeled by time step $t$ . Given a hidden state $h_{t-1}$ , the $t$ -th recurrent cell defines the next hidden state $h_{t}$ by $h_{t}=f(x_{t},h_{t-1})$ . The output layer is usually a simple network that takes $x_{t}$ and $h_{t}$ as input and returns $q_{t}=g(x_{t},h_{t};\Gamma)$ as output. These two equations are applied for $t=1,2,\ldots,T$ .

Function $f$ defined above is time-invariant and thus remains the same in all time steps and for all samples. To address this shortcoming, we propose a new model that can dynamically design the recurrent cell structure (i.e. generate different functions $f$ ) with respect to the argument vectors. This is inspired by the idea of RecNNs, thus we call it the Recursive-Recurrent Neural Network model. A dynamic architecture has two advantages: (1) no need to hand-craft a cell, and (2) it automatically adjusts based on timestep and sample.

A simple RecNN model starts with a set of input nodes $\left\{p_{1},\ldots,p_{n}\right\}$ with corresponding embedding vectors $\left\{c_{1},\ldots,c_{n}\right\}$ . Two nodes are merged into a parent node using a pair of weight matrices $L$ and $R$ , a bias vector $b$ , and an activation function $\sigma$ that provides non-linearity. For two nodes $p_{i}$ and $p_{j}$ , their parent, denoted by $p_{i,j}$ , is also a node with the embedding vector calculated by $c_{i,j}=\sigma\left(Lc_{i}+Rc_{j}+b\right)$ . In each iteration, we compute the scores $s_{i,j}=W^{\mathrm{score}}c_{i,j}$ for all pairs of nodes $(p_{i},p_{j})$ and select the pair of nodes $(p_{i_{1}},p_{j_{1}})$ with the highest score. We next merge nodes $p_{i_{1}}$ and $p_{j_{1}}$ into the parent node $p_{i_{1},j_{1}}$ and remove the two child nodes $p_{i_{1}}$ and $p_{j_{1}}$ from further consideration. This procedure repeats until all nodes are merged and only one parent node $p_{\mathrm{out}}$ remains. The set of parameters and activation function $\{L,R,b,\sigma,W^{\mathrm{score}}\}$ are shared across the whole network. The RecNN model returns $p_{out}$ and the binary tree rooted at $p_{\mathrm{out}}$ as the model output.

The RRNN model replaces the fixed hidden cell of RNN by a recursive tree, dynamically determined by an algorithm similar to RecNN. Note that, even with the fixed set of parameters and activation function $\left\{L,R,b,\sigma,W^{\mathrm{score}}\right\}$ , the RecNN model can dynamically produce different tree structures based on input nodes (vectors). Therefore, in RRNN, the recurrent cell is different across all time steps and data points. We further discuss the RRNN model in the following sections.

3.1 Recursive-Recurrent Neural Network Model Framework

We start with an example of how to represent the hidden cell structure of GRU to be a binary tree with computational information on it. In the following we assume $X$ is a given sample, where $X=(x_{1},\ldots,x_{T}),x_{i}\in\mathbb{R}^{p}$ is a sequence of input vectors. Recall that the GRU equations are:

[TABLE]

where $\{W_{r},W_{r}^{\prime},W_{z},W_{z}^{\prime},W_{h},W_{h}^{\prime}\}$ and $\{b_{r},b_{z},b_{h}\}$ are parameter matrices and bias vectors of GRU, respectively. Equations (1) – (4) jointly define the function $f$ of the $t$ -th hidden cell of GRU. As shown in Figure 1, the above equations can also be regarded as a binary tree where each node of the tree corresponds to a 3-tuple (binary operator, activation function, bias vector), and each edge is associated with a trainable matrix or identity matrix.

The tree structure can be achieved in the scheme of RecNN by giving a multiset of initial nodes $\calN_{0}=\left\{x_{t},x_{t},x_{t},x_{t},h_{t-1},h_{t-1},h_{t-1},h_{t-1},h_{t-1},\mathbf{0}\right\}$ . Assuming an appropriate scoring function, in the first iteration, we can find that the parent node that combines $x_{t}$ and $h_{t-1}$ with parameters and operations $(W_{r},W_{r}^{\prime},b_{r},+,\sigma)$ has the highest score, thus we merge two nodes $x_{t}$ and $h_{t-1}$ together to achieve node $q_{t}$ . In the second iteration, we find that the parent node $z_{t}=\sigma\left(W_{z}x_{t}+W_{z}^{\prime}h_{t-1}+b_{z}\right)$ has the highest score among all potential parent nodes, thus we again take two nodes $x_{t}$ and $h_{t-1}$ from the node set and merge them to be $z_{t}$ . After $9$ iterations, we end up with one node $h_{t}$ and this is exactly the output of the $t$ -th hidden cell of GRU. We can prove that, with an appropriate choice of the scoring function, a RecNN can find the tree in Figure 1 and thus it can produce GRU. The statement is given in Section 4.1 and the proof is in the Appendix.

Next, we present the RRNN model. The RRNN model has the same recurrent structure as RNN and an algorithm for building the hidden cell. We start with the case where only one hidden state needs to be transferred between two consecutive hidden cells (GRU falls in this case, but in LSTM we have two states, i.e. the hidden state $h_{t}$ and the memory state). We extend the model to be compatible with multiple hidden states in Appendix B.

For each hidden cell of RRNN, we build up a binary tree from a multiset of initial nodes with corresponding vectors on each node. A set of trainable parameter matrices and bias vectors, activation functions, binary operations, and a scoring function is given prior to the construction of the tree. We denote $\calL=\{L_{1},\ldots,L_{n_{l}}\}$ and $\calR=\{R_{1},\ldots,R_{n_{l}}\}$ with $L_{i},R_{i}\in\mathbb{R}^{p\times p}$ as the set of trainable weight matrices, $\calB=\{b_{1},\ldots,b_{n_{l}}\}$ with $b_{i}\in\mathbb{R}^{p}$ as the set of trainable bias vectors, $\mathcal{U}=\{u_{1},u_{2},\ldots,u_{n_{u}}\}$ and $\mathcal{O}=\{o_{1},o_{2},\ldots,o_{n_{o}}\}$ as the set of available activation (unary) functions and binary operations, respectively. Besides, a scoring function $\alpha(\cdot;\Theta)$ , depending on a set of trainable parameters $\Theta$ , is given. A multiset of initial nodes $\calN_{0}=\{c_{1},c_{2},\ldots,c_{N}\}$ with $c_{i}\in\mathbb{R}^{p}$ is also given as input of the hidden cell. We do not distinguish between a node and its corresponding vector in the following discussion, however we note that nodes in the tree are unique while the corresponding vectors form a multiset. The tree shown in Figure 1 is the equivalent RRNN representation for the GRU cell.

Formally, the RRNN hidden cell can be understood as a function $f:\calN_{0}\rightarrow(\calT,\mathbb{R}^{p})$ , where $\calT$ is the set of all possible (binary) computational trees such that each node of the tree corresponds to a vector and a 3-tuple $(u,o,b),u\in\calU,o\in\calO,b\in\calB$ , and each directed edge from each one of the two child nodes to its parent node is associated with a weight matrix.

Function $f$ can be recursively defined as

[TABLE]

where $f_{k}:\calN_{k-1}\mapsto\calN_{k},k=1,2,\ldots,N-1$ and $f_{N}:\calN_{N-1}\rightarrow(\calT,\mathbb{R}^{p})$ . For $k=1,\ldots,N-1$ , function $f_{k}$ maps multiset $\calN_{k-1}$ to multiset $\calN_{k}$ by the following three steps: (i) $C_{k}=\left\{c:c=u(o(Lc_{i},Rc_{j})+b),c_{i},c_{j}\in\calN_{k-1},i<j,L\in\calL,R\in\calR,b\in\calB,u\in\calU,o\in\calO\right\}$ , (ii) $c_{k}^{*}=\arg\max\limits_{c}\left\{\alpha(c;\Theta):c\in C_{k}\right\}$ , (iii) $\calN_{k}=\{c_{k}^{*}\}\cup\calN_{k-1}\setminus\{c_{i}^{*},c_{j}^{*}\}$ , where $c_{i}^{*},c_{j}^{*}$ are the two child nodes combined to get $c_{k}^{*}$ .

Note that $\calN_{N-1}=f_{N-1}\circ\cdots\circ f_{1}(\calN_{0})$ contains only one node, i.e. $\calN_{N-1}=\left\{c_{N-1}^{*}\right\}$ . Function $f_{N}$ then takes $c_{N-1}^{*}$ and returns the tree rooted at $c_{N-1}^{*}$ (we can discover it by unfolding the collapsing decisions and tracing each parent node down to its child nodes until all initial nodes appear) and the corresponding vector $c_{N-1}^{*}\in\mathbb{R}^{p}$ as the output. We point out that by definition the produced binary tree is full, i.e. each node has exactly 2 or 0 child nodes.

We next specify the recursive relationship of our cells. To this end, let multiset $\calN_{0}^{t}=\calN_{0}^{t}(x_{t},h_{t-1})$ consist of several copies of $x_{t}$ , several copies of $h_{t-1}$ and other constant vectors such as the vector of all zeros or all ones or unit vectors. The numbers of each of them can vary by $t$ . The transition equations and cell output are as follows:

[TABLE]

It remains to specify the loss function. The generic function is as follows with further details provided in Section 3.3. We assume that a sample consists of $(X,Y)$ where $Y=(y_{1},\ldots,y_{T})$ is a sequence of ground truth labels. We also assume that we are given a ground truth binary tree $T_{t}^{\target}$ which is specified as in Figure 1 but without the trainable matrices and bias vectors. The target tree usually does not depend on $t$ . Ideally this target tree should not be specified but we leave this as future research work.

One further complication is the fact that the ground truth tree does not have a unique representation. Indeed, since the leaf nodes corresponding to $\calN_{0}$ are unordered, there are several isomorphisms of a given tree that yield the same underlying ground truth transition function, i.e. mathematically equivalent expressions. To this end, let ${\Iso}(T_{t}^{\target})$ be the set of all isomorphic trees to $T_{t}^{\target}$ . Note that we do not need to consider the isomorphisms when leaf nodes are ordered, as is the case in [24].

The set of all trainable parameters in RRNN is denoted by $\Phi=\{\calL,\calR,\calB,\Theta,\Gamma\}$ . The loss function is specified by

[TABLE]

where function $l$ is the standard loss function, $TD$ measures the difference of two trees, and $m$ is the margin function. These two are described in detail in Section 3.3. The minimum operation over isomorphic target trees can also be replaced by expectation.

3.2 Cell Tree Construction

Several changes of constructing the cell tree are made for practical concerns. Functions $f_{k},k=1,\ldots,N-1$ can be regarded as $N-1$ iterations of merging two nodes (vectors). Multisets $\calN_{k}$ can have multiple copies but in practice we keep a single copy that is reused. The new set $\calN_{k-1}$ consists of three fixed sets of vectors, namely $\calS_{t}^{\text{data}}=\{x_{t}\}$ as the set of vectors from the data samples, $\calS_{t}^{\text{prev}}$ as the set of vectors from the previous hidden cell, and $\calS_{t}^{\text{aux}}$ as the set of auxiliary vectors such as the zero vector, etc., together with the set $\calP_{k-1}$ as the set of generated parent nodes. The model takes the new set $\calN_{k-1}=\calS_{t}^{\text{data}}\cup\calS_{t}^{\text{prev}}\cup\calS_{t}^{\text{aux}}\cup\calP_{k-1}$ as the set of all potential choices of child nodes to build the $k$ -th parent node $c_{k}^{*}$ . Then we set $\calP_{k}=\calP_{k-1}\cup\{c_{k}^{*}\}$ and step to the $(k+1)$ -th iteration. We further need a hyper-parameter $\bar{N}$ corresponding to the number of iterations of the tree construction steps. The practical algorithm for constructing the computational tree for the $t$ -th hidden cell of RRNN is exhibited in Algorithm 1. It is worth mentioning that the number of iterations $\bar{N}$ in Algorithm 1 might be different from the number of nodes $N$ in the predicted tree. A vector might be chosen several times to serve as a child node in Algorithm 1. In this case, the number of nodes $N$ in the predicted tree is larger than the number of iterations $\bar{N}$ .

3.3 Loss Function

We discuss the definition of the tree distance (TD) and the scoring margin $m$ in this section.

Score Margin

To give scoring more partitioning power, we incentivize it to leave a significant margin between the score of the highest-scoring vector and the second-highest vector for each node. Recall the definition of $C_{k}$ and $c_{k}^{*}$ from Section 3.1. We further define $c_{k}^{**}$ to be the vector with the second highest score among the vectors in $C_{k}$ . In Algorithm 1 the analogous to $C_{k}$ is $V_{t}^{k}$ . The scoring margin function is thereby defined as $m(\calN_{k})=-\frac{1}{M}\min\{M,\alpha(c_{k}^{*};\Theta)-\alpha(c_{k}^{**};\Theta)\},$ where $M$ is a hyper-parameter. Intuitively, the margin function incentivizes scoring to increase the gap between the scores of the highest and second-highest vectors to at least $M$ . We divide by $M$ so that the overall scale of this loss term is not affected by the choice of $M$ .

Tree Distance

For convenience, in the discussion of this part, we use $T^{\mathrm{pd}}$ and $T^{\mathrm{tgt}}$ to denote the predicted tree and the target (ground truth) tree, respectively. For any binary tree $\bar{T}$ , we use $\mathrm{Int}(\bar{T})$ to denote all internal (non-leaf) nodes of $\bar{T}$ . We use $\calI(\bar{T})$ to denote the labeling of $\mathrm{Int}(\bar{T})$ such that the root node of $\bar{T}$ has index 1, and if a node has index $i$ , then its left and right child nodes have index $2i$ and $2i+1$ , respectively. For a node $n\in\mathrm{Int}(\bar{T})$ , we use $\Subtree(\bar{T},n)$ to denote the subtree of $\bar{T}$ rooted at node $n$ . In addition, we use $n_{i,\bar{T}}$ and $v_{i,\bar{T}}$ to denote the node and the corresponding vector with index $i$ in tree $\bar{T}$ , respectively.

Given two binary trees $T_{1}$ and $T_{2}$ , we define

[TABLE]

to be the vector differences (VD) of these two trees. The tree distance between $T^{\mathrm{pd}}$ and $T^{\mathrm{tgt}}$ is the sum over all minimum VD values between a sub-tree of $T^{\mathrm{pd}}$ and all sub-trees of $T^{\mathrm{tgt}}$ :

[TABLE]

This expression matches each subtree in $T^{\mathrm{pd}}$ with the closet subtree in $T^{\mathrm{tgt}}$ with respect to VD, and therefore the TD measures the difference of vectors on all of the nodes of the two trees.

4 Properties of RRNN and Gradient Control

In this section, we state some properties of the RRNN model and show how to avoid gradient exploding and vanishing during training of RRNN. We give theorems in this section and defer the proofs to the appendix.

4.1 Expressibility of RRNN

We argue that if we carefully choose sets $\calL,\calR,\calB,\calU,\calO,\mathcal{S}_{t}^{data},\mathcal{S}_{t}^{prev},\mathcal{S}_{t}^{aux}$ , the quantity $N$ , and the scoring function $\alpha$ , then Algorithm 1 can replicate the GRU and LSTM equations. We give the formal statements in this section and defer the choice of the sets and the proof to Appendix C.

Theorem 1.

There exists a scoring function $\alpha$ such that Algorithm 1 generates GRU equations (1) – (4) with an appropriate choice of $\calL,\calR,\calB,\calU,\calO,\mathcal{S}_{t}^{data},\mathcal{S}_{t}^{prev}$ , and $\mathcal{S}_{t}^{aux}$ .

Theorem 2.

There exists a scoring function $\alpha$ such that Algorithm 1 (applied twice) generates the LSTM equations with an appropriate choice of $\calL,\calR,\calB,\calU,\calO,\mathcal{S}_{t}^{data},\mathcal{S}_{t}^{prev}$ , and $\mathcal{S}_{t}^{aux}$ .

4.2 Controlling Gradient

As introduced in [2], the exploding gradient problem refers to the large increase in the norm of the gradient during training. This is due to the fact that the gradient of long-term dependencies grows exponentially quicker than for short-term dependencies. The vanishing gradient problem, on the other hand, refers to the behavior that the gradients of long-term dependencies go to zero exponentially. [20] introduce a sufficient condition of vanishing gradient and a necessary condition of exploding gradient for a simple RNN. In this section, we extend their results to a more general case – we provide these two conditions for our RRNN model. We note that our result as a special case applies to RecNN where such conditions have not yet been established.

We consider the case where only one hidden state $h_{t}$ is returned by the $t$ -th hidden cell of the RRNN model. The loss function (6) can be written as $L(\Phi)=\sum_{t=1}^{T}\mathcal{E}_{t}$ where each $\calE_{t}$ is a function of all parameters in $\Phi$ . For $1\leq t\leq T$ , the gradient of $\calE_{t}$ with respect to $\phi\in\Phi$ comes from $t$ cells, namely $\frac{\partial\calE_{t}}{\partial\phi}=\sum_{t^{\prime}=1}^{t}\frac{\partial\calE_{t}}{\partial h_{t}}\frac{\partial h_{t}}{\partial h_{t^{\prime}}}\frac{\partial^{+}h_{t^{\prime}}}{\partial\phi}$ , where $\frac{\partial^{+}h_{t^{\prime}}}{\partial\phi}$ refers to the direct gradient of $h_{t^{\prime}}$ with respect to $\phi$ directly appearing within the $t^{\prime}$ -th hidden cell. If $\phi$ is a matrix, then we mean $\frac{\partial^{+}h_{t^{\prime}}}{\partial\phi}=\frac{\partial^{+}h_{t^{\prime}}}{\partial\mathrm{vec}(\phi)}$ , where $\mathrm{vec}(\phi)$ is an appropriate matrix vectorization. The exploding (vanishing) gradient problem is defined by $\left\|\frac{\partial\calE_{t}}{\partial h_{t}}\frac{\partial h_{t}}{\partial h_{t^{\prime}}}\frac{\partial^{+}h_{t^{\prime}}}{\partial\phi}\right\|$ going to $+\infty$ ([math]) exponentially fast as $t$ goes to $+\infty$ and $t^{\prime}$ is fixed as a constant. For simplicity, we consider the case where $t=T$ and $t^{\prime}=1$ .

We state a simplified version of the theorems here and defer the full version to Appendix D. We argue that most of the time these conditions are met in practice and we elaborate them one by one in Appendix D.2.

Theorem 3 (Sufficient condition of gradient vanishing).

Under certain conditions given in Theorem 7, we have $\left\|\frac{\partial\calE_{T}}{\partial h_{T}}\frac{\partial h_{T}}{\partial h_{1}}\frac{\partial^{+}h_{1}}{\partial\phi}\right\|\rightarrow 0$ as $T\rightarrow+\infty$ , i.e., the vanishing gradient problem occurs.

Theorem 4 (Necessary condition of gradient exploding).

If we observe the vanishing gradient problem, then at least one of the conditions listed in Theorem 8 holds.

5 Experimental Results

In this section, we present numerical results by comparing our algorithm with a GRU baseline model. The experiments are conducted on three datasets, and the source code is available at http://after_accepted.

We test two versions of the RRNN algorithm. The first one is the full algorithm we presented in Algorithm 1. The second one, which we call it RRNN-GRU, is a simplified version of the RRNN model where we limit the tree structure to be exactly the same as GRU. This model has a limited tree search space and the only dynamic component is the choice of the tuple $(L_{i},R_{i},b_{i})$ to use on each pair of parent-child nodes, so the positioning of weights in the cell is flexible. Therefore, RRNN-GRU is still time-variant and data-dependent. In addition, we alternate between training the $L,R,b$ parameters and training the scoring neural network $\alpha$ consisting of a 2-layer fully connected neural network, while continuously training the output layer. The frequency (in epochs) that we switch training phases is set as a hyperparameter of the RRNN-GRU model. Due to the model architecture, training can sometimes be unstable with exploding gradients which we clip. The baseline model is the single layer GRU which has 100-dimensional hidden states.

For both 100-dimensonal character and word embeddings, we used the pre-trained embedding vectors from GloVe111https://nlp.stanford.edu/projects/glove/. The Adam optimizer is used for all experiments and random initial weights are selected. A random search on hyperparameters is used for all RRNN-GRU models and GRU models. We train the model parameters on the training set and select the optimal parameters and hyperparameters based on the performance measure on the validation set. Then we use this set of hyperparameters and the optimized model parameters to predict on the test set. We test the RRNN model only on the Wikipedia dataset. We report the performance on both validation and test sets for all datasets in Table 1 and list the optimal hyperparamers in Appendix E.2. Further details about the implementations are given in Appendix E.1.

5.1 Datasets and Settings

The Wikipedia task is to predict the next character on text drawn from the Hutter prize Wikipedia dataset222https://cs.fit.edu/~mmahoney/compression/textdata.html [13]. We remove all numbers, punctuation, XML tags, and markup characters so that 26 English characters and space are left in the raw text. Performance is measured using BPC (the smaller the better). For RRNN-GRU, we randomly select 10,000 20-character sequences for the training set, along with 1,000 sequences for validation and 2,000 for testing, such that no sequences overlap. For RRNN, the training set has 5,000 sequences while the validation and test sets remain of the same size.

The Stanford Sentiment Treebank (SST) dataset333https://nlp.stanford.edu/sentiment/treebank.html [25] is a sentiment analysis task involving classifying one-sentence movie reviews as positive, negative, or neutral. We obtain the dataset from the torchtext package and use the full 8,544-sample training set, along with a randomly-chosen 1,000 samples for validation and 2,000 for testing. Since the training data has variable length, we prepend each sample with zeros to make each sample be the same length. The performance is measured in the accuracy of correctly predicting sentiments (the higher the better).

We also perform word-level language modeling using the Penn Treebank (PTB) dataset444https://catalog.ldc.upenn.edu/LDC99T42 [18], a corpus containing articles from the Wall Street Journal. We obtain this dataset from the torchtext package and randomly select a 10,000 sample subset of 20 words each, along with 1,000 samples for validation and 2,000 for testing. We predict over all 10,001 unique words in our subset without eliminating uncommon words. The performance is measured in perplexity (the smaller the better).

5.2 Discussion

From Table 1 it is clear that RRNN-GRU outperforms GRU by 5.8% on the Wikipedia dataset while RRNN improves the results of GRU by 5.5% on validation set and 1.9% on test set with a simple set of hyperparameters. On SST, RRNN-GRU also beats GRU by 0.8% and 5.2% on validation and test sets, respectively. These experiments show that the data-dependent structures do help improve the prediction power of the model and achieve better performance. Meanwhile, RRNN-GRU matches the performance of GRU on PTB. Its performance on the test set of PTB can be improved by a more dedicated hyperparameter search.

One interesting observation of the full RRNN model is the evolution of the predicted tree structures. Figure 3 of Appendix A shows the common tree structures we find at the beginning epochs while Figure 4 of Appendix A shows the common tree structures at later epochs (near the point where optimal performance is achieved on the validation set). The tree structures tend to be balanced in the beginning epochs since the structure of the ground-truth tree plays a significant role. In later epochs, the output layer dominates the predicting ability and therefore the model tends to feed simple $h_{t}$ to the output layer.

Another interesting observation lies in the dynamics of the RRNN-GRU model. Let us denote $\calI_{e,i,t,j}$ to be the index of parameter tuple $(L,R,b)$ that the $j$ -th internal node of $i$ -th sample on the $t$ -th time step in $e$ -th epoch, and we further set $N_{e}\triangleq\sum_{i,t,j}\mathbbm{1}\left\{\calI_{e,i,t,j}\neq\calI_{e-1,i,t,j}\right\}$ to measure the number of changes in the choice of parameter tuples between $(e-1)$ -th epoch and $e$ -th epoch. Then we should expect the quantity $N_{e}$ to be decreasing as $e$ increases since the model is expected to become more stable as the training goes on and the choice of indices of parameter tuples should also become more stable. Figure 2 in Appendix A shows the plot of $N_{e}$ vs epochs which supports our hypothesis.

Acknowledgments

The authors would like to acknowledge and thank Intel for providing access to Intel’s Computing environment.

Appendix A Figures

Appendix B Extensions of RRNN model

The exposition so far handles the case where only one state vector transfers between hidden cells of the RRNN model, and it can capture the structure of GRU. However, LSTM, for example, has two state vectors $h_{t}$ and $c_{t}$ to transfer between cells. In this section we extend the RRNN model to be compatible with transferring multiple state vectors. Suppose that a total of $M$ vectors, $h_{t-1,1},\ldots,h_{t-1,M}$ , are the output of the $(t-1)$ -th hidden cell. The transition equations and cell output are thereby

[TABLE]

where each $f^{i}\triangleq f_{N}^{i}\circ\cdots\circ f_{1}^{i}$ has the same definition as the function $f$ defined in (5), and $\calN_{0,i}^{t}$ is the multiset consisting of multiple copies of $x_{t}$ , $h_{t-1,j},j=1,\ldots,M$ , $h_{t,j},j=1,\ldots,i-1$ , and possible other constant vectors. In practice, we use Algorithm 1 $M$ times to build functions $f^{i},i=1,2,\ldots,M$ .

The loss function is redefined as

[TABLE]

where $\calN_{k,i}^{t}=f_{k}^{i}\circ\cdots\circ f_{1}^{i}(\calN_{0,i}^{t}),k=1,2,\ldots,N-1,i=1,2,\ldots,M$ , and $T_{t,i}^{\target}$ is the ground truth binary tree.

As an example, we show how to transfer two state vectors $c_{t}$ and $h_{t}$ between hidden cells of RRNN and mimic the structure of LSTM. To adhere with notation from prior works, we use $c_{t}$ and $h_{t}$ to replace $h_{t,1}$ and $h_{t,2}$ in the above general definition. The transition equations and cell output are therefore

[TABLE]

where $\calN_{0,1}^{t}(x_{t},c_{t-1},h_{t-1})$ consists of several copies of $x_{t},h_{t-1},c_{t-1}$ and possible other constant vectors, and $\calN_{0,2}^{t}(x_{t},c_{t-1},h_{t-1},c_{t})$ consists of several copies of $x_{t},c_{t-1},h_{t-1},c_{t}$ and possible other constant vectors.

Appendix C Expressibility of RRNN

We extend the contect of Section 5 here. We first show that for a given set of vectors, there always exists a scoring function that can rank the scores of these vectors by any order we want. Formally, we have the following lemma.

Lemma 1.

Given $n$ vectors $v_{1},\ldots,v_{n}\in\mathbb{R}^{p}$ , there exists a function $\alpha$ with a set of parameters $\Theta$ such that $\alpha(v_{1};\Theta)>\cdots>\alpha(v_{n};\Theta)$ .

We next show that if we carefully choose sets $\calL,\calR,\calB,\calU,\calO,\mathcal{S}_{t}^{data},\mathcal{S}_{t}^{prev},\mathcal{S}_{t}^{aux}$ , the quantity $N$ , and the scoring function $\alpha$ , then Algorithm 1 can replicate the GRU and LSTM equations.

To replicate GRU, we should have $N=8$ , $\mathcal{S}_{t}^{data}=\{x_{t}\}$ , $\mathcal{S}_{t}^{prev}=\{h_{t-1}\}$ , $\mathcal{S}_{t}^{aux}=\{\bf{0}\}$ , where $\bf{0}$ is the zero vector. We further set

•

$\calL=\{L_{1},L_{2},L_{3},L_{4}\}$ , where $L_{1}=W_{r},L_{2}=W_{z},L_{3}=W_{h}$ , and $L_{4}=I$ ,

•

$\calR=\{R_{1},R_{2},R_{3},R_{4}\}$ , where $R_{1}=W_{r}^{\prime},R_{2}=W_{z}^{\prime},R_{3}=W_{h}^{\prime}$ , and $R_{4}=I$ ,

•

$\calB=\{b_{1},b_{2},b_{3},b_{4}\}$ , where $b_{1}=b_{r},b_{2}=b_{z},b_{3}=b_{h}$ , and $b_{4}=\bf{0}$ ,

•

$\mathcal{U}=\left\{\sigma(\cdot),\tanh(\cdot),\mathbf{1}-\cdot,\mathrm{id}(\cdot)\right\}$ , where $\mathbf{1}$ stands for the all-ones vector and $\mathrm{id}$ stands for the identity mapping,

•

$\mathcal{O}=\left\{+,\odot\right\}$ , where $\odot$ is the entry-wise multiplication.

Theorem 1 therefore becomes the following

Theorem 5.

There exists a scoring function $\alpha$ such that Algorithm 1 generates GRU equations (1) – (4) for the choice of $\calL,\calR,\calB,\calU,\calO,\mathcal{S}_{t}^{data},\mathcal{S}_{t}^{prev}$ , and $\mathcal{S}_{t}^{aux}$ specified above.

For LSTM, note that there are two state vectors $h_{t}$ and $c_{t}$ . Therefore, to replicate LSTM (see equations (7) – (11) below), we run Algorithm 1 twice. In the first run, we should have $N=7$ , $\mathcal{S}_{t}^{data}=\{x_{t}\}$ , $\mathcal{S}_{t}^{prev}=\{h_{t-1},c_{t-1}\}$ , $\mathcal{S}_{t}^{aux}=\{\bf{0}\}$ . We further set

•

$\calL=\{L_{6}\}$ , where $L_{6}=W_{f},L_{2}=W_{i},L_{3}=W_{o},L_{4}=W_{c}$ , and $L_{5}=I$ ,

•

$\calR=\{R_{1},R_{2},R_{3},R_{4},R_{5}\}$ , where $R_{1}=W_{f}^{\prime},R_{2}=W_{i}^{\prime},R_{3}=W_{o}^{\prime},R_{4}=W_{c}^{\prime}$ , and $R_{5}=I$ ,

•

$\calB=\{b_{1},b_{2},b_{3},b_{4},b_{5}\}$ , where $b_{1}=b_{f},b_{2}=b_{i},b_{3}=b_{o},b_{4}=b_{c}$ , and $b_{5}=\bf{0}$ ,

•

$\mathcal{U}=\left\{\sigma(\cdot),\tanh(\cdot),\mathrm{id}(\cdot)\right\}$ ,

•

$\mathcal{O}=\left\{+,\odot\right\}$ .

In the second run, we should have $N=2$ , $\mathcal{S}_{t}^{data}=\{x_{t}\}$ , $\mathcal{S}_{t}^{prev}=\{h_{t-1},c_{t-1},c_{t}\}$ , $\mathcal{S}_{t}^{aux}=\{\bf{0}\}$ . We further set $\calL=\{L_{6}\},\calR=\{R_{6}\}$ , where $L_{6}=R_{6}=I$ , $\calB=\{b_{6}\}$ , where $b_{6}=\bf{0}$ , $\calU=\{\tanh(\cdot),\mathrm{id}(\cdot)\}$ , and $\calB=\{+,\odot\}$ . Theorem 2 therefore becomes the following

Theorem 6.

There exists a scoring function $\alpha$ such that Algorithm 1 (applied twice) generates the following LSTM equations

[TABLE]

for the choice of $\calL,\calR,\calB,\calU,\calO,\mathcal{S}_{t}^{data},\mathcal{S}_{t}^{prev}$ , and $\mathcal{S}_{t}^{aux}$ specified above.

C.1 Proof of Lemma 1

Consider function

[TABLE]

with $\Theta=\{\sigma_{0}\}$ , and $\sigma_{0}$ is a large enough constant such that $\sigma_{0}^{2}\geq\frac{\Delta}{\log(n^{2})-\log(n^{2}-1)}$ , and $\Delta=\frac{1}{2}\min_{j\neq k}\|v_{j}-v_{k}\|^{2}$ .

For $0\leq i\leq n-1$ , since $-\frac{\|v_{i+1}-v_{k}\|^{2}}{2\sigma_{0}^{2}}\leq 0$ , we have

[TABLE]

On the other hand, for a fixed $1\leq i\leq n$ and all $1\leq k\leq n$ , we have $\exp\left(-\frac{\|v_{i}-v_{k}\|^{2}}{2\sigma_{0}^{2}}\right)\geq\exp\left(-\frac{\Delta}{\sigma_{0}^{2}}\right)\geq\exp\left(\log(n^{2}-1)-\log(n^{2})\right)=1-\frac{1}{n^{2}}\geq 1-\frac{1}{nk}$ , and therefore

[TABLE]

In conclusion, for $1\leq i\leq n-1$ , we have

[TABLE]

and thus $\alpha(v_{1};\Theta)>\cdots>\alpha(v_{n};\Theta)$ .

C.2 Proof of Theorem 5 and 6

We start with the proof of Theorem 5. In Algorithm 1, we set $N=8$ , $\mathcal{S}_{t}^{data}=\{x_{t}\}$ , $\mathcal{S}_{t}^{prev}=\{h_{t-1}\}$ , $\mathcal{S}_{t}^{aux}=\{\bf{0}\}$ , where $\bf{0}$ is the zero vector. In the following, “the algorithm” refers to Algorithm 1. The scoring function $\alpha$ has a set of parameters $\Theta$ and is capable of sorting the scores of different vectors. We show the existence of this function at the end of the proof by relying on Lemma 1.

We start with $\calP_{0}=\emptyset$ . For $k=1$ , the algorithm generates the vector set $V_{t}^{1}$ and one of its elements is $r_{t}\triangleq\sigma\left(L_{1}x_{t}+R_{1}h_{t-1}+b_{1}\right)=\sigma\left(W_{r}x_{t}+W_{r}^{\prime}h_{t-1}+b_{r}\right)$ . The scoring function $\alpha$ guarantees that $\alpha(r_{t};\Theta)>\alpha(v;\Theta),\forall v\in V_{t}^{1},v\neq r_{t}$ . Therefore, we have $c_{1}^{*}=r_{t}$ and $\calP_{1}=\{r_{t}\}$ . Similarly, the algorithm finds $z_{t}$ to be the vector with the highest score in the set $V_{t}^{2}\setminus\calP_{1}$ . We have $c_{2}^{*}=z_{t}$ and $\calP_{2}=\{r_{t},z_{t}\}$ .

For $k=3$ , the algorithm generates the vector set $V_{t}^{3}$ and one of its elements is $\widetilde{r}_{t}\triangleq\mathrm{id}[(L_{4}h_{t-1})\odot(R_{4}r_{t})+b_{4}]=r_{t}\odot h_{t-1}$ . The scoring function $\alpha$ guarantees that $\alpha(\widetilde{r}_{t};\Theta)>\alpha(v;\Theta),\forall v\in V_{t}^{3}\setminus\calP_{2},v\neq\widetilde{r}_{t}$ . Therefore, we have $c_{3}^{*}=\widetilde{r}_{t}$ and $\calP_{3}=\{r_{t},z_{t},\widetilde{r}_{t}\}$ .

For $k=4$ , the algorithm generates the vector set $V_{t}^{4}$ and one of its elements is $\mathbf{1}-z_{t}=\mathbf{1}-(L_{4}\mathbf{0}+R_{4}z_{t}+b_{4})$ . The scoring function $\alpha$ guarantees that $\alpha(\mathbf{1}-z_{t};\Theta)>\alpha(v;\Theta),\forall v\in V_{t}^{4}\setminus\calP_{3},v\neq\mathbf{1}-z_{t}$ . Therefore, we have $c_{4}^{*}=\mathbf{1}-z_{t}$ and $\calP_{4}=\{r_{t},z_{t},\widetilde{r}_{t},\mathbf{1}-z_{t}\}$ .

For $k=5$ , the algorithm generates the vector set $V_{t}^{5}$ and one of its elements is $\widetilde{h}_{t}\triangleq\tanh(L_{3}x_{t}+R_{3}\widetilde{r}_{t}+b_{3})=\tanh(W_{h}x_{t}+W_{h}^{\prime}(r_{t}\odot h_{t-1})+b_{h})$ . The scoring function $\alpha$ guarantees that $\alpha(\widetilde{h}_{t};\Theta)>\alpha(v;\Theta),\forall v\in V_{t}^{5}\setminus\calP_{4},v\neq\widetilde{h}_{t}$ . Therefore, we have $c_{5}^{*}=\widetilde{h}_{t}$ and $\calP_{5}=\{r_{t},z_{t},\widetilde{r}_{t},\mathbf{1}-z_{t},\widetilde{h}_{t}\}$ .

For $k=6$ , the algorithm generates the vector set $V_{t}^{6}$ and one of its elements is $z_{t}\odot h_{t-1}=\mathrm{id}[(L_{4}h_{t-1})\odot(R_{4}z_{t})+b_{4}]$ . The scoring function $\alpha$ guarantees that $\alpha(z_{t}\odot h_{t-1};\Theta)>\alpha(v;\Theta),\forall v\in V_{t}^{6}\setminus\calP_{5},v\neq z_{t}\odot h_{t-1}$ . Therefore, we have $c_{6}^{*}=z_{t}\odot h_{t-1}$ and $\calP_{6}=\{r_{t},z_{t},\widetilde{r}_{t},\mathbf{1}-z_{t},\widetilde{h}_{t},z_{t}\odot h_{t-1}\}$ . Similarly, the algorithm finds $(\mathbf{1}-z_{t})\odot\widetilde{h}_{t}$ to be the vector with the highest score in the set $V_{t}^{7}\setminus\calP_{6}$ . Thus we have $c_{7}^{*}=(\mathbf{1}-z_{t})\odot\widetilde{h}_{t}$ and $\calP_{7}=\{r_{t},z_{t},\widetilde{r}_{t},\mathbf{1}-z_{t},\widetilde{h}_{t},z_{t}\odot h_{t-1},(\mathbf{1}-z_{t})\odot\widetilde{h}_{t}\}$

Finally, for $k=8$ , the algorithm generates the vector set $V_{t}^{8}$ and one of its elements is $h_{t}\triangleq\mathrm{id}[(L_{4}(z_{t}\odot h_{t-1}))+(R_{4}((1-z_{t})\odot\widetilde{h}_{t}))+b_{4}]=z_{t}\odot h_{t-1}+(1-z_{t})\odot\widetilde{h}_{t}$ . The scoring function $\alpha$ guarantees that $\alpha(h_{t};\Theta)>\alpha(v;\Theta),\forall v\in V_{t}^{8}\setminus\calP_{7},v\neq h_{t}$ . Therefore, we have $c_{8}^{*}=h_{t}$ and $\calP_{8}=\{r_{t},z_{t},\widetilde{r}_{t},\mathbf{1}-z_{t},\widetilde{h}_{t},z_{t}\odot h_{t-1},(\mathbf{1}-z_{t})\odot\widetilde{h}_{t},h_{t}\}$ .

It remains to specify the scoring function $\alpha$ . Note that $\alpha$ should satisfy that $\alpha(c_{i}^{*};\Theta)>\alpha(v;\Theta),\forall v\in V_{t}^{i}\setminus\calP_{i-1},v\neq c_{i}^{*}$ for $1\leq i\leq 8$ . Since each set $V_{t}^{i}$ contains a finite number of vectors, Lemma 1 guarantees that such scoring function $\alpha$ exists. Therefore, the algorithm returns vector $h_{t}$ and the binary tree rooted at $h_{t}$ as the output, and thus it replicates the GRU equations (1) – (4). ∎

The proof of Theorem 6 is almost the same as the proof above. The only difference is that in the first run of Algorithm 1, we generate equations (7) – (10), while in the second run of Algorithm 1, we generate equation (11).

Appendix D Gradient Control

In this section, we first give the full statements of Theorem 3 and 4. Then we give proofs for these two theorems and discuss about them.

Theorem 7 (Sufficient condition of gradient vanishing).

Let $\calV_{t}$ to be the set of vectors on the nodes of predicted tree $T_{t}^{\pred}$ . Assume that there exist constants $C_{1},C_{2},C_{3},C_{4}$ such that for all $1\leq t\leq T$ ,

[TABLE]

Under conditions (12) – (16), we have $\left\|\frac{\partial\calE_{T}}{\partial h_{T}}\frac{\partial h_{T}}{\partial h_{1}}\frac{\partial^{+}h_{1}}{\partial\phi}\right\|\rightarrow 0$ as $T\rightarrow+\infty$ , i.e., the vanishing gradient problem occurs.

Theorem 8 (Necessary condition of gradient exploding, restated).

Let $l_{\min}\triangleq\lfloor\log_{2}(N+1)\rfloor+1$ be the minimum possible depth of all full binary trees $T_{t}^{\pred},1\leq t\leq T$ . If the exploding gradient problem occurs, then at least one of the following conditions hold:

•

there exists an activation function $u\in\calU$ such that $\|u^{\prime}\|\geq(N+1)^{-\frac{1}{3l_{\min}}}$ ,

•

there exists a parameter matrix $P\in\calL\cup\calR$ such that $\|P\|\geq(N+1)^{-\frac{1}{3l_{\min}}}$ ,

•

for infinite many $t$ , there exists a pair of parent-child nodes $(v,v_{1})$ in the tree $T_{t}^{\pred}$ such that $\left\|\frac{\partial o(L_{i}v_{1},R_{i}v_{2})}{\partial(L_{i}v_{1})}\right\|\geq(N+1)^{-\frac{1}{3l_{\min}}}$ , where $v=u(o(L_{i}v_{1},R_{i}v_{2})+b_{i})$ ,

•

for infinite many $t$ , there exists a pair of parent-child nodes $(v,v_{2})$ in the tree $T_{t}^{\pred}$ such that $\left\|\frac{\partial o(L_{i}v_{1},R_{i}v_{2})}{\partial(R_{i}v_{2})}\right\|\geq(N+1)^{-\frac{1}{3l_{\min}}}$ , where $v=u(o(L_{i}v_{1},R_{i}v_{2})+b_{i})$ .

D.1 Proof of Theorem 7

Recall that Algorithm 1 builds a full binary tree $T_{t}^{\pred}$ with $N$ internal nodes and $N+1$ leaf nodes for the $t$ -th hidden cell of the RRNN model. We use $n_{1,t}^{\mathrm{int}},\ldots,n_{N,t}^{\mathrm{int}}$ and $n_{1,t}^{\mathrm{leaf}},\ldots,n_{N+1,t}^{\mathrm{leaf}}$ to denote the internal nodes and leaf nodes of the tree $T_{t}^{\pred}$ , respectively. We denote $\calV_{t}=\{v_{n_{1,t}^{\mathrm{int}}},\ldots,v_{n_{N,t}^{\mathrm{int}}},v_{n_{1,t}^{\mathrm{leaf}}},\ldots,v_{n_{N+1,t}^{\mathrm{leaf}}}\}$ to be the set of vectors on the predicted tree $T_{t}^{\pred}$ . We use $\|A\|$ and $\|v\|_{\infty}$ to denote the spectral norm of matrix $A$ and the infinity norm of vector $v$ , respectively. We use $\diag\{v\}$ to denote the diagonalization of vector $v$ . For an activation function $u\in\calU$ , we use $u^{\prime}$ to denote the derivative of $u$ .

Note that

[TABLE]

Intuitively, the vanishing gradients problem appears when the norm of $\frac{\partial h_{i}}{\partial h_{t-1}}$ is smaller than 1. We first provide some lemmas that facilitate proving Theorem 3. For simplicity, we remove the subscript $t$ in $n_{k,t}^{\mathrm{int}}$ and $n_{k,t}^{\mathrm{leaf}}$ , since the following derivation applies to all $1\leq t\leq T$ in the same way.

We define the path starting from the root node $n_{N}^{\mathrm{int}}$ to a leaf node $n_{k}^{\mathrm{leaf}}$ by $P^{k}=\big{[}P_{0}^{k},P_{1}^{k},\ldots,P_{l_{k}}^{k}\big{]}$ , where $l_{k}$ is the length of this path, $P_{0}^{k}=n_{k}^{\mathrm{leaf}}$ , and $P_{l_{k}}^{k}=n_{N}^{\mathrm{int}}$ . Lemma 2 gives an upper-bound for the norm of the gradient of a node with respect to one of its child node in the binary tree $T_{t}^{\pred}$ .

Lemma 2.

Under conditions (12) – (15), there exists a constant $C_{0}<\frac{1}{2}$ such that $\left\|\frac{\partial P_{j}^{k}}{\partial P_{j-1}^{k}}\right\|\leq C_{0}$ for all $1\leq k\leq N+1$ and $1\leq j\leq l_{k}$ .

Proof.

For simplicity, we write $v=u(o(L_{i}v_{1},R_{i}v_{2})+b_{i})$ , where $i$ is the index in $\calL,\calR$ and $\calB$ , $v=P_{j}^{k}$ , $v_{1}=P_{j-1}^{k}$ , and $v_{2}$ is the other child node of $P_{j}^{k}$ . The case where $v_{2}=P_{j-1}^{k}$ is similar.

By the chain rule, we have

[TABLE]

where the last inequality follows by $\|\diag\{u^{\prime}\}\|=\|u^{\prime}\|_{\infty}\leq C_{3}$ by condition (14) together with conditions (12) and (13). By setting $C_{0}=C_{1}C_{2}C_{3}$ , the statement holds from condition (16). ∎

We have for all $1\leq t\leq T$ ,

[TABLE]

where the last inequality follows by Lemma 2.

Note that $l_{1},\ldots,l_{N+1}$ are the lengths of all the paths starting from the root node to the leaf node of a full binary tree. Lemma 3 gives an upper-bound of the sum of exponents of these lengths.

Lemma 3.

Suppose $l_{k},1\leq k\leq N+1$ are the lengths of the $N+1$ paths of the full binary tree $T_{t}^{\pred}$ . Then for any $0<C_{0}<\frac{1}{2}$ , there exists a constant $\epsilon=\epsilon(C_{0}),0<\epsilon<1$ , such that

[TABLE]

Proof.

We prove by induction on $N$ that

[TABLE]

For $N=1$ , the tree $T_{t}^{\pred}$ has exactly one internal node and two leaf nodes, thus there is only one tree structure for $T_{t}^{\pred}$ if we do not consider isomorphisms. We have $\{l_{1},l_{2},l_{3}\}=\{2,2,1\}$ and $\sum_{k=1}^{N+1}C_{0}^{l_{k}}=2C_{0}^{2}+C_{0}=C_{0}^{N}+\sum_{k=1}^{N}C_{0}^{k}$ .

Suppose that equation (18) holds for $N\geq 1$ , and we consider the case of $N+1$ . For a tree $T$ , recall the definition of $\calI(T)$ in Section 3.3. Given the full binary tree $T_{t}^{\pred}$ , we denote $n_{0}$ to be the node that has the largest index in $\calI(T_{t}^{\pred})$ among all $N+1$ internal nodes of $T_{t}^{\pred}$ . It is obvious that both child nodes of $n_{0}$ are leaf nodes (if not, say the left child of $n_{0}$ is also an internal nodes, then it should have a larger index than $n_{0}$ , which leads to a contradiction). We use $T_{0}$ to denote the tree obtained by removing the two child nodes of $n_{0}$ from the tree $T_{t}^{\pred}$ ( $n_{0}$ is a leaf node of $T_{0}$ ). Then $T_{0}$ has exactly $N$ internal nodes. We use $l_{1},\ldots,l_{N+2}$ and $l_{1}^{\prime},\ldots,l_{N+1}^{\prime}$ to denote the lengths of all paths of $T_{t}^{\pred}$ and $T_{0}$ , respectively. Without loss of generality, we denote the length of the path that ends at $n_{0}$ in $T_{0}$ as $l_{N+1}^{\prime}$ , and the length of those two paths that pass through $n_{0}$ in $T_{t}^{\pred}$ as $l_{N+1}$ and $l_{N+2}$ , respectively.

Note that $l_{i}=l_{i}^{\prime},1\leq i\leq N$ and $l_{N+1}=l_{N+2}=l_{N+1}^{\prime}+1$ . We have

[TABLE]

where the inequality follows from the induction equation $\sum_{k=1}^{N+1}C_{0}^{l_{k}^{\prime}}\leq C_{0}^{N}+\sum_{k=1}^{N}C_{0}^{k}$ and the facts that $2C_{0}-1<0$ and $l_{N+1}^{\prime}\leq N$ . This complete the induction step.

It remains to define the constant $\epsilon$ . Since $C_{0}<\frac{1}{2}$ , we have

[TABLE]

Taking $\epsilon=\frac{1}{2}-C_{0}>0$ finishes the proof for Lemma 3. ∎

By (17) and Lemma 3, for every $t,1\leq t\leq T$ we have

[TABLE]

Combining with condition (15), we have

[TABLE]

As $\eta<1$ , we have $\norm{\frac{\partial\calE_{T}}{\partial h_{T}}\left(\prod_{1<t\leq T}\frac{\partial h_{t}}{\partial h_{t-1}}\right)\frac{\partial^{+}h_{1}}{\partial\phi}}$ goes to 0 exponentially with $T\rightarrow\infty$ . ∎

D.2 Disscussion of Theorem 3

We next discuss the feasibility of conditions (12) – (16). Condition (12) requires all the weight matrices in $\calL\cup\calR$ to have the spectral norm no larger than $C_{1}$ . For sigmoid, condition (13) holds for $C_{2}=\frac{1}{4}$ while for tanh and ReLU it holds for $C_{2}=1$ . Condition (14) bounds the spectral norm of the gradient of each binary function. If the binary operation $o$ is addition, then the Jacobian matrix $\frac{\partial o(L_{i}v_{1},R_{i}v_{2})}{\partial(L_{i}v_{1})}$ is simply the identity matrix and its spectral norm equals to 1. For vector entry-wise multiplication, note that

[TABLE]

where $C_{5}\triangleq\max_{v\in\calV_{t},1\leq t\leq T}\|v\|_{\infty}$ is the upper bound of the infinity norm of all vectors on the predicted trees. Therefore, if $\odot\in\calO$ , we should have $C_{3}\geq\sqrt{p}C_{1}C_{5}$ . Condition (16) holds when the scale of weight matrices or vectors on nodes of the predicted trees are small. In the experiments we have $C_{2}=1$ and $p=100$ . Note also that $C_{5}=1$ if each $v$ is the outcome of sigmoid or tanh, or entry-wise product of such vectors; in presence of addition this no longer holds however computational experiments have established that $C_{5}\leq 1$ even if addition is a candidate binary operation. Then condition (16) holds for $C_{1}\approx 0.223$ which has been observed in our experiments.

It is worth to mention that condition (15) is mild since we only require the norm to be bounded by a sufficiently large constant. By (6), we have

[TABLE]

There are three terms in the expression of $\calE_{t}$ that are related to $h_{t}$ , namely, the standard loss term, the tree distance term, and the scoring margin term. We bound them one by one. Since the number of samples is finite, we only consider each term for one data point in the following (i.e. we ignore the expectation).

Assume that $\norm{\frac{\partial l(y_{t},q)}{\partial q}}\leq C_{6},\norm{\frac{\partial g(x_{t},h;\Gamma)}{\partial h}}\leq C_{7}$ hold. Then the loss term is bounded by

[TABLE]

If we use $h_{t}^{\target}$ to denote the vector of the root node of the ground truth tree $T_{t}^{\target}$ and assume that $\norm{h_{t}^{\target}}\leq C_{8}$ for all $t$ . Note that $h_{t}$ only appears at the root node of the predicted tree $T_{t}^{\pred}$ , and the definition of TD can be regarded as a summation over many norms of vector differences. Then the only term in TD that includes $h_{t}$ is $\norm{h_{t}-h_{t}^{\target}}^{2}$ . Therefore, the tree distance term is bounded by

[TABLE]

Again, note that $h_{t}$ only appears in the term

[TABLE]

If we assume that $\norm{\frac{\partial\alpha(h;\Theta)}{\partial h}}\leq C_{9}$ holds for any vector $h\in\mathbb{R}^{p}$ , then the scoring margin term is bounded by

[TABLE]

In conclusion, if we assume the existence of constants $C_{5},C_{6},C_{7},C_{8}$ , and $C_{9}$ , then the gradient of loss function $\calE_{t}$ with respect to the hidden state $h_{t}$ is bounded. Note that in practice $C_{5}$ and $C_{8}$ are about the norm of a finite set of vectors, and $C_{6},C_{7},C_{9}$ bound the norm of some simple functions or networks which in practice are all bounded. Therefore, we can easily argue that these constants do exist and thus the condition (15) is mild.

In summary, gradient vanishing frequently appears in practice.

D.3 Discussion of Theorem 8

In this section, we discuss the conditions appearing in the Theorem 8. As the proof is similar to the proof of Theorem 3 we omit it here.

The conditions listed in Theorem 8 are common in practice since the quantity $(N+1)^{-\frac{1}{3l_{\min}}}$ is smaller than 1. If we have $\tanh\in\calU$ or $\text{ReLU}\in\calU$ , then the first condition above is automatically achieved. Besides, if the addition operation belongs to $\calB$ , then the third and the fourth conditions are both fulfilled. In summary, gradient exploding is frequent in practice.

Appendix E Details of Experimental Study

E.1 Implementation Details

For our implementation of RRNN-GRU, we use PyTorch and train on Nvidia 1080 Ti GPUs or Intel Skylake CPUs. In order for our choices of cell structures to be differentiable with respect to the parameters of the scoring network, we evaluate softmax over the scores of all potential vectors at each node in the cell. Gradient clipping is used for RRNN-GRU, but random hyperparameter search often allows large gradient magnitudes. With the optimal hyperparameters, training of RRNN-GRU takes approximately ten hours for the Wikipedia dataset, one hour for PTB, and eight hours for SST, which is longer than the GRU training time since the RRNN-GRU weights must adapt to multiple placements within the cell structure.

For the RRNN model, we use batch normalization to stabilize training. The RRNN training time on the Wikipedia dataset with 5,000 samples is 40 hours on a CPU of a 12-core server. We find RRNN to be faster on a CPU than GPU due to its structure searching algorithm, but RRNN-GRU’s algorithm runs faster on GPUs.

E.2 Hyperparameters

GRU on Wiki-5k: batch size of 18, learning rate of $1.71\times 10^{-3}$ , and $\ell_{2}$ -regularization coefficient of $3.60\times 10^{-7}$ . 2. 2.

RRNN on Wiki-5k: batch size of 16, learning rate of $10^{-3}$ , and scoring network hidden size of 256. $\lambda_{1}=1,\lambda_{2}=10^{-3},\lambda_{3}=10^{-3},\lambda_{4}=10^{-5}$ . 3. 3.

GRU on Wiki-10k: batch size of 41, learning rate of $1.4231\times 10^{-3}$ , and $\ell_{2}$ -regularization coefficient of $1.2124\times 10^{-11}$ . 4. 4.

RRNN-GRU on Wiki-10k: batch size of 128, learning rate of $10^{-3}$ , and scoring network hidden size of 64. Training alternates between the $L,R$ , and $b$ weights and the scoring network every five epochs. $\lambda_{1}=1,\lambda_{2}=0.1,\lambda_{3}=10^{-8},\lambda_{4}=0.003$ . Gradients are clipped to the maximum norm of 1. 5. 5.

GRU on SST: learning rate of $4.85\times 10^{-4}$ , batch size of 3, and $\ell_{2}$ weight decay coefficient of $2.11\times 10^{-12}$ . 6. 6.

RRNN-GRU on SST: learning rate of $1.06\times 10^{-5}$ , $\lambda_{1}=1,\lambda_{2}=1.76\times 10^{-6},\lambda_{3}=2.67\times 10^{-12},\lambda_{4}=5.47\times 10^{-5}$ , max-margin of $5.47\times 10^{-5}$ , scoring network hidden size of 10 nodes, gradients clipped to the norm of 46.3, alternating training every epoch. 7. 7.

GRU on PTB: learning rate of $5.29\times 10^{-4}$ , batch size of 6, and $\ell_{2}$ weight decay coefficient of $2.71\times 10^{-15}$ . 8. 8.

RRNN-GRU on PTB: batch size of 116, learning rate of $3.03\times 10^{-4}$ , $\lambda_{1}=1,\lambda_{2}=4.16\times 10^{-3},\lambda_{3}=1.22\times 10^{-13},\lambda_{4}=1.36\times 10^{-3}$ , max scoring margin of 1.74, maximum gradient magnitude of 1.64, scoring hidden size of 137, and alternating training every epoch.

We next list the hyperparamter search ranges for RRNN-GRU in Table 2 and GRU in Table 3.

Bibliography29

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures using reinforcement learning. In International Conference on Learning Representations (ICLR) , 2017.
2[2] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks , 5(2):157–166, 1994.
3[3] Jonathon Cai, Richard Shin, and Dawn Song. Making neural programming architectures generalize via recursion. In International Conference on Learning Representations (ICLR) , 2017.
4[4] François Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 1251–1258, 2017.
5[5] Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. In Conference on Neural Information Processing Systems (NIPS) Workshop on Deep Learning , 2014.
6[6] Wim De Mulder, Steven Bethard, and Marie-Francine Moens. A survey on the application of recurrent neural networks to statistical language modeling. Computer Speech & Language , 30(1):61–98, 2015.
7[7] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. ar Xiv preprint ar Xiv:1808.05377 , 2018.
8[8] Christoph Goller and Andreas Kuchler. Learning task-dependent distributed representations by backpropagation through structure. In Proceedings of International Conference on Neural Networks (ICNN) , pages 347–352, 1996.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Dynamic Cell Structure via Recursive-Recurrent Neural Networks

Abstract

1 Introduction

2 Literature Review

3 Recursive-Recurrent Neural Network Model

3.1 Recursive-Recurrent Neural Network Model Framework

3.2 Cell Tree Construction

3.3 Loss Function

Score Margin

Tree Distance

4 Properties of RRNN and Gradient Control

4.1 Expressibility of RRNN

Theorem 1**.**

Theorem 2**.**

4.2 Controlling Gradient

Theorem 3** (Sufficient condition of gradient vanishing).**

Theorem 4** (Necessary condition of gradient exploding).**

5 Experimental Results

5.1 Datasets and Settings

5.2 Discussion

Acknowledgments

Appendix A Figures

Appendix B Extensions of RRNN model

Appendix C Expressibility of RRNN

Lemma 1**.**

Theorem 5**.**

Theorem 6**.**

C.1 Proof of Lemma 1

C.2 Proof of Theorem 5 and 6

Appendix D Gradient Control

Theorem 7** (Sufficient condition of gradient vanishing).**

Theorem 8** (Necessary condition of gradient exploding, restated).**

D.1 Proof of Theorem 7

Lemma 2**.**

Proof.

Lemma 3**.**

Proof.

D.2 Disscussion of Theorem 3

D.3 Discussion of Theorem 8

Appendix E Details of Experimental Study

E.1 Implementation Details

E.2 Hyperparameters

Theorem 1.

Theorem 2.

Theorem 3 (Sufficient condition of gradient vanishing).

Theorem 4 (Necessary condition of gradient exploding).

Lemma 1.

Theorem 5.

Theorem 6.

Theorem 7 (Sufficient condition of gradient vanishing).

Theorem 8 (Necessary condition of gradient exploding, restated).

Lemma 2.

Lemma 3.