Learning in Gated Neural Networks

Ashok Vardhan Makkuva; Sewoong Oh; Sreeram Kannan; Pramod Viswanath

arXiv:1906.02777·cs.LG·June 19, 2020

Learning in Gated Neural Networks

Ashok Vardhan Makkuva, Sewoong Oh, Sreeram Kannan, Pramod Viswanath

PDF

Open Access

TL;DR

This paper analyzes the optimization landscape of gated neural networks, showing that with specially designed loss functions, gradient descent can accurately recover parameters, supported by new sample complexity results and improved numerical performance.

Contribution

It introduces two distinct loss functions for better parameter recovery in mixture-of-experts models and provides the first sample complexity analysis for this problem.

Findings

01

Gradient descent can learn parameters accurately with proper loss functions.

02

Two specialized loss functions improve parameter recovery.

03

First sample complexity results for mixture-of-experts models.

Abstract

Gating is a key feature in modern neural networks including LSTMs, GRUs and sparsely-gated deep neural networks. The backbone of such gated networks is a mixture-of-experts layer, where several experts make regression decisions and gating controls how to weigh the decisions in an input-dependent manner. Despite having such a prominent role in both modern and classical machine learning, very little is understood about parameter recovery of mixture-of-experts since gradient descent and EM algorithms are known to be stuck in local optima in such models. In this paper, we perform a careful analysis of the optimization landscape and show that with appropriately designed loss functions, gradient descent can indeed learn the parameters accurately. A key idea underpinning our results is the design of two {\em distinct} loss functions, one for recovering the expert parameters and another for…

Equations190

y = i = 1 \sum k z_{i} \cdot g (⟨ a_{i}^{*}, x ⟩) + ξ, ξ \sim N (0, σ^{2}),

y = i = 1 \sum k z_{i} \cdot g (⟨ a_{i}^{*}, x ⟩) + ξ, ξ \sim N (0, σ^{2}),

P [z_{i} = 1∣ x] = \frac{e ^{⟨ w_{i}^{*}, x ⟩}}{\sum _{j = 1}^{k} e ^{⟨ w_{j}^{*}, x ⟩}} .

P [z_{i} = 1∣ x] = \frac{e ^{⟨ w_{i}^{*}, x ⟩}}{\sum _{j = 1}^{k} e ^{⟨ w_{j}^{*}, x ⟩}} .

y = (1 - z) g (a_{1}^{⊤} x) + z g (a_{2}^{⊤} x) + ξ,

y = (1 - z) g (a_{1}^{⊤} x) + z g (a_{2}^{⊤} x) + ξ,

y (x)

y (x)

= σ (w^{⊤} x) g (a_{1}^{⊤} x) + (1 - σ (w^{⊤} x)) g (a_{2}^{⊤} x) \in R .

y (x) = (1 - z (x)) ⊙ g (A_{1} x) + z (x) ⊙ g (A_{2} x),

y (x) = (1 - z (x)) ⊙ g (A_{1} x) + z (x) ⊙ g (A_{2} x),

h_{t} = (1 - z_{t}) ⊙ h_{t - 1} + z_{t} ⊙ \tilde{h}_{t},

h_{t} = (1 - z_{t}) ⊙ h_{t - 1} + z_{t} ⊙ \tilde{h}_{t},

\tilde{h}_{t} = g (U_{h} x_{t} + W_{h} (r_{t} ⊙ h_{t - 1})),

z_{t}

z_{t}

h_{t}

h_{t}

+ r_{t} ⊙ g (U_{h} x_{t} + W_{h} h_{t - 1})) .

ℓ_{2} ({a_{i}}, {w_{i}}) = E_{(x, y)} ∥ \overset{y}{^} (x) - y ∥^{2},

ℓ_{2} ({a_{i}}, {w_{i}}) = E_{(x, y)} ∥ \overset{y}{^} (x) - y ∥^{2},

Q_{4} (y) ≜ y^{4} + α y^{3} + β y^{2} + γ y, Q_{2} (y) ≜ y^{2} + δ y,

Q_{4} (y) ≜ y^{4} + α y^{3} + β y^{2} + γ y, Q_{2} (y) ≜ y^{2} + δ y,

t_{3} (u, x)

t_{3} (u, x)

t_{2} (u, x)

t_{1} (u, v, x)

- 4 (u^{⊤} x) (v^{⊤} x) (u^{⊤} v) - ∥ v ∥^{2} (u^{⊤} x)^{2}

+ ∥ u ∥^{2} ∥ v ∥^{2} + 2 (u^{⊤} v)^{2}) / c_{g, σ},

L_{4} (A)

L_{4} (A)

≜ i, j \in [k] i \neq = j \sum E [Q_{4} (y) t_{1} (a_{i}, a_{j}, x)] - μ i \in [k] \sum E [Q_{4} (y) t_{2} (a_{i}, x)]

+ λ i \in [k] \sum (E [Q_{2} (y) t_{3} (a_{i}, x)] - 1)^{2} + \frac{δ}{2} ∥ A ∥_{F}^{2},

∥ \nabla L_{4} (A) ∥_{2} \leq ε, \nabla^{2} L_{4} (A) ≽ - τ /2,

∥ \nabla L_{4} (A) ∥_{2} \leq ε, \nabla^{2} L_{4} (A) ≽ - τ /2,

L_{4} (A)

L_{4} (A)

= m \in [k] \sum E [p_{m}^{*} (x)] i \neq = j i, j \in [k] \sum ⟨ a_{m}^{*}, a_{i} ⟩^{2} ⟨ a_{m}^{*}, a_{j} ⟩^{2}

- μ m, i \in [k] \sum E [p_{m}^{*} (x)] ⟨ a_{m}^{*}, a_{i} ⟩^{4}

+ λ i \in [k] \sum (m \in [k] \sum E [p_{m}^{*} (x)] ⟨ a_{m}^{*}, a_{i} ⟩^{2} - 1)^{2} + \frac{δ}{2} ∥ A ∥_{F}^{2},

L_{l o g} (W, A)

L_{l o g} (W, A)

≜ - E_{(x, y)} [lo g P_{y ∣ x}]

= - E lo g i \in [k] \sum \frac{e ^{⟨ w_{i}, x ⟩}}{\sum _{j \in [k]} e ^{⟨ w_{j}, x ⟩}} \cdot N (y ∣ g (⟨ a_{i}, x ⟩), σ^{2}),

W \in Ω ≜ {W \in R^{(k - 1) \times d} : ∥ w_{i} ∥_{2} \leq R, \forall i \in [k - 1]},

W \in Ω ≜ {W \in R^{(k - 1) \times d} : ∥ w_{i} ∥_{2} \leq R, \forall i \in [k - 1]},

W_{t + 1} = Π_{Ω} (W_{t} - α \nabla_{W} L_{l o g} (W_{t}, A)),

W_{t + 1} = Π_{Ω} (W_{t} - α \nabla_{W} L_{l o g} (W_{t}, A)),

W_{t + 1}

W_{t + 1}

G (W, A)

G_{n} (W, A) ≜ Π_{Ω} (W - α \nabla_{W} L_{l o g}^{(n)} (W, A)) .

G_{n} (W, A) ≜ Π_{Ω} (W - α \nabla_{W} L_{l o g}^{(n)} (W, A)) .

∥ W_{t} - W^{*} ∥ \leq (ρ_{σ})^{t} ∥ W_{0} - W^{*} ∥ + κ ε_{reg} τ = 0 \sum t - 1 (ρ_{σ})^{τ},

∥ W_{t} - W^{*} ∥ \leq (ρ_{σ})^{t} ∥ W_{0} - W^{*} ∥ + κ ε_{reg} τ = 0 \sum t - 1 (ρ_{σ})^{τ},

W^{t} - W^{*}

W^{t} - W^{*}

+ \frac{1}{1 - ρ _{σ}} (κ ε_{reg} + c_{2} \frac{d T lo g ( T k / δ )}{n})

E_{reg} ≜ 1 - π \in S_{k} max i \in [k] min ∣ ⟨ a_{i}, a_{π (i)}^{*} ⟩ ∣,

E_{reg} ≜ 1 - π \in S_{k} max i \in [k] min ∣ ⟨ a_{i}, a_{π (i)}^{*} ⟩ ∣,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications

Full text

Learning in Gated Neural Networks

Ashok Vardhan Makkuva∗**

Sreeram Kannan†**

Sewoong Oh†**

Pramod Viswanath∗**

*∗*University of Illinois at Urbana-Champaign

*†*University of Washington

Abstract

Gating is a key feature in modern neural networks including LSTMs, GRUs and sparsely-gated deep neural networks. The backbone of such gated networks is a mixture-of-experts layer, where several experts make regression decisions and gating controls how to weigh the decisions in an input-dependent manner. Despite having such a prominent role in both modern and classical machine learning, very little is understood about parameter recovery of mixture-of-experts since gradient descent and EM algorithms are known to be stuck in local optima in such models.

In this paper, we perform a careful analysis of the optimization landscape and show that with appropriately designed loss functions, gradient descent can indeed learn the parameters of a MoE accurately. A key idea underpinning our results is the design of two distinct loss functions, one for recovering the expert parameters and another for recovering the gating parameters. We demonstrate the first sample complexity results for parameter recovery in this model for any algorithm and demonstrate significant performance gains over standard loss functions in numerical experiments.

1 Introduction

In recent years, gated recurrent neural networks (RNNs) such as LSTMs and GRUs have shown remarkable successes in a variety of challenging machine learning tasks such as machine translation, image captioning, image generation, hand writing generation, and speech recognition (Sutskever et al.,, 2014; Vinyals et al.,, 2014; Graves et al.,, 2013; Gregor et al.,, 2015; Graves,, 2013). A key interesting aspect and an important reason behind the success of these architectures is the presence of a gating mechanism that dynamically controls the flow of the past information to the current state at each time instant. In addition, it is also well known that these gates prevent the vanishing (and exploding) gradient problem inherent to traditional RNNs (Hochreiter and Schmidhuber,, 1997).

Surprisingly, despite their widespread popularity, there is very little theoretical understanding of these gated models. In fact, basic questions such as learnability of the parameters still remain open. Even for the simplest vanilla RNN architecture, this question was open until the very recent works of Allen-Zhu et al., (2018) and Allen-Zhu and Li, (2019), which provided the first theoretical guarantees of SGD for vanilla RNN models in the presence of non-linear activations. While this demonstrates that the theoretical analysis of these simpler models has itself been a challenging task, gated RNNs have an additional level of complexity in the form of gating mechanisms, which further enhances the difficulty of the problem. This motivates us to ask the following question:

Question 1.

Given the complicated architectures of LSTMs/GRUs, can we find analytically tractable sub-structures of these models?

We believe that addressing the above question can provide new insights into a principled understanding of gated RNNs. In this paper, we make progress towards this and provide a positive answer to the question. In particular, we make a non-trivial connection that a GRU (gated recurrent unit) can be viewed as a time-series extension of a basic building block, known as Mixture-of-Experts (MoE) (Jacobs et al.,, 1991; Jordan and Jacobs,, 1994). In fact, much alike LSTMs/GRUs, MoE is itself a widely popular gated neural network architecture and has found success in a wide range of applications (Tresp,, 2001; Collobert et al.,, 2002; Rasmussen and Ghahramani,, 2002; Yuksel et al.,, 2012; Masoudnia and Ebrahimpour,, 2014; Ng and Deisenroth,, 2014; Eigen et al.,, 2014). In recent years, there is also a growing interest in the fiels of natural language processing and computer vision to build complex neural networks incorporating MoE models to address challenging tasks such as machine translation (Gross et al.,, 2017; Shazeer et al.,, 2017). Hence the main goal of this paper is to study MoE in close detail, especially with regards to learnability of its parameters.

The canonical MoE model is the following: Let $k\in\mathbb{N}$ denote the number of mixture components (or equivalently neurons). Let $x\in\mathbb{R}^{d}$ be the input vector and $y\in\mathbb{R}$ be the corresponding output. Then the relationship between $x$ and $y$ is given by:

[TABLE]

where $g:\mathbb{R}\to\mathbb{R}$ is a non-linear activation function, $\xi$ is a Gaussian noise independent of $x$ and the latent Bernoulli random variable $z_{i}\in\{0,1\}$ indicates which expert has been chosen. In particular, only a single expert is active at any time, i.e. $\sum_{i=1}^{k}z_{i}=1$ , and their probabilities are modeled by a soft-max function:

[TABLE]

Following the standard convention (Makkuva et al.,, 2019; Jacobs et al.,, 1991), we refer to the vectors $a_{i}^{\ast}$ as regressors, the vectors $w_{i}^{\ast}$ as either classifiers or gating parameters, and without loss of generality, we assume that $w_{k}^{\ast}=0$ .

Belying the canonical nature, and significant research effort, of the MoE model, the topic of learning MoE parameters is very poorly theoretically understood. In fact, the task of learning the parameters of a MoE, i.e. $a_{i}^{\ast}$ and $w_{i}^{\ast}$ , with provable guarantees is a long standing open problem for more than two decades (Sedghi et al.,, 2014). One of the key technical difficulties is that in a MoE, there is an inherent coupling between the regressors $a_{i}^{\ast}$ and the gating parameters $w_{i}^{\ast}$ , as can be seen from Eq. (1), which makes the problem challenging (Ho et al.,, 2019). In a recent work (Makkuva et al.,, 2019), the authors provided the first consistent algorithms for learning MoE parameters with theoretical guarantees. In order to tackle the aforementioned coupling issue, they proposed a clever scheme to first estimate the regressor parameters $a_{i}^{\ast}$ and then estimating the gating parameters $w_{i}^{\ast}$ using a combination of spectral methods and the EM algorithm. However, a major draw back is that this approach requires specially crafted algorithms for learning each of these two sets of parameters. In addition, they lack finite sample guarantees. Since SGD and its variants remain the de facto algorithms for training neural networks because of their practical advantages, and inspired by the successes of these gradient-descent based algorithms in finding global minima in a variety of non-convex problems, we ask the following question:

Question 2.

How do we design objective functions amenable to efficient optimization techniques, such as SGD, with provable learning guarantees for MoE?

In this paper, we address this question in a principled manner and propose two non-trivial non-convex loss functions $L_{4}(\cdot)$ and $L_{\log}(\cdot)$ to learn the regressors and the gating parameters respectively. In particular, our loss functions possess nice landscape properties such as local minima being global and the global minima corresponding to the ground truth parameters. We also show that gradient descent on our losses can recover the true parameters with global/random initializations. To the best of our knowledge, ours is the first GD based approach with finite sample guarantees to learn the parameters of MoE. While our procedure to learn $\{a_{i}^{\ast}\}$ and $\{w_{i}^{\ast}\}$ separately and the technical assumptions are similar in spirit to Makkuva et al., (2019), our loss function based approach with provable guarantees for SGD is significantly different from that of Makkuva et al., (2019). We summarize our main contributions below:

•

MoE as a building block for GRU: We provide the first connection that the well-known GRU models are composed of basic building blocks, known as MoE. This link provides important insights into theoretical understanding of GRUs and further highlights the importance of MoE.

•

Optimization landscape design with desirable properties: We design two non-trivial loss functions $L_{4}(\cdot)$ and $L_{\log}(\cdot)$ to learn the regressors and the gating parameters of a MoE separately. We show that our loss functions have nice landscape properties and are amenable to simple local-search algorithms. In particular, we show that SGD on our novel loss functions recovers the parameters with global/random initializations.

•

First sample complexity results: We also provide the first sample complexity results for MoE. We show that our algorithms can recover the true parameters with accuracy $\varepsilon$ and with high probability, when provided with samples polynomial in the dimension $d$ and $1/\varepsilon$ .

Related work. Linear dynamical systems can be thought of as the linear version of RNNs. There is a huge literature on the topic of learning these linear systems Alaeddini et al., (2018); Arora et al., (2018); Dean et al., (2017, 2018); Marecek and Tchrakian, (2018); Oymak and Ozay, (2018); Simchowitz et al., (2018); Hardt et al., (2018). However these works are very specific to the linear setting and do not extend to non-linear RNNs. Allen-Zhu et al., (2018) and Allen-Zhu and Li, (2019) are two recent works to provide first theoretical guarantees for learning RNNs with ReLU activation function. However, it is unclear how these techniques generalize to the gated architectures. In this paper, we focus on the learnability of MoE, which are the building blocks for these gated models.

While there is a huge body of work on MoEs (see Yuksel et al., (2012); Masoudnia and Ebrahimpour, (2014) for a detailed survey), the topic of learning MoE parameters is theoretically less understood with very few works on it. Jordan and Xu, (1995) is one of the early works that showed the local convergence of EM. In a recent work, Makkuva et al., (2019) provided the first consistent algorithms for MoE in the population setting using a combination of spectral methods and EM algorithm. However, they do not provide any finite sample complexity bounds. In this work, we provide a unified approach using GD to learn the parameters with finite sample guarantees. To the best of our knowledge, we give the first gradient-descent based method with consistent learning guarantees, as well as the first finite-sample guarantee for any algorithm. The topic of designing the loss functions and analyzing their landscapes is a hot research topic in a wide variety of machine learning problems: neural networks (Hardt and Ma,, 2017; Kawaguchi,, 2016; Li and Yuan,, 2017; Panigrahy et al.,, 2017; Zhong et al.,, 2017; Ge et al.,, 2018; Gao et al.,, 2019), matrix completion (Bhojanapalli et al.,, 2016), community detection (Bandeira et al.,, 2016), orthogonal tensor decomposition (Ge et al.,, 2015). In this work, we present the first objective function design and the landscape analysis for MoE.

Notation. We denote $\ell_{2}$ -Euclidean norm by $\left\|\cdot\right\|$ . $[d]\triangleq\{1,2,\ldots,d\}$ . $\{e_{i}\}_{i=1}^{d}$ denotes the standard basis vectors in $\mathbb{R}^{d}$ . We denote matrices by capital letters like $A,W$ , etc. For any two vectors $x,y\in\mathbb{R}^{d}$ , we denote their Hadamard product by $x\odot y$ . $\sigma(\cdot)$ denotes the sigmoid function $\sigma(z)=1/(1+e^{-z}),z\in\mathbb{R}$ . For any $z=(z_{1},\ldots,z_{k})\in\mathbb{R}^{k}$ , $\mathrm{softmax}_{i}(z)=\exp(z_{i})/(\sum_{j}\exp(z_{j}))$ . $\mathcal{N}(mu,\Sigma)$ denotes the Gaussian distribution with mean $\mu\in\mathbb{R}^{d}$ and covariance $\Sigma\in\mathbb{R}^{d\times d}$ . Through out the paper, we interchangeably denote regressors as $\{a_{i}\}$ or $A$ , and gating parameters as $\{w_{i}\}$ or $W$ .

Overview. The rest of the paper is organized as follows: In Section 2, we establish the precise mathematical connection between the well known GRU model and the MoE model. Building upon this correspondence, which highlights the importance of MoE, in Section 3 we design two novel loss functions to learn the respective regressors and gating parameters of a MoE and present our theoretical guarantees. In Section 4, we empirically validate that our proposed losses perform much better than the current approaches on a variety of settings.

2 GRU as a hierarchical MoE

In this section, we show that the recurrent update equations for GRU can be obtained from that of MoE, described in Eq. (1). In particular, we show that GRU can be viewed as a hierarchical MoE with depth-2. To see this, we restrict to the setting of a $2$ -MoE, i.e. let $k=2$ and $(a_{1}^{\ast},a_{2}^{\ast})=(a_{1},a_{2})$ , and $(w_{1}^{\ast},w_{2}^{\ast})=(w,0)$ in Eq. (1). Then we obtain that

[TABLE]

where $z\in\{0,1\}$ and $\mathbb{P}\left[z=0|x\right]=\sigma(w^{\top}x)$ . Since $\xi$ is a zero mean random variable independent of $x$ , taking conditional expectation on both sides of Eq. (2) yields that

[TABLE]

Now letting the output $y(x)\in\mathbb{R}^{m}$ to be a vector and allowing for different gating parameters $\{w_{i}\}$ and regressors $\{(a_{1i},a_{2i})\}$ along each dimension $i=1,\ldots,m$ , we obtain

[TABLE]

where $z(x)=(z_{1}(x),\ldots,z_{m}(x))^{\top}$ with $z_{i}(x)=\sigma(w_{i}^{\top}x)$ , and $A_{1},A_{2}\in\mathbb{R}^{m\times d}$ denote the matrix of regressors corresponding to first and second experts respectively.

We now show that Eq. (3) is the basic equation behind the updates in GRU. Recall that in a GRU, given a time series $\{(x_{t},y_{t})\}_{t=1}^{T}$ of sequence length $T$ , the goal is to produce a sequence of hidden states $\{h_{t}\}$ such that the output time series $\hat{y}_{t}=f(Ch_{t})$ is close to $\{y_{t}\}$ in some well-defined loss metric, where $f$ denotes the non-linear activation of the last layer. The equations governing the transition dynamics between $\{x_{t}\}$ and $\{h_{t}\}$ at any time $t\in[T]$ are given by (Cho et al.,, 2014):

[TABLE]

where $z_{t}$ and $r_{t}$ denote the update and reset gates, which are given by

[TABLE]

where the matrices $U$ and $W$ with appropriate subscripts are parameters to be learnt. While the gating activation function $\sigma$ is modeled as sigmoid for the ease of obtaining gradients while training, their intended purpose was to operate as binary valued gates taking values in $\{0,1\}$ . Indeed, in a recent work Li et al., (2018), the authors show that binary valued gates enhance robustness with more interpretability and also give better performance compared to their continuous valued counterparts. In view of this, letting $\sigma$ to be the binary threshold function $\mathds{1}\{x\geq 0\}$ , we obtain that

[TABLE]

Letting $x=(x_{t},h_{t-1})$ and $y(x)=h_{t}$ in Eq. (3) with second expert $g(A_{2}x)$ replaced by a $2$ -MoE, we can see from Eq. (4) that GRU is a depth- $2$ hierarchical MoE. This is also illustrated in Figure 2.

Note that in Figure 2, NN-1 models the mapping $(x_{t},h_{t-1})\mapsto h_{t-1}$ , NN-2 represents $(x_{t},h_{t-1})\mapsto g(U_{h}x_{t})$ , and NN-3 models $(x_{t},h_{t-1})\mapsto g(U_{h}x_{t}+W_{h}h_{t-1})$ . Hence, this is slightly different from the traditional MoE setting in Eq. (1) where the same activation $g(\cdot)$ is used for all the nodes. Nonetheless, we believe that studying this canonical model is a crucial first step which can shed important insights for a general setting.

3 Optimization landscape design for MoE

In the previous section, we presented the mathematical connection between the GRU and the MoE. In this section, we focus on the learnability of the MoE model and design two novel loss functions for learning the regressors and the gating parameters separately.

3.1 Loss function for regressors: $L_{4}$

To motivate the need for loss function design in a MoE, first we take a moment to highlight the issues with the traditional approach of using the mean square loss $\ell_{2}$ . If $(x,y)$ are generated according to the ground-truth MoE model in Eq. (1), $\ell_{2}(\cdot)$ computes the quadratic cost between the expected predictions $\hat{y}$ and the ground-truth $y$ , i.e.

[TABLE]

where $\hat{y}(x)=\sum_{i}\mathrm{softmax}_{i}(w_{1}^{\top}x,\ldots,w_{k-1}^{\top}x,0)~{}g(a_{i}^{\top}x)$ is the predicted output, and $\{a_{i}\},\{w_{i}\}$ denote the respective regressors and gating parameters. It is well-known that this mean square loss is prone to bad local minima as demonstrated empirically in the earliest work of Jacobs et al., (1991) (we verify this in Section 4 too), which also emphasized the importance of the right objective function to learn the parameters. Note that the bad landscape of $\ell_{2}$ is not just unique to MoE, but also widely observed in the context of training neural network parameters (Livni et al.,, 2014). In the one-hidden-layer NN setting, some recent works (Ge et al.,, 2018; Gao et al.,, 2019) addressed this issue by designing new loss functions with good landscape properties so that standard algorithms like SGD can provably learn the parameters. However these methods do not generalize to the MoE setting since they crucially rely on the fact that the coefficients $z_{i}$ appearing in front of the activation terms $g(\langle{a_{i}^{\ast}},{x}\rangle)$ in Eq. (1), which correspond to the linear layer weights in NN, are constant. Such an assumption does not hold in the context of MoEs because the gating probabilities depend on $x$ in a parametric way through the softmax function and hence introducing the coupling between $w_{i}^{\ast}$ and $a_{i}^{\ast}$ (a similar observation was noted in Makkuva et al., (2019) in the context of spectral methods).

In order to address the aforementioned issues, inspired by the works of Ge et al., (2018) and Gao et al., (2019), we design a novel loss function $L_{4}(\cdot)$ to learn the regressors first. Our loss function depends on two distinct special transformations on both the input $x\in\mathbb{R}^{d}$ and the output $y\in\mathbb{R}$ . For the output, we consider the following transformations:

[TABLE]

where the set of coefficients $(\alpha,\beta,\gamma,\delta)$ are dependent on the choice of non-linearity $g$ and noise variance $\sigma^{2}$ . These are obtained by solving a simple linear system (see Appendix B). For the special case $g=\mathrm{Id}$ , which corresponds to linear activations, the Quartic transform is $\mathcal{Q}_{4}(y)=y^{4}-6y^{2}(1+\sigma^{2})+3+3\sigma^{4}-6\sigma^{2}$ and the Quadratic transform is $\mathcal{Q}_{2}(y)=y^{2}-(1+\sigma^{2})$ . For the input $x$ , we assume that $x\sim\mathcal{N}(0,I_{d})$ , and for any two fixed $u,v\in\mathbb{R}^{d}$ , we consider the projections of multivariate-Hermite polynomials Grad, (1949); Holmquist, (1996); Janzamin et al., (2014) along these two vectors, i.e.

[TABLE]

where $c_{g,\sigma}$ and $c^{\prime}_{g,\sigma}$ are two non-zero constants depending on $g$ and $\sigma$ . These transformations $(t_{1},t_{2},t_{3})$ on the input $x$ and $(\mathcal{Q}_{4},\mathcal{Q}_{2})$ on the output $y$ can be viewed as extractors of higher order information from the data. The utility of these transformations is concretized in Theorem 1 through the loss function defined below. Denoting the set of our regression parameters by the matrix $A^{\top}=[a_{1}|a_{2}|\ldots|a_{k}]\in\mathbb{R}^{d\times k}$ , we now define our objective function $L_{4}(A)$ as

[TABLE]

where $\mu,\lambda,\delta>0$ are some positive regularization constants. Notice that $L_{4}$ is defined as an expectation of terms involving the data transformations: $\mathcal{Q}_{4},\mathcal{Q}_{2},t_{1},t_{2},$ and $t_{3}$ . Hence its gradients can be readily computed from finite samples and is amenable to standard optimization methods such as SGD for learning the parameters. Moreover, the following theorem highlights that the landscape of $L_{4}$ does not have any spurious local minima.

Theorem 1 (Landscape analysis for learning regressors).

Under the mild technical assumptions of Makkuva et al., (2019), the loss function $L_{4}$ does not have any spurious local minima. More concretely, let $\varepsilon>0$ be a given error tolerance. Then we can choose the regularization constants $\mu,\lambda$ and the parameters $\varepsilon,\tau$ such that if $A$ satisfies

[TABLE]

then $(A^{\dagger})^{\top}=PD\Gamma A^{\ast}+E$ , where $D$ is a diagonal matrix with entries close to $1$ , $\Gamma$ is a diagonal matrix with $\Gamma_{ii}=\sqrt{\mathbb{E}[p_{i}^{\ast}(x)]}$ , $P$ is a permutation matrix and $\left\|E\right\|\leq\varepsilon_{0}$ . Hence every approximate local minimum is $\varepsilon$ -close to the global minimum.

Intuitions behind the theorem and the special transforms: While the transformations and the loss $L_{4}$ defined above may appear non-intuitive at first, the key observation is that $L_{4}$ can be viewed as a fourth-order polynomial loss in the parameter space, i.e.

[TABLE]

where $p_{i}^{\ast}$ refers to the softmax probability for the $i^{th}$ label with true gating parameters, i.e. $p_{i}^{\ast}(x)=\mathrm{softmax}_{i}(\langle{w_{1}^{\ast}},{x}\rangle,\ldots,\langle{w_{k-1}^{\ast}},{x}\rangle,0)$ . This alternate characterization of $L_{4}(\cdot)$ in Eq. (7) is the crucial step towards proving Theorem 1. Hence these specially designed transformations on the data $(x,y)$ help us to achieve this objective. Given this viewpoint, we utilize tools from Ge et al., (2018), where a similar loss involving fourth-order polynomials were analyzed in the context of $1$ -layer ReLU network, to prove the desired landscape properties for $L_{4}$ . The full details behind the proof are provided in Appendix C. Moreover, in Section 4 we empirically verify that the technical assumptions are only needed for the theoretical results and that our algorithms are robust to these assumptions, and work equally well even when we relax them.

In the finite sample regime, we replace the population expectations in Eq. (6) with sample average to obtain the empirical loss $\hat{L}$ . The following theorem establishes that $\hat{L}$ too inherits the same landscape properties of $L$ when provided enough samples.

Theorem 2 (Finite sample landscape).

There exists a polynomial $\mathrm{poly}(d,1/\varepsilon)$ such that whenever $n\geq\mathrm{poly}(d,1/\varepsilon)$ , $\hat{L}$ inherits the same landscape properties as that of $L$ established in Theorem 1 with high probability. Hence stochastic gradient descent on $\hat{L}$ converges to an approximate local minima which is also close to a global minimum in time polynomial in $d,1/\varepsilon$ .

Remark 1.

Notice that the parameters $\{a_{i}\}$ learnt through SGD are some permutation of the true parameters $a_{i}^{\ast}$ upto sign flips. This sign ambiguity can be resolved using existing standard procedures such as Algorithm 1 in Ge et al., (2018). In the remainder of the paper, we assume that we know the regressors upto some error $\varepsilon_{\mathrm{reg}}>0$ in the following sense: $\max_{i\in[k]}\left\|a_{i}-a_{i}^{\ast}\right\|=\sigma^{2}\varepsilon_{\mathrm{reg}}$ .

3.2 Loss function for gating parameters: $L_{\log}$

In the previous section, we have established that we can learn the regressors $a_{i}^{\ast}$ upto small error using SGD on the loss function $L_{4}$ . Now we are interested in answering the following question: Can we design a loss function amenable to efficient optimization algorithms such as SGD with recoverable guarantees to learn the gating parameters?

In order to gain some intuition towards addressing this question, consider the simplified setting of $\sigma=0$ and $A=A^{\ast}$ . In this setting, we can see from Eq. (1) that the output $y$ equals one of the activation values $g(\langle{a_{i}^{\ast}},{x}\rangle)$ , for $i\in[k]$ , with probability $1$ . Since we already have access to the true parameters, i.e. $A=A^{\ast}$ , we can see that we can exactly recover the hidden latent variable $z$ , which corresponds to the chosen hidden expert for each sample $(x,y)$ . Thus the problem of learning the classifiers $w_{i}^{\ast},\ldots,w_{k-1}^{\ast}$ reduces to a multi-class classification problem with label $z$ for each input $x$ and hence can be efficiently solved by traditional methods such as logistic regression. It turns out that these observations can be formalized to deal with more general settings (where we only know the regressors approximately and the noise variance is not zero) and that the gradient descent on the log-likelihood loss achieves the same objective. Hence we use the negative log-likelihood function to learn the classifiers, i.e.

[TABLE]

where $W^{\top}=\begin{bmatrix}w_{1}|w_{2}|\ldots|w_{k-1}\end{bmatrix}$ . Note that the objective Eq. (8) in not convex in the gating parameters $W$ whenever $\sigma\neq 0$ . We omit the input distribution $P_{x}$ from the above negative log-likelihood since it does not depend on any of the parameters. We now define the domain of the gating parameters $\Omega$ as

[TABLE]

for some fixed $R>0$ . Without loss of generality, we assume that $w_{k}=0$ . Since we know the regressors approximately from the previous stage, i.e. $A\approx A^{\ast}$ , we run gradient descent only for the classifier parameters keeping the regressors fixed, i.e.

[TABLE]

where $\alpha>0$ is a suitably chosen learning-rate, $\Pi_{\Omega}(W)$ denotes the projection operator which maps each row of its input matrix onto the ball of radius $R$ , and $t>0$ denotes the iteration step. In a more succinct way, we write

[TABLE]

Note that $G(W,A)$ denotes the projected gradient descent operator on $W$ for fixed $A$ . In the finite sample regime, we define our loss $L_{\log}^{(n)}(W,A)$ as the finite sample counterpart of Eq. (8) by taking empirical expectations. Accordingly, we define the gradient operator $G_{n}(W,A)$ as

[TABLE]

In this paper, we analyze a sample-splitting version of the gradient descent, where given the number of samples $n$ and the iterations $T$ , we first split the data into $T$ subsets of size $\lfloor n/T\rfloor$ , and perform iterations on fresh batch of samples, i.e. $W_{t+1}=G_{n/T}(W_{t},A)$ . We use the norm $\left\|W-W^{\ast}\right\|=\max_{i\in[k-1]}\left\|w_{i}-w_{i}^{\ast}\right\|_{2}$ for our theoretical results. The following theorem establishes the almost geometric convergence of the population-gradient iterates under some high SNR conditions. The following results are stated for $R=1$ for simplicity and also hold for any general $R>0$ .

Theorem 3 (GD convergence for classifiers).

Assume that $\max_{i\in[k]}\left\|a_{i}-a_{i}^{\ast}\right\|_{2}=\sigma^{2}\varepsilon_{\mathrm{reg}}$ . Then there exists two positive constants $\alpha_{0}$ and $\sigma_{0}$ such that for any step size $0<\alpha\leq\alpha_{0}$ and noise variance $\sigma^{2}<\sigma_{0}^{2}$ , the population gradient descent iterates $\{W\}_{t\geq 0}$ converge almost geometrically to the true parameter $W^{\ast}$ for any randomly initialized $W_{0}\in\Omega$ , i.e.

[TABLE]

where $(\rho_{\sigma},\kappa)\in(0,1)\times(0,\infty)$ are dimension-independent constants depending on $g,k$ and $\sigma$ such that $\rho_{\sigma}=o_{\sigma}(1)$ and $\kappa=O_{k,\sigma}(1)$ .

Proof.

(Sketch) For simplicity, let $\varepsilon_{\mathrm{reg}}=0$ . Then we can show that $G(W^{\ast},A^{\ast})=W^{\ast}$ since $\nabla_{W}L_{\mathrm{\log}}(W=W^{\ast},A^{\ast})=0$ . Then we capitalize on the fact that $G(\cdot,A^{\ast})$ is strongly convex with minimizer at $W=W^{\ast}$ to show the geometric convergence rate. The more general case of $\varepsilon_{\mathrm{reg}}>0$ is handled through perturbation analysis. ∎

We conclude our theoretical discussion on MoE by providing the following finite sample complexity guarantees for learning the classifiers using the gradient descent in the following theorem, which can be viewed as a finite sample version of Theorem 3.

Theorem 4 (Finite sample complexity and convergence rates for GD).

In addition to the assumptions of Theorem 3, assume that the sample size $n$ is lower bounded as $n\geq c_{1}Td\log(\frac{T}{\delta})$ . Then the sample-gradient iterates $\{W^{t}\}_{t=1}^{T}$ based on $n/T$ samples per iteration satisfy the bound

[TABLE]

with probability at least $1-\delta$ .

4 Experiments

In this section, we empirically validate the fact that running SGD on our novel loss functions $L_{4}$ and $L_{\mathrm{\log}}$ achieves superior performance compared to the existing approaches. Moreover, we empirically show that our algorithms are robust to the technical assumptions made in Theorem 1 and that they achieve equally good results even when the assumptions are relaxed.

Data generation. For our experiments, we choose $d=10$ , $k\in\{2,3\}$ , $a_{i}^{\ast}=e_{i}$ for $i\in[k]$ and $w_{i}^{\ast}=e_{k+i}$ for $i\in[k-1]$ , and $g=\mathrm{Id}$ . We generate the data $\{(x_{i},y_{i})_{i=1}^{n}\}$ according to Eq. (1) and using these ground-truth parameters. We chose $\sigma=0.05$ for all of our experiments.

Error metric. If $A\in\mathbb{R}^{k\times d}$ denotes the matrix of regressors where each row is of norm $1$ , we use the error metric $\mathcal{E}_{\mathrm{reg}}$ to gauge the closeness of $A$ to the ground-truth $A^{\ast}$ :

[TABLE]

where $S_{k}$ denotes the set of all permutations on $[k]$ . Note that $\mathcal{E}_{\mathrm{reg}}\leq\varepsilon$ if and only if the learnt regressors have a minimum correlation of $1-\varepsilon$ with the ground-truth parameters, upto a permutation. The error metric $\mathcal{E}_{\mathrm{gating}}$ is defined similarly.

Results. In Figure 1, we choose $k=3$ and compare the performance of our algorithm against existing approaches. In particular, we consider three methods: 1) EM algorithm, 2) SGD on the the classical $\ell_{2}$ -loss from Eq. (3.1), and 3) SGD on our losses $L_{4}$ and $L_{\mathrm{log}}$ . For all the methods, we ran $5$ independent trials and plotted the mean error. Figure 1(a) highlights the fact that minimizing our loss function $L_{4}$ by SGD recovers the ground-truth regressors, whereas SGD on $\ell_{2}$ -loss as well as EM get stuck in local optima. For learning the gating parameters $W$ using our approach, we first fix the regressors $A$ at the values learnt using $L_{4}$ , i.e. $A=\hat{A}$ , where $\hat{A}$ is the converged solution for $L_{4}$ . For $\ell_{2}$ and the EM algorithm, the gating parameters $W$ are learnt jointly with regressors $A$ . Figure 1(b) illustrates the phenomenon that our loss $L_{\mathrm{log}}$ for learning the gating parameters performs considerably better than the standard approaches, as indicated in significant gaps between the respective error values. Finally, in Figure 1(c) we plot the regressor error for $L_{4}$ over $5$ random initializations. We can see that we recover the ground truth parameters in all the trials, thus empirically corroborating our technical results in Section 3.

4.1 Robustness to technical assumptions

In this section, we verify numerically the fact that our algorithms work equally well in the absence of technical assumptions made in Section 3.

Relaxing orthogonality in Theorem 1. A key assumption in proving Theorem 1, adapted from Makkuva et al., (2019), is that the set of regressors $\{a_{i}^{\ast}\}$ and set of gating parameters $\{w_{i}^{\ast}\}$ are orthogonal to each other. While this assumption is needed for the technical proofs, we now empirically verify that our conclusions still hold when we relax this. For this experiment, we choose $k=2$ and let $(a_{1}^{\ast},a_{2}^{\ast})=(e_{1},e_{2})$ . For the gating parameter $w^{\ast}\triangleq w_{1}^{\ast}$ , we randomly generate it from uniform distribution on the $d$ -dimensional unit sphere. In Figure 3(a) and Figure 3(b), we plotted the individual parameter estimation error for $5$ different runs for both of our losses $L_{4}$ and $L_{\mathrm{log}}$ for learning the regressors and the gating parameter respectively. We can see that our algorithms are still able to learn the true parameters even when the orthogonality assumption is relaxed.

Relaxing Gaussianity of the input. To demonstrate the robustness of our approach to the assumption that the input $x$ is standard Gaussian, i.e. $x\sim\mathcal{N}(0,I_{d})$ , we generated $x$ according to a mixture of two symmetric Gaussians each with identity covariances, i.e. $x\sim p\mathcal{N}(\mu,I_{d})+(1-p)\mathcal{N}(-\mu,I_{d})$ , where $p\in[0,1]$ is the mixing probability and $\mu\in\mathbb{R}^{d}$ is a fixed but randomly chosen vector. For various mixing proportions $p\in\{0.1,0.2,0.3,0.4,0.5\}$ , we ran SGD on our loss $L_{4}$ to learn the regressors. Figure 3(c) highlights that we learn these ground truth parameters in all the settings.

Finally we note that in all our experiments, the loss $L_{4}$ seems to require a larger batch size ( $1024$ ) for its gradient estimation while running SGD. However, with smaller batch sizes such as $128$ we are still able to achieve similar performance but with more variance. (see Appendix E).

5 Discussion

In this paper we established the first mathematical connection between two popular gated neural networks: GRU and MoE. Inspired by this connection and the success of SGD based algorithms in finding global minima in a variety of non-convex problems in deep learning, we provided the first gradient descent based approach for learning the parameters in a MoE. While the canoncial MoE does not involve any time series, extension of our methods for the recurrent setting is an important future direction. Similarly, extensions to deep MoE comprised of multiple gated as well as non-gated layers is also a fruitful direction of further research. We believe that the theme of using different loss functions for distinct parameters in NN models can potentially enlighten some new theoretical insights as well as practical methodologies for complex neural models.

Acknowledgements

We would like to thank the anonymous reviewers for their suggestions. This work is supported by NSF grants 1927712 and 1929955.

Appendix A Connection between $k$ -MoE and other popular models

Relation to other mixture models.

Notice that if let $w_{i}^{\ast}=0$ in Eq. (1) for all $i\in[k]$ , we recover the well-known uniform mixtures of generalized linear models (GLMs). Similarly, allowing for bias parameters in Eq. (1), we can recover the generic mixtures of GLMs. Moreover, if we let $g$ to be the linear function, we get the popular mixtures of linear regressions model. These observations highlight that MoE models are a far more stricter generalization of mixtures of GLMs since they allow the mixing probability $p_{i}^{\ast}(x)$ to depend on each input $x$ in a parametric way. This makes the learning of the parameters far more challenging since the gating and expert parameters are inherently coupled.

Relation to feed-forward neural networks.

Note that if we let $w_{i}^{\ast}=0$ and allow for bias parameters in the soft-max probabilities in Eq. (1), taking conditional expectation on both sides yields

[TABLE]

Thus the mapping $x\mapsto\hat{y}(x)$ is exactly the same as that of a $1$ -hidden -layer neural network with activation function $g$ if we restrict the output layer to positive weights. Thus $k$ -MoE can also be viewed as a probabilistic model for gated feed-forward networks.

Appendix B Valid class of non-linearities

We slightly modify the class of non-linearities from Makkuva et al., (2019) for our theoretical results. The only key modification is that we use a fourth-order derivative based conditions, as opposed to third-order derivatives used in the above work. Following their notation, let $Z\sim\mathcal{N}(0,1)$ and $Y|Z\sim\mathcal{N}(g(Z),\sigma^{2})$ , where $g:\mathbb{R}\to\mathbb{R}$ . For $(\alpha,\beta,\gamma,\delta)\in\mathbb{R}^{4}$ , define

[TABLE]

where

[TABLE]

Similarly, define

[TABLE]

Condition 1.

$\mathbb{E}[\mathcal{S}_{4}^{\prime}(Z)]=\mathbb{E}[\mathcal{S}_{4}^{\prime\prime}(Z)]=\mathbb{E}[\mathcal{S}_{4}^{\prime\prime\prime}(Z)]=0$ and $\mathbb{E}[\mathcal{S}_{4}^{\prime\prime\prime\prime}(Z)]\neq 0$ . Or equivalently, in view of Stein’s lemma Stein, (1972),

[TABLE]

Condition 2.

$\mathbb{E}[\mathcal{S}_{2}^{\prime}(Z)]=0$ and $\mathbb{E}[\mathcal{S}_{2}^{\prime\prime}(Z)]\neq 0$ . Or equivalently,

[TABLE]

Definition 1.

We say that the non-linearity $g$ is $(\alpha,\beta,\gamma,\delta)-\mathrm{valid}$ if there exists a tuple $(\alpha,\beta,\gamma,\delta)\in\mathbb{R}^{4}$ such that both Condition 1 and Condition 2 are satisfied.

While these conditions might seem restrictive at first, all the widely used non-linearities such as $\mathrm{Id}$ , ReLU, leaky-ReLU, sigmoid, etc. belong to this. For some of these non-linear activations, we provide the pre-computed transformations below:

Example 1.

If $g=\mathrm{Id}$ , then $\mathcal{S}_{3}(y)=y^{4}-6y^{2}(1+\sigma^{2})$ and $\mathcal{Q}_{2}(y)=y^{2}$ .

Example 2.

If $g=$ ReLU, i.e. $g(z)=\mathrm{max}\{0,z\}$ , we have that for any $p,q\in\mathbb{N}$ ,

[TABLE]

Substituting these moments in the linear set of equations $\mathbb{E}[\mathcal{S}_{4}(Z)Z]=\mathbb{E}[\mathcal{S}_{4}(Z)(Z^{2}-1)]=\mathbb{E}[\mathcal{S}_{4}(Z)(Z^{3}-3Z)]=0$ , we obtain

[TABLE]

Solving for $(\alpha,\beta,\gamma)$ will yield $\mathcal{S}_{4}(Z)$ . Finally, we have that $\delta=-2\sqrt{\frac{2}{\pi}}$ .

Appendix C Proofs of Section 3.1

Remark 2.

To choose the parameters in Theorem 1, we follow the parameter choices from Ge et al., (2018). Let $c$ be a sufficiently small universal constant (e.g. $c=0.01$ ). Assume $\mu\leq c/\kappa^{\ast}$ , and $\lambda\geq 1/(ca_{\min}^{\ast})$ . Let $\tau_{0}=c\min\left\{\mu/(\kappa da_{\max}^{\ast}),\lambda\right\}\sigma_{\min}(M)$ . Let $\delta\leq\min\left\{\frac{c\varepsilon_{0}}{a_{\max}^{\ast}\cdot m\sqrt{d}\kappa^{1/2}(M)},\tau_{0}/2\right\}$ and $\varepsilon=\min\left\{\lambda\sigma_{\min}(M)^{1/2},c\delta/\sqrt{\left\|M\right\|},c\varepsilon_{0}\delta\sigma_{\min}(M)\right\}$ .

For any $k\times d$ matrix $A$ , let $A^{\dagger}$ be its pseudo inverse such that $AA^{\dagger}=I_{k\times k}$ and $A^{\dagger}A$ is the projection matrix to the row span of $A$ . Let $\alpha_{i}^{\ast}\triangleq\mathbb{E}[p_{i}^{\ast}(x)],a_{i}^{\ast}=\frac{1}{\alpha_{i}^{\ast}}$ and $\kappa^{\ast}=\frac{\alpha_{\max}^{\ast}}{\alpha_{\min}^{\ast}}$ . Let $M=\sum_{i\in[k]}\alpha_{i}^{\ast}a_{i}^{\ast}(a_{i}^{\ast})^{\top}$ , $\kappa(M)=\frac{\left\|M\right\|}{\sigma_{\min}(M)}$ .

For the sake of clarity, we now formally state our main assumptions, adapted from Makkuva et al., (2019):

$x$ follows a standard Gaussian distribution, i.e. $x\sim\mathcal{N}(0,I_{d})$ . 2. 2.

$\left\|a_{i}^{\ast}\right\|=1$ for all $i\in[k]$ and $\left\|w_{i}^{\ast}\right\|\leq R$ for all $i\in[k-1]$ . 3. 3.

The regressors $a_{1}^{\ast},\ldots,a_{k}^{\ast}$ are linearly independent and the classifiers $\{w_{i}^{\ast}\}_{i\in[k-1]}$ are orthogonal to the span $\mathcal{S}=\mathrm{span}\left\{a_{1}^{\ast},\ldots,a_{k}^{\ast}\right\}$ , and $2k-1<d$ . 4. 4.

The non-linearity $g:\mathbb{R}\to\mathbb{R}$ is $(\alpha,\beta,\gamma,\delta)-\mathrm{valid}$ , which we define in Appendix B.

Note that while the first three assumptions are same as that of Makkuva et al., (2019), the fourth assumption is slightly different from theirs. Under this assumptions, we first give an alternative characterization of $L_{4}(\cdot)$ in the following theorem which would be crucial for the proof of Theorem 1.

Theorem 5.

The function $L(\cdot)$ defined in Eq. (6) satisfies that

[TABLE]

C.1 Proof of Theorem 5

Proof.

For the proof of Theorem 5, we use the notion of score functions defined as Janzamin et al., (2014):

[TABLE]

In this paper we focus on $m=2,4$ . When $x\sim\mathcal{N}(0,I_{d})$ , we know that $\mathcal{S}_{2}(x)=x\otimes x-I$ and

[TABLE]

The score transformations $\mathcal{S}_{4}(x)$ and $\mathcal{S}_{2}(x)$ can be viewed as multi-variate polynomials in $x$ of degrees $4$ and $2$ respectively. For the output $y$ , recall the transforms $\mathcal{Q}_{4}(y)$ and $\mathcal{Q}_{2}(y)$ defined in Section 3.1. The following lemma shows that one can construct a fourth-order super symmetric tensor using these special transforms.

Lemma 1 (Super symmetric tensor construction).

Let $(x,y)$ be generated according to Eq. (1) and Assumptions $(1)$ - $(4)$ hold. Then

[TABLE]

where $p_{i}^{\ast}(x)=\mathbb{P}\left[z_{i}=1|x\right]$ , $c_{g,\sigma}$ and $c^{\prime}_{g,\sigma}$ are two non-zero constants depending on $g$ and $\sigma$ .

Now the proof of the theorem immediately follows from Lemma 1. Recall from Eq. (6) that

[TABLE]

Fix $i,j\in[k]$ . Notice that we have $t_{1}(a_{i},a_{j},x)=\mathcal{S}_{4}(x)(a_{i},a_{i},a_{j},a_{j})/c_{g,\sigma}$ . Hence we obtain

[TABLE]

The simplification for the remaining terms is similar and follows directly from definitions of $t_{2}(\cdot,x)$ and $t_{3}(\cdot,x)$ . ∎

C.2 Proof of Theorem 1

Proof.

The proof is an immediate consequence of Theorem 5 and Theorem C.5 of Ge et al., (2018). ∎

C.3 Proof of Theorem 2

Proof.

Note that our loss function $L_{4}(A)$ can be written as $\mathbb{E}[\ell(x,y,A)]$ where $\ell$ is at most a fourth degree polynomial in $x$ , $y$ and $A$ . Hence our finite sample guarantees directly follow from Theorem 1 and Theorem E.1 of Ge et al., (2018). ∎

C.4 Proof of Lemma 1

Proof.

The proof of this lemma essentially follows the same arguments as that of (Makkuva et al.,, 2019, Theorem 1), where we replace $(\mathcal{S}_{3}(x),\mathcal{S}_{2}(x),\mathcal{P}_{3}(y),\mathcal{P}_{2}(y))$ with $(\mathcal{S}_{4}(x),\mathcal{S}_{2}(x),\mathcal{Q}_{4}(y),\mathcal{P}_{2}(y))$ respectively and letting $\mathcal{T}_{3}$ defined there with our $\mathcal{T}_{4}$ defined above.

∎

Appendix D Proofs of Section 3.2

For the convergence analysis of SGD on $L_{\mathrm{\log}}$ , we use techniques from Balakrishnan et al., (2017) and Makkuva et al., (2019). In particular, we adapt (Makkuva et al.,, 2019, Lemma 3) and (Makkuva et al.,, 2019, Lemma 4) to our setting through Lemma 2 and Lemma 3, which are central to the proof of Theorem 3 and Theorem 4. We now sate our lemmas.

Lemma 2.

Under the assumptions of Theorem 3, it holds that

[TABLE]

In addition, $W=W^{\ast}$ is a fixed point for $G(W,A^{\ast})$ .

Lemma 3.

Let the matrix of regressors $A$ be such that $\max_{i\in[k]}\|A_{i}^{\top}-(A^{\ast}_{i})^{\top}\|_{2}=\sigma^{2}\varepsilon$ . Then for any $W\in\Omega$ , we have that

[TABLE]

where $\kappa$ is a constant depending on $g,k$ and $\sigma$ . In particular, $\kappa\leq(k-1)\frac{\sqrt{6(2+\sigma^{2})}}{2}$ for $g=$ linear, sigmoid and ReLU.

Lemma 4 (Deviation of finite sample gradient operator).

For some universal constant $c_{1}$ , let the number of samples $n$ be such that $n\geq c_{1}d\log(1/\delta)$ . Then for any fixed set of regressors $A\in\mathbb{R}^{k\times d}$ , and a fixed $W\in\Omega$ , the bound

[TABLE]

holds with probability at least $1-\delta$ .

D.1 Proof of Theorem 3

Proof.

The proof directly follows from Lemma 2 and Lemma 3. ∎

D.2 Proof of Theorem 4

Proof.

Let the set of regressors $A$ be such that $\max_{i\in[k]}\|A_{i}^{\top}-(A^{\ast}_{i})^{\top}\|_{2}=\sigma^{2}\varepsilon_{1}$ . Fix $A$ . For any iteration $t\in[T]$ , from Lemma 4 we have the bound

[TABLE]

with probability at least $1-\delta/T$ . Using an union bound argument, Eq. (11) holds with probability at least $1-\delta$ for all $t\in[T]$ . Now we show that the following bound holds:

[TABLE]

Indeed, for any $t\in\{0,\ldots,T-1\}$ , we have that

[TABLE]

where we used in Lemma 2, Lemma 3 and Lemma 4 in the last inequality to bound each of the terms. From Eq. (11), we obtain that

[TABLE]

∎

D.3 Proof of Lemma 2

Proof.

Recall that the loss function for the population setting, $L_{\log}(W,A)$ , is given by

[TABLE]

where $p_{i}(x)\triangleq\frac{e^{\langle{w_{i}},{x}\rangle}}{\sum_{j\in[k]}e^{\langle{w_{j}},{x}\rangle}}$ and $N_{i}\triangleq\mathcal{N}(y|g(\langle{a_{i}},{x}\rangle),\sigma^{2})$ . Hence for any $i\in[k-1]$ , we have

[TABLE]

Moreover,

[TABLE]

Hence we obtain that

[TABLE]

Notice that if $z\in[k]$ denotes the latent variable corresponding to which expert is chosen, we have that the posterior probability of choosing the $i$ th expert is given by

[TABLE]

whereas,

[TABLE]

Hence, when $A=A^{\ast}$ and $W=W^{\ast}$ , we get that

[TABLE]

Thus $W=W^{\ast}$ is a fixed point for $G(W,A^{\ast})$ since

[TABLE]

Now we make the observation that the population-gradient updates $W_{t+1}=G(W_{t},A)$ are same as the gradient-EM updates. Thus the contraction of the population-gradient operator $G(\cdot,A^{\ast})$ follows from the contraction property of the gradient EM algorithm (Makkuva et al.,, 2019, Lemma 3). To see this, recall that for $k$ -MoE, the gradient-EM algorithm involves computing the function $Q(W|W_{t})$ for the current iterate $W_{t}$ and defined as:

[TABLE]

where $p_{W_{t}}^{(i)}=\mathbb{P}\left[z=i|x,y,w_{t}\right]$ corresponds to the posterior probability for the $i^{\mathrm{th}}$ expert, given by

[TABLE]

Then the next iterate of the gradient-EM algorithm is given by $W_{t+1}=\Pi_{\Omega}(W_{t}+\alpha\nabla_{W}Q(W|W_{t})_{W=W_{t}})$ . We have that

[TABLE]

Hence if we use the same step size $\alpha$ , our population-gradient iterates on the log-likelihood are same as that of the gradient-EM iterates. This finishes the proof. ∎

D.4 Proof of Lemma 3

Proof.

Fix any $W\in\Omega$ and let $A=\begin{bmatrix}a_{1}^{\top}\\ \ldots\\ a_{k}^{\top}\end{bmatrix}\in\mathbb{R}^{k\times d}$ be such that $\max_{i\in[k]}\left\|a_{i}-a_{i}^{\ast}\right\|_{2}=\sigma^{2}\varepsilon_{1}$ for some $\varepsilon_{1}>0$ . Let

[TABLE]

Denoting the $i^{\mathrm{th}}$ row of $W^{\prime}\in\mathbb{R}^{(k-1)\times d}$ by $w^{\prime}_{i}$ and that of $(W^{\prime})^{\ast}$ by $(w^{\prime}_{i})^{\ast}$ for any $i\in[k-1]$ , we have that

[TABLE]

Thus it suffices to bound $\left\|\nabla_{w_{i}}L_{\log}(W,A)-\nabla_{w_{i}}L_{\log}(W,A^{\ast})\right\|_{2}$ . From Eq. (13), we have that

[TABLE]

where,

[TABLE]

Thus we have

[TABLE]

where $p^{(i)}(A,W)\triangleq\frac{p_{i}(x)N_{i}}{\sum_{i\in[k]}p_{i}(x)N_{i}}$ denotes the posterior probability of choosing the $i^{\mathrm{th}}$ expert. Now we observe that Eq. (14) reduces to the setting of (Makkuva et al.,, 2019, Lemma 4) and hence the conclusion follows.

∎

D.5 Proof of Lemma 4

Proof.

We first prove the lemma for $k=2$ . For $2$ -MoE, we have that the posterior probability is given by

[TABLE]

where $f(\cdot)=\frac{1}{1+e^{-(\cdot)}},N_{1}=\mathcal{N}(y|g(a_{1}^{\top}x),\sigma^{2})$ and $N_{2}=\mathcal{N}(y|g(a_{2}^{\top}x),\sigma^{2})$ for fixed $a_{1},a_{2}\in\mathbb{R}^{d}$ . Then we have that

[TABLE]

Hence

[TABLE]

Since $0<\alpha<1$ , we have that

[TABLE]

We now bound $T_{1}$ and $T_{2}$ .

Bounding $T_{2}$ : We prove that the random variable $\sum_{i\in[n]}\frac{f(w^{\top}x_{i})x_{i}}{n}-\mathbb{E}[f(w^{\top}x)x]$ is sub-gaussian with parameter $L/\sqrt{n}$ for some constant $L>1$ and thus its squared norm is sub-exponential. We then bound $T_{2}$ using standard sub-exponential concentration bounds. Towards the same, we first show that the random variable $f(w^{\top}x)x-\mathbb{E}[f(w^{\top}x)x]$ is sub-gaussian with parameter $L$ . Or equivalently, that $f(w^{\top}x)\langle{x},{u}\rangle-\mathbb{E}[f(w^{\top}x)\langle{x},{u}\rangle]$ is sub-gaussian for all $u\in\mathbb{S}^{d}$ .

Without loss of generality, assume that $w\neq 0$ . First let $u=\vec{w}\triangleq\frac{w}{\left\|w\right\|}$ . Thus $Z\triangleq\langle{\vec{w}},{x}\rangle\sim\mathcal{N}(0,1)$ . We have

[TABLE]

It follows that $g(\cdot)$ is Lipschitz since

[TABLE]

From the Talagaran concentration of Gaussian measure for Lipschitz functions (Ledoux and Talagrand,, 1991), it follows that $g(Z)$ is sub-gaussian with parameter $L$ . Now consider any $u\in\mathbb{S}^{d}$ such that $u\perp w$ . Then we have that $Y\triangleq\langle{u},{x}\rangle\sim\mathcal{N}(0,1)$ and $Z\triangleq\langle{\vec{w}},{x}\rangle\sim\mathcal{N}(0,1)$ are independent. Thus,

[TABLE]

is sub-gaussian with parameter $1$ since $f\in[0,1]$ and $Y,Z$ are independent standard gaussians. Since any $u\in\mathbb{S}^{d}$ can be written as

[TABLE]

where $P_{S}$ denotes the projection operator onto the sub-space $S$ , we have that $f(w^{\top}x)\langle{x},{u}\rangle-\mathbb{E}[f(w^{\top}x)\langle{x},{u}\rangle]$ is sub-gaussian with parameter $L$ for all $u\in\mathbb{S}^{d}$ . Thus it follows that $\sum_{i\in[n]}\frac{f(w^{\top}x_{i})x_{i}}{n}-\mathbb{E}[f(w^{\top}x)x]$ is zero-mean and sub-gaussian with parameter $L/\sqrt{n}$ which further implies that

[TABLE]

with probability at least $1-\delta/2$ .

Bounding $T_{1}$ : Let $Z\triangleq\|\sum_{i\in[n]}\frac{p_{w}(x_{i},y_{i})x_{i}}{n}-\mathbb{E}[p_{w}(x,y)x]\|_{2}=\sup_{u\in\mathbb{S}^{d}}Z(u)$ , where

[TABLE]

Let $\{u_{1},\ldots,u_{M}\}$ be a $1/2$ -cover of the unit sphere $\mathbb{S}^{d}$ . Hence for any $v\in\mathbb{S}^{d}$ , there exists a $j\in[M]$ such that $\left\|v-u_{j}\right\|_{2}\leq 1/2$ . Thus,

[TABLE]

where we used the fact that $|Z(u)-Z(v)|\leq Z\left\|u-v\right\|_{2}$ for any $u,v\in\mathbb{S}^{d}$ . Now taking supremum over all $v\in\mathbb{S}^{d}$ yields that $Z\leq 2\max_{j\in[M]}Z(u_{j})$ . Now we bound $Z(u)$ for a fixed $u\in\mathbb{S}^{d}$ . By symmetrization trick (Vaart and Wellner,, 1996), we have

[TABLE]

where $\varepsilon_{1},\ldots,\varepsilon_{n}$ are i.i.d. Rademacher variables. Define the event $E\triangleq\{\frac{1}{n}\sum_{i\in[n]}\langle{x_{i}},{u}\rangle^{2}\leq 2\}$ . Since $\langle{x_{i}},{u}\rangle\sim\mathcal{N}(0,1)$ , standard tail bounds imply that $\mathbb{P}\left[E^{c}\right]\leq e^{-n/32}$ . Thus we have that

[TABLE]

Considering the first term, for any $\lambda>0$ , we have

[TABLE]

where we used the Ledoux-Talagrand contraction for Rademacher process (Ledoux and Talagrand,, 1991), since $|p_{w}(x_{i},y_{i})|\leq 1$ for all $(x_{i},y_{i})$ . The sub-gaussianity of Rademacher sequence $\{\varepsilon_{i}\}$ implies that

[TABLE]

using the definition of the event $E$ . Thus the above bound on the moment generating function implies the following tail bound:

[TABLE]

Combining all the bounds together, we obtain that

[TABLE]

Since $M\leq 2^{d}$ , using the union bound we obtain that

[TABLE]

Since $n\geq c_{1}d\log(1/\delta)$ , we have that $T_{1}=Z\leq c\sqrt{\frac{d\log(1/\delta)}{n}}$ with probability at least $1-\delta/2$ . Combining these bounds on $T_{1}$ and $T_{2}$ yields the final bound on $\varepsilon_{G}(n,\delta)$ .

Now consider any $k\geq 2$ . From Eq. (13), defining $N_{i}\triangleq\mathcal{N}(y|g(a_{i}^{\top}x),\sigma^{2})$ and $p_{i}(x)=\frac{e^{w_{i}^{\top}x}}{1+\sum_{j\in[k-1]}e^{w_{j}^{\top}x}}$ , we have that

[TABLE]

Similarly,

[TABLE]

Since $\left\|G_{n}(W,A)-G(W,A)\right\|=\max_{i\in[k-1]}\left\|G_{n}(W,A)_{i}-G(W,A)_{i}\right\|_{2}$ , with out loss of generality, we let $i=1$ . The proof for the other cases is similar. Thus we have

[TABLE]

where $p^{(1)}(x,y)\triangleq\frac{p_{1}(x)N_{1}}{\sum_{i\in[k]}p_{i}(x)N_{i}}$ . Since $|p^{(1}(x,y)|\leq 1$ and $|p_{1}(x)|\leq 1$ , we can use the same argument as in the bounding of $T_{1}$ proof for $2$ -MoE above to get the parametric bound. This finishes the proof. ∎

Appendix E Additional experiments

E.1 Reduced batch size

In Figure 4 we ran SGD on our loss $L_{4}(\cdot)$ with $5$ different runs with a batch size of $128$ and a learning rate of $0.001$ for $d=10$ and $k=3$ . We can see that our algorithm still converges to zero but with a more variance because of noisy gradient estimation and also lesser number of samples than the required sample complexity.

Bibliography51

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Alaeddini et al., (2018) Alaeddini, A., Alemzadeh, S., Mesbahit, A., and Mesbahi, M. (2018). Linear model regression on time-series data: Non-asymptotic error bounds and applications. In 2018 IEEE Conference on Decision and Control (CDC) , pages 2259–2264. IEEE.
2Allen-Zhu and Li, (2019) Allen-Zhu, Z. and Li, Y. (2019). Can sgd learn recurrent neural networks with provable generalization? ar Xiv preprint ar Xiv:1902.01028 .
3Allen-Zhu et al., (2018) Allen-Zhu, Z., Li, Y., and Song, Z. (2018). On the convergence rate of training recurrent neural networks. ar Xiv preprint ar Xiv:1810.12065 .
4Arora et al., (2018) Arora, S., Hazan, E., Lee, H., Singh, K., Zhang, C., and Zhang, Y. (2018). Towards provable control for unknown linear dynamical systems.
5Balakrishnan et al., (2017) Balakrishnan, S., Wainwright, M. J., and Yu, B. (2017). Statistical guarantees for the EM algorithm: From population to sample-based analysis. The Annals of Statistics , 45(1):77–120.
6Bandeira et al., (2016) Bandeira, A. S., Boumal, N., and Voroninski, V. (2016). On the low-rank approach for semidefinite programs arising in synchronization and community detection. ar Xiv preprint ar Xiv:1602.04426 .
7Bhojanapalli et al., (2016) Bhojanapalli, S., Neyshabur, B., and Srebro, N. (2016). Global optimality of local search for low rank matrix recovery. ar Xiv preprint ar Xiv:1605.07221 .
8Cho et al., (2014) Cho, K., van Merrienboer, B., Bahdanau, D., and Bengio, Y. (2014). On the properties of neural machine translation: Encoder-decoder approaches. abs/1409.1259.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Abstract

1 Introduction

Question 1**.**

Question 2**.**

2 GRU as a hierarchical MoE

3 Optimization landscape design for MoE

3.1 Loss function for regressors: L4L_{4}L4​

Theorem 1** (Landscape analysis for learning regressors).**

Theorem 2** (Finite sample landscape).**

Remark 1**.**

3.2 Loss function for gating parameters: Llog⁡L_{\log}Llog​

Theorem 3** (GD convergence for classifiers).**

Proof.

Theorem 4** (Finite sample complexity and convergence rates for GD).**

4 Experiments

4.1 Robustness to technical assumptions

5 Discussion

Acknowledgements

Appendix A Connection between kkk-MoE and other popular models

Relation to other mixture models.

Relation to feed-forward neural networks.

Appendix B Valid class of non-linearities

Condition 1**.**

Condition 2**.**

Definition 1**.**

Example 1**.**

Example 2**.**

Appendix C Proofs of Section 3.1

Remark 2**.**

Theorem 5**.**

C.1 Proof of Theorem 5

Proof.

Lemma 1** (Super symmetric tensor construction).**

C.2 Proof of Theorem 1

Proof.

C.3 Proof of Theorem 2

Proof.

C.4 Proof of Lemma 1

Proof.

Appendix D Proofs of Section 3.2

Lemma 2**.**

Lemma 3**.**

Lemma 4** (Deviation of finite sample gradient operator).**

D.1 Proof of Theorem 3

Proof.

D.2 Proof of Theorem 4

Proof.

D.3 Proof of Lemma 2

Proof.

D.4 Proof of Lemma 3

Proof.

D.5 Proof of Lemma 4

Proof.

Appendix E Additional experiments

E.1 Reduced batch size

Question 1.

Question 2.

3.1 Loss function for regressors: $L_{4}$

Theorem 1 (Landscape analysis for learning regressors).

Theorem 2 (Finite sample landscape).

Remark 1.

3.2 Loss function for gating parameters: $L_{\log}$

Theorem 3 (GD convergence for classifiers).

Theorem 4 (Finite sample complexity and convergence rates for GD).

Appendix A Connection between $k$ -MoE and other popular models

Condition 1.

Condition 2.

Definition 1.

Example 1.

Example 2.

Remark 2.

Theorem 5.

Lemma 1 (Super symmetric tensor construction).

Lemma 2.

Lemma 3.

Lemma 4 (Deviation of finite sample gradient operator).