Learning in Gated Neural Networks
Ashok Vardhan Makkuva, Sewoong Oh, Sreeram Kannan, Pramod Viswanath

TL;DR
This paper analyzes the optimization landscape of gated neural networks, showing that with specially designed loss functions, gradient descent can accurately recover parameters, supported by new sample complexity results and improved numerical performance.
Contribution
It introduces two distinct loss functions for better parameter recovery in mixture-of-experts models and provides the first sample complexity analysis for this problem.
Findings
Gradient descent can learn parameters accurately with proper loss functions.
Two specialized loss functions improve parameter recovery.
First sample complexity results for mixture-of-experts models.
Abstract
Gating is a key feature in modern neural networks including LSTMs, GRUs and sparsely-gated deep neural networks. The backbone of such gated networks is a mixture-of-experts layer, where several experts make regression decisions and gating controls how to weigh the decisions in an input-dependent manner. Despite having such a prominent role in both modern and classical machine learning, very little is understood about parameter recovery of mixture-of-experts since gradient descent and EM algorithms are known to be stuck in local optima in such models. In this paper, we perform a careful analysis of the optimization landscape and show that with appropriately designed loss functions, gradient descent can indeed learn the parameters accurately. A key idea underpinning our results is the design of two {\em distinct} loss functions, one for recovering the expert parameters and another for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
Learning in Gated Neural Networks
Ashok Vardhan Makkuva∗**
Sreeram Kannan†**
Sewoong Oh†**
Pramod Viswanath∗**
*∗*University of Illinois at Urbana-Champaign
*†*University of Washington
Abstract
Gating is a key feature in modern neural networks including LSTMs, GRUs and sparsely-gated deep neural networks. The backbone of such gated networks is a mixture-of-experts layer, where several experts make regression decisions and gating controls how to weigh the decisions in an input-dependent manner. Despite having such a prominent role in both modern and classical machine learning, very little is understood about parameter recovery of mixture-of-experts since gradient descent and EM algorithms are known to be stuck in local optima in such models.
In this paper, we perform a careful analysis of the optimization landscape and show that with appropriately designed loss functions, gradient descent can indeed learn the parameters of a MoE accurately. A key idea underpinning our results is the design of two distinct loss functions, one for recovering the expert parameters and another for recovering the gating parameters. We demonstrate the first sample complexity results for parameter recovery in this model for any algorithm and demonstrate significant performance gains over standard loss functions in numerical experiments.
1 Introduction
In recent years, gated recurrent neural networks (RNNs) such as LSTMs and GRUs have shown remarkable successes in a variety of challenging machine learning tasks such as machine translation, image captioning, image generation, hand writing generation, and speech recognition (Sutskever et al.,, 2014; Vinyals et al.,, 2014; Graves et al.,, 2013; Gregor et al.,, 2015; Graves,, 2013). A key interesting aspect and an important reason behind the success of these architectures is the presence of a gating mechanism that dynamically controls the flow of the past information to the current state at each time instant. In addition, it is also well known that these gates prevent the vanishing (and exploding) gradient problem inherent to traditional RNNs (Hochreiter and Schmidhuber,, 1997).
Surprisingly, despite their widespread popularity, there is very little theoretical understanding of these gated models. In fact, basic questions such as learnability of the parameters still remain open. Even for the simplest vanilla RNN architecture, this question was open until the very recent works of Allen-Zhu et al., (2018) and Allen-Zhu and Li, (2019), which provided the first theoretical guarantees of SGD for vanilla RNN models in the presence of non-linear activations. While this demonstrates that the theoretical analysis of these simpler models has itself been a challenging task, gated RNNs have an additional level of complexity in the form of gating mechanisms, which further enhances the difficulty of the problem. This motivates us to ask the following question:
Question 1**.**
Given the complicated architectures of LSTMs/GRUs, can we find analytically tractable sub-structures of these models?
We believe that addressing the above question can provide new insights into a principled understanding of gated RNNs. In this paper, we make progress towards this and provide a positive answer to the question. In particular, we make a non-trivial connection that a GRU (gated recurrent unit) can be viewed as a time-series extension of a basic building block, known as Mixture-of-Experts (MoE) (Jacobs et al.,, 1991; Jordan and Jacobs,, 1994). In fact, much alike LSTMs/GRUs, MoE is itself a widely popular gated neural network architecture and has found success in a wide range of applications (Tresp,, 2001; Collobert et al.,, 2002; Rasmussen and Ghahramani,, 2002; Yuksel et al.,, 2012; Masoudnia and Ebrahimpour,, 2014; Ng and Deisenroth,, 2014; Eigen et al.,, 2014). In recent years, there is also a growing interest in the fiels of natural language processing and computer vision to build complex neural networks incorporating MoE models to address challenging tasks such as machine translation (Gross et al.,, 2017; Shazeer et al.,, 2017). Hence the main goal of this paper is to study MoE in close detail, especially with regards to learnability of its parameters.
The canonical MoE model is the following: Let denote the number of mixture components (or equivalently neurons). Let be the input vector and be the corresponding output. Then the relationship between and is given by:
[TABLE]
where is a non-linear activation function, is a Gaussian noise independent of and the latent Bernoulli random variable indicates which expert has been chosen. In particular, only a single expert is active at any time, i.e. , and their probabilities are modeled by a soft-max function:
[TABLE]
Following the standard convention (Makkuva et al.,, 2019; Jacobs et al.,, 1991), we refer to the vectors as regressors, the vectors as either classifiers or gating parameters, and without loss of generality, we assume that .
Belying the canonical nature, and significant research effort, of the MoE model, the topic of learning MoE parameters is very poorly theoretically understood. In fact, the task of learning the parameters of a MoE, i.e. and , with provable guarantees is a long standing open problem for more than two decades (Sedghi et al.,, 2014). One of the key technical difficulties is that in a MoE, there is an inherent coupling between the regressors and the gating parameters , as can be seen from Eq. (1), which makes the problem challenging (Ho et al.,, 2019). In a recent work (Makkuva et al.,, 2019), the authors provided the first consistent algorithms for learning MoE parameters with theoretical guarantees. In order to tackle the aforementioned coupling issue, they proposed a clever scheme to first estimate the regressor parameters and then estimating the gating parameters using a combination of spectral methods and the EM algorithm. However, a major draw back is that this approach requires specially crafted algorithms for learning each of these two sets of parameters. In addition, they lack finite sample guarantees. Since SGD and its variants remain the de facto algorithms for training neural networks because of their practical advantages, and inspired by the successes of these gradient-descent based algorithms in finding global minima in a variety of non-convex problems, we ask the following question:
Question 2**.**
How do we design objective functions amenable to efficient optimization techniques, such as SGD, with provable learning guarantees for MoE?
In this paper, we address this question in a principled manner and propose two non-trivial non-convex loss functions and to learn the regressors and the gating parameters respectively. In particular, our loss functions possess nice landscape properties such as local minima being global and the global minima corresponding to the ground truth parameters. We also show that gradient descent on our losses can recover the true parameters with global/random initializations. To the best of our knowledge, ours is the first GD based approach with finite sample guarantees to learn the parameters of MoE. While our procedure to learn and separately and the technical assumptions are similar in spirit to Makkuva et al., (2019), our loss function based approach with provable guarantees for SGD is significantly different from that of Makkuva et al., (2019). We summarize our main contributions below:
- •
MoE as a building block for GRU: We provide the first connection that the well-known GRU models are composed of basic building blocks, known as MoE. This link provides important insights into theoretical understanding of GRUs and further highlights the importance of MoE.
- •
Optimization landscape design with desirable properties: We design two non-trivial loss functions and to learn the regressors and the gating parameters of a MoE separately. We show that our loss functions have nice landscape properties and are amenable to simple local-search algorithms. In particular, we show that SGD on our novel loss functions recovers the parameters with global/random initializations.
- •
First sample complexity results: We also provide the first sample complexity results for MoE. We show that our algorithms can recover the true parameters with accuracy and with high probability, when provided with samples polynomial in the dimension and .
Related work. Linear dynamical systems can be thought of as the linear version of RNNs. There is a huge literature on the topic of learning these linear systems Alaeddini et al., (2018); Arora et al., (2018); Dean et al., (2017, 2018); Marecek and Tchrakian, (2018); Oymak and Ozay, (2018); Simchowitz et al., (2018); Hardt et al., (2018). However these works are very specific to the linear setting and do not extend to non-linear RNNs. Allen-Zhu et al., (2018) and Allen-Zhu and Li, (2019) are two recent works to provide first theoretical guarantees for learning RNNs with ReLU activation function. However, it is unclear how these techniques generalize to the gated architectures. In this paper, we focus on the learnability of MoE, which are the building blocks for these gated models.
While there is a huge body of work on MoEs (see Yuksel et al., (2012); Masoudnia and Ebrahimpour, (2014) for a detailed survey), the topic of learning MoE parameters is theoretically less understood with very few works on it. Jordan and Xu, (1995) is one of the early works that showed the local convergence of EM. In a recent work, Makkuva et al., (2019) provided the first consistent algorithms for MoE in the population setting using a combination of spectral methods and EM algorithm. However, they do not provide any finite sample complexity bounds. In this work, we provide a unified approach using GD to learn the parameters with finite sample guarantees. To the best of our knowledge, we give the first gradient-descent based method with consistent learning guarantees, as well as the first finite-sample guarantee for any algorithm. The topic of designing the loss functions and analyzing their landscapes is a hot research topic in a wide variety of machine learning problems: neural networks (Hardt and Ma,, 2017; Kawaguchi,, 2016; Li and Yuan,, 2017; Panigrahy et al.,, 2017; Zhong et al.,, 2017; Ge et al.,, 2018; Gao et al.,, 2019), matrix completion (Bhojanapalli et al.,, 2016), community detection (Bandeira et al.,, 2016), orthogonal tensor decomposition (Ge et al.,, 2015). In this work, we present the first objective function design and the landscape analysis for MoE.
Notation. We denote -Euclidean norm by . . denotes the standard basis vectors in . We denote matrices by capital letters like , etc. For any two vectors , we denote their Hadamard product by . denotes the sigmoid function . For any , . denotes the Gaussian distribution with mean and covariance . Through out the paper, we interchangeably denote regressors as or , and gating parameters as or .
Overview. The rest of the paper is organized as follows: In Section 2, we establish the precise mathematical connection between the well known GRU model and the MoE model. Building upon this correspondence, which highlights the importance of MoE, in Section 3 we design two novel loss functions to learn the respective regressors and gating parameters of a MoE and present our theoretical guarantees. In Section 4, we empirically validate that our proposed losses perform much better than the current approaches on a variety of settings.
2 GRU as a hierarchical MoE
In this section, we show that the recurrent update equations for GRU can be obtained from that of MoE, described in Eq. (1). In particular, we show that GRU can be viewed as a hierarchical MoE with depth-2. To see this, we restrict to the setting of a -MoE, i.e. let and , and in Eq. (1). Then we obtain that
[TABLE]
where and . Since is a zero mean random variable independent of , taking conditional expectation on both sides of Eq. (2) yields that
[TABLE]
Now letting the output to be a vector and allowing for different gating parameters and regressors along each dimension , we obtain
[TABLE]
where with , and denote the matrix of regressors corresponding to first and second experts respectively.
We now show that Eq. (3) is the basic equation behind the updates in GRU. Recall that in a GRU, given a time series of sequence length , the goal is to produce a sequence of hidden states such that the output time series is close to in some well-defined loss metric, where denotes the non-linear activation of the last layer. The equations governing the transition dynamics between and at any time are given by (Cho et al.,, 2014):
[TABLE]
where and denote the update and reset gates, which are given by
[TABLE]
where the matrices and with appropriate subscripts are parameters to be learnt. While the gating activation function is modeled as sigmoid for the ease of obtaining gradients while training, their intended purpose was to operate as binary valued gates taking values in . Indeed, in a recent work Li et al., (2018), the authors show that binary valued gates enhance robustness with more interpretability and also give better performance compared to their continuous valued counterparts. In view of this, letting to be the binary threshold function , we obtain that
[TABLE]
Letting and in Eq. (3) with second expert replaced by a -MoE, we can see from Eq. (4) that GRU is a depth- hierarchical MoE. This is also illustrated in Figure 2.
Note that in Figure 2, NN-1 models the mapping , NN-2 represents , and NN-3 models . Hence, this is slightly different from the traditional MoE setting in Eq. (1) where the same activation is used for all the nodes. Nonetheless, we believe that studying this canonical model is a crucial first step which can shed important insights for a general setting.
3 Optimization landscape design for MoE
In the previous section, we presented the mathematical connection between the GRU and the MoE. In this section, we focus on the learnability of the MoE model and design two novel loss functions for learning the regressors and the gating parameters separately.
3.1 Loss function for regressors:
To motivate the need for loss function design in a MoE, first we take a moment to highlight the issues with the traditional approach of using the mean square loss . If are generated according to the ground-truth MoE model in Eq. (1), computes the quadratic cost between the expected predictions and the ground-truth , i.e.
[TABLE]
where is the predicted output, and denote the respective regressors and gating parameters. It is well-known that this mean square loss is prone to bad local minima as demonstrated empirically in the earliest work of Jacobs et al., (1991) (we verify this in Section 4 too), which also emphasized the importance of the right objective function to learn the parameters. Note that the bad landscape of is not just unique to MoE, but also widely observed in the context of training neural network parameters (Livni et al.,, 2014). In the one-hidden-layer NN setting, some recent works (Ge et al.,, 2018; Gao et al.,, 2019) addressed this issue by designing new loss functions with good landscape properties so that standard algorithms like SGD can provably learn the parameters. However these methods do not generalize to the MoE setting since they crucially rely on the fact that the coefficients appearing in front of the activation terms in Eq. (1), which correspond to the linear layer weights in NN, are constant. Such an assumption does not hold in the context of MoEs because the gating probabilities depend on in a parametric way through the softmax function and hence introducing the coupling between and (a similar observation was noted in Makkuva et al., (2019) in the context of spectral methods).
In order to address the aforementioned issues, inspired by the works of Ge et al., (2018) and Gao et al., (2019), we design a novel loss function to learn the regressors first. Our loss function depends on two distinct special transformations on both the input and the output . For the output, we consider the following transformations:
[TABLE]
where the set of coefficients are dependent on the choice of non-linearity and noise variance . These are obtained by solving a simple linear system (see Appendix B). For the special case , which corresponds to linear activations, the Quartic transform is and the Quadratic transform is . For the input , we assume that , and for any two fixed , we consider the projections of multivariate-Hermite polynomials Grad, (1949); Holmquist, (1996); Janzamin et al., (2014) along these two vectors, i.e.
[TABLE]
where and are two non-zero constants depending on and . These transformations on the input and on the output can be viewed as extractors of higher order information from the data. The utility of these transformations is concretized in Theorem 1 through the loss function defined below. Denoting the set of our regression parameters by the matrix , we now define our objective function as
[TABLE]
where are some positive regularization constants. Notice that is defined as an expectation of terms involving the data transformations: and . Hence its gradients can be readily computed from finite samples and is amenable to standard optimization methods such as SGD for learning the parameters. Moreover, the following theorem highlights that the landscape of does not have any spurious local minima.
Theorem 1** (Landscape analysis for learning regressors).**
Under the mild technical assumptions of Makkuva et al., (2019), the loss function does not have any spurious local minima. More concretely, let be a given error tolerance. Then we can choose the regularization constants and the parameters such that if satisfies
[TABLE]
then , where is a diagonal matrix with entries close to , is a diagonal matrix with , is a permutation matrix and . Hence every approximate local minimum is -close to the global minimum.
Intuitions behind the theorem and the special transforms: While the transformations and the loss defined above may appear non-intuitive at first, the key observation is that can be viewed as a fourth-order polynomial loss in the parameter space, i.e.
[TABLE]
where refers to the softmax probability for the label with true gating parameters, i.e. . This alternate characterization of in Eq. (7) is the crucial step towards proving Theorem 1. Hence these specially designed transformations on the data help us to achieve this objective. Given this viewpoint, we utilize tools from Ge et al., (2018), where a similar loss involving fourth-order polynomials were analyzed in the context of -layer ReLU network, to prove the desired landscape properties for . The full details behind the proof are provided in Appendix C. Moreover, in Section 4 we empirically verify that the technical assumptions are only needed for the theoretical results and that our algorithms are robust to these assumptions, and work equally well even when we relax them.
In the finite sample regime, we replace the population expectations in Eq. (6) with sample average to obtain the empirical loss . The following theorem establishes that too inherits the same landscape properties of when provided enough samples.
Theorem 2** (Finite sample landscape).**
There exists a polynomial such that whenever , inherits the same landscape properties as that of established in Theorem 1 with high probability. Hence stochastic gradient descent on converges to an approximate local minima which is also close to a global minimum in time polynomial in .
Remark 1**.**
Notice that the parameters learnt through SGD are some permutation of the true parameters upto sign flips. This sign ambiguity can be resolved using existing standard procedures such as Algorithm 1 in Ge et al., (2018). In the remainder of the paper, we assume that we know the regressors upto some error in the following sense: .
3.2 Loss function for gating parameters:
In the previous section, we have established that we can learn the regressors upto small error using SGD on the loss function . Now we are interested in answering the following question: Can we design a loss function amenable to efficient optimization algorithms such as SGD with recoverable guarantees to learn the gating parameters?
In order to gain some intuition towards addressing this question, consider the simplified setting of and . In this setting, we can see from Eq. (1) that the output equals one of the activation values , for , with probability . Since we already have access to the true parameters, i.e. , we can see that we can exactly recover the hidden latent variable , which corresponds to the chosen hidden expert for each sample . Thus the problem of learning the classifiers reduces to a multi-class classification problem with label for each input and hence can be efficiently solved by traditional methods such as logistic regression. It turns out that these observations can be formalized to deal with more general settings (where we only know the regressors approximately and the noise variance is not zero) and that the gradient descent on the log-likelihood loss achieves the same objective. Hence we use the negative log-likelihood function to learn the classifiers, i.e.
[TABLE]
where . Note that the objective Eq. (8) in not convex in the gating parameters whenever . We omit the input distribution from the above negative log-likelihood since it does not depend on any of the parameters. We now define the domain of the gating parameters as
[TABLE]
for some fixed . Without loss of generality, we assume that . Since we know the regressors approximately from the previous stage, i.e. , we run gradient descent only for the classifier parameters keeping the regressors fixed, i.e.
[TABLE]
where is a suitably chosen learning-rate, denotes the projection operator which maps each row of its input matrix onto the ball of radius , and denotes the iteration step. In a more succinct way, we write
[TABLE]
Note that denotes the projected gradient descent operator on for fixed . In the finite sample regime, we define our loss as the finite sample counterpart of Eq. (8) by taking empirical expectations. Accordingly, we define the gradient operator as
[TABLE]
In this paper, we analyze a sample-splitting version of the gradient descent, where given the number of samples and the iterations , we first split the data into subsets of size , and perform iterations on fresh batch of samples, i.e. . We use the norm for our theoretical results. The following theorem establishes the almost geometric convergence of the population-gradient iterates under some high SNR conditions. The following results are stated for for simplicity and also hold for any general .
Theorem 3** (GD convergence for classifiers).**
Assume that . Then there exists two positive constants and such that for any step size and noise variance , the population gradient descent iterates converge almost geometrically to the true parameter for any randomly initialized , i.e.
[TABLE]
where are dimension-independent constants depending on and such that and .
Proof.
(Sketch) For simplicity, let . Then we can show that since . Then we capitalize on the fact that is strongly convex with minimizer at to show the geometric convergence rate. The more general case of is handled through perturbation analysis. ∎
We conclude our theoretical discussion on MoE by providing the following finite sample complexity guarantees for learning the classifiers using the gradient descent in the following theorem, which can be viewed as a finite sample version of Theorem 3.
Theorem 4** (Finite sample complexity and convergence rates for GD).**
In addition to the assumptions of Theorem 3, assume that the sample size is lower bounded as . Then the sample-gradient iterates based on samples per iteration satisfy the bound
[TABLE]
with probability at least .
4 Experiments
In this section, we empirically validate the fact that running SGD on our novel loss functions and achieves superior performance compared to the existing approaches. Moreover, we empirically show that our algorithms are robust to the technical assumptions made in Theorem 1 and that they achieve equally good results even when the assumptions are relaxed.
Data generation. For our experiments, we choose , , for and for , and . We generate the data according to Eq. (1) and using these ground-truth parameters. We chose for all of our experiments.
Error metric. If denotes the matrix of regressors where each row is of norm , we use the error metric to gauge the closeness of to the ground-truth :
[TABLE]
where denotes the set of all permutations on . Note that if and only if the learnt regressors have a minimum correlation of with the ground-truth parameters, upto a permutation. The error metric is defined similarly.
Results. In Figure 1, we choose and compare the performance of our algorithm against existing approaches. In particular, we consider three methods: 1) EM algorithm, 2) SGD on the the classical -loss from Eq. (3.1), and 3) SGD on our losses and . For all the methods, we ran independent trials and plotted the mean error. Figure 1(a) highlights the fact that minimizing our loss function by SGD recovers the ground-truth regressors, whereas SGD on -loss as well as EM get stuck in local optima. For learning the gating parameters using our approach, we first fix the regressors at the values learnt using , i.e. , where is the converged solution for . For and the EM algorithm, the gating parameters are learnt jointly with regressors . Figure 1(b) illustrates the phenomenon that our loss for learning the gating parameters performs considerably better than the standard approaches, as indicated in significant gaps between the respective error values. Finally, in Figure 1(c) we plot the regressor error for over random initializations. We can see that we recover the ground truth parameters in all the trials, thus empirically corroborating our technical results in Section 3.
4.1 Robustness to technical assumptions
In this section, we verify numerically the fact that our algorithms work equally well in the absence of technical assumptions made in Section 3.
Relaxing orthogonality in Theorem 1. A key assumption in proving Theorem 1, adapted from Makkuva et al., (2019), is that the set of regressors and set of gating parameters are orthogonal to each other. While this assumption is needed for the technical proofs, we now empirically verify that our conclusions still hold when we relax this. For this experiment, we choose and let . For the gating parameter , we randomly generate it from uniform distribution on the -dimensional unit sphere. In Figure 3(a) and Figure 3(b), we plotted the individual parameter estimation error for different runs for both of our losses and for learning the regressors and the gating parameter respectively. We can see that our algorithms are still able to learn the true parameters even when the orthogonality assumption is relaxed.
Relaxing Gaussianity of the input. To demonstrate the robustness of our approach to the assumption that the input is standard Gaussian, i.e. , we generated according to a mixture of two symmetric Gaussians each with identity covariances, i.e. , where is the mixing probability and is a fixed but randomly chosen vector. For various mixing proportions , we ran SGD on our loss to learn the regressors. Figure 3(c) highlights that we learn these ground truth parameters in all the settings.
Finally we note that in all our experiments, the loss seems to require a larger batch size () for its gradient estimation while running SGD. However, with smaller batch sizes such as we are still able to achieve similar performance but with more variance. (see Appendix E).
5 Discussion
In this paper we established the first mathematical connection between two popular gated neural networks: GRU and MoE. Inspired by this connection and the success of SGD based algorithms in finding global minima in a variety of non-convex problems in deep learning, we provided the first gradient descent based approach for learning the parameters in a MoE. While the canoncial MoE does not involve any time series, extension of our methods for the recurrent setting is an important future direction. Similarly, extensions to deep MoE comprised of multiple gated as well as non-gated layers is also a fruitful direction of further research. We believe that the theme of using different loss functions for distinct parameters in NN models can potentially enlighten some new theoretical insights as well as practical methodologies for complex neural models.
Acknowledgements
We would like to thank the anonymous reviewers for their suggestions. This work is supported by NSF grants 1927712 and 1929955.
Appendix A Connection between -MoE and other popular models
Relation to other mixture models.
Notice that if let in Eq. (1) for all , we recover the well-known uniform mixtures of generalized linear models (GLMs). Similarly, allowing for bias parameters in Eq. (1), we can recover the generic mixtures of GLMs. Moreover, if we let to be the linear function, we get the popular mixtures of linear regressions model. These observations highlight that MoE models are a far more stricter generalization of mixtures of GLMs since they allow the mixing probability to depend on each input in a parametric way. This makes the learning of the parameters far more challenging since the gating and expert parameters are inherently coupled.
Relation to feed-forward neural networks.
Note that if we let and allow for bias parameters in the soft-max probabilities in Eq. (1), taking conditional expectation on both sides yields
[TABLE]
Thus the mapping is exactly the same as that of a -hidden -layer neural network with activation function if we restrict the output layer to positive weights. Thus -MoE can also be viewed as a probabilistic model for gated feed-forward networks.
Appendix B Valid class of non-linearities
We slightly modify the class of non-linearities from Makkuva et al., (2019) for our theoretical results. The only key modification is that we use a fourth-order derivative based conditions, as opposed to third-order derivatives used in the above work. Following their notation, let and , where . For , define
[TABLE]
where
[TABLE]
Similarly, define
[TABLE]
Condition 1**.**
and . Or equivalently, in view of Stein’s lemma Stein, (1972),
[TABLE]
Condition 2**.**
and . Or equivalently,
[TABLE]
Definition 1**.**
We say that the non-linearity is if there exists a tuple such that both Condition 1 and Condition 2 are satisfied.
While these conditions might seem restrictive at first, all the widely used non-linearities such as , ReLU, leaky-ReLU, sigmoid, etc. belong to this. For some of these non-linear activations, we provide the pre-computed transformations below:
Example 1**.**
If , then and .
Example 2**.**
If ReLU, i.e. , we have that for any ,
[TABLE]
Substituting these moments in the linear set of equations , we obtain
[TABLE]
Solving for will yield . Finally, we have that .
Appendix C Proofs of Section 3.1
Remark 2**.**
To choose the parameters in Theorem 1, we follow the parameter choices from Ge et al., (2018). Let be a sufficiently small universal constant (e.g. ). Assume , and . Let . Let and .
For any matrix , let be its pseudo inverse such that and is the projection matrix to the row span of . Let and . Let , .
For the sake of clarity, we now formally state our main assumptions, adapted from Makkuva et al., (2019):
follows a standard Gaussian distribution, i.e. . 2. 2.
for all and for all . 3. 3.
The regressors are linearly independent and the classifiers are orthogonal to the span , and . 4. 4.
The non-linearity is , which we define in Appendix B.
Note that while the first three assumptions are same as that of Makkuva et al., (2019), the fourth assumption is slightly different from theirs. Under this assumptions, we first give an alternative characterization of in the following theorem which would be crucial for the proof of Theorem 1.
Theorem 5**.**
The function defined in Eq. (6) satisfies that
[TABLE]
C.1 Proof of Theorem 5
Proof.
For the proof of Theorem 5, we use the notion of score functions defined as Janzamin et al., (2014):
[TABLE]
In this paper we focus on . When , we know that and
[TABLE]
The score transformations and can be viewed as multi-variate polynomials in of degrees and respectively. For the output , recall the transforms and defined in Section 3.1. The following lemma shows that one can construct a fourth-order super symmetric tensor using these special transforms.
Lemma 1** (Super symmetric tensor construction).**
Let be generated according to Eq. (1) and Assumptions - hold. Then
[TABLE]
where , and are two non-zero constants depending on and .
Now the proof of the theorem immediately follows from Lemma 1. Recall from Eq. (6) that
[TABLE]
Fix . Notice that we have . Hence we obtain
[TABLE]
The simplification for the remaining terms is similar and follows directly from definitions of and . ∎
C.2 Proof of Theorem 1
Proof.
The proof is an immediate consequence of Theorem 5 and Theorem C.5 of Ge et al., (2018). ∎
C.3 Proof of Theorem 2
Proof.
Note that our loss function can be written as where is at most a fourth degree polynomial in , and . Hence our finite sample guarantees directly follow from Theorem 1 and Theorem E.1 of Ge et al., (2018). ∎
C.4 Proof of Lemma 1
Proof.
The proof of this lemma essentially follows the same arguments as that of (Makkuva et al.,, 2019, Theorem 1), where we replace with respectively and letting defined there with our defined above.
∎
Appendix D Proofs of Section 3.2
For the convergence analysis of SGD on , we use techniques from Balakrishnan et al., (2017) and Makkuva et al., (2019). In particular, we adapt (Makkuva et al.,, 2019, Lemma 3) and (Makkuva et al.,, 2019, Lemma 4) to our setting through Lemma 2 and Lemma 3, which are central to the proof of Theorem 3 and Theorem 4. We now sate our lemmas.
Lemma 2**.**
Under the assumptions of Theorem 3, it holds that
[TABLE]
In addition, is a fixed point for .
Lemma 3**.**
Let the matrix of regressors be such that . Then for any , we have that
[TABLE]
where is a constant depending on and . In particular, for linear, sigmoid and ReLU.
Lemma 4** (Deviation of finite sample gradient operator).**
For some universal constant , let the number of samples be such that . Then for any fixed set of regressors , and a fixed , the bound
[TABLE]
holds with probability at least .
D.1 Proof of Theorem 3
Proof.
The proof directly follows from Lemma 2 and Lemma 3. ∎
D.2 Proof of Theorem 4
Proof.
Let the set of regressors be such that . Fix . For any iteration , from Lemma 4 we have the bound
[TABLE]
with probability at least . Using an union bound argument, Eq. (11) holds with probability at least for all . Now we show that the following bound holds:
[TABLE]
Indeed, for any , we have that
[TABLE]
where we used in Lemma 2, Lemma 3 and Lemma 4 in the last inequality to bound each of the terms. From Eq. (11), we obtain that
[TABLE]
∎
D.3 Proof of Lemma 2
Proof.
Recall that the loss function for the population setting, , is given by
[TABLE]
where and . Hence for any , we have
[TABLE]
Moreover,
[TABLE]
Hence we obtain that
[TABLE]
Notice that if denotes the latent variable corresponding to which expert is chosen, we have that the posterior probability of choosing the th expert is given by
[TABLE]
whereas,
[TABLE]
Hence, when and , we get that
[TABLE]
Thus is a fixed point for since
[TABLE]
Now we make the observation that the population-gradient updates are same as the gradient-EM updates. Thus the contraction of the population-gradient operator follows from the contraction property of the gradient EM algorithm (Makkuva et al.,, 2019, Lemma 3). To see this, recall that for -MoE, the gradient-EM algorithm involves computing the function for the current iterate and defined as:
[TABLE]
where corresponds to the posterior probability for the expert, given by
[TABLE]
Then the next iterate of the gradient-EM algorithm is given by . We have that
[TABLE]
Hence if we use the same step size , our population-gradient iterates on the log-likelihood are same as that of the gradient-EM iterates. This finishes the proof. ∎
D.4 Proof of Lemma 3
Proof.
Fix any and let be such that for some . Let
[TABLE]
Denoting the row of by and that of by for any , we have that
[TABLE]
Thus it suffices to bound . From Eq. (13), we have that
[TABLE]
where,
[TABLE]
Thus we have
[TABLE]
where denotes the posterior probability of choosing the expert. Now we observe that Eq. (14) reduces to the setting of (Makkuva et al.,, 2019, Lemma 4) and hence the conclusion follows.
∎
D.5 Proof of Lemma 4
Proof.
We first prove the lemma for . For -MoE, we have that the posterior probability is given by
[TABLE]
where and for fixed . Then we have that
[TABLE]
Hence
[TABLE]
Since , we have that
[TABLE]
We now bound and .
Bounding : We prove that the random variable is sub-gaussian with parameter for some constant and thus its squared norm is sub-exponential. We then bound using standard sub-exponential concentration bounds. Towards the same, we first show that the random variable is sub-gaussian with parameter . Or equivalently, that is sub-gaussian for all .
Without loss of generality, assume that . First let . Thus . We have
[TABLE]
It follows that is Lipschitz since
[TABLE]
From the Talagaran concentration of Gaussian measure for Lipschitz functions (Ledoux and Talagrand,, 1991), it follows that is sub-gaussian with parameter . Now consider any such that . Then we have that and are independent. Thus,
[TABLE]
is sub-gaussian with parameter since and are independent standard gaussians. Since any can be written as
[TABLE]
where denotes the projection operator onto the sub-space , we have that is sub-gaussian with parameter for all . Thus it follows that is zero-mean and sub-gaussian with parameter which further implies that
[TABLE]
with probability at least .
Bounding : Let , where
[TABLE]
Let be a -cover of the unit sphere . Hence for any , there exists a such that . Thus,
[TABLE]
where we used the fact that for any . Now taking supremum over all yields that . Now we bound for a fixed . By symmetrization trick (Vaart and Wellner,, 1996), we have
[TABLE]
where are i.i.d. Rademacher variables. Define the event . Since , standard tail bounds imply that . Thus we have that
[TABLE]
Considering the first term, for any , we have
[TABLE]
where we used the Ledoux-Talagrand contraction for Rademacher process (Ledoux and Talagrand,, 1991), since for all . The sub-gaussianity of Rademacher sequence implies that
[TABLE]
using the definition of the event . Thus the above bound on the moment generating function implies the following tail bound:
[TABLE]
Combining all the bounds together, we obtain that
[TABLE]
Since , using the union bound we obtain that
[TABLE]
Since , we have that with probability at least . Combining these bounds on and yields the final bound on .
Now consider any . From Eq. (13), defining and , we have that
[TABLE]
Similarly,
[TABLE]
Since , with out loss of generality, we let . The proof for the other cases is similar. Thus we have
[TABLE]
where . Since and , we can use the same argument as in the bounding of proof for -MoE above to get the parametric bound. This finishes the proof. ∎
Appendix E Additional experiments
E.1 Reduced batch size
In Figure 4 we ran SGD on our loss with different runs with a batch size of and a learning rate of for and . We can see that our algorithm still converges to zero but with a more variance because of noisy gradient estimation and also lesser number of samples than the required sample complexity.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Alaeddini et al., (2018) Alaeddini, A., Alemzadeh, S., Mesbahit, A., and Mesbahi, M. (2018). Linear model regression on time-series data: Non-asymptotic error bounds and applications. In 2018 IEEE Conference on Decision and Control (CDC) , pages 2259–2264. IEEE.
- 2Allen-Zhu and Li, (2019) Allen-Zhu, Z. and Li, Y. (2019). Can sgd learn recurrent neural networks with provable generalization? ar Xiv preprint ar Xiv:1902.01028 .
- 3Allen-Zhu et al., (2018) Allen-Zhu, Z., Li, Y., and Song, Z. (2018). On the convergence rate of training recurrent neural networks. ar Xiv preprint ar Xiv:1810.12065 .
- 4Arora et al., (2018) Arora, S., Hazan, E., Lee, H., Singh, K., Zhang, C., and Zhang, Y. (2018). Towards provable control for unknown linear dynamical systems.
- 5Balakrishnan et al., (2017) Balakrishnan, S., Wainwright, M. J., and Yu, B. (2017). Statistical guarantees for the EM algorithm: From population to sample-based analysis. The Annals of Statistics , 45(1):77–120.
- 6Bandeira et al., (2016) Bandeira, A. S., Boumal, N., and Voroninski, V. (2016). On the low-rank approach for semidefinite programs arising in synchronization and community detection. ar Xiv preprint ar Xiv:1602.04426 .
- 7Bhojanapalli et al., (2016) Bhojanapalli, S., Neyshabur, B., and Srebro, N. (2016). Global optimality of local search for low rank matrix recovery. ar Xiv preprint ar Xiv:1605.07221 .
- 8Cho et al., (2014) Cho, K., van Merrienboer, B., Bahdanau, D., and Bengio, Y. (2014). On the properties of neural machine translation: Encoder-decoder approaches. abs/1409.1259.
