RNN Training along Locally Optimal Trajectories via Frank-Wolfe Algorithm
Yun Yue, Ming Li, Venkatesh Saligrama, Ziming Zhang

TL;DR
This paper introduces a novel RNN training method using the Frank-Wolfe algorithm, achieving lower training costs and improved performance on benchmarks, especially with long-term dependencies and noisy data.
Contribution
It develops a new RNN training approach based on Frank-Wolfe, providing theoretical convergence guarantees and demonstrating empirical advantages over traditional back-propagation.
Findings
Lower overall training cost compared to back-propagation
Significant performance improvements on benchmark datasets
Effective training of deep RNN architectures and robustness to noise
Abstract
We propose a novel and efficient training method for RNNs by iteratively seeking a local minima on the loss surface within a small region, and leverage this directional vector for the update, in an outer-loop. We propose to utilize the Frank-Wolfe (FW) algorithm in this context. Although, FW implicitly involves normalized gradients, which can lead to a slow convergence rate, we develop a novel RNN training method that, surprisingly, even with the additional cost, the overall training cost is empirically observed to be lower than back-propagation. Our method leads to a new Frank-Wolfe method, that is in essence an SGD algorithm with a restart scheme. We prove that under certain conditions our algorithm has a sublinear convergence rate of for error. We then conduct empirical experiments on several benchmark datasets including those that exhibit long-term…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and ELM · Stochastic Gradient Optimization Techniques
MethodsStochastic Gradient Descent
