Parallelizing non-linear sequential models over the sequence length
Yi Heng Lim, Qi Zhu, Joshua Selfridge, Muhammad Firmansyah Kasim

TL;DR
This paper introduces a parallel algorithm that significantly accelerates GPU evaluation of non-linear sequential models like RNNs and Neural ODEs, enabling faster training without accuracy loss, thus broadening their applicability to long sequence tasks.
Contribution
The authors present a novel parallel evaluation method for sequential models that does not require special architecture modifications, achieving up to 1000x speedup and enabling practical training of long sequence models.
Findings
GPU evaluation speed increased by up to 3 orders of magnitude.
Training time reduced by more than 10 times without accuracy loss.
Demonstrated effectiveness of GRUs on long time series classification.
Abstract
Sequential models, such as Recurrent Neural Networks and Neural Ordinary Differential Equations, have long suffered from slow training due to their inherent sequential nature. For many years this bottleneck has persisted, as many thought sequential models could not be parallelized. We challenge this long-held belief with our parallel algorithm that accelerates GPU evaluation of sequential models by up to 3 orders of magnitude faster without compromising output accuracy. The algorithm does not need any special structure in the sequential models' architecture, making it applicable to a wide range of architectures. Using our method, training sequential models can be more than 10 times faster than the common sequential method without any meaningful difference in the training results. Leveraging this accelerated training, we discovered the efficacy of the Gated Recurrent Unit in a long time…
Peer Reviews
Decision·ICLR 2024 poster
* The method is clearly presented. The authors do a good job contextualizing the approach with respect to direct multiple shooting, including recent work on adapting it to Neural ODEs (Appendix A.1, A.2) * The efficiency evaluation is quite thorough, investigating the effect of batch size, state dimension and sequence length.
* It is not clear whether this approach improves over direct multiple shooting (which is also applicable to GRUs). Is there some drawback to the linearization you introduce? Do you require more steps to converge? * The benchmarking is quite limited, both tasks showcased are small scale. The tasks do a good job at showing relative performance (end-to-end time) improvements, but they do not provide any insight on the method itself. Are there important hyperparameters, methods that could impact the
+ The method clearly has large speedups in certain training regimes, particularly for long sequences and small batch sizes + The theory of the method is clearly described, with proofs provided in the appendix. + The proofs in the appendix are clearly described. + The method is generally applicable to any sequential method, unlike previous methods which require specific architectures or structural assumptions. + Convergence is quicker than previous methods which did not incorporate Jacobians + P
+ The practical importance of this method is somewhat unclear. From a practical point of view, the DEER method is a mechanism to use more memory in order to speed up the forward and backwards pass. Therefore, many of the experiments, particularly figure 2, are not really a fair comparison, as it's very common to increase the batch size until the memory is fully utilized. In other words, the fairer comparison would be to fix the throughput (i.e. FLOP/s or memory usage) of DEER and the seque
- The proposed DEER method is well-motivated and presented clearly - The theoretical results seem sound, though I only skimmed the proofs to follow the arguments, rather than check every detail line-by-line - The method is general and can be applicable to a host of nonlinear differential equation methods such as neural ODEs and any nonlinear RNN (e.g. LSTM, GRU, etc) of broad interest to the sequence modeling community. - The method has the advantage of theoretically having quadratic converge
- The biggest weakness of the paper is the empirical results, particularly related to performance. Only two tasks are considered, a synthetic physics system where HNNs are trained and the EigenWorms task where a GRU is trained. - For the EigenWorms task, the GRU does not significantly perform that much better than several of the baselines and is outperformed by others. It is claimed that the DEER method enables faster training and thus experimentation to identify optimal GRU architectures. B
Code & Models
Videos
Taxonomy
TopicsTime Series Analysis and Forecasting · Neural Networks and Applications · Stock Market Forecasting Methods
