Parallel Training of GRU Networks with a Multi-Grid Solver for Long Sequences
Gordon Euhyun Moon, Eric C. Cyr

TL;DR
This paper introduces a novel parallel training method for GRU networks using a multigrid solver, significantly speeding up training on long sequences by hierarchical correction of hidden states.
Contribution
The paper presents a new parallel-in-time training scheme for GRUs based on multigrid reduction, enabling efficient training of very long sequences.
Findings
Achieves up to 6.5x speedup over serial training.
Performance improves with increasing sequence length.
Effective hierarchical correction accelerates end-to-end communication.
Abstract
Parallelizing Gated Recurrent Unit (GRU) networks is a challenging task, as the training procedure of GRU is inherently sequential. Prior efforts to parallelize GRU have largely focused on conventional parallelization strategies such as data-parallel and model-parallel training algorithms. However, when the given sequences are very long, existing approaches are still inevitably performance limited in terms of training time. In this paper, we present a novel parallel training scheme (called parallel-in-time) for GRU based on a multigrid reduction in time (MGRIT) solver. MGRIT partitions a sequence into multiple shorter sub-sequences and trains the sub-sequences on different processors in parallel. The key to achieving speedup is a hierarchical correction of the hidden state to accelerate end-to-end communication in both the forward and backward propagation phases of gradient descent.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis
MethodsGated Recurrent Unit
