Ouroboros: On Accelerating Training of Transformer-Based Language Models
Qian Yang, Zhouyuan Huo, Wenlin Wang, Heng Huang, Lawrence Carin

TL;DR
This paper introduces Ouroboros, a novel model-parallel algorithm that accelerates training of Transformer-based language models, overcoming backward locking issues and ensuring convergence, with demonstrated speedups and maintained accuracy.
Contribution
It presents the first model-parallel algorithm for Transformer training that guarantees convergence and significantly improves training speed over existing methods.
Findings
Achieves faster training speedup beyond data parallelism.
Maintains comparable or better accuracy.
Proven convergence to critical points for non-convex problems.
Abstract
Language models are essential for natural language processing (NLP) tasks, such as machine translation and text summarization. Remarkable performance has been demonstrated recently across many NLP domains via a Transformer-based language model with over a billion parameters, verifying the benefits of model size. Model parallelism is required if a model is too large to fit in a single computing device. Current methods for model parallelism either suffer from backward locking in backpropagation or are not applicable to language models. We propose the first model-parallel algorithm that speeds the training of Transformer-based language models. We also prove that our proposed algorithm is guaranteed to converge to critical points for non-convex problems. Extensive experiments on Transformer and Transformer-XL language models demonstrate that the proposed algorithm obtains a much faster…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Cosine Annealing · Variational Dropout · Residual Connection · Adaptive Input Representations · Adaptive Softmax · Linear Warmup With Cosine Annealing · Transformer-XL
