Ouroboros: On Accelerating Training of Transformer-Based Language Models

Qian Yang; Zhouyuan Huo; Wenlin Wang; Heng Huang; Lawrence Carin

arXiv:1909.06695·cs.CL·September 17, 2019

Ouroboros: On Accelerating Training of Transformer-Based Language Models

Qian Yang, Zhouyuan Huo, Wenlin Wang, Heng Huang, Lawrence Carin

PDF

Open Access 1 Repo

TL;DR

This paper introduces Ouroboros, a novel model-parallel algorithm that accelerates training of Transformer-based language models, overcoming backward locking issues and ensuring convergence, with demonstrated speedups and maintained accuracy.

Contribution

It presents the first model-parallel algorithm for Transformer training that guarantees convergence and significantly improves training speed over existing methods.

Findings

01

Achieves faster training speedup beyond data parallelism.

02

Maintains comparable or better accuracy.

03

Proven convergence to critical points for non-convex problems.

Abstract

Language models are essential for natural language processing (NLP) tasks, such as machine translation and text summarization. Remarkable performance has been demonstrated recently across many NLP domains via a Transformer-based language model with over a billion parameters, verifying the benefits of model size. Model parallelism is required if a model is too large to fit in a single computing device. Current methods for model parallelism either suffer from backward locking in backpropagation or are not applicable to language models. We propose the first model-parallel algorithm that speeds the training of Transformer-based language models. We also prove that our proposed algorithm is guaranteed to converge to critical points for non-convex problems. Extensive experiments on Transformer and Transformer-XL language models demonstrate that the proposed algorithm obtains a much faster…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

LaraQianYang/Ouroboros
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Cosine Annealing · Variational Dropout · Residual Connection · Adaptive Input Representations · Adaptive Softmax · Linear Warmup With Cosine Annealing · Transformer-XL