ParaDySe: A Parallel-Strategy Switching Framework for Dynamic Sequence Lengths in Transformer
Zhixin Ou, Peng Liang, Jianchen Han, Baihui Liu, Linbo Qiao

TL;DR
ParaDySe is an adaptive framework that dynamically switches parallel strategies during Transformer training to optimize performance and memory usage for sequences of varying lengths.
Contribution
It introduces a novel on-the-fly strategy switching mechanism with cost models and heuristic algorithms for efficient Transformer training on dynamic sequence lengths.
Findings
Addresses out-of-memory issues in long sequences
Achieves communication-parallelization cancellation on short sequences
Improves training efficiency for large language models
Abstract
Dynamic sequences with varying lengths have been widely used in the training of Transformer-based large language models (LLMs). However, current training frameworks adopt a pre-defined static parallel strategy for these sequences, causing neither communication-parallelization cancellation on short sequences nor out-of-memory on long sequences. To mitigate these issues, we propose ParaDySe, a novel adaptive Parallel strategy switching framework for Dynamic Sequences. ParaDySe enables on-the-fly optimal strategy adoption according to the immediate input sequence. It first implements the modular function libraries for parallel strategies with unified tensor layout specifications, and then builds sequence-aware memory and time cost models with hybrid methods. Guided by cost models, ParaDySe selects optimal layer-wise strategies for dynamic sequences via an efficient heuristic algorithm. By…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Parallel Computing and Optimization Techniques
