ParaDySe: A Parallel-Strategy Switching Framework for Dynamic Sequence Lengths in Transformer

Zhixin Ou; Peng Liang; Jianchen Han; Baihui Liu; Linbo Qiao

arXiv:2511.13198·cs.LG·November 18, 2025

ParaDySe: A Parallel-Strategy Switching Framework for Dynamic Sequence Lengths in Transformer

Zhixin Ou, Peng Liang, Jianchen Han, Baihui Liu, Linbo Qiao

PDF

Open Access

TL;DR

ParaDySe is an adaptive framework that dynamically switches parallel strategies during Transformer training to optimize performance and memory usage for sequences of varying lengths.

Contribution

It introduces a novel on-the-fly strategy switching mechanism with cost models and heuristic algorithms for efficient Transformer training on dynamic sequence lengths.

Findings

01

Addresses out-of-memory issues in long sequences

02

Achieves communication-parallelization cancellation on short sequences

03

Improves training efficiency for large language models

Abstract

Dynamic sequences with varying lengths have been widely used in the training of Transformer-based large language models (LLMs). However, current training frameworks adopt a pre-defined static parallel strategy for these sequences, causing neither communication-parallelization cancellation on short sequences nor out-of-memory on long sequences. To mitigate these issues, we propose ParaDySe, a novel adaptive Parallel strategy switching framework for Dynamic Sequences. ParaDySe enables on-the-fly optimal strategy adoption according to the immediate input sequence. It first implements the modular function libraries for parallel strategies with unified tensor layout specifications, and then builds sequence-aware memory and time cost models with hybrid methods. Guided by cost models, ParaDySe selects optimal layer-wise strategies for dynamic sequences via an efficient heuristic algorithm. By…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Parallel Computing and Optimization Techniques