PackMamba: Efficient Processing of Variable-Length Sequences in Mamba   training

Haoran Xu; Ziqian Liu; Rong Fu; Zhongling Su; Zerui Wang; Zheng Cai,; Zhilin Pei; and Xingcheng Zhang

arXiv:2408.03865·cs.LG·August 22, 2024

PackMamba: Efficient Processing of Variable-Length Sequences in Mamba training

Haoran Xu, Ziqian Liu, Rong Fu, Zhongling Su, Zerui Wang, Zheng Cai,, Zhilin Pei, and Xingcheng Zhang

PDF

Open Access

TL;DR

PackMamba is a novel training framework that significantly improves the efficiency of processing variable-length sequences in Mamba models, achieving over 3x speedup on large language models.

Contribution

It introduces PackMamba, a high-throughput variant of Mamba that effectively handles variable-length sequences by modifying state-space model operators for better performance.

Findings

01

Achieves 3.06x speedup on 1.4B model

02

Achieves 2.62x speedup on 2.8B model

03

Demonstrates improved GPU utilization and efficiency

Abstract

With the evolution of large language models, traditional Transformer models become computationally demanding for lengthy sequences due to the quadratic growth in computation with respect to the sequence length. Mamba, emerging as a groundbreaking architecture in the field of generative AI, demonstrates remarkable proficiency in handling elongated sequences with reduced computational and memory complexity. Nevertheless, the existing training framework of Mamba presents inefficiency with variable-length sequence inputs. Either single-sequence training results in low GPU utilization, or batched processing of variable-length sequences to a maximum length incurs considerable memory and computational overhead. To address this problem, we analyze the performance of bottleneck operators in Mamba under diverse tensor shapes and proposed PackMamba, a high-throughput Mamba that efficiently handles…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization

MethodsLinear Layer · Residual Connection · Multi-Head Attention · Attention Is All You Need · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Softmax · Absolute Position Encodings · Dense Connections