Structured Recurrent Mixers for Massively Parallelized Sequence Generation
Benjamin L. Badger

TL;DR
The paper introduces Structured Recurrent Mixers, a novel architecture enabling dual sequence representations for efficient training and high-throughput inference, improving over existing linear complexity models.
Contribution
It presents a new architecture that converts between parallel and recurrent representations without specialized kernels, enhancing training efficiency and inference throughput.
Findings
Greater training efficiency and input capacity compared to other models.
12x throughput and 170x concurrency improvements over Transformers.
Effective reinforcement learning training with SRMs.
Abstract
Over the last two decades, language modeling has experienced a shift from the use of predominantly recurrent architectures that process tokens sequentially during training and inference to non-recurrent models that process sequence elements in parallel during training, which results in greater training efficiency and stability at the expense of lower inference throughput. Here we introduce the Structured Recurrent Mixer, an architecture that allows for algebraic conversion between a sequence parallel representation at train time and a recurrent representation at inference, notably without the need for specialized kernels or device-specific memory management. We show experimentally that this dual representation allows for greater training efficiency, higher input information capacity, and larger inference throughput and concurrency when compared to other linear complexity models. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
