Expressivity-Efficiency Tradeoffs for Hybrid Sequence Models
John Cooper, Ilias Diakonikolas, Mingchen Ma, and Frederic Sala

TL;DR
This paper investigates the tradeoffs in hybrid sequence models combining Transformers and state-space layers, demonstrating their advantages in efficiency and generalization through theoretical proofs and empirical validation on synthetic tasks.
Contribution
It provides the first theoretical analysis of hybrid models' benefits and limitations, and empirically shows learned hybrids outperform non-hybrids in various metrics.
Findings
Hybrid models solve synthetic tasks with fewer parameters and less memory.
Learned hybrid models outperform non-hybrids with up to 6x parameters.
Hybrid models show improved length generalization and robustness.
Abstract
Hybrid sequence models--combining Transformer and state-space model layers--seek to gain the expressive versatility of attention as well as the computational efficiency of state-space model layers. Despite burgeoning interest in hybrid models, we lack a basic understanding of the settings where--and underlying mechanisms through which--they offer benefits over their constituent models. In this paper, we study this question, focusing on a broad family of core synthetic tasks. For this family of tasks, we prove the existence of fundamental limitations for non-hybrid models. Specifically, any Transformer or state-space model that solves the underlying task requires either a large number of parameters or a large working memory. On the other hand, for two prototypical tasks within this family--namely selective copying and associative recall--we construct hybrid models of small size and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Time Series Analysis and Forecasting · Speech Recognition and Synthesis
