Characterizing State Space Model and Hybrid Language Model Performance with Long Context
Saptarshi Mitra, Rachid Karami, Haocheng Xu, Sitao Huang, Hyoukjun Kwon

TL;DR
This paper benchmarks and compares the performance of Transformer, State Space Models (SSMs), and hybrid models for long-context inference on consumer GPUs, revealing SSMs' advantages at very long sequences due to their linear complexity and reduced memory footprint.
Contribution
It provides the first comprehensive benchmarking of these models on consumer GPUs, highlighting SSMs' suitability for on-device long-context AI applications.
Findings
SSMs outperform Transformers at very long sequences (~57K tokens).
Transformers are faster at short sequences (<8K tokens).
Custom SSM kernels dominate inference runtime on edge platforms.
Abstract
Emerging applications such as AR are driving demands for machine intelligence capable of processing continuous and/or long-context inputs on local devices. However, currently dominant models based on Transformer architecture suffers from the quadratic computational and memory overhead, which hinders applications required to process long contexts. This has spurred a paradigm shift towards new architectures like State Space Models (SSMs) and SSM-Transformer hybrid models, which provide near-linear scaling. The near-linear scaling enabled efficient handling of millions of tokens while delivering high performance in recent studies. Although such works present promising results, their workload characteristics in terms of computational performance and hardware resource requirements are not yet thoroughly explored, which limits our understanding of their implications to the system level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
