How Many Heads Make an SSM? A Unified Framework for Attention and State Space Models

Ali Ghodsi

arXiv:2512.15115·cs.LG·December 18, 2025

How Many Heads Make an SSM? A Unified Framework for Attention and State Space Models

Ali Ghodsi

PDF

Open Access

TL;DR

This paper introduces a unified theoretical framework for understanding the expressivity and trainability of sequence models like Transformers and state space models, revealing fundamental trade-offs and equivalences.

Contribution

It presents a comprehensive framework that unifies various sequence architectures and derives key theoretical results on their expressivity and gradient behavior.

Findings

01

Single-head attention is limited to low-dimensional operator spans.

02

Representing a linear SSM with k-dimensional lag operators requires k heads.

03

Attention layers allow distance-independent gradient paths, unlike stable linear dynamics.

Abstract

Sequence modeling has produced diverse architectures -- from classical recurrent neural networks to modern Transformers and state space models (SSMs) -- yet a unified theoretical understanding of expressivity and trainability trade-offs remains limited. We introduce a unified framework that represents a broad class of sequence maps via an input-dependent effective interaction operator $W_{ij} (X)$ , making explicit two recurring construction patterns: (i) the Unified Factorized Framework (Explicit) (attention-style mixing), in which $W_{ij} (X)$ varies through scalar coefficients applied to shared value maps, and (ii) Structured Dynamics (Implicit) (state-space recurrences), in which $W_{ij}$ is induced by a latent dynamical system. Using this framework, we derive three theoretical results. First, we establish the Interaction Rank Gap: models in the Unified Factorized Framework, such as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFerroelectric and Negative Capacitance Devices · Generative Adversarial Networks and Image Synthesis · Neural Networks and Reservoir Computing