TL;DR
Sessa introduces a novel decoder architecture with attention inside a recurrent feedback loop, enabling better long-range information retention and selective retrieval compared to traditional Transformers and state-space models.
Contribution
The paper presents Sessa, a new model that combines attention and recurrence, achieving power-law memory decay and flexible long-range information retrieval.
Findings
Sessa achieves power-law memory tails with decay rate $O( ext{ell}^{-eta})$
Sessa outperforms baselines on long-context benchmarks
Sessa maintains competitive performance on short-context tasks
Abstract
Modern sequence modeling is dominated by two families: Transformers, whose self-attention can access arbitrary elements of the visible sequence, and structured state-space models, which propagate information through an explicit recurrent state. These mechanisms face different limitations on long contexts: when attention is diffuse, the influence of individual tokens is diluted across the effective support, while recurrent state propagation can lose long-range sensitivity unless information is actively preserved. As a result, both mechanisms face challenges in preserving and selectively retrieving information over long contexts. We propose Sessa, a decoder that places attention inside a recurrent feedback path. This creates many attention-based paths through which past tokens can influence future states, rather than relying on a single attention read or a single recurrent chain. We prove…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
