Provably Shorter Scratchpads in Hybrid DeltaNet-Attention Decoders
Tomasz Steifer

TL;DR
This paper analyzes hybrid recurrent-attention decoders in language models, demonstrating they can achieve shorter scratchpads and greater efficiency than pure models through formal expressivity proofs.
Contribution
It provides a theoretical comparison showing hybrid architectures can solve certain tasks with constant scratchpad length, unlike pure models.
Findings
Hybrid decoders solve parity-conditioned retrieval with O(1) scratchpad.
Pure Gated DeltaNet models require larger scratchpads.
Pure Gated Attention models need polynomial scratchpad size.
Abstract
We investigate the expressive power of hybrid recurrent-attention decoders, a class of architectures used in recent open-source language models such as Qwen3-Next and its successors. These models combine Gated Attention heads with recurrent Gated DeltaNet heads. Is there a formal advantage, in terms of model expressivity or efficiency, to such a hybrid architecture? We show that there is. We define parity-conditioned retrieval task and show that under constant-precision assumption, a Qwen-style hybrid of Gated DeltaNet and Gated Attention solves this task with a constant scratchpad, or equivalently chain-of-thought steps. In contrast, no similar solution exists for pure Gated DeltaNet models, while pure Gated Attention requires at least a polynomial scratchpad.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
