Provably Shorter Scratchpads in Hybrid DeltaNet-Attention Decoders

Tomasz Steifer

arXiv:2605.16640·cs.LG·May 19, 2026

Provably Shorter Scratchpads in Hybrid DeltaNet-Attention Decoders

Tomasz Steifer

PDF

TL;DR

This paper analyzes hybrid recurrent-attention decoders in language models, demonstrating they can achieve shorter scratchpads and greater efficiency than pure models through formal expressivity proofs.

Contribution

It provides a theoretical comparison showing hybrid architectures can solve certain tasks with constant scratchpad length, unlike pure models.

Findings

01

Hybrid decoders solve parity-conditioned retrieval with O(1) scratchpad.

02

Pure Gated DeltaNet models require larger scratchpads.

03

Pure Gated Attention models need polynomial scratchpad size.

Abstract

We investigate the expressive power of hybrid recurrent-attention decoders, a class of architectures used in recent open-source language models such as Qwen3-Next and its successors. These models combine Gated Attention heads with recurrent Gated DeltaNet heads. Is there a formal advantage, in terms of model expressivity or efficiency, to such a hybrid architecture? We show that there is. We define parity-conditioned retrieval task and show that under constant-precision assumption, a Qwen-style hybrid of Gated DeltaNet and Gated Attention solves this task with a constant scratchpad, or equivalently $O (1)$ chain-of-thought steps. In contrast, no similar solution exists for pure Gated DeltaNet models, while pure Gated Attention requires at least a polynomial scratchpad.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.