Depth-Recurrent Attention Mixtures: Giving Latent Reasoning the Attention it Deserves

Jonas Knupp; Jan Hendrik Metzen; Jeremias Bohn; Georg Groh; Kristian Kersting

arXiv:2601.21582·cs.AI·January 30, 2026

Depth-Recurrent Attention Mixtures: Giving Latent Reasoning the Attention it Deserves

Jonas Knupp, Jan Hendrik Metzen, Jeremias Bohn, Georg Groh, Kristian Kersting

PDF

Open Access

TL;DR

This paper introduces Dreamer, a modular depth-recurrent attention framework that improves latent reasoning efficiency, reduces training data needs, and enhances model diversity across depths in language reasoning tasks.

Contribution

It proposes a novel depth-recurrent attention mixture framework that addresses hidden-size bottlenecks and scales efficiently for reasoning models.

Findings

01

Models require 2 to 8x fewer training tokens for the same accuracy.

02

Outperforms larger SOTA models with the same training tokens.

03

Shows increased expert selection diversity across depths.

Abstract

Depth-recurrence facilitates latent reasoning by sharing parameters across depths. However, prior work lacks combined FLOP-, parameter-, and memory-matched baselines, underutilizes depth-recurrence due to partially fixed layer stacks, and ignores the bottleneck of constant hidden-sizes that restricts many-step latent reasoning. To address this, we introduce a modular framework of depth-recurrent attention mixtures (Dreamer), combining sequence attention, depth attention, and sparse expert attention. It alleviates the hidden-size bottleneck through attention along depth, decouples scaling dimensions, and allows depth-recurrent models to scale efficiently and effectively. Across language reasoning benchmarks, our models require 2 to 8x fewer training tokens for the same accuracy as FLOP-, parameter-, and memory-matched SOTA, and outperform ca. 2x larger SOTA models with the same training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling