Olmo Hybrid: From Theory to Practice and Back

William Merrill; Yanhong Li; Tyler Romero; Anej Svete; Caia Costello; Pradeep Dasigi; Dirk Groeneveld; David Heineman; Bailey Kuehl; Nathan Lambert; Chuan Li; Kyle Lo; Saumya Malik; DJ Matusz; Benjamin Minixhofer; Jacob Morrison; Luca Soldaini; Finbarr Timbers; Pete Walsh; Noah A. Smith; Hannaneh Hajishirzi; Ashish Sabharwal

arXiv:2604.03444·cs.LG·April 20, 2026

Olmo Hybrid: From Theory to Practice and Back

William Merrill, Yanhong Li, Tyler Romero, Anej Svete, Caia Costello, Pradeep Dasigi, Dirk Groeneveld, David Heineman, Bailey Kuehl, Nathan Lambert, Chuan Li, Kyle Lo, Saumya Malik, DJ Matusz, Benjamin Minixhofer, Jacob Morrison, Luca Soldaini, Finbarr Timbers, Pete Walsh

PDF

TL;DR

This paper introduces Olmo Hybrid, a 7B-parameter hybrid language model combining recurrence and attention, demonstrating its theoretical advantages, practical performance benefits, and superior scaling efficiency over pure transformer models.

Contribution

The paper provides theoretical and empirical evidence that hybrid models can express more complex tasks and scale more efficiently than transformers, with practical validation through Olmo Hybrid.

Findings

01

Olmo Hybrid outperforms Olmo 3 in pretraining and mid-training evaluations.

02

Hybrid models scale more efficiently than pure transformers.

03

Theoretical analysis shows hybrid models can express tasks beyond both transformers and linear RNNs.

Abstract

Recent work has demonstrated the potential of non-transformer language models, especially linear recurrent neural networks (RNNs) and hybrid models that mix recurrence and attention. Yet there is no consensus on whether the potential benefits of these new architectures justify the risk and effort of scaling them up. To address this, we provide evidence for the advantages of hybrid models over pure transformers on several fronts. First, theoretically, we show that hybrid models do not merely inherit the expressivity of transformers and linear RNNs, but can express tasks beyond both, such as code execution. Putting this theory to practice, we train Olmo Hybrid, a 7B-parameter model largely comparable to Olmo 3 7B but with the sliding window layers replaced by Gated DeltaNet layers. We show that Olmo Hybrid outperforms Olmo 3 across standard pretraining and mid-training evaluations,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.