A Systematic Analysis of Hybrid Linear Attention

Dustin Wang; Rui-Jie Zhu; Steven Abreu; Yong Shan; Taylor Kergan; Yuqi Pan; Yuhong Chou; Zheng Li; Ge Zhang; Wenhao Huang; Jason Eshraghian

arXiv:2507.06457·cs.CL·July 10, 2025

A Systematic Analysis of Hybrid Linear Attention

Dustin Wang, Rui-Jie Zhu, Steven Abreu, Yong Shan, Taylor Kergan, Yuqi Pan, Yuhong Chou, Zheng Li, Ge Zhang, Wenhao Huang, Jason Eshraghian

PDF

Open Access 2 Models

TL;DR

This paper systematically evaluates various linear attention mechanisms and hybrid architectures in Transformers, revealing how different configurations impact language modeling and recall performance, and providing open-source models for future research.

Contribution

It offers a comprehensive analysis of linear and hybrid attention models, training and releasing 72 models across multiple variants and ratios for benchmarking and further study.

Findings

01

Superior standalone linear models do not always perform best in hybrids.

02

Recall improves significantly with more full attention layers, especially below a 3:1 ratio.

03

Gating mechanisms and hierarchical recurrence are key for effective hybrid models.

Abstract

Transformers face quadratic complexity and memory issues with long sequences, prompting the adoption of linear attention mechanisms using fixed-size hidden states. However, linear models often suffer from limited recall performance, leading to hybrid architectures that combine linear and full attention layers. Despite extensive hybrid architecture research, the choice of linear attention component has not been deeply explored. We systematically evaluate various linear attention models across generations - vector recurrences to advanced gating mechanisms - both standalone and hybridized. To enable this comprehensive analysis, we trained and open-sourced 72 models: 36 at 340M parameters (20B tokens) and 36 at 1.3B parameters (100B tokens), covering six linear attention variants across five hybridization ratios. Benchmarking on standard language modeling and recall tasks reveals that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Topic Modeling · Big Data and Digital Economy