Mixture of Layers with Hybrid Attention

Ivan Ternovtsii; Yurii Bilak

arXiv:2605.09516·cs.LG·May 12, 2026

Mixture of Layers with Hybrid Attention

Ivan Ternovtsii, Yurii Bilak

PDF

TL;DR

The paper proposes Mixture of Layers (MoL), a new transformer architecture that uses parallel thin blocks with hybrid attention to improve routing efficiency and global context understanding.

Contribution

It introduces MoL with hybrid attention, combining shared softmax and Gated DeltaNet attention, enabling scalable sparse block routing in transformers.

Findings

01

MoL achieves efficient routing with reduced dimensionality.

02

Hybrid attention improves global context coverage.

03

The approach enhances transformer scalability and performance.

Abstract

Standard Mixture-of-Experts (MoE) transformers route tokens to expert subnetworks within each layer, but the layer structure itself remains monolithic. We introduce Mixture of Layers (MoL), which replaces full-width transformer blocks (d_model) with K parallel thin blocks at reduced dimensionality (d_thin << d_model), connected via learned down/up projections and composed via top-k block routing. Scaling sparse block routing to many blocks creates an attention coverage problem, as each block sees fewer tokens. We address this by introducing hybrid attention, which pairs one shared softmax block for global context with Gated DeltaNet linear attention in routed blocks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.