Mixture-of-Depths Attention

Lianghui Zhu; Yuxin Fang; Bencheng Liao; Shijie Wang; Tianheng Cheng; Zilong Huang; Chen Chen; Lai Wei; Yutao Zeng; Ya Wang; Yi Lin; Yu Li; Xinggang Wang

arXiv:2603.15619·cs.CL·March 17, 2026

Mixture-of-Depths Attention

Lianghui Zhu, Yuxin Fang, Bencheng Liao, Shijie Wang, Tianheng Cheng, Zilong Huang, Chen Chen, Lai Wei, Yutao Zeng, Ya Wang, Yi Lin, Yu Li, Xinggang Wang

PDF

Open Access

TL;DR

Mixture-of-depths attention (MoDA) enhances deep language models by allowing attention across different layer depths, improving performance with minimal computational overhead, and is efficiently implementable on hardware.

Contribution

Introduces MoDA, a novel attention mechanism enabling cross-depth attention in deep models, with an efficient algorithm and demonstrated performance gains.

Findings

01

MoDA improves perplexity by 0.2 on validation benchmarks.

02

Increases downstream task performance by 2.11%.

03

Achieves 97.3% of FlashAttention-2 efficiency at sequence length 64K.

Abstract

Scaling depth is a key driver for large language models (LLMs). Yet, as LLMs become deeper, they often suffer from signal degradation: informative features formed in shallow layers are gradually diluted by repeated residual updates, making them harder to recover in deeper layers. We introduce mixture-of-depths attention (MoDA), a mechanism that allows each attention head to attend to sequence KV pairs at the current layer and depth KV pairs from preceding layers. We further describe a hardware-efficient algorithm for MoDA that resolves non-contiguous memory-access patterns, achieving 97.3% of FlashAttention-2's efficiency at a sequence length of 64K. Experiments on 1.5B-parameter models demonstrate that MoDA consistently outperforms strong baselines. Notably, it improves average perplexity by 0.2 across 10 validation benchmarks and increases average performance by 2.11% on 10 downstream…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning