MonarchAttention: Zero-Shot Conversion to Fast, Hardware-Aware Structured Attention

Can Yaras; Alec S. Xu; Pierre Abillama; Changwoo Lee; Laura Balzano

arXiv:2505.18698·cs.LG·October 28, 2025

MonarchAttention: Zero-Shot Conversion to Fast, Hardware-Aware Structured Attention

Can Yaras, Alec S. Xu, Pierre Abillama, Changwoo Lee, Laura Balzano

PDF

Open Access 1 Repo 1 Video

TL;DR

MonarchAttention introduces a structured, hardware-efficient approximation of softmax attention that reduces computational complexity and maintains performance across vision and language tasks, enabling faster transformer inference without retraining.

Contribution

It proposes Monarch matrices for sub-quadratic attention approximation, enabling transferability and hardware efficiency without additional training.

Findings

01

Achieves up to 8.2x speed-up on long sequences.

02

Maintains minimal performance loss when replacing all attention layers.

03

Effective across diverse vision and language tasks.

Abstract

Transformers have achieved state-of-the-art performance across various tasks, but suffer from a notable quadratic complexity in sequence length due to the attention mechanism. In this work, we propose MonarchAttention -- a novel approach to sub-quadratic attention approximation via Monarch matrices, an expressive class of structured matrices. Based on the variational form of softmax, we describe an efficient optimization-based algorithm to compute an approximate projection of softmax attention onto the class of Monarch matrices with $Θ (N N d)$ computational complexity and $Θ (N d)$ memory/IO complexity. Unlike previous approaches, MonarchAttention is both (1) transferable, yielding minimal performance loss with no additional training, even when replacing every attention layer of the Transformer, and (2) hardware-efficient, utilizing the highest-throughput tensor core units…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cjyaras/monarch-attention
pytorchOfficial

Videos

MonarchAttention: Zero-Shot Conversion to Fast, Hardware-Aware Structured Attention· slideslive

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Ferroelectric and Negative Capacitance Devices · Advanced Memory and Neural Computing

MethodsAttention Is All You Need · Softmax