Low-latency vision transformers via large-scale multi-head attention

Ronit D. Gross; Tal Halevi; Ella Koresh; Yarden Tzach; and Ido Kanter

arXiv:2506.23832·cs.CV·July 1, 2025

Low-latency vision transformers via large-scale multi-head attention

Ronit D. Gross, Tal Halevi, Ella Koresh, Yarden Tzach, and Ido Kanter

PDF

Open Access

TL;DR

This paper introduces a novel large-scale multi-head attention mechanism in vision transformers that enhances accuracy and reduces latency by exploiting symmetry breaking and label-specific attention clusters, with potential applications in NLP.

Contribution

It generalizes the symmetry breaking phenomenon to large-scale MHA, leading to new ViT architectures with improved accuracy and lower latency through convolutional replacements.

Findings

01

ViT architectures with label-specific attention clusters outperform traditional models.

02

Replacing initial transformer blocks with convolutional layers reduces latency significantly.

03

The proposed mechanisms improve classification accuracy on CIFAR-100.

Abstract

The emergence of spontaneous symmetry breaking among a few heads of multi-head attention (MHA) across transformer blocks in classification tasks was recently demonstrated through the quantification of single-nodal performance (SNP). This finding indicates that each head focuses its attention on a subset of labels through cooperation among its SNPs. This underlying learning mechanism is generalized to large-scale MHA (LS-MHA) using a single matrix value representing single-head performance (SHP), analogous to single-filter performance in convolutional neural networks (CNNs). The results indicate that each SHP matrix comprises multiple unit clusters such that each label being explicitly recognized by a few heads with negligible noise. This leads to an increased signal-to-noise ratio (SNR) along the transformer blocks, thereby improving classification accuracy. These features give rise to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEEG and Brain-Computer Interfaces · Face Recognition and Perception · Ferroelectric and Negative Capacitance Devices