Elastic Attention Cores for Scalable Vision Transformers

Alan Z. Song; Yinjie Chen; Mu Nan; Rui Zhang; Jiahang Cao; Weijian Mai; Muquan Yu; Hossein Adeli; Deva Ramanan; Michael J. Tarr; Andrew F. Luo

arXiv:2605.12491·cs.CV·May 13, 2026

Elastic Attention Cores for Scalable Vision Transformers

Alan Z. Song, Yinjie Chen, Mu Nan, Rui Zhang, Jiahang Cao, Weijian Mai, Muquan Yu, Hossein Adeli, Deva Ramanan, Michael J. Tarr, Andrew F. Luo

PDF

TL;DR

VECA introduces a scalable vision transformer architecture that uses core-periphery structured attention, enabling linear complexity and competitive performance without direct patch-to-patch interactions.

Contribution

This work proposes VECA, a novel vision transformer with efficient linear-time attention via learned core tokens, reducing computational cost while maintaining high accuracy.

Findings

01

VECA achieves competitive performance on classification and dense tasks.

02

VECA reduces computational cost compared to traditional quadratic attention models.

03

VECA demonstrates effective learning without direct patch-to-patch interactions.

Abstract

Vision Transformers (ViTs) achieve strong data-driven scaling by leveraging all-to-all self-attention. However, this flexibility incurs a computational cost that scales quadratically with image resolution, limiting ViTs in high-resolution domains. Underlying this approach is the assumption that pairwise token interactions are necessary for learning rich visual-semantic representations. In this work, we challenge this assumption, demonstrating that effective visual representations can be learned without any direct patch-to-patch interaction. We propose VECA (Visual Elastic Core Attention), a vision transformer architecture that uses efficient linear-time core-periphery structured attention enabled by a small set of learned cores. In VECA, these cores act as a communication interface: patch tokens exchange information exclusively through the core tokens, which are initialized from scratch…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.