Enhancing compact convolutional transformers with super attention

Simpenzwe Honore Leandre; Natenaile Asmamaw Shiferaw; Dillip Rout

arXiv:2508.18960·cs.CV·August 27, 2025

Enhancing compact convolutional transformers with super attention

Simpenzwe Honore Leandre, Natenaile Asmamaw Shiferaw, Dillip Rout

PDF

TL;DR

This paper introduces a vision model that combines token mixing, sequence-pooling, and convolutional tokenizers to achieve superior accuracy and efficiency in fixed-length tasks, outperforming traditional transformers on CIFAR100.

Contribution

The proposed model innovates by integrating convolutional tokenizers with token mixing and sequence-pooling, leading to state-of-the-art performance and efficiency without relying on common training techniques.

Findings

01

Significant accuracy improvements on CIFAR100 (from 36.50% to 46.29%)

02

More efficient than SDPA transformers for short context lengths

03

High training stability without data augmentation or positional embeddings

Abstract

In this paper, we propose a vision model that adopts token mixing, sequence-pooling, and convolutional tokenizers to achieve state-of-the-art performance and efficient inference in fixed context-length tasks. In the CIFAR100 benchmark, our model significantly improves the baseline of the top 1% and top 5% validation accuracy from 36.50% to 46.29% and 66.33% to 76.31%, while being more efficient than the Scaled Dot Product Attention (SDPA) transformers when the context length is less than the embedding dimension and only 60% the size. In addition, the architecture demonstrates high training stability and does not rely on techniques such as data augmentation like mixup, positional embeddings, or learning rate scheduling. We make our code available on Github.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.