A Lightweight Convolution and Vision Transformer integrated model with Multi-scale Self-attention Mechanism

Yi Zhang; Lingxiao Wei; Bowei Zhang; Ziwei Liu; Kai Yi; Shu Hu

arXiv:2508.16884·cs.CV·September 12, 2025

A Lightweight Convolution and Vision Transformer integrated model with Multi-scale Self-attention Mechanism

Yi Zhang, Lingxiao Wei, Bowei Zhang, Ziwei Liu, Kai Yi, Shu Hu

PDF

TL;DR

This paper introduces SAEViT, a lightweight vision transformer model that combines sparse attention, enhanced inter-channel communication, and convolutional features to improve efficiency and accuracy in image classification.

Contribution

The paper proposes a novel sparse attention module, a channel-interactive feed-forward network, and a hierarchical convolutional structure to reduce complexity and enhance local feature modeling in ViT.

Findings

01

Achieves 76.3% Top-1 accuracy on ImageNet-1K with only 0.8 GFLOPs.

02

Outperforms existing lightweight models in accuracy and efficiency.

03

Demonstrates effectiveness of combined sparse attention and convolutional features.

Abstract

Vision Transformer (ViT) has prevailed in computer vision tasks due to its strong long-range dependency modelling ability. \textcolor{blue}{However, its large model size and weak local feature modeling ability hinder its application in real scenarios. To balance computation efficiency and performance in downstream vision tasks, we propose an efficient ViT model with sparse attention (dubbed SAEViT) and convolution blocks. Specifically, a Sparsely Aggregated Attention (SAA) module has been proposed to perform adaptive sparse sampling and recover the feature map via deconvolution operation,} which significantly reduces the computational complexity of attention operations. In addition, a Channel-Interactive Feed-Forward Network (CIFFN) layer is developed to enhance inter-channel information exchange through feature decomposition and redistribution, which mitigates the redundancy in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.