ParFormer: A Vision Transformer with Parallel Mixer and Sparse Channel   Attention Patch Embedding

Novendra Setyawan; Ghufron Wahyu Kurniawan; Chi-Chia Sun; Jun-Wei; Hsieh; Jing-Ming Guo; and Wen-Kai Kuo

arXiv:2403.15004·cs.CV·October 3, 2024·1 cites

ParFormer: A Vision Transformer with Parallel Mixer and Sparse Channel Attention Patch Embedding

Novendra Setyawan, Ghufron Wahyu Kurniawan, Chi-Chia Sun, Jun-Wei, Hsieh, Jing-Ming Guo, and Wen-Kai Kuo

PDF

Open Access

TL;DR

ParFormer is a novel vision transformer that combines convolutional and attention mechanisms with sparse channel attention to enhance feature extraction efficiency, reduce redundancy, and improve performance on resource-constrained devices.

Contribution

This paper introduces ParFormer, a vision transformer with a Parallel Mixer and SCAPE module, achieving high accuracy and throughput while reducing computational redundancy for edge device applications.

Findings

01

ParFormer-T achieves 78.9% Top-1 accuracy on ImageNet-1K.

02

ParFormer-T outperforms MobileViT-S and EdgeNeXt-S in throughput and speed.

03

ParFormer-M surpasses ResNet-50 and PoolFormer-S24 in COCO object detection.

Abstract

Convolutional Neural Networks (CNNs) and Transformers have achieved remarkable success in computer vision tasks. However, their deep architectures often lead to high computational redundancy, making them less suitable for resource-constrained environments, such as edge devices. This paper introduces ParFormer, a novel vision transformer that addresses this challenge by incorporating a Parallel Mixer and a Sparse Channel Attention Patch Embedding (SCAPE). By combining convolutional and attention mechanisms, ParFormer improves feature extraction. This makes spatial feature extraction more efficient and cuts down on unnecessary computation. The SCAPE module further reduces computational redundancy while preserving essential feature information during down-sampling. Experimental results on the ImageNet-1K dataset show that ParFormer-T achieves 78.9\% Top-1 accuracy with a high throughput on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCCD and CMOS Imaging Sensors · Infrared Target Detection Methodologies · Image Processing Techniques and Applications

MethodsAttention Is All You Need · Vision Transformer · Linear Layer · Stochastic Depth · Byte Pair Encoding · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Multi-Head Attention · Softmax