FMViT: A multiple-frequency mixing Vision Transformer

Wei Tan; Yifeng Geng; Xuansong Xie

arXiv:2311.05707·cs.CV·November 13, 2023·2 cites

FMViT: A multiple-frequency mixing Vision Transformer

Wei Tan, Yifeng Geng, Xuansong Xie

PDF

Open Access

TL;DR

FMViT is an efficient hybrid Vision Transformer architecture that combines multi-frequency features and deploy-friendly mechanisms, achieving superior accuracy and speed on industrial platforms like TensorRT and CoreML.

Contribution

The paper introduces FMViT, a novel hybrid ViT that enhances expressive power with multi-frequency features and deploy-friendly modules, outperforming existing models in accuracy and inference speed.

Findings

01

FMViT surpasses Resnet101 by 2.5% top-1 accuracy on ImageNet.

02

FMViT achieves 43% faster inference than EfficientNet-B5.

03

On CoreML, FMViT outperforms MobileOne by 2.6% accuracy with similar latency.

Abstract

The transformer model has gained widespread adoption in computer vision tasks in recent times. However, due to the quadratic time and memory complexity of self-attention, which is proportional to the number of input tokens, most existing Vision Transformers (ViTs) encounter challenges in achieving efficient performance in practical industrial deployment scenarios, such as TensorRT and CoreML, where traditional CNNs excel. Although some recent attempts have been made to design CNN-Transformer hybrid architectures to tackle this problem, their overall performance has not met expectations. To tackle these challenges, we propose an efficient hybrid ViT architecture named FMViT. This approach enhances the model's expressive power by blending high-frequency features and low-frequency features with varying frequencies, enabling it to capture both local and global information effectively.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Brain Tumor Detection and Classification · Visual Attention and Saliency Detection