FMViT: A multiple-frequency mixing Vision Transformer
Wei Tan, Yifeng Geng, Xuansong Xie

TL;DR
FMViT is an efficient hybrid Vision Transformer architecture that combines multi-frequency features and deploy-friendly mechanisms, achieving superior accuracy and speed on industrial platforms like TensorRT and CoreML.
Contribution
The paper introduces FMViT, a novel hybrid ViT that enhances expressive power with multi-frequency features and deploy-friendly modules, outperforming existing models in accuracy and inference speed.
Findings
FMViT surpasses Resnet101 by 2.5% top-1 accuracy on ImageNet.
FMViT achieves 43% faster inference than EfficientNet-B5.
On CoreML, FMViT outperforms MobileOne by 2.6% accuracy with similar latency.
Abstract
The transformer model has gained widespread adoption in computer vision tasks in recent times. However, due to the quadratic time and memory complexity of self-attention, which is proportional to the number of input tokens, most existing Vision Transformers (ViTs) encounter challenges in achieving efficient performance in practical industrial deployment scenarios, such as TensorRT and CoreML, where traditional CNNs excel. Although some recent attempts have been made to design CNN-Transformer hybrid architectures to tackle this problem, their overall performance has not met expectations. To tackle these challenges, we propose an efficient hybrid ViT architecture named FMViT. This approach enhances the model's expressive power by blending high-frequency features and low-frequency features with varying frequencies, enabling it to capture both local and global information effectively.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Brain Tumor Detection and Classification · Visual Attention and Saliency Detection
