TL;DR
iFormer is a novel mobile vision network that combines convolution and self-attention to achieve high accuracy and low latency on mobile devices, outperforming existing lightweight models across multiple vision tasks.
Contribution
The paper introduces iFormer, a hybrid mobile vision network that integrates convolution and efficient self-attention, with a new mobile modulation attention mechanism for improved global modeling.
Findings
Achieves 80.4% Top-1 accuracy on ImageNet-1k with 1.10 ms latency on iPhone 13.
Outperforms MobileNetV4 under similar latency constraints.
Shows significant improvements in downstream tasks like object detection and segmentation.
Abstract
We present a new family of mobile hybrid vision networks, called iFormer, with a focus on optimizing latency and accuracy on mobile applications. iFormer effectively integrates the fast local representation capacity of convolution with the efficient global modeling ability of self-attention. The local interactions are derived from transforming a standard convolutional network, \textit{i.e.}, ConvNeXt, to design a more lightweight mobile network. Our newly introduced mobile modulation attention removes memory-intensive operations in MHA and employs an efficient modulation mechanism to boost dynamic global representational capacity. We conduct comprehensive experiments demonstrating that iFormer outperforms existing lightweight networks across various tasks. Notably, iFormer achieves an impressive Top-1 accuracy of 80.4\% on ImageNet-1k with a latency of only 1.10 ms on an iPhone 13,…
Peer Reviews
Decision·ICLR 2025 Poster
This paper is well-organized and easy to follow. Detailed design specifications and comprehensive experiments enhanced the integrity of the article and demonstrated its contributions. The main contribution, SHMA, provides a new approach to designing efficient attention and Transformer blocks. The resulting iFormer series outperforms sota baseline mobile networks with stronger performance and lower latency.
W1: The motivation and necessity of substituting half of the conv blocks at the third stage and all blocks at the last stage into Transformer blocks in ConvNeXt are still not very clear. From Figure 2, changing the conv blocks into SHA blocks gains a 0.4% improvement in performance but is also 0.12 ms (about 10%) slower. I'd like to know further explanation for this design and ablation studies on the choice of stages or different ratios of Conv versus Transformer blocks if possible. W2: Ac
1. The study of model architecture could inspire further exploration in designing more efficient architectures. 2. The paper is well-organized and easy to follow.
1. In Table 1, iFormer-S achieves the same latency as RepViT-M1.0 with slightly fewer parameters, yet in larger variants, iFormer achieves lower latency with substantially more parameters compared to RepViT. What is the reason for this difference? 2. Some studies are not included in the comparison or the related wotk section, such as [1, 2]. [1] Cmt: Convolutional neural networks meet vision transformers. [2] Learning efficient vision transformers via fine-grained manifold distillation.
I think the logic of exploration in this article, starting with ConvNeXt, first “lightening” the ConvNeXt to create a streamlined lightweight network, then exploring the attention module, is reasonable. I think the analysis about “cosine similarity between multiples” proves that using a single attention is good and worth supporting. I think the experiment reported in this paper is comprehensive (imagenet, coco, ade-20k). The paper also reports some knowledge distillation results, which is su
1: Single head self-attention has been conducted in "Shvit: Single-head vision transformer with memory efficient macro design" . Alternative to standard self-attention has been conducted in GhostNetV2. Modulation in the token mixer module has been conducted in Conv2Former. This paper references many related methods, and while that is one approach, I don't think it stands out. Although such research is a decent format, I believe it impacts the novelty of this paper. 2: The process of evo
Code & Models
Videos
Taxonomy
TopicsContext-Aware Activity Recognition Systems · Multimedia Communication and Technology
MethodsSoftmax · Attention Is All You Need · ConvNeXt · Convolution · Focus
