Efficient Modulation for Vision Networks
Xu Ma, Xiyang Dai, Jianwei Yang, Bin Xiao, Yinpeng Chen, Yun Fu, Lu, Yuan

TL;DR
This paper introduces EfficientMod, a novel modulation-based design for vision networks that improves accuracy and efficiency, achieving state-of-the-art results and better trade-offs in various vision tasks.
Contribution
The paper proposes the EfficientMod block, a new modulation mechanism tailored for efficient networks, and demonstrates its effectiveness and versatility in improving vision model performance.
Findings
EfficientMod-s outperforms EfficientFormerV2-s2 by 0.6% top-1 accuracy and is 25% faster on GPU.
EfficientMod-s surpasses MobileViTv2-1.0 by 2.9% accuracy at the same GPU latency.
The method improves downstream task performance, achieving 3.6 mIoU better on ADE20K.
Abstract
In this work, we present efficient modulation, a novel design for efficient vision networks. We revisit the modulation mechanism, which operates input through convolutional context modeling and feature projection layers, and fuses features via element-wise multiplication and an MLP block. We demonstrate that the modulation mechanism is particularly well suited for efficient networks and further tailor the modulation design by proposing the efficient modulation (EfficientMod) block, which is considered the essential building block for our networks. Benefiting from the prominent representational ability of modulation mechanism and the proposed efficient design, our network can accomplish better trade-offs between accuracy and efficiency and set new state-of-the-art performance in the zoo of efficient networks. When integrating EfficientMod with the vanilla self-attention block, we obtain…
Peer Reviews
Decision·ICLR 2024 poster
1. The paper had a clear introduction to previous works and how is the proposed method motivated from these works. This makes it easier to follow the work and understand how it works. 2. There are extensive experiments on multiple tasks. And the proposed method achieves better performance and latency than previous efficient models.
1. There are limited technical contributions in the work. This paper focuses on improving the latency of previous works. The improvements/changes from previous works are mainly engineering designs, for example, fuse multiple FC layers together, fuse multiple DWConv into a larger one, replace reshape operation with repeat. The guidance is mainly from previous works such as ShuffleNet v2, which is to reduce fragmented operations for improved latency. There are limited new insights. 2. It is not cl
1. The proposed model is simple yet effective. 2. The proposed model shows strong performance on several benchmarks, including ImageNet, COCO, and ADE20K.
1. In Table 2, there are no latency reported for state-of-the-art efficient models. 2. The proposed method seems simple and more analysis and motivations for the design are needed to understand the principal of the design choice. 3. Important baselines such as ConvNeXt and Swin Transformer are not included in the comparisons.
- The proposed method, EfficientMod, is remarkably simple, which could exert a significant influence when deploying a deep model on a resource-limited device. - The experimental results clearly showcase the effectiveness of EfficientMod in outperforming existing state-of-the-art methods across various tasks (classification, detection, and segmentation).
- Examining Figure 1, and comparing (b) and (c), the proposed EfficientMod block fuses the MLP on the top and the modulation block as one unified block to improve efficiency. It is conceivable that this might limit performance when using the same number of parameters. The authors should elucidate the principles behind this design, not only from the perspective of efficiency but also in terms of representational ability. - Building on the first point, it is imperative to present a comparison bet
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInfrared Target Detection Methodologies · CCD and CMOS Imaging Sensors · Advanced Memory and Neural Computing
MethodsSparse Evolutionary Training
