TL;DR
MogaNet introduces a new convolutional architecture with gated aggregation for improved visual representation, achieving competitive accuracy with fewer parameters and FLOPs across multiple vision tasks.
Contribution
The paper proposes MogaNet, a novel ConvNet architecture that effectively encodes complex interactions using simple convolutions and gating, enhancing efficiency and performance.
Findings
Achieves 80.0% top-1 accuracy on ImageNet-1K with 5.2M parameters.
Outperforms state-of-the-art models like ParC-Net and ConvNeXt-L in accuracy and efficiency.
Demonstrates strong results across various downstream vision benchmarks.
Abstract
By contextualizing the kernel as global as possible, Modern ConvNets have shown great potential in computer vision tasks. However, recent progress on multi-order game-theoretic interaction within deep neural networks (DNNs) reveals the representation bottleneck of modern ConvNets, where the expressive interactions have not been effectively encoded with the increased kernel size. To tackle this challenge, we propose a new family of modern ConvNets, dubbed MogaNet, for discriminative visual representation learning in pure ConvNet-based models with favorable complexity-performance trade-offs. MogaNet encapsulates conceptually simple yet effective convolutions and gated aggregation into a compact module, where discriminative features are efficiently gathered and contextualized adaptively. MogaNet exhibits great scalability, impressive efficiency of parameters, and competitive performance…
Peer Reviews
Decision·ICLR 2024 poster
+ The paper is well-written, comprehensively introducing the motivation and method details. + The experiments are comprehensive, covering several popular vision tasks as well as varies of network scales. + The experimental and visualized analysis is good, helping the reviewer better understand the method. + Code has been released, so the reproducibility can be ensured.
- Despite good experiments and visualizations, I think the novelty is limited. As described in the introduction, the low-order interactions are modeling the local features, such as edge and texture. The high-order on the other hand models high-level semantic features. So multi-order feature aggreation indicates the multiscale aggregation with low and high level features. This paper implements it via depth-wise convolution with different kernel size and further adds gated operation, introducing m
## originality This paper presents a **novel perspective** that we should design neural networks such that it can efficiently learn **multi-order** interactions, esp. the mid-order ones. Guided by this perspective, this paper proposes a new form of **attention** mechanism (Moga Block) for both spatial and channel aggregation. While the proposed Moga Block is **not** exactly of strong novelty, the lens through which the new design is investigated and measured is very **interesting and novel**. #
There lacks a **theoretical understanding** on why the proposed Moga Block can help facilitate the learning of more mid-order interactions. There also lacks a **theoretical understanding** on why more mid-order interactions is better for the computer vision tasks. What should the **best curve** for "interaction strength of order" look like? Should it be a horizontal line across all the interaction orders? (If not, why should we automatically believe that more mid-order interactions will be bette
+: The experiments are conducted on several vision tasks, and the results show the proposed networks are competitive to existing popular architectures. In my opinion, extensive experiments are main strength of this work. +: The overall architectures of this work are clearly descried, and seems to be easy implement.
-: The analysis on multi-order game-theoretic interaction encourage to propose the multi-order gated aggregation network. However, in my opinion, relationship between Sec. 3 (i.e., analysis) and Sec. 4 (implementation) seems a bit loose. Specifically, I have a bit doubt on why fine-grained local texture (low-order) and complex global shape (middle-order) can be instantiated by Conv1×1(·) and GAP(·) respectively. And why three different DWConv layers with dilation ratios can capture low, middle,
Code & Models
- 🤗MogaNet/moganet_xtiny_224_in1kmodel· ♡ 1♡ 1
- 🤗MogaNet/moganet_tiny_224_in1kmodel· ♡ 1♡ 1
- 🤗MogaNet/moganet_small_224_in1kmodel· ♡ 1♡ 1
- 🤗MogaNet/moganet_base_224_in1kmodel· ♡ 1♡ 1
- 🤗MogaNet/moganet_large_224_in1kmodel· ♡ 1♡ 1
- 🤗MogaNet/moganet_xlarge_224_in1kmodel· ♡ 1♡ 1
- 🤗MogaNet/moganet_xtiny_256_in1kmodel· ♡ 3♡ 3
- 🤗MogaNet/moganet_tiny_256_in1kmodel· ♡ 1♡ 1
- 🤗birder-project/moganet_s_eu-commonmodel· 21 dl21 dl
Videos
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Neural Network Applications · Video Surveillance and Tracking Methods
MethodsDense Connections · Attention Is All You Need · Residual Connection · Vision Transformer · Gated Linear Unit · 1x1 Convolution · Convolution · Gated Convolution
