Container: Context Aggregation Network
Peng Gao, Jiasen Lu, Hongsheng Li, Roozbeh Mottaghi, Aniruddha, Kembhavi

TL;DR
The paper introduces Container, a unified context aggregation network that combines the strengths of CNNs and Transformers, achieving faster convergence and superior performance in vision tasks like detection and segmentation.
Contribution
It proposes a novel general-purpose block for multi-head context aggregation, unifying CNNs, Transformers, and MLP-Mixers, with an efficient variant for downstream vision tasks.
Findings
Achieves state-of-the-art detection and segmentation results with improved mAP scores.
Faster convergence speeds compared to traditional CNNs.
Effective in self-supervised learning frameworks.
Abstract
Convolutional neural networks (CNNs) are ubiquitous in computer vision, with a myriad of effective and efficient variations. Recently, Transformers -- originally introduced in natural language processing -- have been increasingly adopted in computer vision. While early adopters continue to employ CNN backbones, the latest networks are end-to-end CNN-free Transformer solutions. A recent surprising finding shows that a simple MLP based solution without any traditional convolutional or Transformer components can produce effective visual representations. While CNNs, Transformers and MLP-Mixers may be considered as completely disparate architectures, we provide a unified view showing that they are in fact special cases of a more general method to aggregate spatial context in a neural network stack. We present the \model (CONText AggregatIon NEtwoRk), a general-purpose building block for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Stream Mining Techniques · Anomaly Detection Techniques and Applications · Context-Aware Activity Recognition Systems
MethodsMulti-Head Attention · Attention Is All You Need · Vision Transformer · Linear Layer · Feature Pyramid Network · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Attention Dropout · Feedforward Network · 1x1 Convolution
