Focal Modulation Networks
Jianwei Yang, Chunyuan Li, Xiyang Dai, Lu Yuan, Jianfeng Gao

TL;DR
FocalNets replace self-attention with a focal modulation mechanism that hierarchically encodes context, selectively aggregates it, and injects it into tokens, leading to superior performance across vision tasks.
Contribution
This paper introduces FocalNets, a novel vision model that replaces self-attention with focal modulation, achieving state-of-the-art results with similar computational costs.
Findings
Outperform state-of-the-art self-attention models on image classification.
Achieve higher accuracy in object detection and segmentation tasks.
Set new SOTA results on COCO and ADE20K benchmarks.
Abstract
We propose focal modulation networks (FocalNets in short), where self-attention (SA) is completely replaced by a focal modulation mechanism for modeling token interactions in vision. Focal modulation comprises three components: (i) hierarchical contextualization, implemented using a stack of depth-wise convolutional layers, to encode visual contexts from short to long ranges, (ii) gated aggregation to selectively gather contexts for each query token based on its content, and (iii) element-wise modulation or affine transformation to inject the aggregated context into the query. Extensive experiments show FocalNets outperform the state-of-the-art SA counterparts (e.g., Swin and Focal Transformers) with similar computational costs on the tasks of image classification, object detection, and segmentation. Specifically, FocalNets with tiny and base size achieve 82.3% and 83.9% top-1…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗timm/focalnet_base_lrf.ms_in1kmodel· 141 dl141 dl
- 🤗timm/focalnet_base_srf.ms_in1kmodel· 198 dl198 dl
- 🤗timm/focalnet_huge_fl3.ms_in22kmodel· 76 dl76 dl
- 🤗timm/focalnet_huge_fl4.ms_in22kmodel· 40 dl40 dl
- 🤗timm/focalnet_large_fl3.ms_in22kmodel· 32 dl32 dl
- 🤗timm/focalnet_large_fl4.ms_in22kmodel· 330 dl330 dl
- 🤗timm/focalnet_small_lrf.ms_in1kmodel· 188 dl188 dl
- 🤗timm/focalnet_small_srf.ms_in1kmodel· 43 dl43 dl
- 🤗timm/focalnet_tiny_lrf.ms_in1kmodel· 421 dl421 dl
- 🤗timm/focalnet_tiny_srf.ms_in1kmodel· 424 dl424 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Visual Attention and Saliency Detection
MethodsMulti-Head Attention · Attention Is All You Need · Layer Normalization · Linear Layer · Dense Connections · Residual Connection · Vision Transformer · Region Proposal Network · Balanced Selection · Convolution
