Focal Modulation Networks

Jianwei Yang; Chunyuan Li; Xiyang Dai; Lu Yuan; Jianfeng Gao

arXiv:2203.11926·cs.CV·November 8, 2022·148 cites

Focal Modulation Networks

Jianwei Yang, Chunyuan Li, Xiyang Dai, Lu Yuan, Jianfeng Gao

PDF

Open Access 5 Repos 10 Models

TL;DR

FocalNets replace self-attention with a focal modulation mechanism that hierarchically encodes context, selectively aggregates it, and injects it into tokens, leading to superior performance across vision tasks.

Contribution

This paper introduces FocalNets, a novel vision model that replaces self-attention with focal modulation, achieving state-of-the-art results with similar computational costs.

Findings

01

Outperform state-of-the-art self-attention models on image classification.

02

Achieve higher accuracy in object detection and segmentation tasks.

03

Set new SOTA results on COCO and ADE20K benchmarks.

Abstract

We propose focal modulation networks (FocalNets in short), where self-attention (SA) is completely replaced by a focal modulation mechanism for modeling token interactions in vision. Focal modulation comprises three components: (i) hierarchical contextualization, implemented using a stack of depth-wise convolutional layers, to encode visual contexts from short to long ranges, (ii) gated aggregation to selectively gather contexts for each query token based on its content, and (iii) element-wise modulation or affine transformation to inject the aggregated context into the query. Extensive experiments show FocalNets outperform the state-of-the-art SA counterparts (e.g., Swin and Focal Transformers) with similar computational costs on the tasks of image classification, object detection, and segmentation. Specifically, FocalNets with tiny and base size achieve 82.3% and 83.9% top-1…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Visual Attention and Saliency Detection

MethodsMulti-Head Attention · Attention Is All You Need · Layer Normalization · Linear Layer · Dense Connections · Residual Connection · Vision Transformer · Region Proposal Network · Balanced Selection · Convolution