Masked-attention Mask Transformer for Universal Image Segmentation
Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander, Kirillov, Rohit Girdhar

TL;DR
Mask2Former is a versatile image segmentation architecture that unifies multiple tasks with a single model, outperforming specialized methods and setting new state-of-the-art results across various datasets.
Contribution
Introduces Mask2Former, a unified transformer-based architecture capable of handling panoptic, instance, and semantic segmentation tasks with improved efficiency and accuracy.
Findings
Outperforms specialized architectures on four datasets.
Sets new state-of-the-art for panoptic, instance, and semantic segmentation.
Reduces research effort by at least three times.
Abstract
Image segmentation is about grouping pixels with different semantics, e.g., category or instance membership, where each choice of semantics defines a task. While only the semantics of each task differ, current research focuses on designing specialized architectures for each task. We present Masked-attention Mask Transformer (Mask2Former), a new architecture capable of addressing any image segmentation task (panoptic, instance or semantic). Its key components include masked attention, which extracts localized features by constraining cross-attention within predicted mask regions. In addition to reducing the research effort by at least three times, it outperforms the best specialized architectures by a significant margin on four popular datasets. Most notably, Mask2Former sets a new state-of-the-art for panoptic segmentation (57.8 PQ on COCO), instance segmentation (50.1 AP on COCO) and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗facebook/mask2former-swin-base-coco-instancemodel· 2.3k dl· ♡ 42.3k dl♡ 4
- 🤗facebook/mask2former-swin-tiny-coco-instancemodel· 94k dl· ♡ 1394k dl♡ 13
- 🤗facebook/mask2former-swin-small-coco-instancemodel· 16k dl· ♡ 916k dl♡ 9
- 🤗facebook/mask2former-swin-large-coco-instancemodel· 18k dl· ♡ 618k dl♡ 6
- 🤗facebook/mask2former-swin-base-coco-panopticmodel· 21k dl· ♡ 1621k dl♡ 16
- 🤗facebook/mask2former-swin-large-coco-panopticmodel· 522k dl· ♡ 32522k dl♡ 32
- 🤗facebook/mask2former-swin-small-coco-panopticmodel· 622 dl· ♡ 1622 dl♡ 1
- 🤗facebook/mask2former-swin-tiny-coco-panopticmodel· 11k dl· ♡ 1011k dl♡ 10
- 🤗facebook/mask2former-swin-tiny-cityscapes-panopticmodel· 1.1k dl1.1k dl
- 🤗facebook/mask2former-swin-large-cityscapes-panopticmodel· 1.2k dl· ♡ 31.2k dl♡ 3
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Medical Image Segmentation Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Dense Connections · Byte Pair Encoding · Label Smoothing · Absolute Position Encodings · Residual Connection · Softmax
