Per-Pixel Classification is Not All You Need for Semantic Segmentation
Bowen Cheng, Alexander G. Schwing, Alexander Kirillov

TL;DR
This paper introduces MaskFormer, a unified mask classification approach for semantic and panoptic segmentation that simplifies existing methods and achieves state-of-the-art results by predicting sets of binary masks with associated class labels.
Contribution
The paper proposes MaskFormer, a novel unified mask classification model that handles both semantic and instance segmentation with the same framework, loss, and training procedure.
Findings
MaskFormer outperforms per-pixel classification baselines on large class sets.
Achieves 55.6 mIoU on ADE20K for semantic segmentation.
Achieves 52.7 PQ on COCO for panoptic segmentation.
Abstract
Modern approaches typically formulate semantic segmentation as a per-pixel classification task, while instance-level segmentation is handled with an alternative mask classification. Our key insight: mask classification is sufficiently general to solve both semantic- and instance-level segmentation tasks in a unified manner using the exact same model, loss, and training procedure. Following this observation, we propose MaskFormer, a simple mask classification model which predicts a set of binary masks, each associated with a single global class label prediction. Overall, the proposed mask classification-based method simplifies the landscape of effective approaches to semantic and panoptic segmentation tasks and shows excellent empirical results. In particular, we observe that MaskFormer outperforms per-pixel classification baselines when the number of classes is large. Our mask…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗facebook/maskformer-swin-base-ademodel· 1.4k dl· ♡ 131.4k dl♡ 13
- 🤗facebook/maskformer-swin-base-cocomodel· 1.5k dl· ♡ 261.5k dl♡ 26
- 🤗facebook/maskformer-swin-large-ademodel· 418 dl· ♡ 58418 dl♡ 58
- 🤗facebook/maskformer-swin-large-cocomodel· 368 dl· ♡ 27368 dl♡ 27
- 🤗facebook/maskformer-swin-small-ademodel· 46 dl· ♡ 246 dl♡ 2
- 🤗facebook/maskformer-swin-small-cocomodel· 1.3k dl· ♡ 41.3k dl♡ 4
- 🤗facebook/maskformer-swin-tiny-ademodel· 557 dl· ♡ 5557 dl♡ 5
- 🤗facebook/maskformer-swin-tiny-cocomodel· 130 dl· ♡ 6130 dl♡ 6
- 🤗facebook/mask2former-swin-base-coco-instancemodel· 2.3k dl· ♡ 42.3k dl♡ 4
- 🤗facebook/mask2former-swin-tiny-coco-instancemodel· 94k dl· ♡ 1394k dl♡ 13
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
