Vision Transformers Are Good Mask Auto-Labelers
Shiyi Lan, Xitong Yang, Zhiding Yu, Zuxuan Wu, Jose M. Alvarez, Anima, Anandkumar

TL;DR
This paper introduces Mask Auto-Labeler (MAL), a Transformer-based framework that generates high-quality mask pseudo-labels from box annotations, enabling near-supervised performance in instance segmentation.
Contribution
The paper demonstrates that Vision Transformers can effectively auto-label masks from box annotations, significantly narrowing the gap with human annotations in instance segmentation.
Findings
MAL achieves 44.1% mAP on COCO, surpassing previous box-supervised methods.
Masks generated by MAL are sometimes better than human annotations.
Instance segmentation models trained with MAL masks reach up to 97.4% of fully-supervised performance.
Abstract
We propose Mask Auto-Labeler (MAL), a high-quality Transformer-based mask auto-labeling framework for instance segmentation using only box annotations. MAL takes box-cropped images as inputs and conditionally generates their mask pseudo-labels.We show that Vision Transformers are good mask auto-labelers. Our method significantly reduces the gap between auto-labeling and human annotation regarding mask quality. Instance segmentation models trained using the MAL-generated masks can nearly match the performance of their fully-supervised counterparts, retaining up to 97.4\% performance of fully supervised models. The best model achieves 44.1\% mAP on COCO instance segmentation (test-dev 2017), outperforming state-of-the-art box-supervised methods by significant margins. Qualitative results indicate that masks produced by MAL are, in some cases, even better than human annotations.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Infrastructure Maintenance and Monitoring
