ISTR: End-to-End Instance Segmentation with Transformers
Jie Hu, Liujuan Cao, Yao Lu, ShengChuan Zhang, Yan Wang, Ke Li, Feiyue, Huang, Ling Shao, Rongrong Ji

TL;DR
ISTR introduces the first end-to-end transformer-based framework for instance segmentation, predicting mask embeddings and using recurrent refinement to achieve state-of-the-art results on MS COCO.
Contribution
It proposes a novel end-to-end instance segmentation transformer that predicts mask embeddings and employs a recurrent refinement strategy, eliminating the need for non-end-to-end components.
Findings
Achieves 46.8/38.6 box/mask AP with ResNet50-FPN on COCO
Achieves 48.1/39.9 box/mask AP with ResNet101-FPN on COCO
Demonstrates strong potential as a baseline for instance-level recognition
Abstract
End-to-end paradigms significantly improve the accuracy of various deep-learning-based computer vision models. To this end, tasks like object detection have been upgraded by replacing non-end-to-end components, such as removing non-maximum suppression by training with a set loss based on bipartite matching. However, such an upgrade is not applicable to instance segmentation, due to its significantly higher output dimensions compared to object detection. In this paper, we propose an instance segmentation Transformer, termed ISTR, which is the first end-to-end framework of its kind. ISTR predicts low-dimensional mask embeddings, and matches them with ground truth mask embeddings for the set loss. Besides, ISTR concurrently conducts detection and segmentation with a recurrent refinement strategy, which provides a new way to achieve instance segmentation compared to the existing top-down…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Machine Learning and Data Classification
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Multi-Head Attention · Adam · Layer Normalization · Residual Connection · Label Smoothing · Byte Pair Encoding · Dropout
