SOIT: Segmenting Objects with Instance-Aware Transformers
Xiaodong Yu, Dahu Shi, Xing Wei, Ye Ren, Tingqun Ye, Wenming Tan

TL;DR
SOIT is a novel end-to-end instance segmentation framework using instance-aware transformers that eliminates the need for traditional components like RoI cropping and NMS, achieving superior results on MS COCO.
Contribution
It introduces a single-stage, RoI- and NMS-free instance segmentation method based on set prediction with transformers, improving efficiency and accuracy.
Findings
Outperforms state-of-the-art on MS COCO
Eliminates need for RoI and NMS components
Joint learning improves detection performance
Abstract
This paper presents an end-to-end instance segmentation framework, termed SOIT, that Segments Objects with Instance-aware Transformers. Inspired by DETR \cite{carion2020end}, our method views instance segmentation as a direct set prediction problem and effectively removes the need for many hand-crafted components like RoI cropping, one-to-many label assignment, and non-maximum suppression (NMS). In SOIT, multiple queries are learned to directly reason a set of object embeddings of semantic category, bounding-box location, and pixel-wise mask in parallel under the global image context. The class and bounding-box can be easily embedded by a fixed-length vector. The pixel-wise mask, especially, is embedded by a group of parameters to construct a lightweight instance-aware transformer. Afterward, a full-resolution mask is produced by the instance-aware transformer without involving any…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Position-Wise Feed-Forward Layer · Adam · Layer Normalization · Convolution · Feedforward Network · Absolute Position Encodings
