Mask Frozen-DETR: High Quality Instance Segmentation with One GPU
Zhanhao Liang, Yuhui Yuan

TL;DR
Mask Frozen-DETR offers a fast, GPU-efficient approach to high-quality instance segmentation by converting existing DETR models with minimal additional training, outperforming state-of-the-art methods on COCO.
Contribution
Introduces Mask Frozen-DETR, a simple framework that transforms DETR-based detectors into instance segmenters with minimal training and resource requirements.
Findings
Outperforms Mask DINO on COCO test-dev (55.3% vs. 54.7%)
Over 10X faster training than comparable methods
Can be trained on a single GPU with 16 GB memory
Abstract
In this paper, we aim to study how to build a strong instance segmenter with minimal training time and GPUs, as opposed to the majority of current approaches that pursue more accurate instance segmenter by building more advanced frameworks at the cost of longer training time and higher GPU requirements. To achieve this, we introduce a simple and general framework, termed Mask Frozen-DETR, which can convert any existing DETR-based object detection model into a powerful instance segmentation model. Our method only requires training an additional lightweight mask network that predicts instance masks within the bounding boxes given by a frozen DETR-based object detector. Remarkably, our method outperforms the state-of-the-art instance segmentation method Mask DINO in terms of performance on the COCO test-dev split (55.3% vs. 54.7%) while being over 10X times faster to train. Furthermore,…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
- The idea of adapting DETR-based detectors for instance segmentation is interesting. - The main benefit of the proposed approach is an adapter architecture for instance segmentation (from DETR-based detectors) that can reach to state-of-the-art level of Average Precision after 6 training epochs, thus reducing the training time by 10x (compared with Mask DINO). - Various design choices and ablations to provide the readers good insights about the proposed architecture.
1. The proposed adapter works only with DTER-based detectors. 2. Written presentation could be improved: - (This may be subjective) The Fig 3,5,6, and 7 have many things in common, they can be grouped into a large figure to simplify the presentation and save space. Doing so may also help to simplify the technical section 3, making it shorter and simpler. - Figure 4 is never been referred from text. - If more space is needed, ablation table 8, 9, 10 can report only the main metric of AP^{mask}.
* The proposed approach leverages the strengths of existing object detection models and introduces an efficient instance segmentation head design. * The experiments conducted in the paper showcase the quality of the proposed approach, achieving state-of-the-art results on the COCO dataset while significantly reducing training time. * The significance of the paper lies in its ability to improve the efficiency of instance segmentation models by utilizing frozen object detector weights and introduc
The paper lacks clarity in explaining the proposed approach, making it difficult to understand the specific details of the Mask Frozen-DETR framework.
1. The paper has a strait-forward motivation and idea. With the pre-trained DETR, only training the mask network is much faster and efficient, where the GPU hours in Table 6 supports the claim. 2. The paper writing is good and easy to understand. 3. The image feature encoder design is carefully studied in Table 2.
1. The paper has a low tech novelty. The mask head takes object features from RoI pooling using bounding box and object queries as input, which has been studied in queryinst. Also, it is similar to the mask head design in Mask R-CNN. 2. When comparing the GPU hours in Table 6, a more fair comparison should include (sum) both the training time for the frozen DETR and for the mask network. The training time of the object bounding box detector should be added. 3. Can the paper also provide infere
This is a clearly presented study of a kind of "transfer learning" with frozen pretrained model. It's a little different from the most common transfer learning settings in that the training task (instance masks) is meant to supplement the original output (boxes), rather than replace it. The ability of the system to do this with smaller amounts of compute is promising (see below).
This is a nice study of this sort of frozen model adaptation, but currently of limited use. The setting uses COCO with all boxes and masks available, so there is little need to separate box and mask training: the data is fixed, and both box and mask labels come from the same source. At best, one might say that optimizing box det first followed by masks later could make efficient use of training resources by allowing a (human or machine) model developer to focus on optimizing one task at a time
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Image and Object Detection Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Layer Normalization · Linear Layer · Dense Connections · Residual Connection · Vision Transformer
