DI-MaskDINO: A Joint Object Detection and Instance Segmentation Model
Zhixiong Nan, Xianghong Li, Tao Xiang, Jifeng Dai

TL;DR
DI-MaskDINO introduces a novel approach to jointly improve object detection and instance segmentation by addressing the imbalance issue in transformer decoder layers, leading to state-of-the-art results on COCO and BDD100K.
Contribution
The paper proposes DI-MaskDINO with De-Imbalance and BATO modules to alleviate detection-segmentation imbalance, enhancing joint detection and segmentation performance.
Findings
Outperforms existing models on COCO and BDD100K benchmarks.
Achieves +1.2 AP^{box} and +0.9 AP^{mask} over MaskDINO.
Improves detection and segmentation metrics over SOTA models.
Abstract
This paper is motivated by an interesting phenomenon: the performance of object detection lags behind that of instance segmentation (i.e., performance imbalance) when investigating the intermediate results from the beginning transformer decoder layer of MaskDINO (i.e., the SOTA model for joint detection and segmentation). This phenomenon inspires us to think about a question: will the performance imbalance at the beginning layer of transformer decoder constrain the upper bound of the final performance? With this question in mind, we further conduct qualitative and quantitative pre-experiments, which validate the negative impact of detection-segmentation imbalance issue on the model performance. To address this issue, this paper proposes DI-MaskDINO model, the core idea of which is to improve the final performance by alleviating the detection-segmentation imbalance. DI-MaskDINO is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing and 3D Reconstruction · Industrial Vision Systems and Defect Detection · Handwritten Text Recognition Techniques
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Residual Connection · Dense Connections · Softmax · Multi-Head Attention · Vision Transformer · self-DIstillation with NO labels
