DI-MaskDINO: A Joint Object Detection and Instance Segmentation Model

Zhixiong Nan; Xianghong Li; Tao Xiang; Jifeng Dai

arXiv:2410.16707·cs.CV·October 23, 2024

DI-MaskDINO: A Joint Object Detection and Instance Segmentation Model

Zhixiong Nan, Xianghong Li, Tao Xiang, Jifeng Dai

PDF

Open Access 1 Repo

TL;DR

DI-MaskDINO introduces a novel approach to jointly improve object detection and instance segmentation by addressing the imbalance issue in transformer decoder layers, leading to state-of-the-art results on COCO and BDD100K.

Contribution

The paper proposes DI-MaskDINO with De-Imbalance and BATO modules to alleviate detection-segmentation imbalance, enhancing joint detection and segmentation performance.

Findings

01

Outperforms existing models on COCO and BDD100K benchmarks.

02

Achieves +1.2 AP^{box} and +0.9 AP^{mask} over MaskDINO.

03

Improves detection and segmentation metrics over SOTA models.

Abstract

This paper is motivated by an interesting phenomenon: the performance of object detection lags behind that of instance segmentation (i.e., performance imbalance) when investigating the intermediate results from the beginning transformer decoder layer of MaskDINO (i.e., the SOTA model for joint detection and segmentation). This phenomenon inspires us to think about a question: will the performance imbalance at the beginning layer of transformer decoder constrain the upper bound of the final performance? With this question in mind, we further conduct qualitative and quantitative pre-experiments, which validate the negative impact of detection-segmentation imbalance issue on the model performance. To address this issue, this paper proposes DI-MaskDINO model, the core idea of which is to improve the final performance by alleviating the detection-segmentation imbalance. DI-MaskDINO is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

CQU-ADHRI-Lab/MI-DETR
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing and 3D Reconstruction · Industrial Vision Systems and Defect Detection · Handwritten Text Recognition Techniques

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Residual Connection · Dense Connections · Softmax · Multi-Head Attention · Vision Transformer · self-DIstillation with NO labels