Comprehensive Multi-Modal Prototypes are Simple and Effective   Classifiers for Vast-Vocabulary Object Detection

Yitong Chen; Wenhao Yao; Lingchen Meng; Sihong Wu; Zuxuan Wu; Yu-Gang; Jiang

arXiv:2412.17800·cs.CV·December 24, 2024

Comprehensive Multi-Modal Prototypes are Simple and Effective Classifiers for Vast-Vocabulary Object Detection

Yitong Chen, Wenhao Yao, Lingchen Meng, Sihong Wu, Zuxuan Wu, Yu-Gang, Jiang

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces Prova, a multi-modal prototype classifier that significantly improves vast-vocabulary object detection across various models and settings by leveraging comprehensive multi-modal prototypes.

Contribution

Prova is a simple yet effective multi-modal prototype classifier that enhances recognition performance in vast-vocabulary object detection, addressing the limitations of previous classifiers.

Findings

01

Prova improves Faster R-CNN, FCOS, and DINO AP by 3.3, 6.2, and 2.9 respectively.

02

Prova achieves 32.8 base AP and 11.0 novel AP in open-vocabulary detection.

03

Prova outperforms previous methods with 2.6 and 4.3 gains in base and novel AP.

Abstract

Enabling models to recognize vast open-world categories has been a longstanding pursuit in object detection. By leveraging the generalization capabilities of vision-language models, current open-world detectors can recognize a broader range of vocabularies, despite being trained on limited categories. However, when the scale of the category vocabularies during training expands to a real-world level, previous classifiers aligned with coarse class names significantly reduce the recognition performance of these detectors. In this paper, we introduce Prova, a multi-modal prototype classifier for vast-vocabulary object detection. Prova extracts comprehensive multi-modal prototypes as initialization of alignment classifiers to tackle the vast-vocabulary object recognition failure problem. On V3Det, this simple method greatly enhances the performance among one-stage, two-stage, and DETR-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

row11n/prova
pytorchOfficial

Models

🤗
Row11n/Prova
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsResidual Connection · Layer Normalization · Linear Layer · Softmax · Attention Is All You Need · Non Maximum Suppression · Dense Connections · Multi-Head Attention · RoIPool · Vision Transformer