Open-World Object Detection via Discriminative Class Prototype Learning

Jinan Yu; Liyan Ma; Zhenglin Li; Yan Peng; Shaorong Xie

arXiv:2302.11757·cs.CV·February 24, 2023

Open-World Object Detection via Discriminative Class Prototype Learning

Jinan Yu, Liyan Ma, Zhenglin Li, Yan Peng, Shaorong Xie

PDF

TL;DR

This paper introduces OCPL, a novel approach for open-world object detection that learns discriminative class prototypes to effectively identify known and unknown objects, improving detection and incremental learning.

Contribution

The paper proposes a prototype-based framework with modules like PEA, ESC, and CSC to enhance discriminative embeddings for open-world object detection.

Findings

01

Effective differentiation of known and unknown classes.

02

Improved detection performance on PASCAL VOC and MS-COCO.

03

Enhanced incremental learning capabilities.

Abstract

Open-world object detection (OWOD) is a challenging problem that combines object detection with incremental learning and open-set learning. Compared to standard object detection, the OWOD setting is task to: 1) detect objects seen during training while identifying unseen classes, and 2) incrementally learn the knowledge of the identified unknown objects when the corresponding annotations is available. We propose a novel and efficient OWOD solution from a prototype perspective, which we call OCPL: Open-world object detection via discriminative Class Prototype Learning, which consists of a Proposal Embedding Aggregator (PEA), an Embedding Space Compressor (ESC) and a Cosine Similarity-based Classifier (CSC). All our proposed modules aim to learn the discriminative embeddings of known classes in the feature space to minimize the overlapping distributions of known and unknown classes, which…

Tables2

Table 1. Table 1 : We show the performance of our proposed method on open-world object detection. (↑) means higher is better, and (↓) means lower is better. The ”previous”, ”current” and ”both” denote mAP of previously known classes, currently known classes and all known classes respectively. ORE* stands for ORE model without EBUI module (energy-based unknown identifier).

Task IDs	Task 1				Task 2						Task 3						Task 4 \bigstrut
	WI	A-OSE	mAP(↑)	UR	WI	A-OSE	mAP(↑)			UR	WI	A-OSE	mAP(↑)			UR	mAP(↑) \bigstrut
	(↓)	(↓)	current	(↑)	(↓)	(↓)	previous	current	both	(↑)	(↓)	(↓)	previous	current	both	(↑)	previous	current	both \bigstrut
Faster-RCNN	0.0699	13396	56.16	—	0.0371	12291	4.076	25.74	14.91	—	0.0213	9174	6.96	13.48	9.138	—	2.04	13.68	4.95 \bigstrut
Faster-RCNN +Finetuning	Not applicable as incremental				0.0375	12497	51.09	23.84	37.47	—	0.0279	9622	35.69	11.53	27.64	—	29.53	12.78	25.34 \bigstrut
ORE*	0.0531	12226	56.09	5.48	0.0319	10229	51.80	26.32	39.06	3.14	0.0192	8579	38.16	13.24	29.85	3.38	29.94	13.18	25.75 \bigstrut
OCPL(ours)	0.0423	5670	56.64	8.26	0.0220	5690	50.65	27.54	39.10	7.65	0.0162	5166	38.63	14.74	30.67	11.88	30.75	14.42	26.67 \bigstrut

Table 2. Table 2 : Ablation study between different modules.

Row ID	Prototype	PEA	ESC	CSC	WI	A-OSE	mAP	UR \bigstrut
1	learnable	✓	✗	✗	0.0442	5840	55.55	7.52 \bigstrut[t]
2	learnable	✓	✓	✗	0.0457	5866	54.78	7.74
3	learnable	✓	✗	✓	0.0442	6140	55.84	7.71
4	fixed+finetuning	✓	✗	✗	0.0478	6155	56.12	6.65
5	fixed+finetuning	✓	✓	✗	0.0431	5588	56.29	7.84
6	fixed+finetuning	✓	✓	✓	0.0423	5670	56.64	8.26 \bigstrut[b]

Equations14

D (F (x_{i})_{p}, C^{j}) = D_{e} (F (x_{i})_{p}, C^{j}) - D_{d} (F (x_{i})_{p}, C^{j}), \hfill D_{e} (F (x_{i})_{p}, C^{j}) = \frac{1}{d} F (x_{i})_{p} - C^{j}_{2}^{2}, \hfill D_{d} (F (x_{i})_{p}, C^{j}) = F (x_{i})_{p} ∙ C^{j}, \hfill

D (F (x_{i})_{p}, C^{j}) = D_{e} (F (x_{i})_{p}, C^{j}) - D_{d} (F (x_{i})_{p}, C^{j}), \hfill D_{e} (F (x_{i})_{p}, C^{j}) = \frac{1}{d} F (x_{i})_{p} - C^{j}_{2}^{2}, \hfill D_{d} (F (x_{i})_{p}, C^{j}) = F (x_{i})_{p} ∙ C^{j}, \hfill

p_{ij} (y = j ∣ x_{i}, F, C) = \frac{e ^{- D (F (x_{i})_{p}, C^{j})}}{\sum _{k = 1}^{K} e ^{- D (F (x_{i})_{p}, C^{k})}},

p_{ij} (y = j ∣ x_{i}, F, C) = \frac{e ^{- D (F (x_{i})_{p}, C^{j})}}{\sum _{k = 1}^{K} e ^{- D (F (x_{i})_{p}, C^{k})}},

L_{d ce} (x_{i}, θ, C) = - l o g p_{ij} (y = j ∣ x_{i}, F, C),

L_{d ce} (x_{i}, θ, C) = - l o g p_{ij} (y = j ∣ x_{i}, F, C),

L_{osr} (x_{i}, θ, C, R) = max {0, D_{e} (F (x_{i})_{p}, C^{j}) - R},

L_{osr} (x_{i}, θ, C, R) = max {0, D_{e} (F (x_{i})_{p}, C^{j}) - R},

L_{p r o t o} = L_{d ce} + λ L_{osr},

L_{p r o t o} = L_{d ce} + λ L_{osr},

l o g i t s_{i} = α \frac{W ^{T} f _{i}}{∥ W ^{T} ∥ ∥ f _{i} ∥},

l o g i t s_{i} = α \frac{W ^{T} f _{i}}{∥ W ^{T} ∥ ∥ f _{i} ∥},

class_{i}=\left\{{\begin{array}[]{*{20}c}j&,&{if{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}p_{ij}\geqslant\xi,}\\ {unknown}&,&{otherwise,}\\ \end{array}}\right.

class_{i}=\left\{{\begin{array}[]{*{20}c}j&,&{if{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}p_{ij}\geqslant\xi,}\\ {unknown}&,&{otherwise,}\\ \end{array}}\right.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

open-world object detection via discriminative class prototype learning

Abstract

Open-world object detection (OWOD) is a challenging problem that combines object detection with incremental learning and open-set learning. Compared to standard object detection, the OWOD setting is task to: 1) detect objects seen during training while identifying unseen classes, and 2) incrementally learn the knowledge of the identified unknown objects when the corresponding annotations is available. We propose a novel and efficient OWOD solution from a prototype perspective, which we call OCPL: Open-world object detection via discriminative Class Prototype Learning, which consists of a Proposal Embedding Aggregator (PEA), an Embedding Space Compressor (ESC) and a Cosine Similarity-based Classifier (CSC). All our proposed modules aim to learn the discriminative embeddings of known classes in the feature space to minimize the overlapping distributions of known and unknown classes, which is beneficial to differentiate known and unknown classes. Extensive experiments performed on PASCAL VOC and MS-COCO benchmark demonstrate the effectiveness of our proposed method.

**Index Terms— ** Open-world object detection, Prototype Learning, Proposal Embedding Aggregator, Embedding Space Compressor

1 Introduction

The rapid development of deep learning in recent years has significantly improved the performance of object detection, which is to identify and localize regions of interest in an image. But almost all existing object detectors consider the close-world assumption that the test sets and training sets contain the same data categories [1, 2, 3, 4]. In practical applications, the test set may emerge classes not seen in the training set. Therefore, open-world object detection was proposed to solve the challenge that the model not only needs to correctly detect known classes, but also recognize unknown classes [5]. Furthermore the detector progressively learn the new knowledge once identified unknown categories are annotated.

The identification of unknown classes in traditional object detection pipeline faces significant challenges. Dhamija et al. were the first to formalize the open-set object detection problem [6]. They revealed that even State-of-the-art object detectors also result in false positive detection in the open-set environment. Complementing the benchmark for open-set object detection, Joseph et al. proposed the first open-world object detection model, named ORE [5], based on the Faster-RCNN [3]. Since the annotations of unknown object are unavailable, ORE introduces the auto-labelling unknowns module to obtain a weakly supervised set of unknown objects.

Although ORE was the first to introduce and explore the challenging OWOD paradigm, it suffers from several defects. 1) The effect of the auto-labelling mechanism is negligible. It is difficult for the model to obtain supervision of unknown samples from this module and may cause confusion between background and unknown classes. 2) Training an energy-based classifier in ORE requires a fully annotated dataset, which contains label information of known and unknown classes. Obviously, ORE violates the principle of open-world object detection, where only annotations of known classes are available during training.

Motivated by the above observations, we intend to build an open-world object detection pipeline that does not require pseudo-supervised for unknown classes and illogical extra datasets for unknown classes. In fact, Region Proposal Network (RPN) can extract proposals with high objectness scores, which contain both known and unknown classes. In other words, If the overlap of the distributions of unknown and known classes can be largely avoided, we can differentiate them well. Inspired by work on open set recognition [7, 8, 9, 10, 11, 12], we introduce a prototype branch to cluster known classes. Proposal Embedding Aggregator (PEA) and Embedding Space Compressor (ESC) are proposed to make compact clusters. Furthermore, we also introduce a Cosine Similarity-Based classifier (CSC) to help form tighter clusters [13]. We summarize our contributions as follows:

We are the first to explore the possibility of prototype ideology in open-world object detection.
We introduce a prototype branch with Proposal Embedding Aggregator and Embedding Space Compressor which are adopted to separate and squeeze the feature distribution of each known class respectively. In order to form more compact class clusters, we introduce a Cosine Similarity-based Classifier.
Extensive experiments on the OWOD benchmark confirm that our proposed OCPL is satisfactory. Specifically, the proposed method outperforms the recently proposed ORE on most metrics.

2 METHOD

2.1 Overall Architecture

To integrate prototype learning with object detection frameworks, we choose the classic two-stage Faster-RCNN framework. Faster-RCNN can sample proposals of different sizes into the same shape through RoI pooling, which means it is beneficial for prototype learning. Fig.1 shows the overall structure of the proposed OCPL. In the first stage, the feature maps output of the backone are fed into a class-agnostic Region Proposal Network (RPN) to propose potential regions that may have an object. The next stage model classifies, regresses and learns representations for each proposed region through three parallel fully connected layers. In the prototype branch, we learn the distribution of known classes via Proposal Embedding Aggregator and Embedding Space Compressor. Then the distance between the features of the sample and the prototype centers can be used to measure the probability that the sample belongs to a known or unknown class. To further form tighter clusters, we replace the original classifier with a Cosine Similarity-based Classifier. The regression head and the corresponding loss $L_{reg}$ remain unchanged.

2.2 Proposal Embedding Aggregator

Some researchers [9] have shown that setting multiple prototypes for each cluster may be harmful to form tighter distribution of samples in the feature space, Therefore, only one prototype center is chosen for each class and initialized by one-hot encoding. We denote the prototype centers as $\mathcal{C}=\{\mathcal{C}^{k}\in\mathbb{R}^{d},k=1,2,...,K\}$ , where $K$ is number of categories, and we denote $\mathcal{F}$ as the function of RoI Head. For any instance $x_{i}\in\mathbb{R}^{C\times H\times W}$ from RoI pooling, we adopt the setting of ARPL [10] to measure the distance between $x_{i}$ and the prototype center of category $j$ :

[TABLE]

where $\mathcal{F}\left({x_{i}}\right)_{p}\in\mathbb{R}^{d}$ is the embedding features of the prototype branch, $\mathcal{D}_{e}$ and $\mathcal{D}_{d}$ represent Euclidean distance and dot product, respectively. In order to optimize the prototype more conveniently, we imitate the cross entropy mechanism and use distance-based cross entropy loss. Therefore, the probability that instance $x_{i}$ belongs to class $j$ can be as

[TABLE]

where the corresponding objective function is

[TABLE]

where $\theta$ is the model parameter. The optimization of $L_{dce}$ tends to converge only when $\mathcal{F}\left({x_{i}}\right)_{p}$ and $\mathcal{C}^{j}$ are almost in one direction and $\mathcal{F}\left({x_{i}}\right)_{p}$ and $\mathcal{C}^{j}$ are very close, as shown in Fig.1. By optimizing Eq.(3), the embedding features of the identical category are attracted each other by the prototype centers, and vice versa.

2.3 Embedding Space Compressor

In addition, We hope to further squeeze the distribution range of known classes in the feature space to reduce the open space risk loss, which is the degree of overlap between the distribution of unknown classes and known classes [14]. Specifically, the open space risk loss can be constrained by the radius of the prototype centers. The bottom right of Fig.1 describes the principle of $L_{osr}$ , and it can be expressed as

[TABLE]

where $x_{i}$ represents training samples with label $j$ from RoI pooling, and $R$ is a learnable parameter with an initial value of 0. Due to the nature of Euclidean norm, the value of $L_{osr}$ is always greater than zero in the initial stage of model training. With the continuous optimization of the network, $R$ will gradually increase and converge to a value while $\mathcal{D}_{e}({\mathcal{F}\left({x_{i}}\right)_{p},\mathcal{C}^{j})}$ will decrease accordingly. Therefore, $L_{osr}$ can assist $L_{dce}$ to strengthen the discriminative ability of the model. Finally, the embedding features of each positive proposal samples will fall into a corresponding hypersphere with $\mathcal{C}^{k}$ as the center and $R$ as the radius.

The overall training objective function of the prototype branch can be expressed as:

[TABLE]

where $\lambda$ is the balance factor. Our model is insensitive to $\lambda$ and we set $\lambda$ to 0.1 in all experiments.

2.4 Cosine Similarity-based Classifier

The features generated by the model trained by softmax loss are not discriminative enough, and the intra-class compactness is not considered that may cause the intra-class distance to be greater than the inter-class distance [9]. Therefore we use a cosine similarity-based classifier, where the logits before the softmax function of the classification loss is calculated by the scaled cosine similarity between instance features $f_{i}$ and class weights $W$ ,

[TABLE]

where $f_{i}\in\mathbb{R}^{C}$ is obtained by global average pooling of $x_{i}$ , and $\alpha$ is a scaled factor to strength gradient. Explicitly modeling similarity by cosine classifier [13] helps to form tighter clustering of instances.

During inference, a threshold $\gamma$ is used to filter some detections with extremely low classification scores, usually $\gamma$ is set to 0.05. Then the output of the prototype layer is used to calculate the probability in Eq.(2) for the remaining samples. Finally, the class of instance $x_{i}$ can be determined by the following formula:

[TABLE]

where $p_{ij}$ is the probability that instance $x_{i}$ belongs to class $j$ , and $\xi$ the threshold to decide whether $x_{i}$ is an unknown instance.

3 EXPERIMENT

3.1 Experiment Setup

Following data splitting in ORE, a total of 80 categories in the PASCAL VOC [15] and MSCOCO [16] training sets are equally divided into 4 groups to be trained in different tasks. We use Pascal VOC test split and MSCOCO val split for evaluation. For known classes, mAP still applies. For unknown classes, we adopt the Unknown Recall(UR) [17] rate to represent the detection capability of the model for unknown objects. Wildness Impact (WI) [6] is used to measure the degree of confusion between unknown and known classes, and we use absolute Open-Set Error(A-OSE) [18] to report the number of unknown objects detected as any known classes.

In each incremental learning task $T_{i}$ , data in tasks after $T_{i}$ will be treated as unknown classes, only the training data for current task will be present in $T_{i}$ . Therefore training without previous classes will cause catastrophic forgetting, which can be mitigated by fine-tuning network parameters with representative samples from previously known and currently known classes. Our model is based on the Faster-RCNN algorithm with a ResNet-50 [19] backbone. The code is implemented in PyTorch using Detectron2 [20].

3.2 Experiment result

Table 1 shows the comparison of our proposed OCPL model and other detectors under the OWOD evaluation protocol. It is obvious that the standard Faster-RCNN has no recall ability for unknown classes. To make a fair comparison with ORE, we reproduce the ORE* model, which is an ORE model without energy-based unknown identifier that relies on unknown dataset annotations. Our proposed model has more promising results than ORE* on almost all metrics.

The effect of each component we proposed was demonstrated in ablation experiments. The results are shown in Table 2. Two methods of prototype initialization are available. The ”learnable” indicates that the prototype is randomly initialized and is learnable, whereas ”fixed” means that each prototype center is initialized by one-hot encoding, which is more beneficial for modeling the class centers of complex datasets [8]. To make the fixed prototype more flexible, we periodically fine-tune the prototype centers via the learned class features to accommodate data with large visual variation in each class. The results show that our fixed and fine-tuned prototype centers are more likely to generate compact and stable clusters than learnable prototype centers, and it is clear that R in Embedding Space Compressor(ESC) have difficulty adapting to changing prototype centers.

3.3 Visualization and Analysis

We visualized the quality of clusters of known classes trained in Task 1 by t-SNE [21], as shown in Fig.2. Our method forms more compact class clusters, which is beneficial for identifying unknown classes. Fig.3 presents the comparison of detection results. It’s worth noting that ’cake’, ’pizza’ and ’elephant’ have not been trained in $T_{1}$ . Our method successfully identifies unseen categories as ’unknown’, while ORE* is more inclined to false detections.

4 CONCLUSION

In this paper, we propose a novel and efficient solution for Open World Object Detection (OWOD). We are the first to try to incorporate a prototype ideology into the OWOD problem and achieve better performance than the recently proposed ORE on various metrics. Two constraints and a cosine classifier are introduced into our proposed method to avoid overlap between known and unknown class distributions as much as possible. Comprehensive experiments performed on Pascal VOC and MSCOCO demonstrate the effectiveness of our proposed method.

Bibliography21

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2016, pp. 779–788.
2[2] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision , 2017, pp. 2980–2988.
3[3] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems , vol. 28, 2015.
4[4] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He, “Fcos: Fully convolutional one-stage object detection,” in Proceedings of the IEEE/CVF international conference on computer vision , 2019, pp. 9627–9636.
5[5] KJ Joseph, Salman Khan, Fahad Shahbaz Khan, and Vineeth N Balasubramanian, “Towards open world object detection,” in Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition , 2021, pp. 5830–5840.
6[6] Akshay Dhamija, Manuel Gunther, Jonathan Ventura, and Terrance Boult, “The overlooked elephant of object detection: Open set,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , 2020, pp. 1021–1030.
7[7] Hong-Ming Yang, Xu-Yao Zhang, Fei Yin, and Cheng-Lin Liu, “Robust classification with convolutional prototype learning,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2018, pp. 3474–3482.
8[8] Dimity Miller, Niko Sunderhauf, Michael Milford, and Feras Dayoub, “Class anchor clustering: A loss for distance-based open set recognition,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , 2021, pp. 3570–3578.