Taming Self-Training for Open-Vocabulary Object Detection
Shiyu Zhao, Samuel Schulter, Long Zhao, Zhixing Zhang, Vijay Kumar, B.G, Yumin Suh, Manmohan Chandraker, Dimitris N. Metaxas

TL;DR
This paper introduces SAS-Det, a novel approach for open-vocabulary object detection that addresses challenges of noisy pseudo labels and distribution shifts by splitting detection heads and stabilizing teacher updates, leading to improved performance.
Contribution
SAS-Det proposes a split-and-fusion detection head and a periodic teacher update strategy to enhance self-training for open-vocabulary detection, reducing noise and stabilizing training.
Findings
Outperforms recent models on COCO and LVIS benchmarks.
Achieves 37.4 AP50 and 29.1 APr on novel categories.
Demonstrates efficiency and effectiveness of the proposed methods.
Abstract
Recent studies have shown promising performance in open-vocabulary object detection (OVD) by utilizing pseudo labels (PLs) from pretrained vision and language models (VLMs). However, teacher-student self-training, a powerful and widely used paradigm to leverage PLs, is rarely explored for OVD. This work identifies two challenges of using self-training in OVD: noisy PLs from VLMs and frequent distribution changes of PLs. To address these challenges, we propose SAS-Det that tames self-training for OVD from two key perspectives. First, we present a split-and-fusion (SAF) head that splits a standard detection into an open-branch and a closed-branch. This design can reduce noisy supervision from pseudo boxes. Moreover, the two branches learn complementary knowledge from different training data, significantly enhancing performance when fused together. Second, in our view, unlike in closed-set…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
* The intention of this paper is clear, focusing on how to achieve effective self-training in Open Vocabulary Object Detection (OVOD) and achieving promising results on the COCO and LVIS datasets.
Overall, this paper addresses the challenging problem of open vocabulary from the perspective of pseudo label. I believe the main issues with this paper are as follows: * Teacher-student self-training is a classic framework in semi-supervised object detection. However, the proposed periodic update strategy in this paper actually only modifies the frequency of teacher network updates, which lacks any innovation. In addition, the author claims that reducing the frequency of teacher network updates
- This paper is well-written and well organized. - Extensive experiments are conducted to validate the effectiveness.
- I acknowledge the importance to study the self-training problem in open-vocabulary object detection. My main concern is the technique novelty about the split-and-fusion head for self-training and the periodic update strategy for the teacher model. As far as I know, the proposed split-and-fusion head is quite similar to the noise-bypass head in [1] for semi-supervised object detection, and the proposed periodic update strategy is quite similar to the periodic update strategy in [2] for source-f
The paper makes self-training work for open-vocabulary object detection and obtains state-of-the-art results, which are non-trivial given prior attempts that are not so simple. Most of the modifications are verified by experiments, thus providing insights to the community for future research.
1. The hypothesis claimed in this paper is not verified. Most modifications are motivated by the idea that pseudo-labels are noisy and change frequently. However, it is not confirmed by any analysis. The periodic update strategy and the SAF head are indeed effective, but it does not mean that they work because they reduce the noise in pseudo labels. 2. The ablation study does not reveal the entangled effects between components. For example, what if the baseline uses SAF head but does not use ex
1. The paper is well-written and easy following, the writing is good. 2. The two problems mentioned in this paper, i.e., reducing noisy PLs and make updating of teach model stable, are very important in self-training. 3. The experiment on two benchmarks achieves good performance.
1. Although two problems found in this paper are key, but the method proposed for addressing the problems are trivial, especially, the periodic update, periodically updating the teacher after a set number of iterations. It’s just like a trick. 2. Missing the theoretical and experimental analysis for the updating of teaching model and the proposed periodic update. 3. The architecture of the teach model is not clear. 4. It’s better to see the experiments that only train the open branch with novel
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
