Scaling Novel Object Detection with Weakly Supervised Detection Transformers
Tyler LaBonte, Yale Song, Xin Wang, Vibhav Vineet, Neel Joshi

TL;DR
This paper introduces a Weakly Supervised Detection Transformer that enhances novel object detection by leveraging large-scale pretraining, improving efficiency and performance over existing WSOD methods, especially in large-scale scenarios.
Contribution
The paper presents a novel WSOD framework using transformers that enables effective knowledge transfer from large pretraining datasets to detect many novel objects.
Findings
Outperforms previous state-of-the-art WSOD models on large-scale datasets
Class quantity is more crucial than image quantity for WSOD pretraining
The proposed method reduces training rounds and refinement steps
Abstract
A critical object detection task is finetuning an existing model to detect novel objects, but the standard workflow requires bounding box annotations which are time-consuming and expensive to collect. Weakly supervised object detection (WSOD) offers an appealing alternative, where object detectors can be trained using image-level labels. However, the practical application of current WSOD models is limited, as they only operate at small data scales and require multiple rounds of training and refinement. To address this, we propose the Weakly Supervised Detection Transformer, which enables efficient knowledge transfer from a large-scale pretraining dataset to WSOD finetuning on hundreds of novel objects. Additionally, we leverage pretrained knowledge to improve the multiple instance learning (MIL) framework often used in WSOD methods. Our experiments show that our approach outperforms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Scaling Novel Object Detection with Weakly Supervised Detection Transformers· youtube
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dense Connections · Absolute Position Encodings · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Adam
