YOLO-UniOW: Efficient Universal Open-World Object Detection
Lihao Liu, Juexiao Feng, Hui Chen, Ao Wang, Lin Song, Jungong Han,, Guiguang Ding

TL;DR
YOLO-UniOW is a new efficient model for open-world object detection that can recognize known categories and detect unknown objects, adapting dynamically without retraining.
Contribution
It introduces Adaptive Decision Learning and Wildcard Learning strategies, enabling efficient, versatile, and open-vocabulary object detection in real-time environments.
Findings
Achieves 34.6 AP on LVIS with 69.6 FPS
Sets new benchmarks on multiple open-world datasets
Effectively detects unknown objects and adapts to new categories
Abstract
Traditional object detection models are constrained by the limitations of closed-set datasets, detecting only categories encountered during training. While multimodal models have extended category recognition by aligning text and image modalities, they introduce significant inference overhead due to cross-modality fusion and still remain restricted by predefined vocabulary, leaving them ineffective at handling unknown objects in open-world scenarios. In this work, we introduce Universal Open-World Object Detection (Uni-OWD), a new paradigm that unifies open-vocabulary and open-world object detection tasks. To address the challenges of this setting, we propose YOLO-UniOW, a novel model that advances the boundaries of efficiency, versatility, and performance. YOLO-UniOW incorporates Adaptive Decision Learning to replace computationally expensive cross-modality fusion with lightweight…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Advanced Neural Network Applications · Video Surveillance and Tracking Methods
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Contrastive Language-Image Pre-training
