YOLO-World: Real-Time Open-Vocabulary Object Detection

Tianheng Cheng; Lin Song; Yixiao Ge; Wenyu Liu; Xinggang Wang; Ying; Shan

arXiv:2401.17270·cs.CV·February 23, 2024·26 cites

YOLO-World: Real-Time Open-Vocabulary Object Detection

Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, Ying, Shan

PDF

Open Access 3 Repos 1 Models

TL;DR

YOLO-World extends YOLO detectors to open-vocabulary object detection by integrating vision-language modeling, enabling zero-shot detection with high efficiency and outperforming existing methods on key benchmarks.

Contribution

The paper introduces RepVL-PAN and a contrastive loss for open-vocabulary detection, significantly improving YOLO's ability to detect unseen objects in real-time.

Findings

01

Achieves 35.4 AP on LVIS dataset

02

Runs at 52 FPS on V100 GPU

03

Outperforms state-of-the-art methods in accuracy and speed

Abstract

The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools. However, their reliance on predefined and trained object categories limits their applicability in open scenarios. Addressing this limitation, we introduce YOLO-World, an innovative approach that enhances YOLO with open-vocabulary detection capabilities through vision-language modeling and pre-training on large-scale datasets. Specifically, we propose a new Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and region-text contrastive loss to facilitate the interaction between visual and linguistic information. Our method excels in detecting a wide range of objects in a zero-shot manner with high efficiency. On the challenging LVIS dataset, YOLO-World achieves 35.4 AP with 52.0 FPS on V100, which outperforms many state-of-the-art methods in terms of both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
D-Robotics/DOSOD
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques