OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer
Yu Wang, Xiangbo Su, Qiang Chen, Xinyu Zhang, Teng Xi, Kun Yao, Errui, Ding, Gang Zhang, Jingdong Wang

TL;DR
This paper introduces OVLW-DETR, a lightweight, open-vocabulary detection transformer that achieves strong zero-shot detection performance with low latency, using a simple alignment method from vision-language models.
Contribution
It proposes a deployment-friendly open-vocabulary detector with an end-to-end training recipe that aligns detector text embeddings with VLM text encodings, enhancing efficiency and simplicity.
Findings
Outperforms existing real-time open-vocabulary detectors on LVIS benchmark.
Achieves strong zero-shot detection performance with low latency.
Provides an end-to-end training method with simple alignment.
Abstract
Open-vocabulary object detection focusing on detecting novel categories guided by natural language. In this report, we propose Open-Vocabulary Light-Weighted Detection Transformer (OVLW-DETR), a deployment friendly open-vocabulary detector with strong performance and low latency. Building upon OVLW-DETR, we provide an end-to-end training recipe that transferring knowledge from vision-language model (VLM) to object detector with simple alignment. We align detector with the text encoder from VLM by replacing the fixed classification layer weights in detector with the class-name embeddings extracted from the text encoder. Without additional fusing module, OVLW-DETR is flexible and deployment friendly, making it easier to implement and modulate. improving the efficiency of interleaved attention computation. Experimental results demonstrate that the proposed approach is superior over…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Advanced Chemical Sensor Technologies · Advanced Image and Video Retrieval Techniques
MethodsAttention Is All You Need · Byte Pair Encoding · Layer Normalization · ALIGN · Linear Layer · Label Smoothing · Adam · Dropout · Multi-Head Attention · Dense Connections
