Enhancing Open-Vocabulary Object Detection through Multi-Level Fine-Grained Visual-Language Alignment
Tianyi Zhang, Antoine Simoulin, Kai Li, Sana Lakdawala, Shiqing Yu, Arpit Mittal, Hongyu Fu, and Yu Lin

TL;DR
This paper introduces VLDet, a novel framework that enhances open-vocabulary object detection by improving multi-level visual-language alignment, leading to state-of-the-art results on COCO and LVIS datasets.
Contribution
VLDet revamps feature pyramids for fine-grained visual-language alignment and introduces SigRPN with contrastive loss, advancing open-vocabulary object detection capabilities.
Findings
Achieves 58.7 AP on COCO novel classes
Surpasses state-of-the-art on LVIS with 24.8 AP
Demonstrates superior zero-shot detection performance
Abstract
Traditional object detection systems are typically constrained to predefined categories, limiting their applicability in dynamic environments. In contrast, open-vocabulary object detection (OVD) enables the identification of objects from novel classes not present in the training set. Recent advances in visual-language modeling have led to significant progress of OVD. However, prior works face challenges in either adapting the single-scale image backbone from CLIP to the detection framework or ensuring robust visual-language alignment. We propose Visual-Language Detection (VLDet), a novel framework that revamps feature pyramid for fine-grained visual-language alignment, leading to improved OVD performance. With the VL-PUB module, VLDet effectively exploits the visual-language knowledge from CLIP and adapts the backbone for object detection through feature pyramid. In addition, we…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The model achieves state-of-the-art novel-class average precision on COCO and LVIS benchmarks, given its specified pretraining data. A series of well-designed ablations—including those on captions, language-image contrastive learning, visual-language fusion, and multi-scale features—credibly demonstrate the source of its performance gains. 2. The proposed SigRPN module is an elegant design. By framing region proposal as a binary visual-language alignment task for foreground and background se
1. The substantial requirement of 64 A100 GPUs for pretraining will likely hinder widespread adoption. Furthermore, the paper does not profile the inference cost, making it difficult to assess its practical efficiency compared to prior two-stage open-vocabulary detectors. 2. While the method of using a foundation vision-language model to generate captions for the Objects365 dataset is described, the exact model, prompts, and data cleaning procedures are not fully specified. This lack of detail
1. The paper provides some visualization examples (Figure 1 and Figure 4) for comparison, showing the differences between various models intuitively. 2. The paper provides detailed ablation studies in Table 3, Table 4, and Table 5, showing that each component of the proposed method is useful and the improvements are orthogonal. 3. The formulas in the paper are well-written, making the proposed method easy to understand.
1. The paper proposed three independent innovations (Visual-Language Pyramid Upscale Block, Visual-Language Region Proposal Network, and Mini-Batch Image Contrastive Loss) without a unified motivation, making the paper A + B + C. 2. Using a multi-scale feature pyramid is a common practice in object detection. I don't think it is a common problem that needs to be tackled. As ViT-based CLIP produces single-scale features, we can use ViTDet [1,2] to produce a feature pyramid. 3. Vision-language dee
- The paper is clearly written and logically structured, with a well-motivated problem and methodology. - The proposed VL-PUB and SigRPN modules directly address critical limitations of prior OVD methods. - Extensive experiments on standard benchmarks show consistent and significant performance improvements.
- The introduction could be better organized to separate the problem definition, motivation, and main contributions. - The VL-Fuse layer employs bi-directional cross-attention, but its mechanism is not clearly explained or supported with mathematical details. - The paper lacks comparisons with more recent OVD baselines that use stronger backbones or alignment mechanisms.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
