CLIP-Joint-Detect: End-to-End Joint Training of Object Detectors with Contrastive Vision-Language Supervision

Behnam Raoufi; Hossein Sharify; Mohamad Mahdee Ramezanee; Khosrow Hajsadeghi; Saeed Bagheri Shouraki

arXiv:2512.22969·cs.CV·December 30, 2025

CLIP-Joint-Detect: End-to-End Joint Training of Object Detectors with Contrastive Vision-Language Supervision

Behnam Raoufi, Hossein Sharify, Mohamad Mahdee Ramezanee, Khosrow Hajsadeghi, Saeed Bagheri Shouraki

PDF

Open Access

TL;DR

CLIP-Joint-Detect introduces an end-to-end framework that enhances object detection by integrating contrastive vision-language supervision, improving accuracy across various architectures and datasets without sacrificing speed.

Contribution

It presents a detector-agnostic, joint training method that incorporates CLIP-style contrastive supervision into standard detection frameworks, boosting performance.

Findings

01

Significant accuracy improvements on Pascal VOC and MS COCO datasets.

02

Effective across both two-stage and one-stage detection architectures.

03

Maintains real-time inference speed while enhancing detection performance.

Abstract

Conventional object detectors rely on cross-entropy classification, which can be vulnerable to class imbalance and label noise. We propose CLIP-Joint-Detect, a simple and detector-agnostic framework that integrates CLIP-style contrastive vision-language supervision through end-to-end joint training. A lightweight parallel head projects region or grid features into the CLIP embedding space and aligns them with learnable class-specific text embeddings via InfoNCE contrastive loss and an auxiliary cross-entropy term, while all standard detection losses are optimized simultaneously. The approach applies seamlessly to both two-stage and one-stage architectures. We validate it on Pascal VOC 2007+2012 using Faster R-CNN and on the large-scale MS COCO 2017 benchmark using modern YOLO detectors (YOLOv11), achieving consistent and substantial improvements while preserving real-time inference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning