Visual Modality Prompt for Adapting Vision-Language Object Detectors

Heitor R. Medeiros; Atif Belal; Srikanth Muralidharan; Eric Granger; and Marco Pedersoli

arXiv:2412.00622·cs.CV·March 18, 2025

Visual Modality Prompt for Adapting Vision-Language Object Detectors

Heitor R. Medeiros, Atif Belal, Srikanth Muralidharan, Eric Granger, and Marco Pedersoli

PDF

Open Access 1 Repo

TL;DR

This paper introduces ModPrompt, a visual prompt method that adapts vision-language object detectors to new modalities like infrared and depth without losing zero-shot performance, outperforming traditional fine-tuning.

Contribution

The paper proposes ModPrompt, a novel visual prompt strategy with an encoder-decoder design and residuals, enabling effective modality adaptation for vision-language detectors without degrading zero-shot capabilities.

Findings

01

Achieves comparable performance to full fine-tuning on infrared and depth datasets.

02

Preserves zero-shot detection capabilities after modality adaptation.

03

Effective on YOLO-World and Grounding DINO detectors.

Abstract

The zero-shot performance of object detectors degrades when tested on different modalities, such as infrared and depth. While recent work has explored image translation techniques to adapt detectors to new modalities, these methods are limited to a single modality and apply only to traditional detectors. Recently, vision-language detectors, such as YOLO-World and Grounding DINO, have shown promising zero-shot capabilities, however, they have not yet been adapted for other visual modalities. Traditional fine-tuning approaches compromise the zero-shot capabilities of the detectors. The visual prompt strategies commonly used for classification with vision-language models apply the same linear prompt translation to each image, making them less effective. To address these limitations, we propose ModPrompt, a visual prompt strategy to adapt vision-language detectors to new modalities without…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

heitorrapela/modprompt
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications

MethodsAttention Is All You Need · Linear Layer · Softmax · Multi-Head Attention · Dense Connections · Layer Normalization · Residual Connection · Vision Transformer · self-DIstillation with NO labels