Visual Modality Prompt for Adapting Vision-Language Object Detectors
Heitor R. Medeiros, Atif Belal, Srikanth Muralidharan, Eric Granger, and Marco Pedersoli

TL;DR
This paper introduces ModPrompt, a visual prompt method that adapts vision-language object detectors to new modalities like infrared and depth without losing zero-shot performance, outperforming traditional fine-tuning.
Contribution
The paper proposes ModPrompt, a novel visual prompt strategy with an encoder-decoder design and residuals, enabling effective modality adaptation for vision-language detectors without degrading zero-shot capabilities.
Findings
Achieves comparable performance to full fine-tuning on infrared and depth datasets.
Preserves zero-shot detection capabilities after modality adaptation.
Effective on YOLO-World and Grounding DINO detectors.
Abstract
The zero-shot performance of object detectors degrades when tested on different modalities, such as infrared and depth. While recent work has explored image translation techniques to adapt detectors to new modalities, these methods are limited to a single modality and apply only to traditional detectors. Recently, vision-language detectors, such as YOLO-World and Grounding DINO, have shown promising zero-shot capabilities, however, they have not yet been adapted for other visual modalities. Traditional fine-tuning approaches compromise the zero-shot capabilities of the detectors. The visual prompt strategies commonly used for classification with vision-language models apply the same linear prompt translation to each image, making them less effective. To address these limitations, we propose ModPrompt, a visual prompt strategy to adapt vision-language detectors to new modalities without…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications
MethodsAttention Is All You Need · Linear Layer · Softmax · Multi-Head Attention · Dense Connections · Layer Normalization · Residual Connection · Vision Transformer · self-DIstillation with NO labels
