Learning Multi-Modal Prototypes for Cross-Domain Few-Shot Object Detection
Wanqi Wang, Jingcai Guo, Yuxiang Cai, Zhi Chen

TL;DR
This paper introduces LMP, a dual-branch detector that combines textual guidance with visual prototypes to improve cross-domain few-shot object detection, achieving state-of-the-art results.
Contribution
It proposes a novel multi-modal prototype learning framework that integrates visual exemplars and text prompts for enhanced detection in unseen domains with few examples.
Findings
Achieves state-of-the-art mAP on six cross-domain benchmarks.
Effectively combines visual and textual information for detection.
Performs well across 1/5/10-shot settings.
Abstract
Cross-Domain Few-Shot Object Detection (CD-FSOD) aims to detect novel classes in unseen target domains given only a few labeled examples. While open-vocabulary detectors built on vision-language models (VLMs) transfer well, they depend almost entirely on text prompts, which encode domain-invariant semantics but miss domain-specific visual information needed for precise localization under few-shot supervision. We propose a dual-branch detector that Learns Multi-modal Prototypes, dubbed LMP, by coupling textual guidance with visual exemplars drawn from the target domain. A Visual Prototype Construction module aggregates class-level prototypes from support RoIs and dynamically generates hard-negative prototypes in query images via jittered boxes, capturing distractors and visually similar backgrounds. In the visual-guided branch, we inject these prototypes into the detection pipeline with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
