Cross-domain Few-shot Object Detection with Multi-modal Textual   Enrichment

Zeyu Shangguan; Daniel Seita; Mohammad Rostami

arXiv:2502.16469·cs.CV·February 25, 2025

Cross-domain Few-shot Object Detection with Multi-modal Textual Enrichment

Zeyu Shangguan, Daniel Seita, Mohammad Rostami

PDF

Open Access 1 Repo

TL;DR

This paper introduces a meta-learning framework that leverages rich textual semantics to improve cross-domain few-shot object detection, effectively addressing domain shift issues by integrating visual and linguistic features.

Contribution

The paper proposes a novel multi-modal architecture with feature aggregation and semantic rectification modules for better domain adaptation in few-shot object detection.

Findings

01

Significantly outperforms existing few-shot detection methods on benchmarks.

02

Effective alignment of visual and linguistic features across domains.

03

Enhanced understanding of language improves detection accuracy.

Abstract

Advancements in cross-modal feature extraction and integration have significantly enhanced performance in few-shot learning tasks. However, current multi-modal object detection (MM-OD) methods often experience notable performance degradation when encountering substantial domain shifts. We propose that incorporating rich textual information can enable the model to establish a more robust knowledge relationship between visual instances and their corresponding language descriptions, thereby mitigating the challenges of domain shift. Specifically, we focus on the problem of Cross-Domain Multi-Modal Few-Shot Object Detection (CDMM-FSOD) and introduce a meta-learning-based framework designed to leverage rich textual semantics as an auxiliary modality to achieve effective domain adaptation. Our new architecture incorporates two key components: (i) A multi-modal feature aggregation module,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

LONGXUANX/CDFormer_code
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques

MethodsFocus