Towards Domain-Generalized Open-Vocabulary Object Detection: A Progressive Domain-invariant Cross-modal Alignment Method

Xiaoran Xu; Xiaoshan Yang; Jiangang Yang; Yifan Xu; Jian Liu; Changsheng Xu

arXiv:2603.27556·cs.CV·March 31, 2026

Towards Domain-Generalized Open-Vocabulary Object Detection: A Progressive Domain-invariant Cross-modal Alignment Method

Xiaoran Xu, Xiaoshan Yang, Jiangang Yang, Yifan Xu, Jian Liu, Changsheng Xu

PDF

TL;DR

This paper identifies the vulnerability of open-vocabulary object detection to domain shifts and proposes a progressive alignment method to enhance cross-modal invariance and robustness.

Contribution

It introduces PICA, a novel curriculum-based training approach that improves domain generalization in open-vocabulary object detection by maintaining stable cross-modal alignment.

Findings

01

Visual shifts cause collapse of cross-modal space in OVOD.

02

PICA improves robustness to domain shifts in open-vocabulary detection.

03

The work offers a new benchmark for evaluating domain generalization in OVOD.

Abstract

Open-Vocabulary Object Detection (OVOD) has achieved remarkable success in generalizing to novel categories. However, this success often rests on the implicit assumption of domain stationarity. In this work, we provide a principled revisit of the OVOD paradigm, uncovering a fundamental vulnerability: the fragile coupling between visual manifolds and textual embeddings when distribution shifts occur. We first systematically formalize Domain-Generalized Open-Vocabulary Object Detection (DG-OVOD). Through empirical analysis, we demonstrate that visual shifts do not merely add noise; they cause a collapse of the latent cross-modal space where novel category visual signals detach from their semantic anchors. Motivated by these insights, we propose Progressive Domain-invariant Cross-modal Alignment (PICA). PICA departs from uniform training by introducing a multi-level ambiguity and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.