TL;DR
This paper introduces CoEvo, a test-time framework for zero-shot OOD detection with vision-language models that dynamically adapts textual and visual proxies to improve robustness under distribution shifts.
Contribution
CoEvo is a novel, training- and annotation-free method that co-evolves cross-modal proxies for better OOD detection in open-world vision-language applications.
Findings
Achieves state-of-the-art AUROC improvements on ImageNet-1K.
Reduces FPR95 significantly compared to baseline methods.
Effectively maintains cross-modal alignment under distribution shifts.
Abstract
Reliable zero-shot detection of out-of-distribution (OOD) inputs is critical for deploying vision-language models in open-world settings. However, the lack of labeled negatives in zero-shot OOD detection necessitates proxy signals that remain effective under distribution shift. Existing negative-label methods rely on a fixed set of textual proxies, which (i) sparsely sample the semantic space beyond in-distribution (ID) classes and (ii) remain static while only visual features drift, leading to cross-modal misalignment and unstable predictions. In this paper, we propose CoEvo, a training- and annotation-free test-time framework that performs bidirectional, sample-conditioned adaptation of both textual and visual proxies. Specifically, CoEvo introduces a proxy-aligned co-evolution mechanism to maintain two evolving proxy caches, which dynamically mines contextual textual negatives guided…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
