Cross-modal Proxy Evolving for OOD Detection with Vision-Language Models

Hao Tang; Yu Liu; Shuanglin Yan; Fei Shen; Shengfeng He; Jing Qin

arXiv:2601.08476·cs.CV·April 2, 2026

Cross-modal Proxy Evolving for OOD Detection with Vision-Language Models

Hao Tang, Yu Liu, Shuanglin Yan, Fei Shen, Shengfeng He, Jing Qin

PDF

1 Video

TL;DR

This paper introduces CoEvo, a test-time framework for zero-shot OOD detection with vision-language models that dynamically adapts textual and visual proxies to improve robustness under distribution shifts.

Contribution

CoEvo is a novel, training- and annotation-free method that co-evolves cross-modal proxies for better OOD detection in open-world vision-language applications.

Findings

01

Achieves state-of-the-art AUROC improvements on ImageNet-1K.

02

Reduces FPR95 significantly compared to baseline methods.

03

Effectively maintains cross-modal alignment under distribution shifts.

Abstract

Reliable zero-shot detection of out-of-distribution (OOD) inputs is critical for deploying vision-language models in open-world settings. However, the lack of labeled negatives in zero-shot OOD detection necessitates proxy signals that remain effective under distribution shift. Existing negative-label methods rely on a fixed set of textual proxies, which (i) sparsely sample the semantic space beyond in-distribution (ID) classes and (ii) remain static while only visual features drift, leading to cross-modal misalignment and unstable predictions. In this paper, we propose CoEvo, a training- and annotation-free test-time framework that performs bidirectional, sample-conditioned adaptation of both textual and visual proxies. Specifically, CoEvo introduces a proxy-aligned co-evolution mechanism to maintain two evolving proxy caches, which dynamically mines contextual textual negatives guided…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Cross-modal Proxy Evolving for OOD Detection with Vision-Language Models· underline