Mind the Way You Select Negative Texts: Pursuing the Distance Consistency in OOD Detection with VLMs

Zhikang Xu; Qianqian Xu; Zitai Wang; Cong Hua; Sicong Li; Zhiyong Yang; Qingming Huang

arXiv:2603.02618·cs.CV·April 21, 2026

Mind the Way You Select Negative Texts: Pursuing the Distance Consistency in OOD Detection with VLMs

Zhikang Xu, Qianqian Xu, Zitai Wang, Cong Hua, Sicong Li, Zhiyong Yang, Qingming Huang

PDF

TL;DR

This paper introduces InterNeg, a framework that enhances OOD detection with VLMs by enforcing inter-modal distance consistency through negative text selection and image inversion, achieving state-of-the-art results.

Contribution

It systematically aligns intra- and inter-modal distances in VLMs for improved OOD detection, a novel approach compared to prior methods.

Findings

01

Achieves 3.47% lower FPR95 on ImageNet

02

Improves AUROC by 5.50% on Near-OOD benchmark

03

Outperforms existing methods across multiple benchmarks

Abstract

Out-of-distribution (OOD) detection seeks to identify samples from unknown classes, a critical capability for deploying machine learning models in open-world scenarios. Recent research has demonstrated that Vision-Language Models (VLMs) can effectively leverage their multi-modal representations for OOD detection. However, current methods often incorporate intra-modal distance during OOD detection, such as comparing negative texts with ID labels or comparing test images with image proxies. This design paradigm creates an inherent inconsistency against the inter-modal distance that CLIP-like VLMs are optimized for, potentially leading to suboptimal performance. To address this limitation, we propose InterNeg, a simple yet effective framework that systematically utilizes consistent inter-modal distance enhancement from textual and visual perspectives. From the textual perspective, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.