Vision Also You Need: Navigating Out-of-Distribution Detection with Multimodal Large Language Model

Haoran Xu; Yanlin Liu; Zizhao Tong; Jiaze Li; Kexue Fu; Yuyang Zhang; Longxiang Gao; Shuaiguang Li; Xingyu Li; Yanran Xu; Changwei Wang

arXiv:2601.14052·cs.CV·January 21, 2026

Vision Also You Need: Navigating Out-of-Distribution Detection with Multimodal Large Language Model

Haoran Xu, Yanlin Liu, Zizhao Tong, Jiaze Li, Kexue Fu, Yuyang Zhang, Longxiang Gao, Shuaiguang Li, Xingyu Li, Yanran Xu, Changwei Wang

PDF

Open Access

TL;DR

This paper introduces MM-OOD, a multimodal reasoning approach using large language models to improve out-of-distribution detection in images, addressing limitations of text-only methods and enhancing performance on diverse datasets.

Contribution

The paper proposes a novel multimodal pipeline, MM-OOD, leveraging MLLMs for improved near and far OOD detection through multi-round reasoning and a sketch-generate-elaborate framework.

Findings

01

Significant performance improvements on Food-101 dataset

02

Validated scalability on ImageNet-1K

03

Effective detection in both near and far OOD scenarios

Abstract

Out-of-Distribution (OOD) detection is a critical task that has garnered significant attention. The emergence of CLIP has spurred extensive research into zero-shot OOD detection, often employing a training-free approach. Current methods leverage expert knowledge from large language models (LLMs) to identify potential outliers. However, these approaches tend to over-rely on knowledge in the text space, neglecting the inherent challenges involved in detecting out-of-distribution samples in the image space. In this paper, we propose a novel pipeline, MM-OOD, which leverages the multimodal reasoning capabilities of MLLMs and their ability to conduct multi-round conversations for enhanced outlier detection. Our method is designed to improve performance in both near OOD and far OOD tasks. Specifically, (1) for near OOD tasks, we directly feed ID images and corresponding text prompts into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis