OODBench: Out-of-Distribution Benchmark for Large Vision-Language Models

Ling Lin; Yang Bai; Heng Su; Congcong Zhu; Yaoxing Wang; Yang Zhou; Huazhu Fu; Jingrun Chen

arXiv:2602.18094·cs.CV·February 23, 2026

OODBench: Out-of-Distribution Benchmark for Large Vision-Language Models

Ling Lin, Yang Bai, Heng Su, Congcong Zhu, Yaoxing Wang, Yang Zhou, Huazhu Fu, Jingrun Chen

PDF

Open Access 3 Reviews

TL;DR

This paper introduces OODBench, a comprehensive benchmark for evaluating large vision-language models' robustness to out-of-distribution data, revealing significant performance drops and proposing new assessment metrics.

Contribution

The paper presents OODBench, a new automated benchmark with 40K OOD instances, and a novel evaluation metric to better assess VLMs' OOD robustness.

Findings

01

Current VLMs show notable performance degradation on OOD data.

02

The proposed metric effectively measures OOD impact across question difficulties.

03

OODBench provides a scalable, automated way to evaluate OOD robustness in VLMs.

Abstract

Existing Visual-Language Models (VLMs) have achieved significant progress by being trained on massive-scale datasets, typically under the assumption that data are independent and identically distributed (IID). However, in real-world scenarios, it is often impractical to expect that all data processed by an AI system satisfy this assumption. Furthermore, failure to appropriately handle out-of-distribution (OOD) objects may introduce safety risks in real-world applications (e.g., autonomous driving or medical assistance). Unfortunately, current research has not yet provided valid benchmarks that can comprehensively assess the performance of VLMs in response to OOD data. Therefore, we propose OODBench, a predominantly automated method with minimal human verification, for constructing new benchmarks and evaluating the ability of VLMs to process OOD data. OODBench contains 40K instance-level…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. The pipeline of this paper is clear and easy to be understand. 2. This work proposes that VLMs has seen normal patterns and various instances and it proposes that VLMs should has a new descriptions which is reasonable. 3. This work proposes a new framework in generating data. 4. The bad performance of various VLMs proposes the hard problems of OODbench. Basic-to-Advanced Progression is reasonable for current proposes of LLM-based VLM.

Weaknesses

1. The core idea of OOD benchmark in to evaluting whether VLM can understand the images in In-or-Out training distribution. This work aims in find the images making the VLMs wrongly predict which is not reasonable. 2. This work proposes a BAP evaluating metric but it doese not show in the the first table. 3. In constructing the OODBench, the pipeline is about reducing human cost but there is still human labor involed which is non-reasonable. 4. For the pipeline in OOD collection, it finds the ob

Reviewer 02Rating 4Confidence 4

Strengths

1. Timely problem with safety relevance: focuses on covariate shifts common in real-world deployment (especially driving). 2. Mostly automated, reproducible pipeline with reasonable cross-detector validation and public release. 3. BAP design offers layered diagnostics (recognition → counting → logic). 4. Broad, careful experimentation and ablations (detector types/number, threshold T); analysis distinguishing OOD vs. hard samples/hallucination is thoughtful.

Weaknesses

1. Conceptual clarity: The paper claims a focus on covariate shift but operationally defines OOD as “misclassified/low-confidence” by generalized detectors. Are misclassified samples assumed equivalent to covariate-shifted samples? Misclassification can stem from ID hard cases, label noise, ambiguity, or prompt wording. Please clarify the formal relationship and provide quantitative checks (e.g., factors like scale/occlusion/illumination) to support covariate-shift attribution. 2. Task-level dis

Reviewer 03Rating 4Confidence 4

Strengths

- This study explores an interesting question on OOD definition for VLMs. - They introduce a new benchmark and show insights into current VLMs.

Weaknesses

### Major - Definition of OOD data for VLMs. The authors define two categories: (1) Objects in images that are neither main objects nor semantically related to the main semantic object; (2) Variants or anomalous forms of target objects. - However, it is difficult to assert that these data types are truly “OOD” for current VLMs. These VLMs are highly likely trained on datasets that already contain such cases to improve their robustness. - Second, the definition of OOD is somehow ambiguou

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications