Expecting The Unexpected: Towards Broad Out-Of-Distribution Detection
Charles Guille-Escuret, Pierre-Andr\'e No\"el, Ioannis, Mitliagkas, David Vazquez, Joao Monteiro

TL;DR
This paper evaluates the performance of existing out-of-distribution detection methods across five types of distribution shifts, introduces a comprehensive benchmark called BROAD, and proposes a generative ensemble approach for more reliable broad OOD detection.
Contribution
It categorizes diverse distribution shifts, benchmarks existing methods on them, and introduces a generative ensemble approach to improve broad OOD detection.
Findings
Existing methods excel at detecting unknown classes but struggle with other distribution shifts.
The BROAD benchmark reveals inconsistent performance of current methods across different shifts.
The proposed Gaussian mixture ensemble improves detection consistency and robustness.
Abstract
Improving the reliability of deployed machine learning systems often involves developing methods to detect out-of-distribution (OOD) inputs. However, existing research often narrowly focuses on samples from classes that are absent from the training set, neglecting other types of plausible distribution shifts. This limitation reduces the applicability of these methods in real-world scenarios, where systems encounter a wide variety of anomalous inputs. In this study, we categorize five distinct types of distribution shifts and critically evaluate the performance of recent OOD detection methods on each of them. We publicly release our benchmark under the name BROAD (Benchmarking Resilience Over Anomaly Diversity). Our findings reveal that while these methods excel in detecting unknown classes, their performance is inconsistent when encountering other types of distribution shifts. In other…
Peer Reviews
Decision·Submitted to ICLR 2024
The paper offers a comprehensive view of why image recognition models may fail on images that have not been seen during training, and by a comprehensive set of tests demonstrates the relative value of existing detection methods when applied to tasks they were intended for, and for other OOD tasks that are related, but not explicitly targeted by the existing method. Of note is the interesting comment on how to build OOD detectors with generative models, by use of a function they designate as "h
There isn't sufficient detail in the paper to re-construct the Gaussian mixture model (GMMs) proposed by the authors. GMMs are conventionally used to estimate density functions for oddly-shaped distributions, e.g. with multiple modes. It is intuitive, in fact not unexpected, that creating an ensemble of detectors has better performance on average than any individual detector, so the novelty of this finding is limited. however the results from the paper are not reproducible from the paper's con
- The overall organization of the paper is clear and easy to follow. - The proposed ensembling method is straightforward and demonstrate good performance.
- The major weakness of the paper is that most OOD detection scores considered in the paper are proposed to only handle novel classes. Expecting such OOD scores to detect adversarial perturbations and corruptions may be **out-of-scope** and unrealistic. In particular, recent work [1] has demonstrated that when OOD samples are not involved during training (the setting considered in this work), it can be **theoretically impossible** to expect common OOD detection methods to work. Despite that det
1). The assessment of a broder OOD detection capabilities is interesting and probablity important for future OOD detection method development. 2). Extensive experiments have been done to benchmark the recent OOD detection methods. 3). Overall, the paper is clear and well-written.
My concerns are mainly about the proposed method. 1). The ensemble of OOD detection methods seems ad-hoc for this benchmark by evaluating and picking some of the methods that perform relatively well on the benchmark. 2). The proposed method of fitting GMM over scores from different OOD detection methods does not make sense to me. For example, in Sec.3, it says "this approach is adept at identifying atypical realizations of the underlying scores, even in situations where the marginal likelihood
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Data Stream Mining Techniques · Machine Learning and Data Classification
