Out-of-Distribution Detection using Synthetic Data Generation
Momin Abbas, Muneeza Azmat, Raya Horesh, Mikhail Yurochkin

TL;DR
This paper introduces a novel method using Large Language Models to generate synthetic out-of-distribution data, significantly improving OOD detection accuracy without relying on external OOD datasets.
Contribution
The work presents a new approach leveraging LLMs for synthetic OOD data generation, enabling better OOD detection in text classification tasks without external data.
Findings
Achieves zero false positives in some cases
Outperforms baseline methods significantly
Maintains high in-distribution accuracy
Abstract
Distinguishing in- and out-of-distribution (OOD) inputs is crucial for reliable deployment of classification systems. However, OOD data is typically unavailable or difficult to collect, posing a significant challenge for accurate OOD detection. In this work, we present a method that harnesses the generative capabilities of Large Language Models (LLMs) to create high-quality synthetic OOD proxies, eliminating the dependency on any external OOD data source. We study the efficacy of our method on classical text classification tasks such as toxicity detection and sentiment classification as well as classification tasks arising in LLM development and deployment, such as training a reward model for RLHF and detecting misaligned generations. Extensive experiments on nine InD-OOD dataset pairs and various model sizes show that our approach dramatically lowers false positive rates (achieving a…
Peer Reviews
Decision·Submitted to ICLR 2025
- Experiments are pretty comprehensive. - Empirical results are generally positive.
I don't find this paper particularly well-recognized, and I had a hard time finding some relevant experiment details. Can I clarify: 1. Are all the baseline methods (MSP, Energy, DICE) trained with the original real data? And are the synthetic data the same size as the original OOD data? 2. You are using a 70B model to generate the synthetic data but using 13B or 7B data for the OOD detection task. In a way this is distillation? Have you analyzed the impact of the size of the synthetic data g
1. This paper is really easy to follow, with proper figures and content analysis. Also, the methods proposed are very simple, and, as far as I know, new. 2. The selected datasets and metrics are proper to me.
In sec4 last paragraph, the authors stated that "our synthetic data is nearly as effective as real OOD data, and possibly more diverse, in representing OOD samples." with only showing the figures of visualization. I believe more statistical analysis is needed to make this claim valid.
The authors have performed extensive experimental testing, which could be useful if the results were presented more clearly.
Overly Simplistic Definition of OOD: The OOD detection tasks considered in this study, even those termed “near-OOD,” seem simplistic and easy to solve. For instance, distinguishing between tasks like CC and GSM8k does not appear to be particularly challenging. It is not clear what is the real-world relevance or difficulty of these OOD tasks. It would be very surprising that no baseline method perform well on these tasks. Additionally, for the few baselines reported on table 1, there seems to
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications
