Out-of-Distribution Detection using Synthetic Data Generation

Momin Abbas; Muneeza Azmat; Raya Horesh; Mikhail Yurochkin

arXiv:2502.03323·cs.CL·October 3, 2025

Out-of-Distribution Detection using Synthetic Data Generation

Momin Abbas, Muneeza Azmat, Raya Horesh, Mikhail Yurochkin

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

This paper introduces a novel method using Large Language Models to generate synthetic out-of-distribution data, significantly improving OOD detection accuracy without relying on external OOD datasets.

Contribution

The work presents a new approach leveraging LLMs for synthetic OOD data generation, enabling better OOD detection in text classification tasks without external data.

Findings

01

Achieves zero false positives in some cases

02

Outperforms baseline methods significantly

03

Maintains high in-distribution accuracy

Abstract

Distinguishing in- and out-of-distribution (OOD) inputs is crucial for reliable deployment of classification systems. However, OOD data is typically unavailable or difficult to collect, posing a significant challenge for accurate OOD detection. In this work, we present a method that harnesses the generative capabilities of Large Language Models (LLMs) to create high-quality synthetic OOD proxies, eliminating the dependency on any external OOD data source. We study the efficacy of our method on classical text classification tasks such as toxicity detection and sentiment classification as well as classification tasks arising in LLM development and deployment, such as training a reward model for RLHF and detecting misaligned generations. Extensive experiments on nine InD-OOD dataset pairs and various model sizes show that our approach dramatically lowers false positive rates (achieving a…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 3

Strengths

- Experiments are pretty comprehensive. - Empirical results are generally positive.

Weaknesses

I don't find this paper particularly well-recognized, and I had a hard time finding some relevant experiment details. Can I clarify: 1. Are all the baseline methods (MSP, Energy, DICE) trained with the original real data? And are the synthetic data the same size as the original OOD data? 2. You are using a 70B model to generate the synthetic data but using 13B or 7B data for the OOD detection task. In a way this is distillation? Have you analyzed the impact of the size of the synthetic data g

Reviewer 02Rating 6Confidence 2

Strengths

1. This paper is really easy to follow, with proper figures and content analysis. Also, the methods proposed are very simple, and, as far as I know, new. 2. The selected datasets and metrics are proper to me.

Weaknesses

In sec4 last paragraph, the authors stated that "our synthetic data is nearly as effective as real OOD data, and possibly more diverse, in representing OOD samples." with only showing the figures of visualization. I believe more statistical analysis is needed to make this claim valid.

Reviewer 03Rating 5Confidence 3

Strengths

The authors have performed extensive experimental testing, which could be useful if the results were presented more clearly.

Weaknesses

Overly Simplistic Definition of OOD: The OOD detection tasks considered in this study, even those termed “near-OOD,” seem simplistic and easy to solve. For instance, distinguishing between tasks like CC and GSM8k does not appear to be particularly challenging. It is not clear what is the real-world relevance or difficulty of these OOD tasks. It would be very surprising that no baseline method perform well on these tasks. Additionally, for the few baselines reported on table 1, there seems to

Code & Models

Datasets

abbasm2/synthetic_ood
dataset· 33 dl
33 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications