Discriminative Span as a Predictor of Synthetic Data Utility via Classifier Reconstruction
Radhika Amar Desai, Modigari Narendra

TL;DR
This paper introduces a geometry-based metric in embedding space to predict the usefulness of synthetic data for improving classifier performance without needing to train models.
Contribution
It proposes a novel, model-agnostic metric that assesses synthetic data quality by measuring the span of variations relative to classifier directions.
Findings
The metric correlates strongly with downstream CNN classification accuracy.
Synthetic data capturing task-relevant variations improves classifier performance.
The approach works across multiple datasets and architectures.
Abstract
In many real-world computer vision applications, including medical imaging and industrial inspection, binary classification tasks are characterized by a severe scarcity of positive samples. A widely adopted solution is to generate synthetic positive data using image-to-image transformations applied to negative samples. However, a fundamental challenge remains: how can we reliably assess whether such synthetic data will improve downstream model performance? In this work, we propose a geometry-driven metric that predicts the utility of synthetic data without requiring model training. Our approach operates in the embedding space of a pre-trained foundation model and represents the dataset through difference vectors between samples. We evaluate whether the weight vector of a linear classifier can be expressed within the subspace spanned by these variations by measuring the relative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
