Generative Data Mining with Longtail-Guided Diffusion
David S. Hayden, Mao Ye, Timur Garipov, Gregory P. Meyer, Carl Vondrick, Zhao Chen, Yuning Chai, Eric Wolff, Siddhartha S. Srinivasa

TL;DR
This paper introduces a proactive data augmentation method called Longtail Guidance (LTG) that uses model-based signals to generate additional training data from a diffusion model, improving generalization and enabling gap analysis without retraining.
Contribution
The paper proposes a novel LTG process that leverages differentiable uncertainty signals to guide data generation, enhancing model robustness and interpretability.
Findings
Generated data shows meaningful semantic variation.
Significant improvements in image classification benchmarks.
Enables proactive discovery and explanation of model gaps.
Abstract
It is difficult to anticipate the myriad challenges that a predictive model will encounter once deployed. Common practice entails a reactive, cyclical approach: model deployment, data mining, and retraining. We instead develop a proactive longtail discovery process by imagining additional data during training. In particular, we develop general model-based longtail signals, including a differentiable, single forward pass formulation of epistemic uncertainty that does not impact model parameters or predictive performance but can flag rare or hard inputs. We leverage these signals as guidance to generate additional training data from a latent diffusion model in a process we call Longtail Guidance (LTG). Crucially, we can perform LTG without retraining the diffusion model or the predictive model, and we do not need to expose the predictive model to intermediate diffusion states. Data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsData Mining Algorithms and Applications
MethodsDiffusion · Latent Diffusion Model
