ScoreMix: Synthetic Data Generation by Score Composition in Diffusion Models Improves Recognition
Parsa Rahimi, Sebastien Marcel

TL;DR
ScoreMix introduces a novel diffusion-based synthetic data augmentation method that mixes class-conditioned scores to improve recognition accuracy without external resources, demonstrating significant gains across face recognition benchmarks.
Contribution
It presents a self-contained score compositionality approach in diffusion models for generating domain-specific synthetic data for recognition tasks, avoiding reliance on external datasets or models.
Findings
Up to 7% accuracy improvement on face recognition benchmarks.
Mixing distant classes yields larger gains than similar classes.
Method is robust and practical without hyperparameter tuning.
Abstract
Synthetic data generation is increasingly used in machine learning for training and data augmentation. Yet, current strategies often rely on external foundation models or datasets, whose usage is restricted in many scenarios due to policy or legal constraints. We propose ScoreMix, a self-contained synthetic generation method to produce hard synthetic samples for recognition tasks by leveraging the score compositionality of diffusion models. The approach mixes class-conditioned scores along reverse diffusion trajectories, yielding domain-specific data augmentation without external resources. We systematically study class-selection strategies and find that mixing classes distant in the discriminator's embedding space yields larger gains, providing up to 3% additional average improvement, compared to selection based on proximity. Interestingly, we observe that condition and embedding…
Peer Reviews
Decision·Submitted to ICLR 2026
* Clear, simple mechanism with strong intuition. Convex score mixing is well-motivated to preserve score magnitude and remain on-manifold; qualitative grids and discussion illustrate why non-convex weights can fail. * Self-contained augmentation. Training both generator and discriminator only on the available dataset is practically appealing in sensitive domains. * Consistent empirical gains. Across FR benchmarks, ScoreMix improves over training on real data alone and beats a larger IR101 base
* The authors note ScoreMix “roughly doubles” sampling cost vs. AugGen (Line 269). Please explicitly quantify: GPU hours for generator training + sampling per 0.2M synthetic images, versus baselines (e.g., AugGen), and the cost of computing embedding distances & m-plet mining. Without cost curves, practicality is hard to judge. * Class-pair selection uses distances from a trained discriminator. How sensitive are gains to the quality/architecture of that initial model? If we re-select pairs usin
1. This paper develops a self-contained augmentation strategy—that is, one that does not rely onexternal datasets, commercial APIs, or third-party models—to maximize the performance of state-of-the-art discriminators solely with the available data. 2. This paper demonstrates that convex combinations of classconditioned scores yield synthetic samples that consistently improve discriminator training.
My main concern regarding this paper's motivation lies with its core premise: leveraging synthetic data for augmentation, particularly within the sensitive domain of facial data. I question the fundamental viability of this approach. Specifically, wouldn't introducing synthetic data risk confusing the model and ultimately compromise its robustness, especially when considering critical applications like face anti sproofing? 1. Given the widespread availability of high-quality, pre-trained generat
1 This paper proposes practical and simple recipe once a class-conditional diffusion model is trained, avoiding external data or models. 2 Empirical results show consistent improvements across multiple face benchmarks with an actionable rule for choosing distant class pairs in embedding space. 3 Useful analysis that clarifies why reproducing samples are not effective for improving downstream recognition performance and why embedding geometry should guide mixing.
1. The paper’s scope is confined to face recognition, leaving transfer to other recognition domains untested. I understand the focus is low-data regimes, but that still includes domains like medical imaging, retail product IDs, species recognition, and industrial parts where manifold structure is different and the current evidence does not generalize. 2. The finding that mixing more than two classes brings little benefit is reported and interesting but there is little analysis on top of it. It
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
MethodsDiffusion
