# A Combinatorial Approach to Synthetic Data Generation for Machine Learning

**Authors:** Krishna Khadka, Jaganmohan Chandrasekaran, Yu Lei, Raghu Kacker, D. Richard Kuhn

PMC · DOI: 10.1007/s42979-025-04540-x · 2026-01-07

## TL;DR

This paper introduces a new method for generating synthetic data that reduces the number of samples needed while maintaining model performance and improving privacy.

## Contribution

The novel combinatorial sampling approach reduces sample size requirements and preserves model accuracy with fewer data points.

## Key findings

- Combinatorial sampling achieves comparable model performance with fewer synthetic samples than random sampling.
- The method maintains better accuracy when combined with differential privacy compared to traditional approaches.
- Model predictions are often influenced more by feature interactions than by using all features.

## Abstract

Datasets used in machine learning often contain sensitive information, including personally identifiable health and financial details. A common challenge faced by organizations and researchers is the risk of privacy breaches when using real-world data. Synthetic data can be used as an alternative to the real-world data. In existing synthetic data generation techniques, an encoder processes the real-world data to map it into a lower-dimensional latent space. Random sampling is then performed in this latent space. Subsequently, a decoder network is utilized to generate synthetic data from these sampled points in the latent space. Such approaches typically require generating a large number of synthetic samples to approximate the performance of real-world data, subsequently slowing down downstream machine learning tasks. Addressing this, we introduce a combinatorial approach to sampling the latent space, motivated by our empirical findings within this study that most model predictions are largely influenced by interactions between a few features. In some cases, just using a small number of features produces accuracy better than using entire features. Through this approach, we generate samples that utilize t-way interactions among the t latent dimensions out of n. Our experimental results indicate that our approach requires fewer samples than traditional random sampling to achieve comparable model performance for real-world data sets. We also show that when integrated with a differentially private mechanism, our approach incurs a smaller decline in model performance than existing random sampling approach.

## Full-text entities

- **Genes:** SHROOM4 (shroom family member 4) [NCBI Gene 57477] {aka MRXSSDS, SHAP, shrm4}
- **Diseases:** IPM (MESH:D004195)
- **Chemicals:** AC (MESH:D000186), CopulaGAN (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12779700/full.md

---
Source: https://tomesphere.com/paper/PMC12779700