A Combinatorial Approach to Synthetic Data Generation for Machine Learning

Krishna Khadka; Jaganmohan Chandrasekaran; Yu Lei; Raghu Kacker; D. Richard Kuhn

PMC · DOI:10.1007/s42979-025-04540-x·January 7, 2026

A Combinatorial Approach to Synthetic Data Generation for Machine Learning

Krishna Khadka, Jaganmohan Chandrasekaran, Yu Lei, Raghu Kacker, D. Richard Kuhn

PDF

Open Access

TL;DR

This paper introduces a new method for generating synthetic data that reduces the number of samples needed while maintaining model performance and improving privacy.

Contribution

The novel combinatorial sampling approach reduces sample size requirements and preserves model accuracy with fewer data points.

Findings

01

Combinatorial sampling achieves comparable model performance with fewer synthetic samples than random sampling.

02

The method maintains better accuracy when combined with differential privacy compared to traditional approaches.

03

Model predictions are often influenced more by feature interactions than by using all features.

Abstract

Datasets used in machine learning often contain sensitive information, including personally identifiable health and financial details. A common challenge faced by organizations and researchers is the risk of privacy breaches when using real-world data. Synthetic data can be used as an alternative to the real-world data. In existing synthetic data generation techniques, an encoder processes the real-world data to map it into a lower-dimensional latent space. Random sampling is then performed in this latent space. Subsequently, a decoder network is utilized to generate synthetic data from these sampled points in the latent space. Such approaches typically require generating a large number of synthetic samples to approximate the performance of real-world data, subsequently slowing down downstream machine learning tasks. Addressing this, we introduce a combinatorial approach to sampling the…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Genes1

SHROOM4

Proteins1

Species1

Homo sapiens(human · species)

Chemicals2

AC CopulaGAN

Diseases1

IPM

Figures2

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Machine Learning in Healthcare · Big Data and Digital Economy