# Descriptor: Synthetic Genomic Dataset With Diverse Ancestry (SynGen6)

**Authors:** XINYUE WANG, SITAO MIN, JAIDEEP VAIDYA

PMC · DOI: 10.1109/ieeedata.2024.3505852 · IEEE data descriptions · 2025-04-18

## TL;DR

SynGen6 is a synthetic genomic dataset designed to include diverse ancestry groups, aiming to improve fairness and accuracy in genomic research.

## Contribution

The novel contribution is a balanced synthetic dataset with diverse ancestry and privacy-preserving methods for genomic analysis.

## Key findings

- SynGen6 includes 34,200 samples across six populations with 7,120 SNPs.
- The dataset incorporates simulated phenotypes and synthetic relatives for kinship studies.
- Privacy is preserved using ϵ-local differential privacy and PCA-based synthesis.

## Abstract

Advancements in genomic analysis techniques and data-driven research are driving precision medicine. However, in many cases, these advances are not equitable and do not help all subpopulations, since many existing genomic datasets lack diversity, limiting their applicability for studying populations beyond those of European ancestry. Thus, to advance genomic analysis and to allow for a fair benchmarking of novel proposed approaches, there is a significant demand for balanced and representative datasets. To address this issue, we developed, SynGen6, a synthetic dataset that encompasses six distinct populations, providing balanced representation across various ancestry groups. Using the All of Us dataset as a foundation, we utilized principal component analysis (PCA) and ϵ-local differential privacy (LDP) to generate synthetic samples while preserving genetic diversity and the privacy of individuals. To further enhance the dataset, we simulated phenotype vectors associated with significant single nucleotide polymorphisms (SNPs), mirroring real-world gene-disease associations. We also generated synthetic SNPs to watermark the dataset, enabling verification of cloud-based genomic computations for accuracy. Last, synthetic relatives were created to support research on kinship inference and family-based genomic analyses, resulting in a comprehensive dataset of 34 200 samples and 7120 SNPs across six populations. In this article, we describe the dataset and provide the Python scripts used to generate the dataset, which can be extended to create additional synthetic datasets, aiming to fuel advancements in genomic data analysis.

## Full-text entities

- **Diseases:** Us (MESH:D019966), SOCIETY (MESH:C000719191)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12007885/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12007885/full.md

## References

19 references — full list in the complete paper: https://tomesphere.com/paper/PMC12007885/full.md

---
Source: https://tomesphere.com/paper/PMC12007885