# Creating a general-purpose generative model for healthcare data based on multiple clinical studies

**Authors:** Hiroshi Maruyama, Kotatsu Bito, Yuki Saito, Masanobu Hibi, Shun Katada, Aya Kawakami, Kenta Oono, Nontawat Charoenphakdee, Zhengyan Gao, Hideyoshi Igata, Masashi Yoshikawa, Yoshiaki Ota, Hiroki Okui, Kei Akita, Shoichiro Yamaguchi, Yohei Sugawara, Shin-ichi Maeda, Laura Sbaffi, Laura Sbaffi

PMC · DOI: 10.1371/journal.pdig.0001059 · PLOS Digital Health · 2025-11-05

## TL;DR

This paper introduces a general-purpose generative model for healthcare data that can replicate diverse human attributes and generate synthetic datasets, potentially overcoming data access barriers.

## Contribution

The novel contribution is a generative model capturing over 2000 human attributes from multiple clinical studies, enabling synthetic data generation and attribute estimation.

## Key findings

- The model captures key statistical properties like univariate distributions and bivariate relationships from training data.
- The model demonstrates practical utility in predictive, preventive, and personalized medicine applications.
- The model is available as an internet-based software service, lowering barriers to digital healthcare innovation.

## Abstract

Data for healthcare applications are typically customized for specific purposes but are often difficult to access due to high costs and privacy concerns. Rather than prepare separate datasets for individual applications, we propose a novel approach: building a general-purpose generative model applicable to virtually any type of healthcare application. This generative model encompasses a broad range of human attributes, including age, sex, anthropometric measurements, blood components, physical performance metrics, and numerous healthcare-related questionnaire responses. To achieve this goal, we integrated the results of multiple clinical studies into a unified training dataset and developed a generative model to replicate its characteristics. The model can estimate missing attribute values from known attribute values and generate synthetic datasets for various applications. Our analysis confirmed that the model captures key statistical properties of the training dataset, including univariate distributions and bivariate relationships. We demonstrate the model’s practical utility through multiple real-world applications, illustrating its potential impact on predictive, preventive, and personalized medicine.

Digital technologies are expected to revolutionize healthcare, yet digital healthcare has not reached its full potential. A major bottleneck is the poor data availability. Due to concerns regarding privacy and cost, healthcare data is very difficult to access. Here, our aim was to provide a general-purpose statistical model that can be used in place of actual data. Recent advancements in machine-learning technology, especially in generative models, make this challenging goal possible. We built a model that captures complex statistical interactions among more than 2000 human attributes and made it available as a software service on the Internet. The model can be used for estimating unknown attributes from known attributes and generating synthetic data. We believe that this model significantly lowers the barrier to entry into digital healthcare and will stimulate future innovations.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12588491/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12588491/full.md

## References

36 references — full list in the complete paper: https://tomesphere.com/paper/PMC12588491/full.md

---
Source: https://tomesphere.com/paper/PMC12588491