Using Synthetic Datasets to Aggregate Data in Aging and Longevity Studies: A Randomized Resampling Approach
Sarah Peskoe, Zachary Kunicki

TL;DR
This paper proposes a method to create synthetic datasets for aging research to overcome data scarcity and privacy issues, improving accuracy and reducing bias.
Contribution
A novel randomized resampling strategy to enhance the accuracy and reduce bias in synthetic datasets for aging and longevity studies.
Findings
Fixed synthesis chains can introduce bias in downstream variables.
Varying the synthesis chain improves accuracy for key outcomes.
The proposed randomized resampling strategy reduces bias and improves precision in simulations.
Abstract
The scarcity of harmonized datasets capturing clinical and cognitive outcomes—such as pain, function, biological, and behavioral measures—in older adults presents a major barrier to reproducible aging research. While federated data models offer one solution, they are often constrained by institutional privacy policies and logistical complexity. This study explores an alternative: pooled synthetic datasets generated using the synthpop package in R, which employs Classification and Regression Trees (CART) to model conditional distributions across mixed data types. We evaluate the utility of this approach using data from the Health and Retirement Study (HRS), focusing on how the choice of synthesis order affects bias and inference. Our findings show that fixed synthesis chains can introduce bias in downstream variables, while varying the chain improves accuracy for key outcomes. We propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Causal Inference Techniques · Privacy-Preserving Technologies in Data · demographic modeling and climate adaptation
