Simulation Framework for Realistic Large-scale Individual-level Data Generation with an Application in the Health Domain
Santtu Tikka, Jussi Hakanen, Mirka Saarela, Juha Karvanen

TL;DR
This paper introduces a scalable, open-source simulation framework in R for generating realistic large-scale individual-level health data, enabling complex system modeling, policy evaluation, and statistical research.
Contribution
It presents a mathematically rigorous, scalable, and flexible framework with an open-source implementation supporting detailed health data simulation at population scale.
Findings
Simulated health data for millions of individuals over decades.
Demonstrated impact of non-participation on risk model estimates.
Showcased policy intervention analysis in Finnish health context.
Abstract
We propose a framework for realistic data generation and simulation of complex systems and demonstrate its capabilities in the health domain. The main use cases of the framework are predicting the development of risk factors and disease occurrence, evaluating the impact of interventions and policy decisions, and statistical method development. We present the fundamentals of the framework using rigorous mathematical definitions. The framework supports calibration to a real population as well as various manipulations and data collection processes. The freely available open-source implementation in R embraces efficient data structures, parallel computing and fast random number generation which ensure reproducibility and scalability. With the framework it is possible to run daily-level simulations for populations of millions of individuals for decades of simulated time. An example on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topicsdemographic modeling and climate adaptation · Insurance, Mortality, Demography, Risk Management · Statistical Methods and Inference
