# Machine Learning for the Analysis of Healthy Lifestyle Data: Scoping Review and Guidelines

**Authors:** Tony Estrella, Lluis Capdevila, Carla Alfonso, Josep-Maria Losilla

PMC · DOI: 10.2196/78648 · JMIR Human Factors · 2026-02-27

## TL;DR

This paper reviews how machine learning is used to analyze healthy lifestyle data and offers guidelines to improve future research quality and transparency.

## Contribution

The study provides methodological insights and practical guidelines for applying supervised machine learning in health behavior research.

## Key findings

- Most studies used multidomain data from physical activity, diet, sleep, and stress.
- Random forest was the most common algorithm, but a multimodel approach is recommended.
- Explainable AI methods like SHAP values were used in a third of the studies.

## Abstract

Advances in data science and technology have transformed lifestyle research by enabling the integration of multimodal information and the generation of large-scale datasets. Despite the growing interest in machine learning (ML) within health behavior research, significant methodological gaps remain.

The study aims to systematically review the applications of supervised ML algorithms in the analysis of healthy lifestyle data, with a particular focus on the methodological approaches used. The specific objectives are to explore the types and sources of data used for health outcomes, examine the ML processes used, including explainable artificial intelligence (XAI) methods, and review the software tools used. Additionally, this review aims to provide practical guidelines to enhance the quality and transparency of future ML research in health.

Following the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) recommendations, the search was conducted across PubMed, PsycINFO, and Web of Science, yielding 65 studies that met the inclusion criteria.

Most studies (48/65, 74%) integrated multidomain data from physical activity, diet, sleep, and stress. Data sources were split between self-acquired data (33/65, 51%) and health repositories (32/65, 49%). Single-item measurements were common, particularly for physical activity, diet, and sleep. Although 40 of 65 studies used a multimodel approach, random forest was the most frequently applied algorithm. To improve explainability, 22 of 65 (33.84%) studies incorporated specific XAI methods, with 21 using Shapley Additive Explanation values and 1 using local interpretable model-agnostic explanations. R (R Core Team) and Python (Python Software Foundation) were the most widely used software tools, with variation in the libraries used.

This review highlights methodological gaps in the application of supervised ML to healthy lifestyle data. The ML workflow should span from data acquisition to explainability, using iterative steps to improve methodological rigor. Although multidomain data collection enhances the understanding of health issues related to lifestyle, representativeness remains limited due to methodological shortcomings in data acquisition. While random forest was the most commonly used algorithm, a multimodel approach is recommended for a comprehensive comparison. Lifestyle components consistently ranked among the top features in studies integrating XAI. Incorporating XAI methods into the ML pipeline can support personalized interventions, provided data collection is accurate. The R metapackage (tidymodels; Max Kuhn and Hadley Wickham) facilitates process evaluation through unified syntax, improving replicability. Methodological and reporting guidelines and a checklist are provided to enhance transparency and replicability in multidisciplinary ML research.

## Full-text entities

- **Genes:** SHROOM4 (shroom family member 4) [NCBI Gene 57477] {aka MRXSSDS, SHAP, shrm4}
- **Diseases:** cardiovascular and metabolic diseases (MESH:D002318), infertility (MESH:D007246), COVID-19 (MESH:D000086382), osteoporosis (MESH:D010024), NutSo-HH (MESH:D044342), depression (MESH:D003866), substance abuse (MESH:D019966), cancer (MESH:D009369), diabetes (MESH:D003920), DL (MESH:C537113), Depression Anxiety (MESH:D001007), cardiometabolic disease (MESH:D024821), HL (MESH:D000067329), Sleep Disturbance (MESH:D012893), PRESS (MESH:D028361), psoriasis (MESH:D011565), XAI (MESH:C538243), frailty (MESH:D000073496), AI (MESH:C538142), obesity (MESH:D009765), ML (MESH:D007859), LIME (MESH:D004195)
- **Chemicals:** Beeswarm (-), alcohol (MESH:D000438), H2O. (MESH:D014867)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12954701/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12954701/full.md

## References

155 references — full list in the complete paper: https://tomesphere.com/paper/PMC12954701/full.md

---
Source: https://tomesphere.com/paper/PMC12954701