Development of a Type 2 Diabetes Prediction Model Using Specific Health Checkup Data and Extraction of Predictive Factors
Kenichiro Shimai, Kazuki Ohashi, Teppei Suzuki, Ryota Konno, Ryuichiro Ueda, Masami Mukai, Katsuhiko Ogasawara

TL;DR
This study developed a model to predict type 2 diabetes using non-invasive health checkup data and identified key risk factors in a Japanese population.
Contribution
The study introduces a non-invasive predictive model for type 2 diabetes using health checkup data and identifies specific risk factors.
Findings
The model achieved moderate discrimination with an AUC of 0.680 for those aged 40–74 years and 0.665 for those aged ≥75 years.
Key predictors included male sex, slower walking speed, and not eating within 2 hours before bedtime.
Use of antihypertensive drugs was positively associated with T2DM diagnosis.
Abstract
Background: Specific health checkups in Japan aim to prevent and detect non-communicable diseases (NCDs). Lifestyle information and non-invasive measurements obtained during these checkups are valuable for population health monitoring. This study aimed to develop a predictive model for type 2 diabetes mellitus (T2DM) using only non-invasive measurements and to identify key predictors. Methods: A retrospective observational study was conducted using linked health checkup records and medical claims from a city in Japan. Logistic regression was performed to predict a T2DM diagnosis. Results: A total of 409 of the 1363 participants were diagnosed with T2DM, including 285 of the 950 participants aged 40–74 years and 124 of the 413 participants aged ≥75 years. The model achieved an area under the receiver operating characteristic curve of 0.680 for those aged 40–74 years and 0.665 for those…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1- —Cross-ministerial Strategic Innovation Promotion Program (SIP)
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHealth Promotion and Cardiovascular Prevention · Cardiovascular Health and Risk Factors · Nutritional Studies and Diet
1. Introduction
Diabetes mellitus, a representative non-communicable disease (NCD), has been increasing worldwide and remains a major public health concern in Japan [1]. The global prevalence of diabetes among individuals aged 20–79 years was estimated at 11.1% (589 million persons) in 2024, with one in nine adults affected. Prevalence is projected to rise to 13.0% (853 million persons) by 2050 [1]. The disease accounted for over 3.4 million deaths in 2024 and consumed more than one trillion US dollars in direct health expenditures [1]. According to a survey by the Japan Ministry of Health, Labour and Welfare, 16.8% of men and 8.9% of women are “persons strongly suspected of having diabetes”, with prevalence beginning to increase around age 40 and continuing to rise with advancing age [2]. The WHO Global Action Plan for the Prevention and Control of NCDs lists diabetes as one of its three priority target diseases [3]. Preventing and detecting diabetes at an early stage is therefore a critical public health priority both globally and in Japan.
Diabetes leads to serious complications, including cardiovascular disease and nephropathy [4], resulting in higher medical costs [5] and reduced quality-of-life [6]. To prevent such complications and improve community health, the early detection of type 2 diabetes mellitus (T2DM) is essential, as well as the identification of risk factors that can be linked to improved lifestyle habits. In Japan, regular health checkups, such as the health checkup program [7], are conducted to prevent NCDs and enable early detection. Regression analysis is frequently used to predict the risk of diseases including diabetes. In a large-scale multi-institutional joint study on Japanese workers, Nanri et al. used logistic regression analysis to formulate a straightforward risk score for predicting the three-year incidence rates of T2DM and developed a model that had a reasonably good predictive ability [8]. Ashizawa et al. highlighted the usefulness of lifestyle habit questions included in specific health checkups as predictors of metabolic syndrome in the subsequent year [9]. Furthermore, at the national level, robust machine learning models such as XGBoost and CatBoost have been developed to predict T2DM in specific regional populations. Although these models have demonstrated very high predictive performance, they required as many as 80 features, including multiple blood test parameters, which may limit their feasibility in broader community settings [10]. Research into predictive models using specific health checkups has progressed to date, but there has not been much research focusing on specific regional populations. In addition, developing non-invasive predictive models that do not require blood tests can contribute to improving T2DM prediction in targeted regional populations while relying on a limited set of features.
The data acquired during specific health checkups were questionnaire response results, measurement results including physical measurements, and the results from blood tests. Blood tests provide data used as diagnostic standards for diabetes, including hemoglobin A1C (HbA1c) values and fasting blood glucose levels. Questionnaire responses provide non-invasive data. In large-scale screening, non-invasive data are advantageous owing to their cost-effectiveness and ease of acquisition compared with invasive methods [11]. To ensure that health checkups generate cost-effective outcomes, it is crucial to establish a targeting strategy that determines which examinees should receive additional interventions or follow-up [12]. Therefore, the aim of this study was to develop a predictive model for T2DM using non-invasive parameters obtained from health checkups and to identify related predictive factors in a regional Japanese population.
2. Materials and Methods
2.1. Study Population and Database
This study used specific health checkups and health insurance claim data obtained from a single municipality (population: <80,000 persons) in Hokkaido. To conduct the analysis using health insurance claims data from different age groups, two sets of data were used: the National Health Insurance dataset (NHID) covering individuals aged 40–74 years and the Medical Care System dataset covering older adults aged ≥75 years (MCSD). To obtain the National Health Insurance data, ID linkage was implemented to connect the specific health checkup data of 6917 patients in fiscal years (FY) 2017 and 2018 and health insurance claims data for FY 2019. This process resulted in the identification of 285 individuals diagnosed with T2DM. Similarly, to obtain the Medical Care System for the Elderly Aged 75 and Over data, ID linkage was implemented to connect the specific health checkup data of 1854 patients in FYs 2017 and 2018 and the health insurance claims data for FY 2019. This process resulted in the identification of 124 individuals diagnosed with T2DM. In terms of the number of individuals who were not diagnosed with T2DM, 665 persons were identified from NHID, while 289 persons were identified from the MCSD (Figures S1 and S2). To determine the individuals eligible for analysis, the oldest data were used when specific health checkup data for multiple years were present; for health insurance claims, any duplicated data for the same individual was removed.
2.2. Statistical Analysis
Logistic regression analysis was performed using the presence or absence of a T2DM diagnosis as the objective variable, and explanatory variables such as sex, waist circumference, and body mass index (BMI). In our model, the “name of illness” for FY2019 (T2DM diagnosis, yes or no) was used as the objective variable, while the health checkup questionnaire items, sex, BMI, and waist circumference were used as the explanatory variables. Table 1 showed the questionnaire items and response options. Responses to the original questionnaire item “insulin injections and/or taking medicine for reducing blood sugar, yes or no” was used to confirm whether an individual is undergoing diabetes treatment at a medical institution and is actively taking medication [7]; this item was excluded as it solely determined a diabetes designation (yes or no). Furthermore, to identify the risk factors from the questionnaire items related to lifestyle habits, the following questions were asked: “history of stroke, yes or no”; “history of heart disease, yes or no”; “history of kidney failure, yes or no”; and “history of anemia, yes or no.” These questionnaire items were also excluded from the analysis of the present study. The present study aimed to extract the risk factors from the non-invasive data, especially prior to the medical intervention. Therefore, the results of blood tests were not used as a variable in this study, although they were utilized in previous studies. In the logistic regression analysis, the significance level was set at 0.05, and a 95% confidence interval was estimated for the partial regression coefficient and the odds ratio. An area under the receiver operating characteristic curve (AUROC) was used to evaluate the model. JMP Pro 17 (SAS Institute Inc., Cary, NC, USA) was utilized for consolidating the features to be analyzed; a chi-square test was used to analyze categorical variables, while a t-test was used to analyze continuous variables. IBM SPSS modeler 18.2 (IBM Corp., Armonk, NY, USA) was used for the logistic regression analysis.
3. Results
3.1. Participant Characteristics
The characteristics of the participants for NHID (persons diagnosed with diabetes: 285; persons not diagnosed with diabetes: 665) and those for MCSD (persons diagnosed with T2DM: 124; persons not diagnosed with T2DM: 289) were as shown in Table 2. In the aggregate results, comparisons between the two groups (persons diagnosed with or without T2DM) revealed distinctions within the NHID in the items “BMI”, “waist”, “antihypertensive drug”, and “lipid-lowering drug”, and within the MCSD in the items “sex” and “antihypertensive drug”.
3.2. Logistic Regression Analysis
The AUROC by logistic regression analysis was shown in Figure 1. For the NHID and MCSD, the AUROC was 0.680 and 0.665, respectively.
The results of the logistic regression analysis of the NHID and MCSD were as shown in Table 3 and Table 4. In the NHID, The significant predictive factors identified were male, antihypertensive drug, lipid-lowering drug, and walking speed. In this analysis, the odds ratios for male and slow walking speed were 1.824 and 1.367, respectively, indicating positive associations with the diagnosis of T2DM. In contrast, the odds ratios for antihypertensive drug (answer 2: No) and lipid-lowering drug (answer 2: No) were 0.506 and 0.608, respectively, indicating negative association for the diagnosis of T2DM. In the MCSD, the significant predictive factors identified were sex, antihypertensive drug, and eating habits within 2 h before bedtime. The odds ratios for male and eating habits: 2 h before bedtime (answer 2: No) were 1.794 and 3.046, respectively, indicating positive association for the diagnosis of T2DM, whereas the odds ratios for antihypertensive drug (answer 2: No) was 0.601, indicating a negative association for the diagnosis of T2DM. To summarize these results, the common predictive factors were male and antihypertensive drug, while the age-specific predictive factors were lipid-lowering drug, walking speed, and eating habits: 2 h before bedtime.
4. Discussion
This study developed predictive models for T2DM using non-invasive parameters obtained from specific health checkups in a regional Japanese population. The models demonstrated moderate predictive ability, with an AUROC of 0.680 for individuals 40–74 years and 0.665 for those aged ≥75 years. The key predictors identified were male sex across both age groups, while slow walking speed and the absence of antihypertensive or lipid-lowering drug use were specific to participants aged 40–74 years. Among those aged ≥75 years, no eating habits within two hours before bedtime and the absence of antihypertensive drug use were significant predictors.
The AUROC in our models was lower than in previous studies. Heianza et al., using a Japanese cohort of 7654 non-diabetic individuals aged 40–75 years, reported that a non-invasive risk score based on age, sex, family history of diabetes, smoking status, and body mass index demonstrated moderate discriminative ability for five-year incident type 2 diabetes (AUROC 0.708) [13]. Nanri et al., developed a large-scale prediction model for T2DM using non-invasive and invasive parameters [8]. Their non-invasive model achieved an AUROC of 0.717 in the derivation cohort and 0.734 in the validation cohort. Additionally, an invasive model including fasting plasma glucose and HbA1c improved the AUROC to 0.893 and 0.882, respectively. In Nanri et al.’s study, their non-invasive risk model included sex, age, body mass index, waist circumference, hypertension, and smoking status. Although the specific set of predictors differed slightly, most of these variables were also incorporated into our models. Importantly, their validation dataset comprised approximately 12,500 participants, which is considerably larger than the sample size in our study. Additionally, Xu et al., using a Japanese population-based cohort of 10,986 individuals, reported that a non-invasive risk model incorporating sex, body mass index, family history of diabetes, and diastolic blood pressure demonstrated moderate discriminative ability for five-year incident type 2 diabetes (AUROC, 0.643) [11]. Similarly, Kawasoe et al., using Japanese health checkup data from 31,084 participants, reported that a non-invasive risk prediction equation incorporating age, sex, body mass index, and hypertension demonstrated modest discriminative ability for five-year incident diabetes (AUROC 0.70) [14]. Previous studies and our research results are summarized in Table 5. In summary, our model was developed using a small sample size, whereas the previous study constructed its model based on a substantially larger cohort with a limited set of predictive variables, and consequently achieved superior predictive performance. In contrast, our study demonstrated that even with a relatively small and region-specific sample, a model based solely on easily obtainable health checkup questionnaire items can achieve moderate predictive performance. This finding emphasizes the potential contribution of our approach to the development of predictive models that reflect regional characteristics.
The logistic regression analyses revealed two predictive factors common to both the NHID and the MCSD: sex and antihypertensive drugs. This finding is consistent with previous studies that identified these variables as risk factors for diabetes [8]. A lipid-lowering drug was identified as a predictor in the NHID but not in the MCSD. Dyslipidemia has been reported to increase the risk of diabetes among individuals aged 40–54, 55–64, and 65–74 years, but it was not a significant factor among those aged 75 years and older [15]. In the MCSD, an absence of eating habits within two hours before bedtime was identified as a predictor. A previous study revealed that eating four meals per day is associated with a lower risk of type 2 diabetes compared with eating three meals per day [16]. This association was significant for individuals with a BMI < 25 kg/m^2^ but not for those with a BMI ≥ 25 kg/m^2^, and age-specific effects were not examined [17]. Although the present study considered the number of meals, it did not account for meal content. Several studies have explored the relationship between diabetes onset and meal composition, nutrient intake, and related factors. Morimoto et al. analyzed diabetes incidence and eating habits in farming villages in Nagano Prefecture, Japan, and found that higher intakes of vegetables, potatoes, seaweed, fruits, and soybean products were associated with a reduced risk of diabetes [16]. Kimura et al. reported that higher dietary fiber intake in the general Japanese population was linked to a lower risk of type 2 diabetes [18]. Collectively, these findings suggest that more detailed analyses of meal content and nutrient composition may reveal different risk factors across narrower age categories.
Our model was developed using data obtained from a specific region, which may limit its generalizability and model performance. In Japan, all municipalities maintain not only information comparable to that used this study, but also resident-level administrative and socioeconomic data (e.g., income, occupation, household composition, and residential environment). Federated learning offers a privacy-preserving framework to train models collaboratively across municipalities while keeping resident-level data local, which may improve predictive performance [19,20]. Moreover, multi-region longitudinal data with outcome statuses may enable more precise prediction. For example, DeepTrace is a graph neural network (GNN)-based framework that leverages a contact network structure and transmission trajectories to identify superspreaders, while updating its estimates as new tracing information accumulates [21]. Similarly, modeling resident-level data, including annual health checkups, daily life indicators, and healthcare utilization, as longitudinal trajectories and updating risk estimates as observations accrue may further extend our approach toward more powerful predictive models. Realizing such multi-site and individual-level, trajectory-based modeling, however, requires addressing several practical challenges. Specifically, learning from multi-site, high-dimensional data poses well-recognized challenges, including structural heterogeneity and increased complexity in data processing and modeling [22]. Addressing these issues typically requires robust computing environments and well-designed data platforms.
As several limitations of the present study, the data obtained from the questionnaire forms from specific health checkups are self-reported. For example, certain question items had no clear standards for frequency and quantity; these items include “My walking speed is faster compared with persons of the same sex who are roughly my age”; “Compared with others, I eat at a faster speed”; and “When sleeping, I get adequate rest.” This lack of clarity in standards for responses could introduce subjectivity and potential bias in the reported data. This study was based on data from a single municipality in Hokkaido, Japan, and the findings may not be generalizable to other regions. Regional variation in seasonal conditions, dietary patterns, and exercise habits may result in different contextual meanings for identical questionnaire responses. For instance, the types of foods typically consumed within two hours before bedtime may vary across regions. Finally, this study attempted to develop a predictive model using a relatively small sample size and achieved moderate predictive performance. Incorporating resident information held by each municipality into collaborative healthcare learning frameworks, while preserving privacy, has the potential to enhance model performance and support the development of more robust predictive models.
5. Conclusions
This study developed predictive models for T2DM using community-based health checkup and claims data. Models based on non-invasive parameters achieved moderate performance (AUROC = 0.660 and 0.618). Key predictive factors included sex, antihypertensive drugs, lipid-lowering drugs, walking speed, and eating habits before bedtime. These findings suggest the potential for more advanced utilization of routinely collected health checkup data in regional settings.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1International Diabetes Federation IDF Diabetes Atlas 11th ed.International Diabetes Federation Brussels, Belgium 2024 Available online: https://diabetesatlas.org(accessed on 5 November 2025)
- 2Ministry of Health, Labour and Welfare National Health and Nutrition Survey 2023 Available online: https://www.mhlw.go.jp/bunya/kenkou/kenkou_eiyou_chousa.html(accessed on 5 November 2025)
- 3World Health Organization Global Action Plan for the Prevention and Control of Noncommunicable Diseases 2013–2020 WHO Geneva, Switzerland 2013
- 4Ahmad E. Lim S. Lamptey R. Webb D.R. Davies M.J. Type 2 diabetes Lancet 20224001803182010.1016/S 0140-6736(22)01655-536332637 · doi ↗ · pubmed ↗
- 5Bommer C. Heesemann E. Sagalova V. Manne-Goehler J. Atun R. Bärnighausen T. Vollmer S. The global economic burden of diabetes in adults aged 20–79 years: A cost-of-illness study Lancet Diabetes Endocrinol.2017542343010.1016/S 2213-8587(17)30097-928456416 · doi ↗ · pubmed ↗
- 6Trikkalinou A. Papazafiropoulou A.K. Melidonis A. Type 2 diabetes and quality of life World J. Diabetes 2017812012910.4239/wjd.v 8.i 4.12028465788 PMC 5394731 · doi ↗ · pubmed ↗
- 7Ministry of Health, Labour and Welfare About a Specific Medical Checkup and Specific Health Guidance 2025 Available online: https://www.mhlw.go.jp/stf/seisakunitsuite/bunya/0000161103.html(accessed on 5 November 2025)
- 8Nanri A. Nakagawa T. Kuwahara K. Yamamoto S. Honda T. Okazaki H. Uehara A. Yamamoto M. Miyamoto T. Kochi T. Development of risk score for predicting 3-year incidence of type 2 diabetes: Japan Epidemiology Collaboration on Occupational Health Study P Lo S ONE 201510 e 014277910.1371/journal.pone.014277926558900 PMC 4641714 · doi ↗ · pubmed ↗
