Integrated Cross-Platform Analysis Reveals Candidate Variants and Linkage Disequilibrium-Defined Loci Associated with Osteoporosis in Korean Postmenopausal Women
Su Kang Kim, Seoung-Jin Hong, Seung Il Song, Jeong Keun Lee, Gyutae Kim, Byung-Joon Choi, Suyun Seon, Seung Jun Kim, Ju Yeon Ban, Sang Wook Kang

TL;DR
This study combines genetic data from two platforms to find new genetic factors linked to osteoporosis in Korean postmenopausal women.
Contribution
The study introduces an integrated approach using dual exome arrays and machine learning to identify novel osteoporosis-related loci.
Findings
111 overlapping SNPs in 70 genes were identified through cross-platform analysis.
LD-block analysis revealed 10 blocks with significant SNPs, including four on chromosome 12.
Candidate genes like CCDC92, DYNC2H1, and NQO1 were highlighted for further validation.
Abstract
Background: Osteoporosis is highly prevalent in postmenopausal women, yet genome-wide association studies often miss disease-relevant variants because of incomplete single nucleotide polymorphism (SNP) coverage and platform-specific limitations. We aimed to identify genetic contributors to osteoporosis risk by integrating two exome-based genotyping platforms with multilayer analytic approaches. Methods: We analyzed extreme osteoporosis phenotypes in Korean postmenopausal women from the Korean Genome and Epidemiology Study (KoGES) Ansan–Anseong cohorts using the Illumina Infinium HumanExome BeadChip and the Affymetrix Axiom Exome Array. After standard quality control, single-SNP logistic regression, cross-platform overlap analysis, and three machine-learning models were applied. Predicted functional impact was evaluated using multiple in silico algorithms and conservation scores.…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5- —National Research Foundation of Korea (NRF)
- —Korea government (MSIT)
- —National Biobank of Korea, the Centers for Disease Control and Preven-tion, Republic of Korea
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenetic Associations and Epidemiology · Bone health and osteoporosis research · Bone Metabolism and Diseases
1. Introduction
Osteoporosis is a disease characterized by weakened bones that become easily fractured due to low bone mineral density (BMD) and microstructural changes. It is particularly prevalent among postmenopausal women and the older adults. The World Health Organization (WHO) defines osteoporosis as a condition where bone mineral density is −2.5 or lower when comparing bone density measurements to the average bone density of young adults [1]. Osteoporosis is a widespread health issue globally, affecting over 200 million people aged 50 and older. According to the Korea Disease Control and Prevention Agency, 37.3% of women and 7.5% of men over the age of 50 in South Korea have been diagnosed with osteoporosis (www.kdca.go.kr). The prevalence of osteoporosis increases rapidly with age, affecting approximately two-thirds of women and one-fifth of men aged 70 and older.
Despite being a common condition, osteoporosis can have severe, even fatal, consequences. The most significant risk associated with osteoporosis is fractures, which can lead to serious complications. Hip fractures, in particular, often result in mobility loss and prolonged bed rest, increasing the risk of complications such as bedsores and pneumonia. With a mortality rate of approximately 20% within one year following a hip fracture, osteoporosis should therefore be regarded as a serious health threat [2].
The development of osteoporosis is influenced by a wide range of factors, including age, menopause, lifestyle habits such as smoking, alcohol consumption, and physical inactivity, as well as nutritional deficiencies and the use of certain medications [3]. Among these risk factors, genetic predisposition has been increasingly recognized as a major contributor. Numerous studies have shown that genetic influences substantially affect both the onset of osteoporosis and the reduction in bone mineral density. Twin studies, in particular, have indicated that up to 60–80% of the variance in bone density can be attributed to genetic factors, underscoring the importance of genetic predisposition in the pathogenesis of osteoporosis risk or low BMD [4]. To date, most genetic research on osteoporosis risk or low BMD has been conducted using genome-wide association studies (GWAS), which are widely used to identify genetic variants associated with complex diseases [5]. Previous GWAS investigating bone mineral density (BMD) in Korean populations have mainly focused on DXA-based phenotypes and used genotyping or multi-ethnic meta-analyses to improve genome coverage, but have typically relied on single genotyping platforms [6,7]. However, incomplete SNP coverage, reduced power to detect rare variants, and the limited coverage of exonic variants and the dependence on a single platform can lead to missed trait-relevant loci [8].
To address these limitations, this study employed two complementary genotyping platforms: the Illumina Exome Chip and the Affymetrix Axiom Exome Array. Using both arrays expands SNP coverage through distinct probe sets, enhances the detection of rare variants, enables cross-validation of significant loci to reduce false positives and negatives, and mitigates platform-specific biases, thereby improving the reliability and reproducibility of GWAS results. The Illumina Exome Chip and the Affymetrix Axiom Exome Array, commonly used platforms for genetic analysis, each possess distinct advantages and limitations. The Illumina Exome Chip is tailored to rare coding variants such as missense and nonsense mutations, but its design—based primarily on European-ancestry exomes—limits representation in non-European populations, and array-based genotyping constraints result in ~20–30% of targeted variants failing probe design or accurate genotyping. Moreover, statistical power for detecting low-frequency and especially ultra-rare variants remains limited [9,10,11]. In contrast, the Affymetrix Axiom Exome Array exhibits high accuracy and reproducibility across diverse variants, yet shows a marked decline in positive predictive value for ultra-rare heterozygous calls, leading to false positives [12,13].
While analyzing two genotyping platforms simultaneously is highly valuable, single-SNP association results can appear inconsistent—even when reflecting the same underlying biological signal—due to differences in variant density, probe design, and tagging efficiency. To address these limitations, this study incorporated LD block–level analysis to group correlated SNPs into coherent genomic loci [14,15]. This approach enables key SNPs from both platforms to be integrated within the same LD-defined regions, thereby improving the robustness of cross-platform validation, particularly for rare or poorly tagged variants.
Therefore, the aim of this study is to simultaneously analyze data obtained from the Illumina Infinium HumanExome BeadChip and the Affymetrix Axiom Exome Array and to cross-validate the findings from each platform in order to more accurately identify genes associated with low BMD and osteoporosis risk. The ultimate goal is to elucidate the genetic underpinnings of low BMD and osteoporosis risk, thereby providing foundational insights that can inform future efforts in disease prediction, diagnosis, and the development of personalized treatment strategies.
2. Materials and Methods
2.1. Study Subjects
This study was conducted using data from the third phase of the Anseong and Ansan cohorts of the Korean Genome and Epidemiology Study (KoGES), which included a total of 7077 subjects. Figure 1 presents a flowchart outlining the subject selection and analysis process. Initially, men and premenopausal women were excluded, and only postmenopausal women were included in the analysis. In this study, we used bone status indices based on Sunlight OmniSense quantitative ultrasound (QUS) measurements to define osteoporosis risk. Previous studies have reported that QUS measurements are significantly correlated with DXA measurements, particularly in postmenopausal women [16]. T- and Z-scores were derived from speed-of-sound (SOS) measurements [17]. Specifically, low BMD was defined as participants with a T-score < −2.5 and a Z-score < −2.0, as measured by QUS at the midshaft tibia or distal radius. This operational definition was used to maximize phenotypic contrast and differs from the WHO/ISCD clinical diagnostic criteria [1], which rely on central DXA measurements at the lumbar spine or hip. Healthy controls were defined as participants with a QUS-derived T-score ≥ −1.0, and to further enhance separation between cases and controls, participants with T-scores between −2.5 and −1.0 were excluded from the analysis. To isolate the genetic influence on low BMD, subjects with known risk factors such as body weight < 58 kg, body mass index (BMI) < 19, alcohol consumption > 10 mL/day, smoking, long-term steroid use, hormone therapy, and a medical history of fracture or arthritis were also excluded [1,18].
The final selected participants were as follows:
- •Illumina Infinium HumanExome BeadChip (Illumina, Inc., San Diego, CA, USA) group: 98 healthy controls and 191 low BMD cases (total n = 289).
- •Affymetrix Axiom Exome Array group: 99 healthy controls and 194 low BMD cases (total n = 293).
From the Illumina Infinium HumanExome BeadChip, 30,538 single nucleotide polymorphisms were analyzed, and a total of 1200 SNPs met the quality control (QC) thresholds of p < 0.05, minor allele frequency (MAF) > 0.05, and Hardy–Weinberg equilibrium (HWE) > 0.01, which were applied to minimize potential false-positive associations [19]. From the Affymetrix Axiom Exome Array (Thermo Fisher Scientific Inc., Waltham, MA, USA), 242,901 SNPs were examined, and 15,703 SNPs met the same criteria. A cross-platform comparison revealed that 111 SNPs across 70 genes were commonly identified as statistically significant in both datasets. This study was approved by the Institutional Review Board of Dankook University (IRB No. 2018-08-004 Date: 10 September 2018). As indicated in the IRB documents, this study was classified as a retrospective cohort/genetic data–based analysis, and therefore written informed consent was waived. The IRB approved this exemption and included a “Written Consent Waiver Statement” in the official documentation. Accordingly, no participant consent form was required, and none was collected.
2.2. Statistical Analysis
Group comparisons between low BMD cases and healthy controls were carried out using independent t-tests implemented in Python 3.11.13. To identify genetic variants potentially linked to low BMD, logistic regression models were constructed using both SNP & Variation Suite (Golden Helix, Bozeman, MT, USA) and Persistent Linked INtegrated Kit version 1.9 (PLINK) [20]. Visualization of genome-wide association signals was performed through Manhattan and quantile-quantile (Q-Q) plots using Python 3.11.13. The chromosomal positions of significant SNPs were mapped using the PhenoGram visualization tool (http://visualization.ritchielab.org [accessed on 02 October 2018]) [21]. To reduce model-specific bias and enhance the confidence of common features, three models—linear discriminant analysis (LDA), random forest, and XGBoost model—were used to capture both linear (additive) effects and non-linear SNP interactions, thereby providing complementary information for variant prioritization. Machine learning analyses were applied as an auxiliary and exploratory approach to prioritize candidate SNP features associated with case status (OP = 1) versus controls (OP = 0). Linear discriminant analysis (LDA) was used to derive a one-dimensional discriminant component and to rank SNPs based on the absolute values of discriminant coefficients. Random forest and XGBoost classifiers were trained using an 80/20 train–test split, and SNPs were ranked by model-based feature importance scores. The top-ranked SNPs (top 10) from each model were visualized using PCA (tree-based models) and LDA projection (LDA), to illustrate separation patterns. To assess the potential functional impact of non-synonymous variants, we obtained functional prediction scores using the dbNSFP database [22], including the Sorting Intolerant from Tolerant (SIFT), Polymorphism Phenotyping v2, HumDiv-trained model/HumVar-trained model (PolyPhen-2 HDIV/HVAR), Protein Variation Effect Analyzer (PROVEAN), Rare Exome Variant Ensemble Learner (REVEL), and Combined Annotation-Dependent Depletion (CADD). Thresholds for predicting deleteriousness were set as follows: SIFT score < 0.05 [23], PolyPhen-2 HDIV/HVAR score > 0.85 for “probably damaging” [24], PROVEAN score < –2.5 [25], REVEL score > 0.5 [26], and CADD phred ≥ 20 [27]. To explore the evolutionary conservation of the overlapping SNPs, conservation scores were obtained using Genomic Evolutionary Rate Profiling (GERP++) [28], Phylogenetic p-value (phyloP) [29], and Phylogenetic Hidden Markov Model Conservation Score (phastCons) [30]. To explore protein–protein interaction networks, we used the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) v12.0 with a minimum required interaction score of 0.400, and clustering was performed using Markov Cluster Algorithm (MCL) clustering (inflation parameter = 2.0) (https://string-db.org [accessed on 16 August 2025]) [31]. Functional enrichment and pathway analysis of the identified genes were performed using Database for Annotation, Visualization, and Integrated Discovery (DAVID; https://davidbioinformatics.nih.gov [accessed on 16 August 2025]) [32]. Post hoc power analysis was conducted for selected SNPs highlighted in the main association results to evaluate whether the available sample size was sufficient to detect the observed effect sizes under a predefined genome-wide significance threshold (α = 1 × 10^−5^), using Python (version 3.11.13). Selection of SNPs was based on observed effect sizes (OR > 1.3 or < 0.7) and minor allele frequency (MAF) > 0.05. Minimum detectable odds ratios at 80% power were calculated using a Wald test–based normal approximation under an additive logistic regression model, incorporating sample size, case–control ratio, and minor allele frequency.
2.3. Linkage Disequilibrium Block Analysis and SNP Characterization
Genotype data obtained from the Illumina HumanExome BeadChip and the Affymetrix Axiom Exome Array were combined into a single working dataset. Before merging, both datasets underwent standard quality control using PLINK (version 1.9), including filters for call rate, minor allele frequency, and Hardy–Weinberg equilibrium. Because the two platforms use slightly different probe designs and allele encodings, we harmonized allele formats (including strand alignment) and updated SNP identifiers and genomic coordinates to ensure that variants aligned correctly across platforms. After preprocessing, pairwise linkage disequilibrium (LD) values (r^2^) were calculated using PLINK. LD blocks were then defined using a graph-based approach in which SNPs were treated as nodes and pairwise LD relationships (r^2^ ≥ 0.8) were represented as edges. LD blocks were identified as connected components within this SNP network, and SNP-to-block assignments were recorded for downstream block-level analyses. To focus on regions with potential biological relevance, we retained only LD blocks that included at least one SNP with an association p-value below 1 × 10^−4^. Within each selected block, the SNP with the smallest p-value was designated as the lead variant. For visualization and quality assessment, LD heatmaps were generated in R (version 4.3.3) by constructing block-wise r^2^ matrices, ordering SNPs by physical position, and visualizing within-block LD structure. This approach allowed inspection of LD consistency and block integrity across the two genotyping platforms. For each block, we also generated a LocusZoom-style plot to visualize the local association pattern around the lead SNP.
3. Results
3.1. Characteristics of Study Subjects
Age was significantly higher in the low BMD group, although age at menopause showed no significant difference. Alcohol and calcium intake were also comparable between groups. Notably, both body weight and BMI were significantly higher in the low BMD group. Moreover, while distal radius speed of sound (DR-SOS), distal radius T-score (DR-T), and distal radius Z-score (DR-Z) did not differ between groups, midshaft tibia speed of sound (MT-SOS), midshaft tibia T-score (MT-T), and midshaft tibia Z-score (MT-Z) demonstrated significant differences (Table 1).
3.2. Q-Q Plots and Manhattan Plots of Logistic Regression
Figure 2 shows the Q-Q plots and Manhattan plots from genome-wide association analyses using the Illumina Infinium HumanExome BeadChip and the Affymetrix Axiom Exome Array. Neither platform identified any SNPs meeting the genome-wide significance threshold (p < 5 × 10^−8^). In the Illumina Infinium HumanExome BeadChip data, one SNP exceeded the suggestive significance threshold (p < 1 × 10^−5^), and several SNPs displayed slight deviations from the expected line in the upper tail below the suggestive threshold, although the overall distribution closely matched the expected line. In the Affymetrix Axiom Exome Array data, association signals were generally lower, with few SNPs approaching or reaching the suggestive threshold, and distinct clusters were limited. The genomic inflation factors indicated minimal inflation for both platforms (λGC = 1.007 and 1.001, respectively).
3.3. Overlapping SNPs Across Two Genotyping Platforms (111 SNPs Across 70 Genes)
Figure 3a presents a chromosome map showing the locations of genes in which SNPs were commonly detected across both platforms, with each dot representing a distinct gene. Genes with the lowest p-values are highlighted in red. Table 2 lists the number of overlapping SNPs and their corresponding genes for each chromosome. Genes without overlapping SNPs were excluded. Notably, higher numbers of overlapping SNPs and genes were observed on chromosomes 1, 3, 6, 9, 11, and 12.
Figure 3b shows a scatter plot comparing the p-values (−log_10_ transformed) of SNPs analyzed from the same samples on the Illumina Infinium HumanExome BeadChip and the Affymetrix Axiom Exome Array. The red and blue dotted lines represent significance thresholds at p = 0.001 and p = 1 × 10^−5^, respectively. No SNPs reached the genome-wide significance threshold (p < 5 × 10^−8^) or the suggestive threshold (p < 1 × 10^−5^) on either platform.
3.4. Top SNP Selection via Multiple Machine Learning Models
In this study, we applied three machine learning techniques—LDA, random forest, and XGBoost—to assess the impact of SNPs with shared significance across two genome analysis platforms (Illumina Infinium HumanExome BeadChip and Affymetrix Axiom Exome Array) on osteoporosis risk. Table 3 presents the top 10 SNPs identified by each model for both platforms. SNPs with the highest coefficients or importance scores are listed for each model, enabling the identification of genes and SNPs common to multiple platforms and models. LDA results showed a relatively similar top SNP composition between platforms, whereas random forest and XGBoost yielded more divergent top SNP lists. Blank entries in the table indicate SNPs not directly linked to protein-coding genes.
3.5. Predicted Deleterious Non-Synonymous SNPs Identified Across Both Platforms by Multiple in Silico Algorithms
We analyzed overlapping SNPs identified across both platforms to identify SNPs with potential functional impact on proteins. Table 4 shows variants predicted to have deleterious or damaging effects by at least three of the following five algorithms (SIFT, PolyPhen2 HDIV/HVAR, PROVEAN, REVEL, and CADD). The prioritized SNPs were located in genes associated with including metabolic, cytoskeletal, and signal-transduction related genes, including ARMS2, CCDC92, NQO1, ZNF510, PTPRB, and DYNC2H1. Notably, CCDC92, NQO1, ZNF510, PTPRB and DYNC2H1 showed CADD phred scores ≥ 20, suggesting that these variants are among the top 1–5% of deleterious substitutions in the human genome. Several mutations, such as ARMS2 p.Ala69Ser and DYNC2H1 p.Arg2871Pro, were consistently predicted to be damaging by four or more algorithms.
3.6. Conservation Analysis
We performed a conservation analysis of overlapping SNPs identified across both platforms using GERP++, phyloP, and phastCons scores to evaluate their evolutionary conservation. As shown in Table 5, several SNPs, including DYNC2H1 (rs589623), NQO1 (rs1800566), and ZNF510 (rs2289651), exhibited high GERP++ scores (>2.0), indicating strong evolutionary constraint. PhyloP scores (both phyloP1 and phyloP4) were also markedly elevated for NQO1 (9.295 and 8.644, respectively) and DYNC2H1 (4.414 and 0.676), suggesting that these variants occur at highly conserved genomic positions across multiple species. Similarly, phastCons scores for DYNC2H1 and NQO1 approached 1.0, indicating strong probability of being in a conserved element.
3.7. Post Hoc Power and Minimum Detectable Effect Size Analysis
Post hoc power analysis was conducted to assess whether the sample size used in the SNP association analysis was sufficient to detect the observed effect sizes. Under a suggestive significance threshold (α = 1 × 10^−5^), most overlapping SNPs showed post hoc power values exceeding the predefined threshold. However, rs2584021 (PTPRB) exhibited low post hoc power (<0.20), suggesting limited statistical power for these variants. Among the non-redundant SNPs, rs2076212 (PNPLA3) was the only variant showing suggestive significance; however, its post hoc power was low (0.16), indicating that the observed association should be interpreted with caution due to the limited sample size. Nevertheless, although several SNPs showed statistically significant associations, none exceeded the minimum detectable odds ratio required for 80% power, indicating that the study remains underpowered to reliably detect small-to-moderate effect sizes given the modest sample size.
3.8. Protein–Protein Interactions and Functional Enrichment Analysis
We performed protein–protein interaction (PPI) analysis using the overlapping genes identified on both platforms. As shown in Figure 4a, the STRING network consisted of 65 nodes and 16 edges, with an average node degree of 0.492 and an average local clustering coefficient of 0.221. The PPI enrichment p-value was 0.122, indicating that the network did not contain significantly more interactions than would be expected by chance. Although not statistically enriched, several nodes showed multiple connections, which may imply limited functional grouping (Table 6). After MCL clustering, Figure 4b revealed functionally grouped modules. The largest cluster was enriched for genes involved in cytoskeletal organization and cellular projection–related processes. To further explore functional enrichment, subsets of genes were analyzed using STRING Gene Ontology categories. Figure 4c shows a cluster enriched for plasma membrane-bounded cell projection cytoplasm (GO:0032838; FDR = 9.41 × 10^−11^), including several microtubule-associated genes. Figure 4d shows another cluster related to distal axon (GO:0150034; FDR = 2.71 × 10^−10^).
3.9. LD Block Analysis and Cross-Platform Locus Characterization
Across the genome, we found ten LD blocks that included at least one SNP with p < 1 × 10^−4^. These blocks were spread across chromosomes 1, 2, 3, 5, 10, 11, and 12. Their structures were not uniform: some blocks consisted of only two SNPs, whereas others formed broader clusters containing more than 20 correlated variants. The list of SNPs included in each LD block is provided in Table A1. Chromosome 12 was particularly notable, as it contained four independent blocks corresponding to the NAV2, BICD1, CCDC92, and ZNF664 loci. To characterize these blocks in more detail, we visualized pairwise LD patterns using cross-platform LD heatmaps (Figure 5A). These heatmaps highlight how SNPs from the Illumina and Affymetrix exome arrays clustered together within each region, revealing both simple and complex LD structures depending on the locus. In parallel, regional association plots (Figure 5B) were generated to show the distribution of −log_10_(p) values surrounding each lead SNP, allowing us to examine the strength and breadth of association signals within each block.
4. Discussion
This study examined the genetic influence on the development of osteoporosis risk and low BMD in postmenopausal women. Table 1 presents the characteristics of the study subjects. Although the age at menopause was similar between groups, the prevalence of osteoporosis may increase with advancing age after menopause as with a previous study [33]. Underweight is generally associated with lower BMD and a higher risk of fracture; therefore, underweight individuals were excluded from this study [34]. Interestingly, the low BMD group showed higher body weight and BMI, which is contradictory to the general trend. However, both groups were classified as obese according to BMI criteria [35]. This finding may be related to the observation that DR-SOS, DR-T, and DR-Z did not differ significantly between groups, whereas MZ-SOS, MZ-T, and MZ-Z did. The radius is less influenced by walking or daily loading, whereas the tibia supports body weight, making bone loss in the tibia more pronounced. This functional difference may partly explain the variations in body weight and BMI [36]. Overall, postmenopausal women tend to show marked bone loss in the cortical midshaft tibia, whereas differences in the cancellous distal radius are relatively minimal or statistically insignificant.
Logistic regression analysis revealed that no SNPs met the genome-wide significance threshold [37]. However, one SNP exceeded the suggestive significance threshold (p < 1 × 10^−5^) on the Illumina Infinium HumanExome BeadChip. This SNP, rs2076212 of the PNPLA3 gene, is located within the coding region and represents a missense mutation (G115C). It was the only SNP to show statistical significance with an FDR of 0.016 (p < 0.05). PNPLA3 is a gene related to metabolism [38], and a previous study reported its association with osteoporosis [39]. That study also utilized KoGES data but included men, whereas the present study targeted only postmenopausal women. Notably, PNPLA3 is absent from the Affymetrix Axiom Exome Array, meaning it would not have been detected using Affymetrix alone. This highlights the advantage of expanded analysis using both platforms.
A comparatively larger number of overlapping SNPs and genes were observed on chromosomes 1, 3, 6, 9, 11, and 12. Many of these genes are involved in immune response (e.g., HLA-DOA) [40], cell signaling (e.g., DISC1, FGF12) [41,42], and bone metabolism (e.g., COL6A5) [43].
LDA revealed that certain genes, such as ATAD5, DNHD1, TNFSF15, and CCDC92, were recurrent across both platforms, suggesting that these variants may contribute to the discrimination between disease and control groups. In contrast, random forest and XGBoost model showed relatively low rates of common SNPs across platforms, and genes not identified by LDA appeared at the top of their rankings. This discrepancy likely reflects the fact that LDA selects variables contributing to classification based on linear discriminant criteria, whereas random forest and XGBoost consider nonlinear interactions. These findings indicate that the application of diverse models can be beneficial, since linear and non-linear methods may identify independent subsets of variants. Variants consistently detected across models are thus likely to represent particularly robust candidates [44]. SNPs that recur across platforms are more likely to be associated with disease and, therefore, should be prioritized in further analysis. Applying multiple models in parallel to identify overlapping variant candidates can provide more robust and reliable results [45].
To further narrow down the variants with potential biological relevance, we incorporated multiple in silico prediction tools. This approach allowed us to focus on SNPs that consistently showed signs of functional impact rather than relying on any single algorithm. Among the 110 candidates, six variants stood out by receiving deleterious predictions from several independent models, suggesting that they may meaningfully influence bone-related pathways. In particular, CCDC92 p.Ser70Cys and DYNC2H1 p.Arg2871Pro demonstrated strong and concordant signals, including high CADD phred scores (>20), which strengthens the likelihood that these variants play a functional role. Taken together, these findings present how integrating diverse computational assessments can help highlight variants that deserve closer biological or mechanistic investigation.
Conservation analysis provided additional support for prioritizing variants with potential functional significance. Variants with high GERP++, phyloP, and phastCons scores—particularly those in DYNC2H1 and NQO1—showed strong evolutionary constraint, reinforcing the pathogenic signals suggested by other in silico predictions. Among the prioritized SNPs, ZNF510 (rs2289651), ARMS2 (rs10490924), DYNC2H1 (rs589623), CCDC92 (rs11057401), PTPRB (rs2584021), and NQO1 (rs1800566) were consistently predicted to have deleterious effects. Several of these variants also ranked highly in our machine-learning models: CCDC92 in LDA, DYNC2H1 in XGBoost, and PTPRB in random forest. Together, these convergent lines of evidence suggest that these variants may contribute meaningfully to osteoporosis susceptibility in our cohort. In contrast, variants with weak statistical associations or low conservation scores may act through mechanisms less tied to evolutionary constraint or may reflect population-specific genetic variation.
Although the overall PPI enrichment was not statistically significant (PPI enrichment p > 0.05), a small cytoskeleton-related cluster could still be observed. Cluster analysis revealed an over-representation of genes related to cytoskeletal organization and cell projections, which may reflect the importance of mechanosensing and cell architecture regulation in bone remodeling. In particular, significant enrichment in the cytoplasmic region of membrane-bound cell projections and distal axon terminals suggests that osteogenic regulation may be mediated through pathways related to microtubule dynamics and cellular projection processes [46,47,48].
In this study, we investigated the genetic determinants of osteoporosis risk and low BMD in postmenopausal women using two complementary genotyping platforms and a multilayered analytic approach that included statistical testing, machine learning, in silico functional prediction, and protein–protein interaction analysis. Although genome-wide significance was not achieved, the suggestive association of rs2076212 (PNPLA3) highlights the benefit of integrating two platforms, as this variant was present only on the Illumina array and would not have been captured by a single-platform analysis.
Although PNPLA3 was the only variant to exceed the suggestive significance threshold (p < 1 × 10^−5^) in the single-SNP analyses across both platforms, integrating the datasets at the LD-block level revealed several additional loci with similarly strong association patterns. Because the Illumina and Affymetrix arrays differ substantially in variant coverage, many variants present on one platform are entirely absent from the other, making single-SNP testing inherently incomplete. By grouping correlated variants into LD-defined genomic regions using SNPs from both arrays, we were able to recover shared genetic signals that were not detectable from individual p-values alone. This LD-based approach noticeably improved the sensitivity of our analysis and helped reveal additional loci with potential biological relevance. In particular, four distinct regions on chromosome 12—NAV2, BICD1, CCDC92, and ZNF664—became apparent only when LD patterns were examined collectively across the two platforms, even though no individual SNP within these regions met strict significance thresholds on its own. These observations illustrate how LD-block integration can bridge platform-specific gaps, reduce discrepancies between arrays, and uncover extra genetic signals that may play a role in osteoporosis susceptibility.
A number of overlapping SNPs were repeatedly detected across chromosomes enriched in genes related to immune regulation, cell signaling, and bone metabolism. Among the machine-learning models, LDA showed greater cross-platform concordance than random forest and XGBoost model, suggesting that SNPs selected by multiple models and multiple platforms represent high-confidence candidates. Six variants (ARMS2, CCDC92, NQO1, ZNF510, PTPRB, and DYNC2H1) were consistently predicted to be functionally deleterious by ≥3 in silico tools, of which CCDC92 p.Ser70Cys and DYNC2H1 p.Arg2871Pro also showed high CADD scores and strong contributions in the machine-learning models. Furthermore, conservation scores supported the potential pathogenicity of DYNC2H1 and NQO1, reinforcing the likelihood of their involvement in osteoporosis risk. Interestingly, CCDC92 is known to be associated with adipose tissue biology and metabolic traits, whereas DYNC2H1 encodes a motor protein involved in intracellular transport and cytoskeletal regulation, suggesting that these genes may influence bone remodeling indirectly through mechanosensing and metabolic pathways [49,50,51]. In addition, NQO1 p.Pro187Ser, a variant involved in the oxidative stress response, may affect bone homeostasis through redox-mediated mechanisms [52]. PTPRB encodes a receptor-type tyrosine phosphatase that regulates angiogenesis and vascular integrity by modulating VEGFR2 signaling in endothelial cells [53]. These functional interpretations are consistent with our STRING clustering results, which revealed enrichment of cytoskeleton- and membrane-projection–related processes, indicating that mechanosensing and cell architecture could be key mediators of bone loss in postmenopausal women. In addition to the model-based results, the LD-block analysis offered an independent layer of evidence for several of these regions. Genes such as NAV2, BICD1, CCDC92, and ZNF664 appeared as clearly defined LD loci, even though none of the individual SNPs within these regions met genome-wide significance on their own. Notably, several genes highlighted by the machine-learning and in silico analyses—such as CCDC92, NQO1, and DYNC2H1—were also situated within these LD-structured regions. This overlap indicates that different analytic approaches were pointing to the same genomic areas, which strengthens confidence in their biological relevance. By grouping correlated variants into shared LD blocks, this approach also captured regional signals that would have been missed by single-SNP testing alone, allowing variants with modest p-values to be interpreted within a broader genetic context [54,55]. Together, these findings show how LD-block integration complements model-based prioritization and helps reveal additional loci that may contribute to osteoporosis risk.
Although this study was exploratory in nature, the identified candidate genes may have several potential clinical implications. First, when integrated with clinical and demographic factors, these genes may contribute to genetic risk assessment and stratification for low bone mineral density in postmenopausal women. Second, the associated genes are involved in biological pathways related to bone metabolism, mechanosensing, metabolic regulation, and cytoskeletal organization, making them promising candidates for targeted functional studies. Finally, the findings of this study may serve as a basis for hypothesis generation in future research aimed at evaluating biomarkers or therapeutic targets. However, additional replication studies and experimental validation are required before these findings can be translated into clinical practice or drug development.
This study combined two genotyping platforms and multiple analytic approaches—including machine learning, in silico prediction, conservation scoring, and network analysis—to strengthen the identification of biologically meaningful variants. Nevertheless, several limitations should be acknowledged. The final analytic sample size after quality control was modest, which may have limited statistical power to detect variants with modest effect sizes. Post hoc power analyses showed that many overlapping SNPs exceeded the predefined post hoc power threshold; however, several variants exhibited low post hoc power, and some statistically significant signals were observed despite limited power, indicating that these associations should be interpreted with caution. Overall, the modest sample size limited the study’s ability to reliably detect small-to-moderate genetic effects. This was further supported by minimum detectable effect size analyses, which showed that none of the observed associations exceeded the odds ratio required to achieve 80% power, emphasizing the need for validation in larger cohorts. Our study was conducted exclusively in Korean postmenopausal women. While this design inevitably limits the direct generalizability of the findings to other ancestral groups and does not fully exclude population-specific effects, the relative homogeneity of the cohort also represents a practical strength from a genetic epidemiology perspective. By reducing background genetic and environmental variability, this homogeneity can facilitate clearer detection of genetic signals within a defined population.
To address limitations related to the modest sample size and platform-specific SNP coverage, we integrated results from the two genotyping platforms using an LD-block–based framework. By analyzing correlated variants as regional units rather than relying solely on single-SNP tests, we were able to capture broader association patterns that were not apparent at the individual-variant level, particularly in genomic regions where the two arrays provided complementary, non-overlapping coverage. Notably, several regions identified through the LD-block analysis were concordant with loci highlighted by machine-learning and functional prediction approaches, providing convergent support across independent analytic strategies. Nevertheless, these findings should be regarded as hypothesis-generating. Validation in larger, independent cohorts—ideally incorporating multi-ethnic populations—and the use of higher-resolution genomic data will be essential to confirm the implicated loci and to more comprehensively define the genetic architecture of osteoporosis.
5. Conclusions
Taken together, this study highlights the genetic contributors to postmenopausal osteoporosis risk by integrating two genotyping platforms with LD-block analysis, machine learning, and in silico predictions. Although only PNPLA3 reached suggestive significance in single-SNP testing, LD-based refinement revealed additional loci—including NAV2, BICD1, CCDC92, and ZNF664—that were detectable only when cross-platform LD structure was considered. Convergent evidence across analytic layers prioritized variants such as CCDC92, DYNC2H1, and NQO1 as biologically meaningful candidates. These findings underscore the value of combining multi-platform data with structural and functional analyses and provide a focused set of variants for future validation.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Kanis J.A. Assessment of fracture risk and its application to screening for postmenopausal osteoporosis: Synopsis of a WHO report WHO Study Group. Osteoporos. Int.1994436838110.1007/BF 016222007696835 · doi ↗ · pubmed ↗
- 2Cummings S.R. Melton L.J. Epidemiology and outcomes of osteoporotic fractures Lancet 20023591761176710.1016/S 0140-6736(02)08657-912049882 · doi ↗ · pubmed ↗
- 3Rachner T.D. Khosla S. Hofbauer L.C. Osteoporosis: Now and the future Lancet 20113771276128710.1016/S 0140-6736(10)62349-521450337 PMC 3555696 · doi ↗ · pubmed ↗
- 4Harris M. Nguyen T.V. Howard G.M. Kelly P.J. Eisman J.A. Genetic and environmental correlations between bone formation and bone mineral density: A twin study Bone 19982214114510.1016/S 8756-3282(97)00252-49477237 · doi ↗ · pubmed ↗
- 5Trajanoska K. Morris J.A. Oei L. Zheng H.F. Evans D.M. Kiel D.P. Ohlsson C. Richards J.B. Rivadeneira F. GEFOS/GENOMOS consortium and the 23and Me research team Assessment of the genetic and clinical determinants of fracture risk: Genome wide association and mendelian randomisation study BMJ 2018362 k 322510.1136/bmj.k 322530158200 PMC 6113773 · doi ↗ · pubmed ↗
- 6Choi H.J. Park H. Zhang L. Kim J.H. Kim Y.A. Yang J.Y. Pei Y.F. Tian Q. Shen H. Hwang J.Y. Genome-wide association study in East Asians suggests UHMK 1 as a novel bone mineral density susceptibility gene Bone 20169111312110.1016/j.bone.2016.07.00827424934 · doi ↗ · pubmed ↗
- 7Bae Y.S. Im S.W. Kang M.S. Kim J.H. Lee S.H. Cho B.L. Park J.H. Nam Y.S. Son H.Y. Yang S.D. Genome-Wide Association Study of Bone Mineral Density in Korean Men Genom. Inf.201614626810.5808/GI.2016.14.2.6227445649 PMC 4951402 · doi ↗ · pubmed ↗
- 8Maher B. Personal genomes: The case of the missing heritability Nature 2008456182110.1038/456018 a 18987709 · doi ↗ · pubmed ↗
