A cautionary note on the naive use of general-population biobanks to study pulmonary arterial hypertension, with a focus on Mendelian randomisation

Benjamin Woolf; Eckart De Bie; Vallerie McLaughlin; Stefan Gräf; Mark Toshner; Martin R. Wilkins; Christopher J. Rhodes; Stephen Burgess

PMC · DOI:10.1183/13993003.00436-2025·October 16, 2025

A cautionary note on the naive use of general-population biobanks to study pulmonary arterial hypertension, with a focus on Mendelian randomisation

Benjamin Woolf, Eckart De Bie, Vallerie McLaughlin, Stefan Gräf, Mark Toshner, Martin R. Wilkins, Christopher J. Rhodes, Stephen Burgess

PDF

Open Access

TL;DR

This paper warns against using general-population biobanks to study pulmonary arterial hypertension due to potential misclassification and low accuracy.

Contribution

The paper highlights the risks of false findings when using general-population biobanks for PAH research.

Findings

01

Using general-population biobanks for PAH leads to false-positive and false-negative results.

02

Non-random misclassification and low power are major issues in such studies.

Abstract

Pulmonary hypertension (PH) is defined by a mean pulmonary artery pressure >20 mmHg [1]. Patients with PH are assigned to one of five internationally recognised groups. Pulmonary arterial hypertension (PAH), or group 1 PH, is a heterogeneous collection of conditions characterised by increased precapillary pulmonary vascular resistance. Groups 2 to 5 PH comprise PH caused, in turn, by left heart disease, lung diseases (e.g. COPD), chronic thromboembolism, and miscellaneous causes such as haematological diseases. There are two issues with using existing data from general-population biobanks to study PAH: low power and non-random misclassification. Relative to gold standard data, this results in false-positive and false-negative findings. https://bit.ly/4nksm6c

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Diseases3

pulmonary hypertension pulmonary arterial hypertension COPD

Figures1

Click any figure to enlarge with its caption.

a](#F1)), as opposed to disease-specific cohorts. General population biobanks tend to define PAH with a single medical record code. Many studies use these data to perform Mendelian randomisation (MR; [figure 1b](#F1)). MR is a study design that uses genetic variants specifically associated with an exposure of interest to test causal claims [[2](#C2), [3](#C3)]. We demonstrate two issues with existing general population biobank PAH data: low power and non-random misclassification. These result in a failure to replicate findings from gold standard PAH datasets in general population biobanks, and

Funding1

—Economic and Social Research Councilhttp://dx.doi.org/10.13039/501100000269

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPulmonary Hypertension Research and Treatments

Full text

To the Editor:

Pulmonary hypertension (PH) is defined by a mean pulmonary artery pressure >20 mmHg [1]. Patients with PH are assigned to one of five internationally recognised groups. Pulmonary arterial hypertension (PAH), or group 1 PH, is a heterogeneous collection of conditions characterised by increased precapillary pulmonary vascular resistance. Groups 2 to 5 PH comprise PH caused, in turn, by left heart disease, lung diseases (e.g. COPD), chronic thromboembolism, and miscellaneous causes such as haematological diseases.

PAH is increasingly being investigated using data from biobanks with a general population sampling frame (figure 1a), as opposed to disease-specific cohorts. General population biobanks tend to define PAH with a single medical record code. Many studies use these data to perform Mendelian randomisation (MR; figure 1b). MR is a study design that uses genetic variants specifically associated with an exposure of interest to test causal claims [2, 3]. We demonstrate two issues with existing general population biobank PAH data: low power and non-random misclassification. These result in a failure to replicate findings from gold standard PAH datasets in general population biobanks, and spurious findings from general population biobanks that fail to replicate in gold standard datasets.

Because PAH is rare, with a prevalence below 50 cases per million [1], population-based biobanks have few cases. Power in case–control studies is not substantially improved by increasing the case-to-control ratio beyond 1:4 [4]. Consequently, biobanks are typically less well powered than PAH-specific cohorts, despite having tens or hundreds of times more participants.

Low power has two implications: true-positive associations are more likely to be missed, and detected associations have lower odds of being true. Indeed, a reported association between variants proxying IL6R signalling and PAH risk observed in an early release of FinnGen failed to replicate in much larger cohorts [5].

Meta-analyses can address low power. Researchers therefore meta-analysed three general-population biobanks (UK Biobank (UKB), FinnGen release 12, and Million Veteran Program (MVP)) with 3302 apparent PAH cases and 1 205 457 controls (https://mvp-ukbb.finngen.fi/pheno/I9_HYPTENSPUL) [6–8]. The prevalence in this meta-analysis is much higher than expected based on general population surveys. This might imply the presence of misclassification.

Misclassification (when people categorised as cases do not have the condition, or people categorised as controls do) reduces power [9]. The Rhodes et al. [10] genome-wide association study (GWAS) contained 2085 PAH cases with gold standard expert centre diagnoses, and 9659 controls. The FinnGen-UKB-MVP meta-analysis is theoretically better powered because it has more “cases” and controls. Failure to detect an association observed in Rhodes et al. [10] in the FinnGen-UKB-MVP meta-analysis would suggest misclassification of cases in the general population biobanks.

Rhodes et al. [10] detected associations with three independent variants at genome-wide significance (p<5×10^−8^): rs2856830 in the HLA-DPA1/DPB1 gene cluster, and rs13266183 and rs10103692 near the SOX17 gene. The SOX17 gene was then followed up in cell and animal models, which demonstrated the functional relevance of the variants. However, none of these variants are associated with PAH in the FinnGen-UKB-MVP meta-analysis (p=0.506 for rs2856830, p=0.241 for rs13266183, and p=0.318 for rs10103692). This strongly suggests that the general population biobanks do not accurately identify PAH cases.

Although not ideal, when misclassification is random, information can still be gained from large numbers of noisy observations [9]. However, non-random misclassification can introduce bias. Rare conditions like PAH are susceptible to misclassification bias in general population biobanks because a tiny percentage of controls that are non-randomly misclassified can overwhelm signal from the limited number of true cases. (Correctly classifying 99.99% of non-PAH individuals in a general population sample will result in 100 false cases per million, twice the prevalence of true cases.)

One possible source of non-random misclassification is individuals with non-group 1 PH who incorrectly receive a PAH medical record code. We examined this by testing if causes of groups 2 to 5 PH associate with PAH diagnosis in MR analyses. We used GWASs on pulmonary emboli (32 876 cases, 1 508 902 controls) and left heart failure (10 857 cases, 1 463 784 controls) from the FinnGen-UKB-MVP meta-analysis, and COPD (58 559 cases, 937 358 controls) from the Global Biobank Meta-Analysis [11]. Cases were defined using medical record codes. We selected genetic proxies for each phenotype using independent (clumping r^2^=0.001 and distance=10 Mb) genome-wide significant variants, and meta-analysed MR Wald ratios using an inverse variance weighting. The MR analysis using the FinnGen-UKB-MVP meta-analysis found evidence supporting genetically predicted left heart failure (OR 1.916, 95% CI 1.660 to 2.210), COPD (OR 1.376, 95% CI 1.092 to 1.734) and pulmonary emboli (OR 1.110, 95% CI 1.016 to 1.212) as risk factors for PAH diagnoses. This was not replicated by Rhodes et al. [10] (OR 1.209, 95% CI 0.967 to 1.512; OR 1.144, 95% CI 0.772 to 1.697; and OR 0.997, 95% CI 0.874 to 1.137, respectively).

These differences can be explained by case misclassification. Alternative explanations are not convincing. Failure to replicate may be due to case misclassification. An alternative explanation is lower power; however, this would only lead to wider confidence intervals, whereas we also see substantial attenuation in estimates. While there are demographic differences between Rhodes et al. [10] and the FinnGen-UKB-MVP meta-analysis, the differences between the biobanks are at least as large. Thus, if demographic differences are important, there should also be heterogeneity in estimates between the biobanks. However, substantial heterogeneity in PAH associations was not observed (minimum heterogeneity p_FDR_=0.089).

Non-random misclassification can bias downstream analyses. To illustrate, atrial fibrillation (AF) is associated with left heart failure but does not cause PAH. Misclassification of group 2 PH might create a false-positive AF–PAH association. Selecting variants (using the same parameters described above) from the FinnGen-UKB-MVP meta-analysis of AF (170 643 cases and 1 163 021 controls), we observe an association of genetically predicted AF with PAH in the FinnGen-UKB-MVP PAH meta-analysis (p<0.001) but not in Rhodes et al. [10] (p=0.457).

Non-PH individuals can also be misclassified with PAH, for example due to a related cardiorespiratory disease. In the All of Us biobank there are 21 individuals per million who had a PAH ICD-10 code, were on a medication used to treat PAH, and had no group 2 to 5 related ICD-10 codes [12]. However, 129 individuals per million have a PAH ICD-10 code but are not on any PAH medication and do not have a group 2 to 5 PH medical record code (figure 1c); these likely largely represent non-PH individuals with an incorrect PAH diagnosis.

Researchers wishing to use general population biobanks to study rare diseases such as PAH need to ensure they carefully address misclassification. Because PAH cases are unlikely to be unmedicated, the literature suggests supplementing medical records with PAH-related prescriptions and/or requiring elevated pulmonary pressures [13]. However, since non-group 1 PH individuals are prescribed PAH medication, requiring cases to use PAH medication may not address misclassification between PH groups. The blanket exclusion of people with conditions related to group 2 to 5 PH will also exclude true PAH cases due to high co-occurrence of some of these conditions in gold standard-diagnosed PAH patients [14].

To advance the field, a call-to-action was initiated by the Genetics and Genomics Task Force 3 at the 7th World Symposium on Pulmonary Hypertension to assemble a global, diverse, condition-specific genetic registry [15]. We would encourage researchers or clinicians with genotyped PH patients to support the registry.

To conclude, general population biobanks defining PAH with a single medical record code are likely to produce unreliable results due to non-random misclassification and low power. When misclassification is truly random, it can be addressed by increasing sample sizes. Non-random misclassification cannot be addressed so simply. The spurious MR findings presented here highlight that even “robust” designs require good quality data. Inaccurate PAH definitions are thus likely to bias conventional observational analyses. Currently available general population biobank GWASs of PAH should therefore be avoided in downstream analyses such as MR.

Shareable PDF

10.1183/13993003.00436-2025.Shareable1This PDF extract can be shared freely online.Shareable PDF ERJ-00436-2025.Shareable

Bibliography15

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Kovacs G, Bartolome S, Denton CP, et al. Definition, classification and diagnosis of pulmonary hypertension. Eur Respir J 2024; 64: 2401324. doi:10.1183/13993003.01324-202439209475 PMC 11533989 · doi ↗ · pubmed ↗
2Sanderson E, Glymour MM, Holmes MV, et al. Mendelian randomization. Nat Rev Methods Primers 2022; 2: 6. doi:10.1038/s 43586-021-00092-537325194 PMC 7614635 · doi ↗ · pubmed ↗
3Davies NM, Holmes MV, Smith GD. Reading Mendelian randomisation studies: a guide, glossary, and checklist for clinicians. BMJ 2018; 362: k 601. doi:10.1136/bmj.k 60130002074 PMC 6041728 · doi ↗ · pubmed ↗
4Katki HA, Berndt SI, Machiela MJ, et al. Increase in power by obtaining 10 or more controls per case when type-1 error is small in large-scale association studies. BMC Medical Research Methodology 2023; 23: 153. doi:10.1186/s 12874-023-01973-x 37386403 PMC 10308790 · doi ↗ · pubmed ↗
5Woolf B, Perry JA, Hong CC, et al. Multi-biobank summary data Mendelian randomisation does not support a causal effect of IL-6 signalling on risk of pulmonary arterial hypertension. Eur Respir J 2024; 63: 2302031. doi:10.1183/13993003.02031-202338453257 PMC 10991834 · doi ↗ · pubmed ↗
6Kurki MI, Karjalainen J, Palta P, et al. Finn Gen provides genetic insights from a well-phenotyped isolated population. Nature 2023; 613: 508–518. doi:10.1038/s 41586-022-05473-836653562 PMC 9849126 · doi ↗ · pubmed ↗
7Verma A, Huffman JE, Rodriguez A, et al. Diversity and scale: genetic architecture of 2068 traits in the VA Million Veteran Program. Science 2024; 385: eadj 1182. doi:10.1126/science.adj 118239024449 PMC 12857194 · doi ↗ · pubmed ↗
8Neale B. Neale Lab. Neale Lab UK Biobank GWA Ss. Date last accessed: 30 May 2022. www.nealelab.is/uk-biobank