Association between tumour somatic mutations and venous thromboembolism in the 100,000 Genomes Project cancer cohort: a study protocol
Naomi Cornish, Sarah K. Westbury, Matthew T. Warkentin, Chrissie Thirlwell, Andrew D. Mumford, Philip C. Haycock, Benilde Cosmi, Naomi Cornish, Yan Xu, Naomi Cornish

TL;DR
This study aims to find if genetic changes in tumors are linked to blood clots in cancer patients, which could help predict and understand this dangerous condition.
Contribution
The study introduces a novel analysis of tumor somatic mutations and their association with venous thromboembolism in a large cancer cohort.
Findings
Examines the effect of deleterious somatic DNA variants on venous thromboembolism rates.
Investigates the role of tumor mutational burden and signatures in cancer-associated blood clots.
Plans to identify key genes involved in cancer-related thrombosis for potential risk prediction models.
Abstract
Venous thromboembolism (VTE) is a common cause of morbidity and mortality in patients with cancer. There is evidence that specific aberrations in tumour biology contribute to the pathophysiology of this condition. We plan to examine the association between tumour somatic mutations and VTE in an existing cohort of patients with cancer, who were enrolled to the flagship Genomics England 100,000 Genomes Project. Here, we outline an a-priori analysis plan to address this objective, including details on study cohort selection, exposure and outcome definitions, annotation of genetic variants and planned statistical analyses. We will assess the effect of 1) deleterious somatic DNA variants in each gene; 2) tumour mutational burden and 3) tumour mutational signatures on the rate of VTE (outcome) in a pan-cancer cohort. Sensitivity analyses will be performed to examine the robustness of any…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1| Code | Description |
|---|---|
|
| Pulmonary embolism |
|
| Cerebral infarction due to cerebral venous thrombosis, non-pyogenic |
|
| Non-pyogenic thrombosis of intracranial venous system |
|
| Phlebitis and thrombophlebitis of femoral vein |
|
| Phlebitis and thrombophlebitis of other deep vessels of lower extremities |
|
| Portal vein thrombosis |
|
| Budd-Chiari syndrome |
|
| Embolism and thrombosis of vena cava |
|
| Embolism and thrombosis of renal vein |
- —Wellcome Trust
- —Cancer Research UK
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVenous Thromboembolism Diagnosis and Management · Lymphoma Diagnosis and Treatment · Genetic Associations and Epidemiology
Introduction
Venous thromboembolism (VTE) is a frequent complication of malignancy and a leading cause of cancer-related death ^ 1 ^. Estimates of the absolute risk of VTE in cancer patients vary enormously, from approximately 0.6% to over 20%, depending on a myriad of incompletely-understood risk factors ^ 2 ^. These include patient, cancer-specific and treatment-related variables ^ 3 ^.
One of the most powerful predictive factors for cancer-associated VTE is the anatomical type of tumour ^ 4 ^. Many tumours express elevated levels of prothrombotic proteins, or directly interact with adhesion molecules on platelets and vascular endothelial cells ^ 3, 5 ^. This suggests that VTE may result from specific aberrations of tumour biology.
Several groups have previously explored the association between alterations within the tumour genome and VTE. The majority of these studies are relatively small and have focused on a handful of candidate genes (including KRAS, EGFR, ALK and IDH1), sometimes with conflicting results ^ 6, 7 ^.
The largest study in this field used the Memorial Sloan Kettering (MSK)-IMPACT platform to sequence target genes in 11,695 tumour samples ^ 8 ^. Their pan-cancer analysis was restricted to 53 known oncogenes or tumour-suppressor genes, and found that somatic mutations in 7 genes modulated the risk of VTE. A separate analysis in the same cohort found several of these genes were also implicated in the development of arterial thrombosis ^ 9 ^. Recently, another publication from the MSK group has reported that levels of circulating tumour DNA independently predict VTE risk in a dose-dependent relationship. Interestingly some gene level alterations (including KRAS, STK11 and KEAP1) were also shown to be associated with VTE in the discovery cohort (n = 4,141) ^ 10 ^.
There is scope to attempt to replicate these results in an independent cohort and to further explore associations between VTE and somatic mutations in genes which have not previously been analysed.
Objective
We plan to conduct a pan-cancer analysis examining the association between somatic mutations across the tumour genome (exposure) and VTE (outcome).
We intend to analyse the exposure variable in 3 alternative ways 1) a gene-centric approach, where the effect of somatic mutations is considered separately for each gene; 2) by assessing tumour mutational burden; and 3) by assessing tumour mutational signatures. The effect of these variables on the rate of VTE will be assessed using statistical models described below.
Study population
This analysis uses existing data from the 100,000 Genomes Project Cancer Programme, a Genomics England (GEL) initiative which recruited 17,241 participants with cancer and performed whole genome sequencing (WGS) on matched germline and somatic (tumour) genomes ^ 11 ^. Recruitment occurred between 2015 and 2019. Eligibility for the overall program covered patients aged from birth upwards, diagnosed with a solid organ or haematological malignancy. Previously treated patients presenting with cancer recurrence, progression or undergoing surgery following neoadjuvant chemotherapy were all included. Genomic data is linked to pseudo-anonymised longitudinal electronic health records from secondary care, including Hospital Episode Statistics (HES), the National Cancer Registration and Analysis Service (NCRAS), the Systemic Anti-Cancer Therapy (SACT) dataset, and mortality data from the Office for National Statistics (ONS).
From the 100,000 Genomes Project version 18 data-release (December 2023), we will select a cohort of participants who meet the following inclusion criteria:
1. Ongoing valid study consent and prospectively collected tumour sample (exclude patients with stored samples which were collected prior to study recruitment opening)
2. Paired somatic and germline WGS meeting the following quality control criteria: concordant phenotypic and karyotypic sex; read mapping quality > 30 across 210 Gb for tumour DNA and 85Gb for germline DNA; cross-sample contamination <3% for germline DNA (assessed by VerifyBamID [ https://github.com/statgen/verifyBamID]) and <5% for tumour DNA (assessed by ConPair [ https://github.com/nygenome/Conpair]) ^ 12 ^.
3. Linked NCRAS record with a congruous cancer diagnosis to the tumour type received by GEL.
4. Histology consistent with malignant cancer.
5. No missing or discrepant information for critical covariates (including age, sex, genetically inferred ancestry, cancer type and diagnosis date).
6. No prior history of VTE (patients with a documented VTE prior to study entry will be excluded).
Primary outcome definition
The primary outcome of interest is the first occurrence of VTE. This is a composite phenotype which we will define using the Health Data Research UK coding algorithms for deep vein thrombosis at any anatomical site (PH338) and pulmonary embolism (PH71) ( Table 1) [ https://phenotypes.healthdatagateway.org] ^ 13 ^. VTE diagnoses will be identified from ICD10 codes in linked hospital episode statistics (including inpatient, outpatient and emergency department records). Death from any cause other than VTE will be recorded from linked ONS data.
Covariate assessment
Baseline covariates (including age and sex) will be obtained from participant data submitted at recruitment. Cancer-specific covariate information will be derived from linked NCRAS records: this includes date of cancer diagnosis, tumour site and histological type, date of chemotherapy or surgery and stage of cancer (recorded within 12months of study entry). NCRAS diagnoses are coded using ICD10 version 19 [ https://icd.who.int/browse10/2019/en#/II]. Where applicable, information on the specific anti-cancer drug regimen administered will be obtained from linked SACT data.
Genetic data
For cancer participants enrolled to the 100,000 Genomes Project, tissue collection and DNA extraction was performed in local laboratories linked to the regional NHS Genomics Medicine Centres. Tumour DNA was obtained primarily from fresh-frozen histology specimens (and rarely from formalin-fixed paraffin-embedded samples). Germline DNA was obtained from blood or occasionally saliva. The protocols for sample collection and DNA extraction are described in the GEL Sample Handling Guidance (v.4.0) available from [ Document library | Genomics England].
Samples were prepared with an Illumina TruSeq PCR-free library preparation kit, providing sufficient DNA was available; otherwise PCR-based library preparation was used for a minority of samples. DNA sequencing was performed centrally on an Illumina HiSeq platform to an average coverage of 30x (germline DNA) and 100x (tumour DNA). Sequence data was processed using the Illumina North Star (version 2.6.53.23) pipeline with read alignment against the human reference genome GRCh38-Decoy+EBV using ISAAC (version iSAAC-03.16.02.19) ^ 12 ^.
Somatic variant calls and interpretation
Detection of somatic single nucleotide variants (SNVs) and insertions/deletions < 50bp has been performed using Strelka4 (v2.4.7) [ https://github.com/Illumina/strelka]. Detection of large structural somatic variants (inversions, translocations) and insertions/deletions >50bp has been performed using Manta (v.0.28.0) [ https://github.com/Illumina/manta]. In addition to the default quality filters applied by these pipelines, we will exclude 1) variants with germline allele frequency >1% in the GEL cohort (as these may indicate unsubtracted germline SNVs); 2) variants with somatic allele frequency >5% in the GEL cohort (as these may indicate potential technical artefacts); 3) indels in regions of high sequencing noise (i.e proportion of low quality filtered base calls within 50bp of the variant exceeds 10%) ^ 12 ^.
Variants have been annotated against the canonical transcript using Cellbase ^ 14 ^ (integrating information from ENSEMBL (version 90) ^ 15 ^, COSMIC (version 86) ^ 16 ^ and ClinVar (October 2018 release) ^ 17 ^. We will classify variants as potentially deleterious if they fall into any of the following categories:
- Transcript ablation
- Splice acceptor / splice donor variant
- Stop gain / loss
- Start loss
- In-frame insertion / deletion
- Frameshift
- Missense variant predicted to be deleterious to protein function by in-silico prediction algorithms (e.g. FATHMM-MKL score ^ 18 ^)
- Listed in Clinvar [ https://ncbi.nlm.nik/gov/clinvar] as pathogenic/likely pathogenic
Gene inclusion criteria for gene-centric analysis
1) We will analyse all genes which are listed in either tier 1 or tier 2 of the Cancer Gene Census [ https://cancer.sanger.ac.uk] ^ 16 ^. Tier 1 includes known oncogenic and tumour suppressor genes, while tier 2 includes genes where there is emerging (but less extensive) evidence to implicate their role in cancer pathology. Since it is becoming increasingly routine to perform sequencing of these genes in newly-diagnosed cancer patients ^ 19 ^, identifying whether there are associations between these genes and VTE risk has potential clinical utility in the near-future.
2) We will analyse all genes for which germline polymorphisms have been shown to be associated with VTE in large population genome-wide or exome-wide association studies ^ 20– 22 ^ as we hypothesise that somatic variation in these same genes may contribute to cancer associated VTE.
3) Finally, since this is an exploratory analysis designed to identify new genes which have not previously been implicated in VTE, we will analyse any remaining genes for which at least 5% of the study cohort carry potentially deleterious somatic variants. This cut-off has been chosen based on power calculations indicating that in a sample of ~10,000 participants, with a VTE prevalence ~ 10%, there will be >80% power to detect an association between VTE and mutations which are present in >5% of the study cohort, assuming an effect size of 1.5 and type 1 error rate < 0.05 ^ 23 ^.
For each gene included in the analysis we will dichotomise patients into one of two categories:
1) Gene mutated: if the participant carries one or more potentially deleterious somatic variants in the gene at a variant allele frequency (VAF) >=5% in the tumour sample. VAF is calculated by dividing the number of variant reads by the total number of reads (variant + reference sequence) at that position ^ 12 ^. This VAF threshold is derived from literature which suggests that variants called at a lower VAF are frequently due to sequencing errors ^ 24 ^.
2) Gene unmutated: If the participant does not have any somatic variants in the gene, has a variant which is not predicted to be deleterious or has a variant present with a VAF < 5%.
Global tumour mutational burden
In addition to the gene-centric approach described above, we will consider the association between global tumour mutational burden (TMB, calculated as the total number of small somatic variants and indels per Mb of coding sequence) and risk of VTE.
Mutational signature analysis
We will also report the association between 30 specific mutational signatures [ https://cancer.sanger.ac.uk/cosmic/signatures_v2] and risk of VTE. The contribution of each mutational signature to the overall mutation burden has been computed by GEL using the R package nnls ^ 25 ^.
Statistical analyses
Primary analysis
Our primary objective is to identify whether there are potentially causal relationships between somatic mutations in the tumour of a person with cancer and VTE. We will therefore use Cox proportional hazards regression to assess the effect of 1) deleterious somatic DNA variants in each gene; 2) tumour mutational burden and 3) tumour mutational signatures on the rate (hazard) of VTE (outcome) in the pan-cancer cohort. We have chosen this model over a competing risks analysis, based on consensus in the literature that cause-specific hazard models are more appropriate for research questions focused around etiology, rather than prediction which requires accurate estimates of absolute risk that account for competing risks of mortality ^ 26 ^.
Time under observation will begin from the first date that a participant’s tumour was sampled for sequencing (here-after referred to as study entry). For participants who have had multiple tumour samples submitted to GEL, only DNA samples submitted at initial study entry will be used for analysis. Follow-up time will be defined as the time from study entry until the diagnosis of VTE, death from any cause, the last date when electronic health records were uploaded to the GEL Research environment (July 2022), or administrative study termination after five years, whichever occurs first. Diagnosis of VTE will be considered as the event of interest while all other events will be considered as right censoring of the follow-up time.
In the primary analysis, we will adjust Cox models for baseline patient covariates which we have identified as potential confounders including: age at study entry, sex and the top 4 genetic principal components ( Figure 1) (minimally adjusted model).
Directed acyclic graph illustrating hypothesised relationships between somatic mutations and venous thromboembolism, including potential confounders (grey) and mediators (green).
For each gene, results will be expressed in terms of the hazard ratio (HR) and associated 95% confidence interval (95% CI) for VTE in the ‘mutated’ group compared to the ‘unmutated’ (i.e., the reference) group. For TMB and mutational signatures, we will standardize the exposure variable and report the HR and 95% CI for VTE per standard deviation increase in TMB /mutational signature. Results will be interpreted in the context of false-discovery rate corrected p-values to account for multiple testing.
Sensitivity analyses
** Sensitivity analysis 1: Mediation analyses. ** We hypothesise that somatic mutations arising within a tumour contribute directly to VTE (for example through increased tumour-expression of pro-thrombotic proteins). However, these same mutations may drive tumour growth or proxy aggressive histological subtypes of malignancy. This may lead to an indirect association with VTE via other mechanisms ( Figure 1). For example, large tumours may compress or invade vessels, contributing to venous stasis and subsequent thrombosis. Patients with advanced cancer are also more likely to experience complications such as infection or hospitalization, all of which have been associated with VTE ^ 3, 5 ^. Chemotherapy and surgery are also strongly associated with VTE risk. In the era of molecularly targeted therapies, selection of systemic anti-cancer drugs may be dictated by knowledge of the underlying somatic mutation profile ^ 27 ^. This may lead to associations between genetic variants and VTE which are mediated through drugs, rather than any direct biological phenomenon.
We are interested primarily in estimating the degree to which somatic mutations contribute to VTE through direct biological mechanisms. Therefore, for genes / gene signatures where there is possible evidence from the primary analysis for an association with VTE (nominal P < 0.05), we will conduct mediation analyses to explore the degree to which these associations are dependent on indirect pathways/mediating variables
We will compare estimates from the minimally adjusted (primary analysis) with estimates from Cox regressions where we include the following covariates (presumed mediators of the exposure-outcome relationship): 1) tumour type, grouped by 20 anatomical sites according to mappings published by Cancer Research UK [ https://www.cancerresearchuk.org/health-professional/cancer-statistics/incidence/common-cancers-compared#ref-]; 2) cancer stage and 3) SACT. SACT will be included as a time-dependent covariate. We will also group SACT regimens by type and include these drug groups as distinct covariates. Participants with missing information on any of the above covariates will be excluded from this sensitivity analysis.
** Sensitivity analysis 2: Competing risks analysis. ** For genes/mutational signatures where there is statistical evidence indicating a possible association with the rate of VTE in the primary analysis (nominal p <0.05), we will use Fine and Gray regressions to estimate the risk of VTE in the gene-mutated vs reference groups, treating death from any other cause as a competing risk. We will report the subdistribution hazard ratio (SHR) and associated 95% CI for VTE associated with the gene/mutational signature, as well as the estimated cumulative incidence for VTE over time ^ 26 ^.
** Sensitivity analysis 3: Exclude relapsed and previously treated participants. ** The GEL cancer cohort includes some pre-treated patients with relapsed or progressive disease as well as patients who have undergone neo-adjuvant chemotherapy prior to study entry. Prior chemotherapy may be a powerful risk factor for VTE ^ 5 ^. It may also influence the somatic mutation profile of a tumour through treatment-induced mutagenesis and selection of tumour sub-clones ^ 28 ^. Therefore, we will perform a sensitivity analysis which is limited to only newly diagnosed untreated cancer patients: excluding any patients who have received SACT prior to study entry or where the date between original cancer diagnosis and study entry is > 180 days.
To control for variability in time between cancer diagnosis and tissue biopsy (for somatic DNA sequencing), we will also perform a sensitivity analysis where the outcome is time to VTE from cancer diagnosis, with left truncation at the point of tumour sampling.
** Sensitivity analysis 4: Exclude patients with a documented indication for anticoagulation or anti-platelet therapy. ** Prescription data are unfortunately not available for this cohort. However, we will perform a sensitivity analysis where we exclude patients who have a common medical indication for long-term anticoagulation or antiplatelet therapy documented in their hospital episode statistics prior to study entry: namely atrial fibrillation, acute coronary syndrome, ischaemic stroke/transient ischaemic attack or prosthetic heart valve replacement ^ 29, 30 ^. (Note patients with a prior history of VTE are excluded from the primary cohort).
Discussion
We acknowledge in advance some potential limitations of this analysis: firstly, no primary care data is currently available for GEL participants, therefore VTE diagnoses which are only recorded in GP records will be missed. This will lead to an under-estimation of the true VTE incidence and may limit the power of our study. Secondly, since older patients with co-morbidities may be more likely to be prescribed anticoagulation for reasons other than VTE (for example, for stroke prevention in the context of atrial fibrillation) ^ 31 ^, this is a potential confounding variable and failing to include it directly as a co-variate may introduce bias. We attempt to address this with sensitivity analysis no.4. Thirdly, the large structural variant calls from short-read sequence data may lack sensitivity and some aberrations may be missed ^ 32 ^. Nevertheless, we believe this large pan-cancer cohort, which is richly annotated with phenotypic data through linkage to a variety of longitudinal health records, provides a unique opportunity to study this important research question.
Study status
Prior to finalising this protocol, NC explored the GEL research environment platform to develop familiarity with the data structure and the available files for analysis, in order to inform judgements on the feasibility of the planned study.
To ascertain whether there was likely to be adequate power to perform the analyses described above (page 7), we also generated descriptive statistics on the overall cancer cohort including
- Proportion of the sample with missing covariate data
- Proportion of the population experiencing the outcome of interest (VTE)
No prior analyses examining the associations between somatic mutations and VTE have been performed.
Future amendments to this protocol will be clearly documented and justified in subsequent reports. We plan to submit findings from this study for publication once analyses are complete.
Ethics and consent
This study has been approved by the Genomics England research network, and access to the data will adhere to Genomics England research governance agreements. The 100,000 Genomes Project was approved by the NHS Health Research Authority East of England - Cambridge South Research Ethics Committee (REC ref: 14/EE/1112, 20 ^th^ February 2015). Participants were recruited from across 13 NHS Genomic Medicine Centres and all participants agreed to participate in the 100,000 Genomes Project and provided informed written consent.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Khorana AA Francis CW Culakova E : Thromboembolism is a leading cause of death in cancer patients receiving outpatient chemotherapy. J Thromb Haemost. 2007;5(3):632–4. 10.1111/j.1538-7836.2007.02374.x 17319909 · doi ↗ · pubmed ↗
- 2Rutjes AW Porreca E Candeloro M : Primary prophylaxis for venous thromboembolism in ambulatory cancer patients receiving chemotherapy. Cochrane Database Syst Rev. 2020;12(12):CD 008500. 10.1002/14651858.CD 008500.pub 5 33337539 PMC 8829903 · doi ↗ · pubmed ↗
- 3Noble S Pasi J : Epidemiology and pathophysiology of cancer-associated thrombosis. Br J Cancer. 2010;102 Suppl 1(Suppl 1):S 2–9. 10.1038/sj.bjc.6605599 20386546 PMC 3315367 · doi ↗ · pubmed ↗
- 4Mulder FI Horváth-PuhóE van Es N : Venous thromboembolism in cancer patients: a population-based cohort study. Blood. 2021;137(14):1959–69. 10.1182/blood.2020007338 33171494 · doi ↗ · pubmed ↗
- 5Guntupalli SR Spinosa D Wethington S : Prevention of venous thromboembolism in patients with cancer. BMJ. 2023;381:e 072715. 10.1136/bmj-2022-072715 37263632 · doi ↗ · pubmed ↗
- 6ÜnlüB Versteeg HH : Cancer-associated thrombosis: The search for the holy grail continues. Res Pract Thromb Haemost. 2018;2(4):622–9. 10.1002/rth 2.12143 30349879 PMC 6178660 · doi ↗ · pubmed ↗
- 7Abufarhaneh M Pandya RK Alkhaja A : Association between genetic mutations and risk of venous thromboembolism in patients with solid tumor malignancies: A systematic review and meta-analysis. Thromb Res. 2022;213:47–56. 10.1016/j.thromres.2022.02.022 35290837 · doi ↗ · pubmed ↗
- 8Dunbar A Bolton KL Devlin SM : Genomic profiling identifies somatic mutations predicting thromboembolic risk in patients with solid tumors. Blood. 2021;137(15):2103–13. 10.1182/blood.2020007488 33270827 PMC 8057259 · doi ↗ · pubmed ↗
