A Swedish Haplotype GWAS in Familial and Sporadic Site-Specific Colorectal Cancer
Litika Vermani, Shabane Barot, Annika Lindblom

TL;DR
This study identifies genetic risk loci specific to different parts of the colon in familial and sporadic colorectal cancer cases.
Contribution
The study uses haplotype-based GWAS to discover site-specific genetic loci in familial and sporadic colorectal cancer.
Findings
29 distinct risk loci were identified for cecal and proximal colon cancer.
14 loci were associated with familial cecal cancer and seven with sporadic cecal cancer.
18 of the 29 loci contained coding genes.
Abstract
Genetic variants specific to anatomical subsites of colorectal cancer are known to play a crucial role in its prognosis and treatment. We undertook a haplotype-based genome-wide association study (GWAS) to identify specific genetic risk loci for three sites: cecum, right colorectum, and left colorectum. Six different haplotype GWAS were performed using familial and sporadic colorectal cancer cases with tumors at three different sites. The studies included 2358 CRC cases and 1642 healthy controls. A logistic regression model using PLINK v.1.07 software was employed, and risk loci with a p-value of 5 × 10−8 were considered statistically significant. In total, 29 distinct risk loci were identified in the analyses of familial and sporadic cases of cecal and proximal colon cancer. The results from the analyses of familial and sporadic left-sided colorectal cancer did not meet the strict…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1- —Karolinska Institute
- —Swedish Research Council
- —Swedish Cancer Society
- —Radiumhemmet’s Cancer Research Funds
- —Stockholm County Council and Karolinska Institutet
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenetic factors in colorectal cancer · Genetic Associations and Epidemiology · Colorectal Cancer Treatments and Studies
1. Introduction
There is a global increase in the incidence of colorectal cancer (CRC), especially in early-onset disease, and a decrease in CRC-related mortality [1]. Approximately 5–6% of all CRC cases have a germline mutation in a known high-penetrance cancer gene, while most others are hypothesized to arise due to multiple genetic, environmental, and lifestyle factors that are characteristic of complex disease. Right- and left-sided colorectal tumors differ in embryologic origin, morphology, histology, and molecular profiles. Right-sided (proximal) colon tumors (RCC) are derived from the midgut, whereas left-sided (distal) tumors (LCC) are derived from the hindgut [2]. Accumulating evidence suggests that proximal and distal tumors have distinct clinical characteristics and differences in prognosis and treatment outcomes [3]. There are also differences in pathology. Serrated adenomas are more commonly found in the right colon, whereas RCCs tend to be flatter, of a mucinous phenotype, with increased T-cell infiltration and are more often diploid- and microsatellite-unstable. In contrast, distal tumors often demonstrate aneuploidy and chromosomal instability, as well as tubular and villous adenocarcinomas and a polypoid-like morphology [3]. The differences extend to the prognosis, the treatment response, and the preferred metastatic sites of disease, and there is a clear reduced risk of death in left-sided CRC as compared to right-sided, regardless of the disease burden and known mutation status [4]. These divergent patterns may be partly explained by the distinct embryonic origins of the right and left colon, which contribute to their molecular and biological heterogeneity.
Given these differences in pathways and somatic genetics, it is plausible that predisposing genetic variants that increase CRC risk also vary by tumor location. A recent GWAS used a dataset of single variants from 48,214 CRC cases and 64,159 controls to conduct five genome-wide SNP-association scans of case subgroups that were defined by the location of their primary tumors within the colorectum [5]. Thirteen loci, not reported by previous GWAS for overall CRC risk, reached genome-wide significance (p < 5 × 10^−8^) [5]. Distinct loci were found in four of the analyses: three for tumors in the colon, one for tumors in the rectum, three for tumors in the proximal colon, and six for distal tumors. These findings suggested a heterogeneity in risk loci among anatomical tumor subsites [5].
We aimed to investigate whether genetic predisposition varies across anatomical locations within the colorectum using a different method. We have, in our previous studies, seen that sliding-window haplotype GWAS can find rarer loci with higher odds ratios compared to SNP GWAS, which is why we chose this study design in this study. The same microchip was used as in the paper by Huyghe et al. mentioned above [5]. We conducted six site-specific haplotype GWAS, comparing familial and sporadic cases with tumors in the cecum and other right-sided (RCC) or left-sided (LCC) locations, using the same healthy controls.
2. Results
Six haplotype GWAS were performed in patients with familial and sporadic diseases and tumors that originated in the cecum, RCC, or LCC (Figure 1). Several, mostly novel, loci were identified, and all but one contained coding genes and/or RNA genes within the haplotype boundaries. Protein-coding genes were considered the most likely to act as targets. All PLINK analyses used GRCh37, as indicated in all Supplementary Tables. The data have been updated to GRCh38 in tables.
2.1. Analyses in Cases with Tumors in the Cecum
The GWAS of 73 familial cases with cecal cancer resulted in fourteen significant loci with ORs between 4.4 and 22.9 (Supplementary Table S1; Table 1). Eight of the fourteen loci contained genes, five had one or more RNA genes, and one haplotype had no gene. The GWAS of 244 patients with sporadic cecal tumors identified seven significant haplotypes, with ORs ranging from 1.8 to 6.69 (Supplementary Table S2; Table 1). Two of the seven loci had coding genes within their haplotype regions.
2.2. Analyses in Cases with RCC
In the analysis of 134 familial RCC cases versus healthy controls, six haplotypes at six distinct loci were significantly associated with disease (Supplementary Table S3; Table 2). The ORs ranged from 4.29 to 8.68 (Table 2). All the loci contained one or several coding genes. The analysis of 403 sporadic RCC cases versus healthy controls identified two significant loci with coding genes, with ORs ranging from 4.32 to 5.13 (Supplementary Table S4; Table 2).
2.3. Analyses in Cases with LCC
None of the analyses in LCC (340 familial and 1164 sporadic cases) reached statistical significance (p < 5 × 10^−8^) (Supplementary Tables S5 and S6).
3. Discussion
Six haplotype-based GWAS of CRC cases were conducted. The samples were stratified by family history and anatomical subsite. The findings indicate that different sites in familial and sporadic colorectal cancer are associated with distinct predisposing loci and genes. Our results support the growing evidence that proximal and distal CRCs are biologically heterogeneous rather than a uniform disease. The highest ORs were observed at unique loci in the analysis of familial cases with cecal cancer. The differences among sites in the colorectum reflect the differences in embryological origin and exposure to environmental and lifestyle factors [2,3,4]. The current haplotype GWAS identified significant loci that were confined to cecal and right-sided colon cancer, whereas no significant risk loci were detected in either familial or sporadic left-sided colon or rectal cancer using the strict p-value for significance. This indicated that germline genetic predisposition plays a greater role in the development of cecal tumors and right-sided colon cancers than in left-sided colorectal cancers. It also supports the general opinion that most sporadic cases develop in the left colon [6]. In addition, the analysis of sporadic LCC identified more loci in the sporadic cases than in the familial cases, as opposed to the analyses in cases with proximal tumors. This confirms the finding that left-sided tumors tend to be more associated with environmental influence and genetic modifiers. There were many RNAs suggested in the analyses of both the cecum and RCC. Regulatory non-coding RNA influences cell physiopathology and modulates cells by regulating gene expression in different ways [7].
One limitation of this study is that the six studies used small and disproportionate sample sizes. In particular, the loci with the highest ORs had the fewest cases, yet these remained sufficient for statistically significant results. Another possible limitation is that a quite large cohort of the same controls was used in all analyses. Still, many statistically significant loci were also identified in the smaller cohorts, and it is unclear whether this could have influenced the results negatively. The distribution of risk variants might not strictly follow the sites that were chosen for analysis, but it could relate to a gradient from the cecum to the rectum with a decreasing number of high-risk variants and an increasing number with a lower risk. It is difficult to determine what p-value criteria should be used in the haplotype analysis, and suggestions to use both more strict (because of the many tests) and more loose criteria (because they are not independent tests but rather testing the same locus numerous times) have been made. Therefore, we have chosen the p-value that is generally accepted for GWAS. Furthermore, the most important result of this paper is the differences in risk genes (loci) suggested between familial and sporadic tumors at different locations, rather than the exact p-value for statistical significance in each analysis.
The genes suggested in this study were compared to genes from previously published SNP-GWAS [8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34]. Only two genes, ERGIC1 and PITX1, have previously been suggested as risk loci [32,34]. The loci identified in haplotype GWAS are often rare, while those identified by SNP GWAS are common; thus, none of the rare loci in the present study overlapped with the loci suggested by Huyghe et al. [5].
In our study, the ORs for the analysis in cases with cecal cancer were the highest, reaching 22.9 for the WWTR1 locus. The WWTR1 gene has been suggested to act as an oncogene, playing a crucial role in the proliferation of colorectal cancer cells and in tumor growth in vivo [35]. The current haplotype GWAS generated numerous loci and suggested many genes. Most of these loci contained a single gene, strongly suggesting that a risk variant at this locus was implicated in CRC, and many have already been reported in relation to CRC. The analysis of the patients with cecal tumors identified 21 loci: 20 of these were coding, or RNA genes, and nine of the ten contained only one gene. Six of these nine genes were already implicated in cancer. The RYR2 gene is frequently mutated in CRC; in one study, it was among eight CRC-associated genes with mutation rates exceeding 20% [36]. TRIM32 is a crucial member of the TRIM family, is highly expressed in numerous human cancers, and is associated with a poor prognosis. However, the mechanism of TRIM32 in CRC is still unclear [37]. One study analyzed 54 commonly differentially expressed genes and found that genes, including ARSJ, were associated with CRC’s overall survival [38].
The RCC analysis identified eight significant loci: six in familial patients and two in sporadic patients. All eight had at least one coding gene, and five of these had genes previously associated with CRC. Two genes, KIAA40 and ABCA12, were reported in a previous haplotype-based GWAS. KIAA40 was suggested to act as a modifier gene in a subset of CRC cases selected because they reported smoking as a risk factor [39]. ABCA12 was suggested as a modifying risk locus in CRC cases selected for physical inactivity in the same study [39]. Another published paper reported that the KAT2B gene decreased BRCA2 expression in CRC and suggested that KAT2B acted on the PARPi response by regulating the expression of BRCA2 [40]. The RBM47 gene has been suggested to have an anti-tumor function [41]. The same gene was also found as one of ten risk loci in our previous haplotype GWAS of CRC patients with familial gastric and/or prostate cancer [42]. In that study, locus RBM47 had a much lower OR (2.4) and a less significant p-value (p < 4.3 × 10^−6^) than in this study (OR = 7.95, p < 3.75 × 10^−8^) [42]. Fas and Fas ligand (FasL) are implicated in programmed cell death of apoptosis [43]. Cancer stem-like cells (CSCs) are proposed to act within tumor growth and relapse and are a target for cancer therapy. Aspirin was suggested to eliminate CSCs by a unique pathway (p300-Ach3K9-FasL) axis, which could explain the therapeutic significance of aspirin [44]. The gene FLI1 has been implicated in CRC. DNA methyltransferase 3b (DNMT3b) was found to be significantly overexpressed in CRC, and low DNMT3b expression was associated with prolonged survival [45]. The inhibition of DNMT3b increased FLI1 expression and inhibited the malignant phenotype of CRC cells. The inhibition of FLI1 reversed phenotypic modulation by DNMT3b depletion in vitro and in vivo. It was suggested that DNMT3b potentiates CRC cell proliferation, migration, and invasion by downregulating FLI1 [45].
In the GWAS comparing LCC with healthy controls, no loci met the strict criteria for statistical significance. However, numerous genes were suggested in both familial and sporadic analyses, and even if none of the loci reach the criteria for statistical significance, it is still possible that some could be of importance as modifier genes, and further studies are warranted before ruling them out.
4. Materials and Methods
4.1. Cases and Controls
Colorectal cancer cases were recruited as part of the Colorectal Cancer Low-risk study [46]. All of the newly diagnosed colorectal cancer cases from 14 hospitals in mid-Sweden were invited to participate. Blood samples were collected from participants between 2004 and 2009. The criteria for case inclusion and exclusion are detailed in a previous paper [47]. In the present study, we used 2358 CRC cases, 547 of which had a family history of CRC in at least one close relative, and 1811 of which were sporadic CRC cases, lacking a family history of CRC. In total, 1642 healthy men and women from the same Swedish geographical area served as controls in all six analyses. The controls consisted of 1106 healthy blood donors and 536 spouses without a family history of cancer. RCC tumors were defined as those located in the ascending colon, the right flexure, and the transverse colon. LCC tumors were defined as those occurring in the splenic flexure, descending colon, sigmoid colon, and rectum. The demographic and clinical features of the CRC cases that were selected for this study are described in detail in Table 3.
4.2. Genotyping, Quality Control
Peripheral blood was used for DNA extraction according to the standard procedures. The genotyping for both cases and controls was performed at the Center for Inherited Disease Research at Johns Hopkins University, US, using the Illumina Infinium^®^ OncoArray-500K (Illumina, San Diego, CA, USA) [28]. The first and second quality controls were performed within the CORECT (http://epi.grants.cancer.gov/gameon/ (accessed on 6 March 2026)) consortium and at Karolinska Institutet [28,47].
4.3. Haplotype Analysis and Statistics
Six haplotype association analyses were conducted using the software PLINK v.1.07 [48]. We employed a sliding-window approach, moving windows of predefined lengths from 1 to 25 SNPs across the genotyped loci in the 5′ to 3′ direction [48]. As information on chromosome phasing is lacking in genotype data, PLINK v.1.07 applies the expectation-maximization algorithm to estimate the haplotype frequencies within each window through statistical inference [49]. The population frequency (F) is an estimation based on the number of samples (cases and controls) used for each haplotype. All possible haplotypes within each window are tested. PLINK further investigates associations between the estimated haplotypes and CRC via logistic regression. The default minimum haplotype frequency cutoff of 0.01 was applied, excluding haplotypes with a frequency below 1% from individual testing and grouping them as a single rare category. All estimated haplotypes within each window were tested, with an arbitrarily selected haplotype serving as the reference [50]. PLINK provided ORs, Wald test statistics (squared t), and p-values for each haplotype. All of the analyses used the GRCh37 genome build. We applied the established genome-wide significance threshold for SNP GWAS (p < 5 × 10^−8^) [51]. The analysis involves multiple tests of each SNP. If all SNPs had two possible genotypes, a total of 2^50^ possible haplotypes would be generated. However, typically, the number of generated haplotypes for each locus is less than 25. Using haplotype windows of up to 25 SNPs thus generates several haplotypes representing the same unique region, each varying in length from 1 to 25 SNPs. This means that the number of haplotypes generated per SNP varies with SNP variability across haplotypes. This is described in detail for the first significant locus on chromosome 1 in the analysis of the patients with familial cecal cancer, in Supplementary Table S1 and Table 4a,b. Table 4a shows all haplotypes with p > 2.5 × 10^−6^ and Table 4b shows all haplotypes using all suggested haplotypes regardless of the p-value. Variants (SNPs) of the same sequence are observed across many of the generated haplotypes with various and less stringent p-values, all in bold in Table 4a,b. More detailed results from PLINK at this locus are presented in Table 4b, which illustrates all haplotypes with even less stringent p-values to represent the locus. Only the haplotype with the best p-value among all of the haplotypes representing the same haplotype at each locus is selected and presented as one locus under Results. Thus, in the examples in Table 4a,b, the haplotype TAGACAG at the position Chromosome 1: 237,203,483–237,257,650 (GRCh37) corresponds to this locus in Table 1 (the positions here have been converted to GRCh38). Due to the substantial computational demands, the analyses were done on high-performance computers at the UPPMAX Bianca cluster, which is part of the Uppsala Multidisciplinary Center for Advanced Computational Science.
5. Conclusions
Cancers arising in different sides of the colon are known to have very different biology and outcomes. Our haplotype-based GWAS revealed site-specific genetic risk profiles and genes in both familial and sporadic colorectal cancer, highlighting the importance of anatomical context in understanding the tumor initiation and progression at these sites. No risk loci reached statistical significance in LCC, suggesting that more important genetic risk factors are present in cases of cancer arising in the proximal colon. Our findings of new cancer genes may, after future studies, lead to new knowledge and personalized approaches in CRC risk assessments, treatment strategies, or prognostic predictions.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Torre L.A. Bray F. Siegel R.L. Ferlay J. Lortet-Tieulent J. Jemal A. Global cancer statistics, 2012 CA Cancer J. Clin.2015658710810.3322/caac.2126225651787 · doi ↗ · pubmed ↗
- 2Kostouros A. Koliarakis I. Natsis K. Spandidos D.A. Tsatsakis A. Tsiaoussis J. Large intestine embryogenesis: Molecular pathways and related disorders Int. J. Mol. Med.202046275710.3892/ijmm.2020.458332319546 PMC 7255481 · doi ↗ · pubmed ↗
- 3Waldstein S. Spengler M. Pinchuk I.V. Yee N.S. Impact of colorectal cancer sidedness and location on therapy and clinical outcomes: Role of blood-based biopsy for personalized treatment J. Pers. Med.202313111410.3390/jpm 1307111437511727 PMC 10381730 · doi ↗ · pubmed ↗
- 4Petrelli F. Tomasello G. Borgonovo K. Ghidini M. Turati L. Dallera P. Passalacqua R. Sgroi G. Barni S. Prognostic survival associated with left-sided vs right-sided colon cancer: A systematic review and meta-analysis JAMA Oncol.2017321121910.1001/jamaoncol.2016.422727787550 · doi ↗ · pubmed ↗
- 5Huyghe J.R. Harrison T.A. Bien S.A. Hampel H. Figueiredo J.C. Schmit S.L. Conti D.V. Chen S. Qu C. Lin Y. Genetic architectures of proximal and distal colorectal cancer are partly distinct Gut 20217087788810.1136/gutjnl-2020-321534 PMC 822365533632709 · doi ↗ · pubmed ↗
- 6Janin N. A simple model for carcinogenesis of colorectal cancers with microsatellite instability Adv. Cancer Res.20007718922110.1016/s 0065-230x(08)60788-510549359 · doi ↗ · pubmed ↗
- 7Chao H.M. Wang T.W. Chern E. Hsu S.H. Regulatory RN As, micro RNA, long-non coding RNA and circular RNA roles in colorectal cancer stem cells World J. Gastrointest. Oncol.20221474876410.4251/wjgo.v 14.i 4.74835582099 PMC 9048531 · doi ↗ · pubmed ↗
- 8Tomlinson I. Webb E. Carvajal-Carmona L. Broderick P. Kemp Z. Spain S. Penegar S. Chandler I. Gorman M. Wood W. A genome-wide association scan of tag SN Ps identifies a susceptibility variant for colorectal cancer at 8q 24.21Nat. Genet.20073998498810.1038/ng 208517618284 · doi ↗ · pubmed ↗
