A CUSUM approach to the detection of copy-number neutral loss of heterozygosity
Murilo S. Pinheiro, Alu\'isio S. Pinheiro, Denilon S. Carvalho

TL;DR
This paper introduces a CUSUM-based algorithm for detecting copy-number neutral loss of heterozygosity (LOH) in DNA, which is crucial for understanding cancer genetics, demonstrating high reliability and speed even with contaminated data.
Contribution
It formulates LOH detection as change-point identification in a mixture model and proposes a novel CUSUM algorithm for efficient detection.
Findings
Fast and reliable detection method
Effective under mild contamination
Applicable to copy-number neutral LOH
Abstract
Several genetic alterations are involved in the genesis and development of cancers. The determination of whether and how each genetic alterations contributes to cancer development is fundamental for a complete understanding of the human cancer etiology. Loss of heterozygosity (LOH) is one of such genetic phenomenon linked to a variate of diseases and characterized by the change from heterozygosity (the presence of both alleles of a gene) to to homozygosity (presence of only one type of allele) in a particular DNA locus. Thus identification of DNA regions where LOH has taken place is a important issue in the health sciences. In this article we formulate the LOH detection as the identification of change-points in the parameters of a mixture model and present a detection algorithm based on the cumulative sums (CUSUM) method. We found that even under mild contamination our proposal is a…
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3| m = 10 | m = 25 | m = 50 | ||
| Purity = 1 | l = 25 | 0.97 | 0.69 | 0 |
| l = 50 | 0.98 | 0.98 | 0.64 | |
| l = 100 | 0.99 | 0.99 | 0.99 | |
| Purity = 0.79 | l = 25 | 0.94 | 0.33 | 0.0000 |
| l = 50 | 0.97 | 0.97 | 0.06 | |
| l = 100 | 0.99 | 0.9 | 0.90 | |
| Purity = 0.5 | l = 25 | 0.04 | 0.01 | 0 |
| l = 50 | 0.05 | 0 | 0 | |
| l = 100 | 0.07 | 0 | 0 |
| m = 10 | m = 25 | m = 50 | ||
| purity = 1 | l = 25 | 0.94 | 0.99 | 1 |
| l = 50 | 0.95 | 0.99 | 0.99 | |
| l = 100 | 0.95 | 0.99 | 0.99 | |
| purity = 0.79 | l = 25 | 0.95 | 1 | 1 |
| l = 50 | 0.95 | 1 | 1 | |
| l = 100 | 0.94 | 0.99 | 1 | |
| purity = 0.5 | l = 25 | 0.95 | 1 | 1 |
| l = 50 | 0.95 | 1 | 1 | |
| l = 100 | 0.95 | 1 | 1 |
| m = 25 | m = 50 | m = 100 | m = 150 | |
|---|---|---|---|---|
| Execution time (s) | 46.6212 | 47.2890 | 46.2117 | 44.6742 |
| Sensitivity | 0.9517 | 0.9050 | 0.8269 | 0.7958 |
| Specificity | 0.7939 | 0.8706 | 0.9392 | 0.9573 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomic variations and chromosomal abnormalities · Cancer Genomics and Diagnostics · Genomics and Rare Diseases
\Author
Murilo S. Pinheiro, Aluísio Pinheiro and Benilton S. de Carvalho,
\AuthorRunningPinheiro et al. et al. \Affiliations University of Campinas, Brazil \CorrAddressAluísio Pinheiro, Department of Statistics, IMECC University of Campinas, Rua Sérgio Buarque de Holanda, 651 13083-855, Sokolovská 83, D. Barão Geraldo, Campinas, SP, Brazil. \[email protected] \CorrPhone(+55)(19)3521-6080 \CorrFax(+55)(19)3521-6080 \TitleA CUSUM approach to the detection of copy-number neutral loss of heterozygosity \TitleRunningTemplate paper \Abstract Several genetic alterations are involved in the genesis and development of cancers. The determination of whether and how each genetic alterations contributes to cancer development is fundamental for a complete understanding of the human cancer etiology. Loss of heterozygosity (LOH) is one of such genetic phenomenon linked to a variate of diseases and characterized by the change from heterozygosity (the presence of both alleles of a gene) to to homozygosity (presence of only one type of allele) in a particular DNA locus. Thus identification of DNA regions where LOH has taken place is a important issue in the health sciences. In this article we formulate the LOH detection as the identification of change-points in the parameters of a mixture model and present a detection algorithm based on the cumulative sums (CUSUM) method. We found that even under mild contamination our proposal is a fast and reliable method. \KeywordsBeta mixture; EM algorithm; Microarray data; change-point analysis; CUSUM
1 Background
Several genetic alterations such as single base substitution, translocation, copy-number alteration (CNA) and loss of heterozygosity (LOH) are involved in the genesis and development of cancers (Albertson et al., 2003; Beroukhim et al., 2010; Stratton et al., 2009). Determining whether and how genetic alterations contribute to cancer development is paramount for human cancer etiology. One of the most common genotyping tool for the identification of those altered regions are single nucleotide polymorphism arrays (SNP Arrays). With resolution up to one marker for every 100 bp SNP Arrays greatly increased the ability of geneticists to explore the structure of DNA and its effect on health and disease. Although recent technology offers even greater resolution the cost of SNP-A makes it a widely used technology to this day.
Copy-number alteration is a type of structural variation in which a particular region of DNA has a number of copies that differs from the expected two in the diploid genome. These alterations can be of inherited origin or the consequence of somatic mutations occurring during the development of a tumor. Although copy-number variations not always malignant, they are a key event in the development of a variety of human cancers. Their role in tumor development is not clear, but increased copy-number of regions containing oncogenes and the deletion of tumor suppressor genes have been widely documented (Beroukhim et al., 2010).
The primary interest of this text relates to loss of heterozygosity (LOH). This alteration refers to a change from heterozygosity (the presence of both alleles of a gene) to to homozygosity (presence of only one type of allele) in a particular DNA region. LOH can result from the a deletion in a heterozygous DNA region or more intricate phenomena such as mitotic recombination or non-disjunction in somatic tumor cells (Teh et al., 2005). This latter case of LOH is called copy-number neutral loss of heterozygosity (CNNLOH) and gained increased attention from geneticists since SNP-A made possible their detection in unpaired tumor samples (Bignell et al., 2004; Huang et al., 2004). LOH in tumor development is often associated to the inactivation of tumor suppressor genes (Jänne et al., 2004).
The development of SNP Arrays technology made it possible for geneticists to look for CNAs and LOH regions in paired and unpaired tumor samples as described below. Two measurements are obtained from such experiments: the first one, called log R ratio (LRR), is given by
[TABLE]
where is the sum of observed measures for each possible allele and is the expected sum value. The second one, B allele frequency (BAF)
[TABLE]
where is a measure of relative allele frequency. It is clear from those definitions that LRR is related to the copy-number of a locus, while BAF is relative to one of the alleles proportion.
We present a new method for detecting CNNLOH regions based exclusively in the BAF sequence. Segments of the DNA where LOH has taken place are detected in the BAF plot as the absence of a central band (or bands in the case of copy-number four), as is illustrated in Figure 1.
Several approaches have been proposed for the identification of CNA and LOH in paired and unpaired comparative genomic hybridization and SNP Arrays data, such as agglomerative clustering (Wang et al., 2005), penalized likelihood (Picard et al., 2005), circular binary segmentation (Olshen et al., 2004), piecewise linear models (Muggeo and Adelfio, 2010) and hidden Markov models (HMM) (Yau, 2013; Wang et al., 2007; Beroukhim et al., 2006; Ha et al., 2012; Lin et al., 2004). The HMM-based approach is by far the most widely adopted. Our approach to LOH will be one of identifying changes in the proportion parameters of a mixture model by means of statistical process control (SPC) tools and, thus, without imposing any restriction on the statistical distribution of change-points.
2 Method
2.1 Statistical Model
We assume that is the BAF data resulting from a SNP Arrays study and that we already know the copy number in this sequence is two. We are interested in detecting regions of this sample where the BAF data appears to be lacking the central band. This means that differences within the upper and lower strips are irrelevant for our purposes. For this reason we perform the following transformation on the data: . This transformed sample will be called transformed BAF (tBAF).
The tBAF data follows a pattern illustrated in Figure 2. It can be seen that for non-LOH regions there are clearly two bands in the tBAF plot, one running near and another near [math]. We will adopt the convention of calling upper band the band that runs near and lower band any band running below the upper one. The presence of LOH can be identified in the tBAF plot by the absence of an identifiable lower band.
We propose adopting the mixture of two distributions as a model for the tBAF distribution. The first distribution describes the the upper band stochastic behavior whilst the second component should yield the lower band behavior. Take as the density associated to observations in the lower band and for the observations on the upper band. Our model for both non-LOH and LOH regions will be of the form:
[TABLE]
where is the tBAF vector of observations, is a vector containing all the necessary parameters for characterizing our distribution and is the probability of drawing a observation from . The parameter can be interpreted as the probability of observing a heterozygous DNA locus and will be called here the homozygosity level. In some methods it is assumed known (Chen et al., 2013) but its availability for a particular combination of platform and population is not always warranted. For this reason we propose an estimation method that requires only the available sample.
We have argued that what characterizes a LOH region is the lack of any lower band. This can be rephrased as a difference in the parameter in our statistical model, i.e., we expect that the density function describing LOH regions have a much smaller coefficient associated to observing a value drawn from . From this observation we can formulate two such models:
[TABLE]
[TABLE]
where is associated to non-LOH regions, to LOH regions and . In fact, since the LOH is supposed to lack it’s lower band, we may assume that with .
We adopt the one inflated beta (OIB) density function (Ospina and Ferrari, 2010), which has the form:
[TABLE]
where is the probability of observing a and .
The choice of , which describes the lower band, is the zero inflated beta (ZIB) distribution (Ospina and Ferrari, 2010), whose density is given by:
[TABLE]
where is the probability of observing a [math] and .
The estimation of the parameters in the proposed model is performed by Expectation Maximization (EM) algorithm (givens) application to a set of observations where CNNLOH is absent. This set can be obtained from the application of the microarray technology to somatic tissue or by visually inspecting the tBAF sequence and identifying regions without LOH.
2.2 LOH Calling
As discussed in subsection 2.1 the detection of LOH regions in a tBAF sequence can be formulated in the framework of statistical process control (SPC), i.e., the change from non-LOR, in control, to LOH, out of control, or vice-versa. Despite this connection there has been little work on the application of this perspective to the problem of segmenting SNP Arrays data, the work of Li et al. (Li et al., 2009) being the only example known to the authors.
There are a number of possible methods to perform SPC in a sequence of observations (Basseville et al., 1993). The CUSUM method, which was first proposed by Page in Page (1954a), is one of the less commonly adopted (HAWKINS, 1993). The reason is that the CUSUM plots are especially designed to identify a change in a parameter value (or a distribution), to a known value (or distribution) after a change. This assumption of knowledge regarding prior and after change parameters values is usually not reasonable in real applications (Basseville et al., 1993).
Although usually seen as a disadvantage this characteristic of the CUSUM plot is taken here as a advantage. As was shown in subsection 2.1, we are able to provide good approximations for in-control and out-of-control distributions previous to the application of any SPC tool. The CUSUM is not only specially adapted to the situation of a change in known distributions but was shown to be optimal (Moustakides, 1986) and also asymptotically optimal Lorden (1971) if one sets the average delay to detection minimization as the optimality criterion, i.e., it is (in some sense) the fastest method to identify a change-point.
2.3 CUSUM Change-point Detection Procedure
We now provide a detailed account of the CUSUM algorithm for our application. For notational convenience we will always assume that a transition of to is to be detected.
We suppose known whether the actual segment is of the non-LOH type. This means that we are working under the assumption that the observations are independent realizations of . We say that is the assumed model. To detect a transition from to the CUSUM algorithm sequentially computes the instantaneous log-likelihood: and its cumulative sums: ; .
The presence of a change-point is flagged, at time , if is the first cumulative sum to be greater than the alarm threshold .
When the CUSUM detects the presence of a change-point the next step is to estimate its location. We adopt a maximum likelihood approach to solve this problem. Let the supposed model be , that we began the cumulative sum at and that the alarm time was . Our estimate of the change-point is:
[TABLE]
Once the change-point is estimated, all observations between the previous change-point and the newly discovered one are said to be realizations of (non-LOH) and the CUSUM algorithm restarts at the last change-point now considering as the assumed model. When the assumed model is the CUSUM statistic changes to and the threshold to . A similar procedure is then carried out until the next change-point comes by and we return to after identifying a LOH region.
The main issue with the CUSUM algorithm is the correct choice of the two alarm thresholds. Usually those thresholds are found based on the average run length functions for a threshold value , for . The index is that of the assumed model and that of the observed model, i.e., if and the assumed model is and the observations are been generated from . It follows that, for example, is the expected time to call a change-points from to and is the amount of time to a false alarm when the assumed model is .
Page (1954a) shows that the ARL function is the solution for an integral equation of the Fredholm’s type but an analytical solution does not always exist. To overcame this difficulty numerical approximations have been suggested (Goel and Wu, 1971; Page, 1954b; Siegmund, 2013; Brook and Evans, 1972; Lucas, 1982). In the next subsection we propose a novel approach to the threshold selection.
2.4 Threshold Approximation
Here we do not base threshold values on the previous approximations of the ARL function. The choice of a particular value of average run length is difficult and it has no theoretical meaning to geneticists performing the analysis. We propose a criterion for selecting a threshold based on the a definition of segments with small length and a level of tolerance for segments with smaller lengths as follows.
We assume that a index can only be called a change-point if the following segment has length at least . This restriction of a minimum length can be integrated in the CUSUM method by imposing a condition on the probability of raising an alarm only after changes that persist for at least observations. Take as our supposed model and let be a sequence of independent realizations of . Consider , , the sequence of sums resulting from an application of the CUSUM algorithm to . Given a threshold and a probability , the previous restriction on the minimum length of a segment can be formulated as:
[TABLE]
where and is the predefined level of tolerance. This restriction can be interpreted as imposing the condition of only raising an early alarm when there is abundant evidence of a change-point.
For any two given values of and there are infinitely many values of such that Equation 1 is valid. One of these values is the th quantile of the statistical distribution. This value of can be estimated by bootstrapping the distribution of from simulations of the estimated distribution . The threshold for the assumed model can be found in a similar way.
3 Computational Studies
In this section we present a set of computational simulations where we evaluated the ability of the presented segmentation method to correctly identify regions with and without LOH along a BAF sequence. In order for the study to be as close as possible to the situations encountered in practice we will use the BAF samples available at the ”acnr” R package Pierre-Jean and Neuvial (2016).
We simulate a set of sequences of the form by randomly sampling with replacement observations available in the ”acnr” package. The structure of the sequences are always as follows: we have selected from the population with neutral number of copies and without LOH, then are selected from the population with CNNLOH and finally are sampled from the population with copy-number neutral and without LOH. We use in our study the values of . In addition, we consider the cases where the purity of the sample can assume the values .
In our algorithm we always take and as the parameter for LOH model and threshold calculation, respectively. We also consider .
In order to evaluate the segmentation quality, we note that the problem of detecting LOH regions can be seen as a classification problem with only two categories. In our case we call negative (0) observations in regions without LOH and positive (1) the observations in regions with LOH. Then for each simulated sequence we have a base sequence of zeros and ones describing to which class each observation belongs. If is the sequence of zeros and ones indicating the classification using the proposed method, we define
[TABLE]
[TABLE]
We use the measures of sensitivity and of specificity to evaluate the performance of our method. Table 1 and Table 2 present the means of specificity and sensitivity for 100 replicates of each sequence , respectively.
Note that in Table 1 that our method correctively detected regions with LOH in cases with purity equal to one and . When the segmentation quality is noticeably lower. As the purity decreases the segmentation quality also becomes worse. For the case of purity 0.5 the method was unable to correctly identify the LOH regions in the vast majority of cases. For 0.79 purity the situation is not as severe with the case , which yields reasonably good results.
Table 2 shows that our proposed method does not overestimate the regions with LOH but miss some of then when the value of is close or bigger than the region length or when the sample purity is close to 0.5.
4 Real Data Application
We will now apply the proposed method to a real data set. The data set we utilize consists of 482 benign and tumor samples from 259 men with prostate cancer studied in Ross-Adams et al. (2015).
We apply the oncoSNP segmentation procedure (Yau, 2013) to all tumoral samples available in ”https://www.ncbi.nlm.nih.gov/geo/” and separate all the segments where the method accused a neutral number of copies. In each of these segments we apply our segmentation method and compare the detection of regions with LOH to the one made by oncoSNP again using sensitivity and specificity assuming that the result of oncoSNP is the gold standard.
We visually select a segment with 1800 observations within chromosome 1 and estimate our models using this segment. The other parameters of our method were chosen as , and we used 10000 simulations to estimate the two threshold values. Table 3 presents the sensitivity and specificity results.
It is clear that for all values of our method correctly detected most of the LOH regions pointed out by oncoSNP. As one would expect smaller values of detect more oncoSNP segments than do greater values of . In terms of specificity the opposite is true: greater values of result in greater specificity. This is also not surprising.
In terms of execution time our procedure shows a great advantage in comparison to oncoSNP. The worst mean execution time of our procedure is 47.29 seconds. Also if we remove the time necessary for data loading this times reduces to,a maximum of, 9.72 seconds. This is a very small execution time when compared to the minuties mean execution time for oncoSNP.
To explore the segmentation features we choose a small region of one of the segments and look at the regions of LOH detected by oncoSNP and by our method. We look at the segmentation made by three different parameter choices for our method: and , and , and . In all cases we adopt and 10000 simulations in the thresholds estimation process. The segmentation results are presented in Figure 3 where the order of annotation of the graphs (a), (b), (c) and (d) follows the order in which we presented the algorithms.
The first thing we note in Figure 3 is that the oncoSNP identifies only two regions with LOH and that the regions detected by oncoSNP are also detected by the proposed method. The largest of these regions coincides with that identified by oncoSNP and the largest number found by our methods justifies specificity values between 0.70 and 0.90.
The effect of choosing is not difficult to interpret and is clearly justified by the Figure 3. Smaller values of provide a segmentation with smaller identified regions. The effect of is more subtle and can be seen in Figure 3 by a reduction on the first identified segment in panels (c) and (d). This happens because decreases the method’s tolerance for the existence of points close to 0.5 within regions declared to contain LOH. Note that in the largest region with identified LOH there is a near-half observation for all segmentations built by our methods. This is because this observation is isolated within the segment and therefore its influence is not as strong as that in the first segment identified by the last two methods considered.
5 Discussion
The segmentation method we propose is conceptually simple in addition to being easily implemented. A mixture of inflated betas models the BAF data allowing a fast model estimation procedure with the help of the EM algorithm. We segment the data set with the CUSUM technical which originated in the statistical process control and has optimal characteristics, resulting in a fast and accurate segmentation.
The great advantages of the method are that the estimation procedure can be performed in a small portion of the data set, result in a faster execution time in comparison to the methods that use HMM-based approaches. Also the built segmentation features can be adjusted by correctly selecting the parameters , associated to the minimal length of a segment, and , related to the level of tolerance to near-half observations inside LOH regions. The method is robust to mild levels of contaminations.
The fast performance characteristic of the proposed method is of great importance to the analysis of most recent array technology given the large amount of data resulting from application of said technologies. For example most recent SNP arrays can produce more than 2millions observations and whole-genome shotgun sequencing produce approximately 3billions observations, one for each base in the DNA chain. The sheer size of those numbers make clear the need for fast methods in the analysis of genetic related experiments. We believe that our proposed method is a advancement in this direction.
Acknowledgements
We want to thank FAPESP (grant 2013/00506-1) and CAPES for providing the resources for the realization of this project.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Albertson et al. (2003) Albertson, D. G., Collins, C., Mc Cormick, F., and Gray, J. W. (2003). Chromosome aberrations in solid tumors. Nature genetics , 34 (4), 369–376.
- 2Basseville et al. (1993) Basseville, M., Nikiforov, I. V., et al. (1993). Detection of abrupt changes: theory and application , volume 104. Prentice Hall Englewood Cliffs.
- 3Beroukhim et al. (2006) Beroukhim, R., Lin, M., Park, Y., Hao, K., Zhao, X., Garraway, L. A., Fox, E. A., Hochberg, E. P., Mellinghoff, I. K., Hofer, M. D., et al. (2006). Inferring loss-of-heterozygosity from unpaired tumors using high-density oligonucleotide snp arrays. P Lo S Comput Biol , 2 (5), e 41.
- 4Beroukhim et al. (2010) Beroukhim, R., Mermel, C. H., Porter, D., Wei, G., Raychaudhuri, S., Donovan, J., Barretina, J., Boehm, J. S., Dobson, J., Urashima, M., et al. (2010). The landscape of somatic copy-number alteration across human cancers. Nature , 463 (7283), 899–905.
- 5Bignell et al. (2004) Bignell, G. R., Huang, J., Greshock, J., Watt, S., Butler, A., West, S., Grigorova, M., Jones, K. W., Wei, W., Stratton, M. R., et al. (2004). High-resolution analysis of dna copy number using oligonucleotide microarrays. Genome research , 14 (2), 287–295.
- 6Brook and Evans (1972) Brook, D. and Evans, D. (1972). An approach to the probability distribution of cusum run length. Biometrika , 59 (3), 539–549.
- 7Chen et al. (2013) Chen, G. K., Chang, X., Curtis, C., and Wang, K. (2013). Precise inference of copy number alterations in tumor samples from snp arrays. Bioinformatics , 29 (23), 2964–2970.
- 8Goel and Wu (1971) Goel, A. L. and Wu, S. (1971). Determination of arl and a contour nomogram for cusum charts to control normal mean. Technometrics , 13 (2), 221–230.
