Riemannian Trust Region Method for Haplotype Assembly
Mohamad Mahdi Mohades, Mohammad Hossein Kahaei

TL;DR
This paper introduces a Riemannian trust region optimization method to accurately solve the haplotype assembly problem by modeling it as a sphere-constrained maximization, effectively escaping local maxima and saddle points.
Contribution
It presents a novel manifold optimization approach using trust region methods for haplotype assembly, addressing nonconvexity challenges.
Findings
High accuracy in haplotype estimation demonstrated
Effective escape from local maxima and saddle points
Outperforms existing methods in simulation results
Abstract
In this letter we model the Haplotype assembly problem (HAP) as a maximization problem over an -dimensional sphere. Due to nonconvexity of the feasible set, we propose a manifold optimization approach to solve the mentioned maximization problem. To escape local maxima as well as saddle points we utilize trust region method. Simulation results show that our proposed method is with high accuracy in estimation of Haplotype.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRNA Research and Splicing · Molecular Biology Techniques and Applications · RNA regulation and disease
Riemannian Trust Region Method for Haplotype Assembly
Mohamad Mahdi Mohades and Mohammad Hossein Kahaei The authors are with the School of Electrical Engineering, Iran University of Science & Technology, Tehran 16846-13114, Iran (e-mail: [email protected]; [email protected]).
Abstract
In this letter we model the Haplotype assembly problem (HAP) as a maximization problem over an -dimensional sphere. Due to nonconvexity of the feasible set, we propose a manifold optimization approach to solve the mentioned maximization problem. To escape local maxima as well as saddle points we utilize trust region method. Simulation results show that our proposed method is with high accuracy in estimation of Haplotype.
Index Terms:
Haplotype assembly, Manifold optimization, Riemannian trust region.
I Introduction
Haplotype is a string of single nucleotide polymorphisms (SNPs) of chromosomes [1]. Haplotypes are useful in studying human evolutionary history and drug discovery and development. Haplotypes can be assembled utilizing sequenced reads, where each read is a fragment of the two chromosomes [2]. Such assembly can be performed by solving mathematical models of sequenced reads. For diploid organisms, each SNP site is considered to be either or . This means that the result of haplotype assembly from the sequenced reads ought to be a vector of elements. Moreover, the obtained haplotype, say , is corresponding to one of the two chromosomes and is corresponding to the other one.
One mathematical approach to find the haplotype is matrix completion. In this approach a read matrix, say , containing sequenced reads is created. The elements of this matrix are either corresponding to the reads or where there is no read. Then, it is required to replace the elements of the matrix with or so that the rank of the newly generated matrix, say , be one. Then the factorization gives the haplotype through applying sign function over . When the reads are affected by noise, the sign of some of the entries of the matrix is changed. In this case using the following optimization problem the completed matrix is obtained. Please note that in HAP, whether to estimate or , the estimation is accurate.
[TABLE]
where is the sampling operator acting as follows
[TABLE]
and is the set of the observations. There exist some approaches to solve the problem (1). Based on [3], the problem (1) can be directly solved over the manifold of rank-one matrices. Also, a variation of the formulation of the problem (1) has been solved in [4] over the Cartesian product of Grassmann manifolds. In fact, instead optimizing over the variable , the problem is solved over the variables , and where is the singular value decomposition of . Consider the following equivalent optimization problem for the problem (1).
[TABLE]
where is a given value. In [5] the nonconvex objective function has been replaced by the nuclear norm and the minimization problem turns into a convex one [6]. Moreover, the problem (1) can be modeled as the following optimization problem.
[TABLE]
Alternating minimization is utilized to solve the above nonconvex problem [7]. In [8], the authors have given an example to show that the proposed objective function of the problem (1) is not a reliable cost function in haplotype assembly. They instead have proposed an objective function to truly model the haplotype assembly problem. Using simulation results they have shown their approach is more accurate than previously proposed algorithms in haplotype assembly .
Apart from the aforementioned optimization approaches, it has been proposed to minimize the following minimum error correction (MEC) function to estimate the haplotype.
[TABLE]
where is the -th row of and
[TABLE]
where equals [math] for the same inputs and otherwise. It is NP-Hard to obtain the optimal solution of the MEC problem (5). Therefore, heuristic methods have been proposed to solve the problem (5). For example, in [9], a heuristic combinatorial approach is proposed to find the haplotype. However, no provable guarantee for a good solution is proposed therein. Another heuristic approaches can be found in [10] and [11]. In [12], the authors have firstly offered the following optimization problem to find the haplotype,
[TABLE]
where is the adjacency matrix defined to evaluate the similarity of each pair of rows of the read matrix and is the th entry of . Then, they proposed a semidefinite program and solved it rapidly and accurately.
In this paper, we talk about some facts about the haplotype assembly problem (HAP). Then, based on such facts we propose a maximization problem to estimate haplotypes. Note that unlike the maximization problem 7, our formulation directly uses the reads and moreover is not a quadratic form. The proposed problem is defined over the -dimensional sphere. We use an algorithm to solve the mentioned problem and prove the convergence of the algorithm. Simulation results illustrate that for a large observation error probability, our method outweighs some of the previously proposed methods in the sense of haplotype estimation accuracy.
The remainder of the paper is organized as follows. In Section II some preliminaries are introduced to state our main result. Section III talks about the main result of the paper. Simulation results are presented in Section IV. Conclusion is presented in Section V.
II Preliminaries
In Section III we formulate HAP as a manifold optimization problem. To do so, we require some related concepts as discussed in the following.
Definition 1**.**
[13]** Let be a manifold. The set of all tangent vectors at point is called the tangent space at point and denoted by . Moreover, the disjoint collection of the tangent spaces is called tangent bundle and denoted by .
Definition 2**.**
A given subset of the manifold along with a bijective mapping between and an open subset of consist a pair which is called a -dimensional chart of .
Definition 3**.**
[13]** A differentiable bijective mapping between two manifolds is called diffeomorphism provided that its inverse is differentiable too.
Definition 4**.**
[13]** A tangent vector field on is a smooth function which assigns to each point of the manifold a tangent vector belonging to tangent bundle. The gradient of a real valued smooth function over a manifold is an example of vector field which is denoted by . For example, tangent vector field over sphere is specified as:
[TABLE]
where is the gradient over the Euclidean space and is the identity matrix [13].
Definition 5**.**
[13]** Let the manifold be a subset of the manifold . When the manifold topology of coincides the induced topology of , is called embedded submanifold.
Definition 6**.**
[13]** A manifold endowed with a smoothly varying inner product, is called Riemannian manifold. The inner product over the manifold is denoted by either or . Moreover, illustrates the restriction of the inner product to the tangent space . Also, the metric induced by this norm is called Riemannian distance and denoted by .
Definition 7**.**
[13]** A locally distance minimizing curve over a manifold is called geodesic. Specifically, straight lines over Euclidean spaces are geodesics.
Definition 8**.**
[13]** A globally distance minimizing curve over a manifold is called minimizing geodesic. For example, is the minimizing geodesic between points over sphere as:
[TABLE]
where , , and . For simplicity we consider .
Definition 9**.**
[13]** Let be a Riemannian manifold whose tangent bundle is . Exponential map is a mapping from to so that for , is equal to at time ; where is the unique geodesic starting from the base point of with velocity at time [math].
Definition 10**.**
[13]** Injectivity radius of the manifold is defined as follows,
[TABLE]
where shows the restriction of the mapping to the ball . Moreover, is a normal neighborhood.
Definition 11**.**
[13]** Let own a positive injectivity radius of . Then the real valued function on is () provided that
* is differentiable,* 2. 2.
* with , there exists for which*
[TABLE]
where is the unique minimizing geodesic satisfying and . Moreover, is an isometry operator; which translates the tangent vector to the tangent space making it possible to differentiate the tangent vectors and , and is called parallel translation. Also, is obtained by taking infimum over the length of all curves joining to (see formula (3-30) of **[13]**). For example, for the compact embedded submanifold we have
[TABLE]
where .
Proposition 1**.**
(Lemma 7.4.7 of [13]) Consider as a normal neighborhood of and as a continuously differentiable tangent vector field over . Also, let be the unique minimizing geodesic with , , and . Then :
[TABLE]
where is Riemannian connection which generalizes the concept of directional derivative of a vector field. Specifically, for Sphere at point is as follows:
[TABLE]
where is the conventional directional derivative on a Euclidean space.
Definition 12**.**
[13]** The Riemannian Hessian of the real valued function on the Riemannian manifold at is defined as follows:
[TABLE]
To solve a manifold optimization problem, Riemannian Trust Region (RTR) method can be utilized. RTR methods utilize second order geometry of the cost function which lets them escape saddle points and obtain better results compared to first order line search methods [13]. RTR algorithm to minimize given cost function over the manifold is presented in Table 1. Before presenting RTR algorithm, let us define the trust region subproblem.
Definition 13**.**
[13]** Let be a real valued cost function over the manifold , the following quadratic optimization problem is called trust region subproblem,
[TABLE]
where is some symmetric operator on .
[TABLE]
Proposition 2**.**
[13]** Let be a sequence generated by Alg. 1. Also, the following conditions are satisfied,
Mapping is and bounded below at the level set , 2. 2.
Mapping is , 3. 3.
There exist and such that the retraction function (see Definition 4.1.1 of **[13]**) satisfies
[TABLE] 4. 4.
Mapping is radially , i.e., such that
[TABLE]
for all , and with . 5. 5.
There is a constant such that for all , where is the symmetric operator defined in Definition 16 at iteration . 6. 6.
Any obtained in Step 1 of Alg. 1 satisfies inequality
[TABLE]
for some constant , where is the operator norm of . Then, the following holds:
[TABLE]
III Main result
In this section we fistly present some facts in regard of HAP. Then, we propose an optimization problem to estimate haplotypes. Finally, we discuss the convergence of an algorithm for solving the proposed optimization problem.
Let be the noiseless read matrix and be the completion of . Then, the following statements hold.
is a rank one matrix for which there exists the factorization . 2. 2.
There is no difference for HAP to estimate as or . In other words, is equivalent to . 3. 3.
There exists a unique maximizer for the problem,
[TABLE]
up to the equivalence of and . 4. 4.
It can be verified that (21) is equivalent to the following optimization problem over Sphere ,
[TABLE]
We therefore propose the next optimization problem to estimate the haplotype for the noisy HAP,
[TABLE]
Please note that, even though objective function of Problem 21 is convex, we intend to maximize the objective function. Therefore, the solution of the optimization problem cannot be trivially obtained through convex optimization approaches. Moreover, due to Alg. 1 is a descent algorithm, we rewrite Problem (23) in the following form:
[TABLE]
Let us make the objective function of (24) differentiable to easily use smooth optimization approaches for finding the solution. For this purpose, we propose the subsequent problem,
[TABLE]
where is the th row of the matrix and is a very small positive value. Please note that for , the objective function of (25) turns into the objective function of (24).
Theorem 1**.**
For the optimization problem (25), Conditions of Proposition 2 are satisfied and consequently Alg. 1 converges.
Proof.
We require to prove that the conditions of Proposition 2 are satisfied for in problem (25).
Condition 1; It is easy to see that is continuous. Let be a chart for , then is differentiable over if is differentiable [13]. Let , then, a straightforward choice for is to choose so that , when . Accordingly, it is easy to see that is differentiable and subsequently is differentiable. Therefore, is . Moreover, for it is obvious that is lower bounded.
Condition 2; Let be the geodesic defined in Definition (8) with , and distance function be as Equation (12). Now, let . Being aware of the fact that , specifically for , it is easy to verify that,
[TABLE]
Then, by considering as the gradient vector field in equality (13), we have:
[TABLE]
where we used the isometry property of . Using Equation (14), we have:
[TABLE]
By calculating the Euclidean gradient for the cost function of Problem (25) as
[TABLE]
it is not difficult to verify that
[TABLE]
Then, using Equality (8), Cauchy-Shwarz inequality and considering that and , we will have:
[TABLE]
The aforementioned argument along with Equality (26) show that is with .
Condition 3; Let be the restriction of retraction function to the tangent space which can be defined as [13]. It is not difficult to verify that for a given the geodesic , defined in Formula (9), satisfies , when . Then, using Equation (12) we have:
[TABLE]
Now, let us find and for sake of inequality (17). It is easy to numerically verify that for and any arbitrary , inequality (17) holds.
Condition 4; Let us evaluate radially property of . We have:
[TABLE]
where . Now, using the fact that is orthogonal to , and some simple calculus, for any arbitrary we obtain that:
[TABLE]
Meaning that radially condition has been satisfied with .
**Condition 5;**Let us consider the symmetric operator be Hessian of function at point . Then, based on Definition 12, the fact that and also an argument discussed in verifying Condition 2, we just require to prove the boundedness of (see Inequality (31)). Based on Equation (8) and using the Eulidean gradient of function , as defined in Equation (29), we will have:
[TABLE]
This results in
[TABLE]
Condition 6; It is shown in [13] that using the truncated conjugate gradient method (see Alg. 11 of [13]), Inequality (19) is satisfied with .
The proof is complete. ∎
IV Simulation Results
Hamming distance of the original haplotype and its estimation, denoted by , is our criterion to evaluate the performance of different methods. Please note that in HAP, it does not matter that the original haplotype is either or . Accordingly, to obtain we calculate the Hamming distance of the estimated haplotype with both and and choose the minimum one. Simulations are performed using synthetic data created by random generation of bipolar vectors and to construct . The observation set is randomly produced with the probability of observation . Also, some observed samples are erroneous after changing their original sign; we show the set of erroneous samples by , where . Fig. 1 is depicted for , , and pd. Moreover, is set to , where denotes the cardinality of a set. As seen, our method outperforms the other approaches by generating lower .
V Conclusion
In this letter we proposed a new method for haplotype estimation. We properly modeled HAP over an (n-1)-dimensional Sphere. We also discussed the convergence of a Riemannian trust region method. Simulation results confirmed our method outperforms some of the other haplotype assembly methods.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] R. Schwartz et al. , “Theory and algorithms for the haplotype assembly problem,” Communications in Information & Systems , vol. 10, no. 1, pp. 23–38, 2010.
- 2[2] V. Bansal, A. L. Halpern, N. Axelrod, and V. Bafna, “An mcmc algorithm for haplotype assembly from whole-genome sequence data,” Genome research , vol. 18, no. 8, pp. 1336–1346, 2008.
- 3[3] B. Vandereycken, “Low-rank matrix completion by riemannian optimization,” SIAM Journal on Optimization , vol. 23, no. 2, pp. 1214–1236, 2013.
- 4[4] R. H. Keshavan, A. Montanari, and S. Oh, “Matrix completion from a few entries,” IEEE transactions on information theory , vol. 56, no. 6, pp. 2980–2998, 2010.
- 5[5] J.-F. Cai, E. J. Candès, and Z. Shen, “A singular value thresholding algorithm for matrix completion,” SIAM Journal on Optimization , vol. 20, no. 4, pp. 1956–1982, 2010.
- 6[6] E. J. Candès and B. Recht, “Exact matrix completion via convex optimization,” Foundations of Computational mathematics , vol. 9, no. 6, p. 717, 2009.
- 7[7] C. Cai, S. Sanghavi, and H. Vikalo, “Structured low-rank matrix factorization for haplotype assembly,” IEEE Journal of Selected Topics in Signal Processing , vol. 10, no. 4, pp. 647–657, 2016.
- 8[8] M. M. Mohades, S. Majidian, and M. H. Kahaei, “Haplotype assembly using manifold optimization and error correction mechanism,” IEEE Signal Processing Letters , vol. 26, pp. 868–872, June 2019.
