TL;DR
This paper introduces a data-driven approach to infer chemical reaction networks from trajectory data by modeling them as continuous-time Markov chains and employing sparse regularization to learn propensity functions.
Contribution
The paper presents a novel method for learning reaction networks from trajectory data using likelihood maximization and $l^1$ regularization, with theoretical analysis and numerical validation.
Findings
Effective learning of propensity functions from synthetic data
Asymptotic analysis confirms method's consistency in infinite-data limit
Demonstrated applicability to fully observed reaction systems
Abstract
We develop a data-driven method to learn chemical reaction networks from trajectory data. Modeling the reaction system as a continuous-time Markov chain and assuming the system is fully observed, our method learns the propensity functions of the system with predetermined basis functions by maximizing the likelihood function of the trajectory data under sparse regularization. We demonstrate our method with numerical examples using synthetic data and carry out an asymptotic analysis of the proposed learning procedure in the infinite-data limit.
| No. | Reaction | |
|---|---|---|
| \ce -¿[κ] products | ||
| \ce -¿[κ] products | ||
| \ce2 -¿[κ] products | ||
| \ce + -¿[κ] products |
| No. | Reaction | Channel | ||
|---|---|---|---|---|
| \ce -¿[κ_1] | ||||
| \ce + -¿[κ_2] | ||||
| \ce -¿[κ_3] | ||||
| \ce -¿[κ_4] 2 |
| Channel | ||||
|---|---|---|---|---|
| Vector | ||||
| No. of occurrences |
| True | ||||
|---|---|---|---|---|
| Estimated |
| Channel | |||||||
|---|---|---|---|---|---|---|---|
| No. | Reaction | Channel | ||
|---|---|---|---|---|
| \ce -¿[κ_1] | ||||
| \ce -¿[κ_2] | ||||
| \ce -¿[κ_3] | ||||
| \ce -¿[κ_4] | ||||
| \ce + -¿[κ_5] |
| Channel | ||||
|---|---|---|---|---|
| Vector | ||||
| No. of occurrences |
| True | |||||
|---|---|---|---|---|---|
| Estimated |
| Channel | Channel | Channel | Channel | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Channel | |||||||
|---|---|---|---|---|---|---|---|
| No. | Reaction | Channel | ||
|---|---|---|---|---|
| \ce -¿[κ_1] | ||||
| \ce + -¿[κ_2] | ||||
| \ce -¿[κ_3] | ||||
| \ce -¿[κ_4] + | ||||
| \ce -¿[κ_5] + | ||||
| \ce -¿[κ_6] |
| Channel | ||||||
|---|---|---|---|---|---|---|
| Vector | ||||||
| No. of occurrences |
| True | ||||||
|---|---|---|---|---|---|---|
| Estimated |
| Ch. | Ch. | Ch. | Ch. | Ch. | Ch. | |||
|---|---|---|---|---|---|---|---|---|
| Ch. | Ch. | Ch. | Ch. | Ch. | Ch. | ||
|---|---|---|---|---|---|---|---|
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Learning chemical reaction networks from trajectory data
Wei Zhang Zuse Institute Berlin, D-14195 Berlin, Germany.
Stefan Klus Department of Mathematics and Computer Science, Freie Universität Berlin, D-14195 Berlin, Germany.
Tim Conrad11footnotemark: 1 22footnotemark: 2
Christof Schütte11footnotemark: 1 22footnotemark: 2
Abstract
We develop a data-driven method to learn chemical reaction networks from trajectory data. Modeling the reaction system as a continuous-time Markov chain and assuming the system is fully observed, our method learns the propensity functions of the system with predetermined basis functions by maximizing the likelihood function of the trajectory data under sparse regularization. We demonstrate our method with numerical examples using synthetic data and carry out an asymptotic analysis of the proposed learning procedure in the infinite-data limit.
**Keywords ** chemical reaction, inverse problem, data-driven method, sparse optimization, asymptotic analysis
**AMS ** 92C42, 62M86
1 Introduction
Chemical reaction networks [23, 1] have been shown to be very useful in studying dynamical processes in chemistry and biology, where systems under investigation typically contain many different reactants that interact with each other. In in-silico biology, for instance, the cellular processes are often modeled as chemical reaction networks, which take the relevant biological/chemical components as well as their interactions into account [24, 6, 44, 35]. Modeling cellular processes, or finding the kinetic structure of the underlying reaction networks [54, 11, 14, 40, 51, 32], is one of the most prominent fields of in-silico biology due to the important role of such models in understanding the cellular behavior. This task is particularly challenging for realistic reaction networks that are characterized by a large number of elements and interactions (reactions). At the same time, more and more trajectory data of cellular processes is becoming available due to state-of-the-art single-cell based laboratory techniques [12, 42].
The aim of this work is to develop data-driven methods [33] that allow us to learn chemical reaction networks from trajectory data and to apply the new methods to the modeling of cellular processes. Given trajectory data of a stochastic chemical reaction process, we propose a numerical approach to reconstruct the underlying reaction network by maximizing the likelihood function of the trajectory with sparsity regularization. Roughly speaking, our approach consists of three steps. In the first step, preliminary information of the reaction network such as the number of different elements (reactant, products) and the total number of reaction channels is extracted from trajectory data by counting and enumerating. Based on this information, the second step is to define several basis functions which will be used in learning the propensity functions of the reaction network. The theory of chemical reactions suggests that we can choose each basis function as the product of copy-numbers of at most two different reactants [2, 16], i.e., polynomial functions of degree up to . In the third step, the propensity function of each reaction channel is represented using linear combinations of the basis functions involving unknown coefficients, which are then determined by maximizing the log-likelihood function of the trajectory data along with sparse regularization techniques using the -norm [27].
In contrast to Lasso [47, 48], the optimization problem that needs to be solved in our learning approach is a nonlinear sparse optimization problem, due to the nonlinearity of the log-likelihood function of the reaction network. In our study, we find that FISTA (Fast Iterative Shrinkage-Thresholding Algorithm) proposed in [7] is a suitable algorithm for solving our problem. We also propose a simple preconditioning technique which can significantly improve the performance of the numerical algorithm by allowing larger step-sizes in FISTA. This preconditioning technique turns out to be particularly useful when the basis functions take values at different orders of magnitudes for the given trajectory data. Furthermore, we provide an asymptotic analysis of our learning approach in the infinite-data limit. Under certain technical assumptions, by applying large sample theory [19, 34, 50] and limit theorems for stochastic processes [17], we establish the asymptotic consistency and the asymptotic normality of the estimators in our learning procedure, which therefore provides a solid theoretical basis for the data-driven method proposed in this paper.
Let us first review related work and summarize the contributions of this paper. The reconstruction of the governing equations from data using sparsity constraints is getting more and more attention, see [53, 10, 37, 14] for methods pertaining to ordinary differential equations (ODEs) and [8] for learning stochastic differential equations (SDEs). For chemical and biological reaction systems, the problem of estimating unknown parameters has been well studied when the systems are modeled both as ODEs [28, 3] and as continuous-time Markov chain processes [1, 41, 9, 54], while the reconstruction of the entire chemical reaction networks, i.e., finding parsimonious models, has only been considered when the systems are modeled as ODE systems [51, 37, 14]. We refer the readers to the nice review [51] for recent developments on the reverse engineering in systems biology. Compared to the aforementioned existing results, our work is new in the following three aspects. Firstly, we study sparse reconstruction of chemical reaction networks as continuous-time Markov chains, which, to the best of our knowledge, has not been considered in the literature. In contrast to ODE models, a continuous-time Markov chain as a stochastic model has the ability to provide more details of the reaction systems by capturing stochastic effects, which are known to be important for cellular processes [46, 45, 30]. Secondly, we have developed numerical codes in which we implemented the FISTA method [7] to solve a nonlinear sparse optimization problem in order to learn the reaction networks from trajectory data. Our numerical approaches, in particular the preconditioning technique, may be useful in other sparse optimization problems as well. Thirdly, we provide a theoretical justification of the proposed data-driven method. Note that, although different data-driven methods using sparsity [53, 10, 8] have been developed in the literature for different types of dynamical systems, the theoretical analysis of these methods is largely incomplete (see [49]). We expect the theoretical analysis presented in the current work to shed light on the characteristic properties of other data-driven methods as well.
Before concluding this introduction, we discuss several issues that will not be studied in detail in the current paper. Most importantly, we assume that the dynamics of the system is fully observed. In applications, it may be the case that either only certain “important” species in the system are observed or the full dynamics is only discretely observed at a fixed observation frequency [9]. In the former case, one can still apply the sparse learning approach proposed here and the output will be an “effective” model for the observed “important” species. However, the theoretical asymptotic analysis does not carry over directly, and it is therefore important to assess the quality of the effective model provided by the learning approach. In the latter case, where the full dynamics is observed discretely, learning the parsimonious model becomes more challenging. First of all, since not all reactions are observed, the reaction channels of the system need to be identified by other means. Supposing that this can be done, the likelihood function of the given trajectory data can be obtained by summing up the likelihood of all possible underlying trajectories that are consistent with the observation data. One can formulate the learning approach again as a sparse minimization problem, but it will be necessary to sample the underlying trajectories of the system in order to evaluate both the likelihood function and its derivatives. This results in some difficulties when solving the sparse minimization problem. We will address these issues in future work.
The remainder of the paper is organized as follows. In Section 2, we introduce chemical reaction networks and the required notation. Learning chemical reaction networks from trajectory data and its formulation as an optimization problem will be considered in Section 3. In Section 4, we demonstrate the efficiency of the numerical algorithm for solving the (sparse) optimization problems with three concrete numerical examples. In Section 5, we analyze the learning tasks when the length of the trajectory data goes to infinity and study the asymptotic behavior of the solutions of the optimization problems. Appendix A summarizes the main steps of the algorithm FISTA. Appendix B contains properties of an elementary function used in the current work. Two useful limit lemmas of counting processes are summarized in Appendix C. Finally, the proofs of results in Section 5 are collected in Appendix D.
The code used for producing the numerical results in Section 4 is available at: https://github.com/zwpku/sparse-learning-CRN.
2 Chemical reaction networks as continuous-time Markov chains: forward problem
Chemical reaction networks consist of different chemical species that can interact with each other through independent chemical reactions. Suppose the system has different chemical species, denoted by . Each species , , has copies, where the copy-number may change whenever a reaction involving the species has occurred. The state of the system can be represented as the vector
[TABLE]
where and is the set of all possible states of the system.
The evolution of the system’s state can be modeled as a state-dependent continuous-time Markov chain [1, 23]. Let denote a reaction in the system. The state change vector of , , is defined such that, starting in state , the state of the system will change to when the reaction occurs. The waiting time of the system before the reaction occurs satisfies an exponential distribution with the rate parameter (propensity function), which in turn depends on both the state and the structure of . Specifically, the probability density function of is given by
[TABLE]
In Table 1, we list the propensity functions of reactions which consume at most two molecules (see [5, 29] for further details). In particular, note that the propensity functions for the reactions in Table 1 are polynomial functions whose degrees are less or equal to .
In many reaction systems, different chemical reactions may have the same state change vector (see the second example in Remark 1). Assume that chemical reactions , , , are involved in the evolution of the system and these reactions have in total different state change vectors , where . For each , , we introduce the terminology chemical channel . We say the reaction belongs to the channel , or contains the reaction , if the state change vector of is . For each , we also define the index set
[TABLE]
and let be the number of chemical reactions belonging to , i.e., . Clearly, these index sets satisfy \bigcup_{i=1}^{K}\mathcal{I}_{i}=\big{\{}1,2,\dots,N\big{\}}, , if , and therefore .
A reaction channel is said to be activated when a certain reaction belonging to occurs. For each , is the waiting time at a state before the activation of the channel , while is the waiting time before any of the chemical reactions in the system occurs. Assuming the chemical reactions are independent of each other and the waiting times follow exponential distributions, we know that the waiting times and also follow exponential distributions, with the propensity functions
[TABLE]
respectively. In particular, let be the probability density function of and the probability that is the first channel which becomes activated at state , then
[TABLE]
We point out that the evolution equation of the dynamics described above (continuous-time Markov chains) can be expressed in a simple form. In fact, denoting the state of the system at time , from [1] we know that satisfies the dynamical equation
[TABLE]
where , , are independent unit Poisson processes.
Remark 1**.**
As concrete examples, let us consider two simple reaction networks.
Reactions \ce{A+B->[\kappa_{1}]2B}\,,\,\ce{B->[\kappa_{2}]A}, with rate constants , . In this case, we have two different reactions () and two different reaction channels (), with state change vectors and respectively. According to Table 1, the propensity functions of these two channels (assuming ) are , . 2. 2.
Reactions \ce{ABB}\,,\,\ce{A\emptyset}, with rate constants , . In this case, we have , , since the state change vector of both reactions is . The propensity functions of the two reactions , (assuming ) are and , while the propensity function of the channel is .
3 Learning chemical reaction networks: inverse problem
In this section, we study the problem of learning chemical reaction networks from trajectory data. Depending on the information known about the chemical reaction networks, we consider two different learning tasks in Subsection 3.2 and Subsection 3.3, where the second task is the main focus of this paper. In both tasks, the propensity functions in (1) are determined by maximizing the log-likelihood function among the parameterized propensity functions which depend on both a set of basis functions and several parameters. To emphasize the dependence on parameters, let the parameterized propensity functions be denoted by and , respectively, where and is the vector consisting of all parameters. Similar to (2), we define the probability (density) functions corresponding to
[TABLE]
In the first learning task (Subsection 3.2), we assume that the structure of the chemical reactions is known and the goal is to determine the reaction rate constant of each reaction, i.e., the constants in Table 1. In this case, each basis function in the parameterized propensity functions corresponds to an actual chemical reaction that is indeed involved in the evolution of the system (no redundancy), while the task is to determine the value of each parameter (parameter estimation) by maximizing the log-likelihood function. This is indeed a standard problem and has been widely studied in the literature. We include it in this section due to its connections to the sparse learning task considered in Subsection 3.3.
In the second learning task (Subsection 3.3), on the other hand, we assume that the structure of the chemical reactions in the system is also unknown. In this case, candidate basis functions are chosen to parameterize the propensity functions, and sparsity regularization is used to remove the redundancy in the basis functions.
Before introducing the two learning tasks, we briefly discuss the trajectory of the system and derive the likelihood function of a given trajectory.
3.1 Space of trajectories and the likelihood function
Given , there are two different ways to represent the trajectories of the system in the interval . The first representation relies on the total number of reactions occurred within , the waiting time of each reaction, and the new state of the system after each of the reactions. Specifically, starting from a state at time , each trajectory in the time can be represented as a sequence
[TABLE]
which means that, starting from , the state of the system changes from to after waiting for a period of time of length , where . The final time in (5) is the amount of time that the system spends at the final state before time . Clearly, we have . In the second representation, the indices of the reaction channels are used instead of the new state after each reaction. That is, we represent the same trajectory , , as
[TABLE]
where, for each , denotes the index of the reaction channel and is the waiting time before the -th reaction occurs, respectively. The two representations (5) and (6) can be converted from one to the other, using the relation , which holds for .
In this work, we assume that a trajectory of the system, represented either as described in (5) or (6), is available up to time . In other words, we assume that both the change of the state and the length of the waiting time are known for each occurrence of the chemical reactions. From the trajectory data, we can deduce the total number of different reaction channels , as well as the state change vector for each channel , . (Note, however, that when a certain channel contains more than one reaction, from the data alone we will not be able to tell which reaction belonging to has actually occurred when is activated.) For each , we denote by
[TABLE]
the indices such that in (6), where is the total number of times that the channel has been activated within time , and therefore the relation
[TABLE]
is satisfied. For brevity, let us introduce the notation
[TABLE]
to describe the trajectory of the system within the time interval . The space consisting of all trajectories of the system on will be denoted by . Note that, as a random variable, contains both continuous and discrete components. Given a parameter vector , we consider the chemical reaction system determined by the (parameterized) probability density functions , in (4), and define
[TABLE]
for the trajectory in (9). Let denote the mathematical expectation with respect to the trajectories of the system. Then, for any bounded measurable function , we have
[TABLE]
from which we can view the function as the probability density (distribution) of on the space (we can indeed verify that ). To simplify the notation, we will formally write
[TABLE]
as the integration on the right-hand side of (11). Using (10) and (12), we can write down the likelihood function of the trajectory data as
[TABLE]
where
[TABLE]
can be considered as the likelihood function along the reaction channel .
3.2 Learning task 1: determine rate constants by maximizing the log-likelihood
Assuming that the structure of the chemical reactions of the system is known, we now consider the problem of determining the reaction rate constant of each reaction. Note that the propensity function of each reaction in Table 1 can be written as , where is a polynomial of the system’s state whose specific form depends on the structure of , and is the rate constant. Therefore, in the current learning task we assume that the propensity function of the th chemical reaction in the system is given by
[TABLE]
where the nonnegative function is known from the structure of , and is the unknown rate constant which we want to determine from trajectory data.
Let be the vector
[TABLE]
consisting of all the unknown rate constants, where for all . For each channel , , we also define the vector
[TABLE]
which consists of the rate constants of reactions belonging to . Corresponding to (15), the parameterized propensity functions in (1) are
[TABLE]
while the optimal value of is determined by maximizing the (logarithmic) likelihood functions in (13), or equivalently, by solving the minimization problem
[TABLE]
With the trajectory data as defined in (5) and using the propensity functions in (17), the objective function above can be computed explicitly and we have
[TABLE]
In the above, we recall that the indices are defined in (7), the logarithmic likelihood function
[TABLE]
only depends on and should be compared to (14). Note that the expressions above also imply that the minimization problem (18) can be decomposed into minimization problems
[TABLE]
which can be solved separately.
For each index , , such that for some , the corresponding Euler–Lagrange equation of (18) is
[TABLE]
Differentiating one more time, we get the Hessian matrix of the objective function in (18)
[TABLE]
where .
In order to study the optimization problem (18)–(19), let us introduce the matrix
[TABLE]
for each , where we have assumed that the index set \mathcal{I}_{i}=\big{\{}j_{1},j_{2},\dots,j_{N_{i}}\big{\}}. We define to be the th column vector of for and thus obtain the following result concerning the solution of the optimization problem (18)–(19).
Proposition 1**.**
The following three conditions are equivalent.
For each , the vectors are linearly independent. 2. 2.
The function in (19) is strictly convex. 3. 3.
The optimization problem (18)–(19) has a unique solution.
Proof.
(2) (3) is obvious. To show that (1) implies (2), it is sufficient to verify that the Hessian matrix of is positive definite. Using (22), for any vector , we have
[TABLE]
Since the columns of are linearly independent for each , we conclude that (24) is zero if and only if is a zero vector. This implies that is strictly convex.
Finally, let us prove that (3) implies (1) by contradiction. Define to be the unique solution of the optimization problem (18). Assume that there is , , such that the vectors are linearly dependent. As a result, we can find a vector , such that
[TABLE]
Since satisfies (21), the property (25) implies that satisfies (21) as well. Multiplying by (or ) on both sides of (21) and summing up the indices, we get
[TABLE]
Combining (25), (26), as well as the expressions in (19), we obtain , which contradicts the uniqueness of . ∎
To distinguish the parameters obtained from solving the optimization problem (18) and the true parameters of the system, we will define to be the maximizer of for fixed time in what follows, and to be the vector consisting of the true parameters such that (15) holds. In particular, when and , i.e., the channel only contains one reaction , the Euler–Lagrange equation (21) can be solved analytically and we have
[TABLE]
3.3 Learning task 2: determine the rate constants and the structure of chemical reactions using sparsity
In this subsection, we study the problem of learning the propensity functions of the chemical reaction networks from trajectory data when neither the structure of the chemical reactions nor their rate constants is known.
First of all, we can figure out the total number of the reaction channels from the trajectory data, as discussed in Subsection 3.1. Now suppose that we are given candidate basis functions
[TABLE]
together with index sets , , such that ,
[TABLE]
Accordingly, we introduce the vectors
[TABLE]
For each channel , the propensity function in (1) will be approximated using the basis functions , , and the coefficients in . More precisely, we define
[TABLE]
where , and the function
[TABLE]
is introduced (see Figure 1), in order to guarantee the non-negativity of for all vectors . Corresponding to (31), the total propensity function is given by
[TABLE]
Since the propensity functions of reactions in many applications typically have a simple form (Table 1), there is likely redundancy in the basis functions and therefore we can assume that the unknown vector only has a few nonzero entries (and is thus sparse). With this observation in mind, we propose to determine by maximizing the (logarithmic) likelihood function under the sparsity assumption, or, equivalently, by solving the nonlinear sparse minimization problem
[TABLE]
where is the likelihood function (13) with the propensity functions in (31) and (33). Explicitly, we have
[TABLE]
If we quantify the sparsity of using the norm (denoted by ), then (34) results in
[TABLE]
In (36), the log-likelihood function is rescaled by (this scaling is suggested by the analysis in Section 5), and the constant , which measures the strength of the sparsity regularization, can be chosen depending on .
Similar to the problem (18) in the previous subsection, the minimizer of (36) can be computed by solving sparse minimization problems
[TABLE]
separately, where
[TABLE]
In practice, we find that (36), or equivalently (37), can be efficiently solved by FISTA proposed in [7], especially when preconditioning is applied (see Remark 4 below and examples in Section 4). The main algorithmic steps of FISTA are provided in Algorithm 1 in Appendix A.
We obtain the following result concerning the minimization problems (36) and (37).
Proposition 2**.**
Suppose . The objective functions of the optimization problems (36) and (37) are strictly convex.
Proof.
It is sufficient to consider the objective function in (37). By straightforward calculations (for instance, see (77) and (78) in Appendix B), we can verify that both and are strictly convex functions. Therefore, the function in (38) is strictly convex. Since the norm is convex as well, we conclude that the objective function in (37) is strictly convex. ∎
Let denote the unique minimizer of the problem (36). Similar to the Euler–Lagrange equation (21), in the current case satisfies the inclusion relation [4, 13]
[TABLE]
where
[TABLE]
for , and is the subdifferential of the absolute value function , defined by
[TABLE]
Finally, let be the vector in whose components are defined in (40) and define the set \partial|\bm{\omega}|=\big{\{}\bm{v}\in\mathbb{R}^{N}~{}\big{|}~{}\bm{v}=(v_{1},v_{2},\dots,v_{N})^{\top}\,,~{}v_{j}\in\partial|\omega_{j}|\,,~{}1\leq j\leq N\big{\}} . We can express the condition (39) in vector form as
[TABLE]
The characterization above of the minimizers will be used in the analysis in Section 5.
We conclude this section with the following remarks.
Remark 2** (Role of the function ).**
In principle, we would like to allow both the basis functions and the unknown coefficients to be either positive or negative. By introducing the function in (35), we avoid imposing many inequality constraints which would be otherwise needed in order to guarantee that the log-likelihood function in (35) is well-defined. The properties of in (32) are discussed in Appendix B. In particular, we have , uniformly . For this reason, we define .
Remark 3** (Choice of basis functions).**
In the sparse minimization problem (36), the vector contains all the coefficients , and the corresponding basis functions in (28) are involved. This formulation makes the notations simpler and is also convenient for analysis, particularly in Section 5. Numerically, on the other hand, the coefficient vectors in (30) can be computed separately by solving the minimization problems (37), , with the same set of basis functions , , for all the channels. In this case, corresponding to the formulation adopted at the beginning of this subsection where all coefficients are put together, we define the index sets \mathcal{I}_{i}=\big{\{}(i-1)L+1\,,~{}(i-1)L+2\,,\,\dots\,,~{}iL\big{\}}, , and for each , we define the function
[TABLE]
Accordingly, we have \bm{\omega}^{(i)}=\big{(}\omega_{(i-1)L+1},\,\omega_{(i-1)L+2},\,\dots,\omega_{(i-1)L+L}\big{)}^{\top}, and the propensity function in (31) can be written more transparently as
[TABLE] 2. 2.
While we are mainly interested in chemical reaction systems, the same learning approach can be applied to other types of continuous-time Markov chains whose jump distributions are state-dependent. In particular, for chemical reaction systems that obey law of mass-action, according to Table 1 we may choose from the polynomials
[TABLE]
*where denotes the *th component of the state , based on the knowledge about the potential chemical reactions that are possibly involved in the system.
Remark 4** (Preconditioning).**
In concrete applications, due to the complexity of the trajectory data, different basis functions may take values that are of different orders of magnitude. As a result, the objective functions in (37), or equivalently in (36), may become inhomogeneous along different components . This leads to numerical difficulties in solving (37) since a small step-size has to be used as a result of the strong dependence of the objective function on the change of along certain directions (i.e., large gradient, ill-conditioned). A simple way to alleviate this numerical issue is to precondition the problems (37) by rescaling the basis functions. Equivalently, let denote the rescaling constants, where , . Instead of (37), we can compute the minimizer of the rescaled sparse minimization problem
[TABLE]
where the vector consists of , . Then it is easy to verify that the minimizer of (37) can be recovered from , for . By properly choosing the constants based on analyzing the trajectory data, we can expect that minimizing (44) will be easier compared to (37). Readers are referred to Section 4 for further discussions on this issue and concrete examples.
Remark 5** (Possible extensions).**
Below we discuss several possible generalizations.
So far, we have assumed that the evolution of the system is fully observed. In concrete applications, sometimes a small subset of species in the system is supposed to be able to describe the system’s dynamics **[15]**. Correspondingly, it may happen that the trajectory data is only partially observed for these “important” species. In this case, one can still apply the second learning approach in this subsection to learn the system and the outcome of the optimization problem (36) will be an effective dynamics for these selected “important” species. However, we point out that, since the effective reactions among these “important” species do not necessarily obey the law of mass-action any more, it may be important to include other types of basis functions (e.g., rational functions for Michaelis–Menten type kinetics **[39]**) together with the polynomial basis in (43) in order to obtain a good approximation of the effective dynamics. 2. 2.
It is straightforward to generalize the analysis to the case where multiple trajectories of the system are available. We refer the readers to the numerical examples in Section 4 for details. 3. 3.
In this work, in particular in Section 5, we are mainly interested in the theoretical justification of the two learning approaches in the infinite-data limit, i.e., . The numerical examples in Section 4 also mainly serve this purpose. Regarding the choice of the sparsity parameter , one can expect that a large will increase the sparsity of the solution, but at the same time will also introduce bias in the prediction. Therefore, in the numerical experiments in Section 4, we empirically choose in such a way that the sparsity and accuracy of the solution are balanced. In practice, instead of choosing a fixed in (36) for coefficients in front of all basis functions, it is helpful to consider different values of for different coefficients and to tune the parameter(s) carefully using the cross-validation technique **[20, 26]**. See **[8]** for more details.
4 Examples
In this section, we study the learning tasks discussed in Section 3 with three concrete numerical examples.
4.1 Example 1
In the first example, we study the chemical reaction system given by Table 2, where two different species are involved in chemical reactions. The propensity functions of these reactions depend on both the state of the system, i.e., the copy-numbers of the species and , and the rate constants , .
To study the two learning tasks discussed in Section 3, we fix the parameters
[TABLE]
and trajectories of the system are generated using the stochastic simulation algorithm (SSA) [21, 22, 23]. Each trajectory starts from the same initial state at time and is simulated until time ( of the trajectories are shown in Figure 2 for illustration). From Table 2, it is clear that different reactions belong to different reaction channels and therefore there are in total reaction channels in the reaction network. For the quantities introduced in Section 2, we obtain and . After processing the trajectory data, we find that the activation numbers of the reaction channels within these trajectories are , , , and , respectively, as shown in Table 3.
With the prepared trajectory data, let us first consider the problem of learning the rate constants , , assuming that the types of these reactions are known. For this purpose, we consider the negative log-likelihood function
[TABLE]
which is similar to (19), except that in (46) we have taken all the trajectories into account. Specifically, in (46) denotes the index of the trajectory, while the notation , , , , , has the same meaning (for the th trajectory) as the corresponding notations , , , , , and in (19), respectively. Following the setting in Subsection 3.2, in this example we have the parameter set , the index set , , as well as the functions given by
[TABLE]
Since each reaction channel only contains one single reaction, the minimizer of the objective function (46) can be computed explicitly using an expression similar to (27), and we get , which is indeed close to the true parameters (see Table 4).
Let us now study the second learning task in Subsection 3.3 with the same trajectory data, where we assume that the structure of the chemical reactions involved in the system is unknown as well. Notice that, by analyzing the trajectory data, in this case we can still figure out that there are in total species and different reaction channels in the reaction network (see Table 3). In order to determine the propensity function of each reaction channel, based on Table 1 and the discussions in Remark 3, we choose polynomials of degree at most for , i.e.,
[TABLE]
as basis functions. The propensity functions of the reaction channels are approximated by
[TABLE]
where is defined in (32) and we set . In (48), the function depends on the parameters \bm{\omega}^{(i)}=\big{(}\omega_{6(i-1)+1}, , , \omega_{6(i-1)+6}\big{)}^{\top}, and the same set of basis functions in (47) is used for each of the channels.
To determine the value of , which consists of all the unknown parameters, we follow the discussions in Remark 3 of Subsection 3.3 and solve the sparse minimization problems
[TABLE]
for each channel separately, by applying Algorithm 1 in Appendix A. We choose the parameter empirically, such that the sparsity and accuracy of the solution are balanced. In each iteration step, evaluating the objective function in (49) as well as its derivative requires traversing every reaction along the trajectories. This part of the calculation is performed in parallel using MPI in our code. The iteration procedure continues until the relative difference between the minimal and the maximal values of the objective function in the last iteration steps is smaller than . In this example, we run the code using processors in parallel and it takes only a few seconds to meet the convergence criterion.
The final results are summarized in Table 5. To make a comparison with the true parameters in (45), we notice that, with the basis functions in (47), the true propensity functions of the reaction channels in the system (see Table 2 and Table 4) can be expressed as
[TABLE]
where . From the expressions above, we see that the propensity functions in (48), with the estimated parameters in Table 5 (for or ), indeed provide reasonable approximations of the true propensity functions in (50). Comparing the results for different , we can observe that while the solution is sparser for (e.g., coefficients corresponding to the basis in Table 5), the approximation of the true coefficients is better when is smaller (i.e., , underlined coefficients in Table 5).
4.2 Example 2: predator-prey system
In the second example, we consider the predator-prey type reaction system in Table 6, which has two different species and chemical reactions [55]. The system models the birth and death of two different species and is widely used as building block of more complicate chemical or biological systems. In contrast to the previous example where different reactions have different state change vectors, in the current case both the reaction \ce -¿[κ_2] and the reaction \ce + -¿[κ_5] have the same state change vector .
In the first step, we generate the trajectory data of the system with the parameters
[TABLE]
Starting from the state at time , trajectories are simulated using SSA until the final time , and of these trajectories are shown in Figure 3. After analyzing the trajectory data, we can identify the different reaction channels in the system as well as the numbers of occurrences of activations for each channel within the trajectories (see Table 7).
With the prepared trajectory data, we study the estimation of the parameters , , assuming that the structure of the reactions in Table 6 is known (learning task 1). In the same way as in the previous example, we consider the minimization of the same negative log-likelihood function (46). The parameters can be computed explicitly from the expression which is similar to (27) since the corresponding reaction channel contains only one single reaction, while the parameters , both of which are involved in the same channel , can be found using a standard gradient descent method [43]. In the latter case, we choose the time step-size and the initial values are set to . In both cases, it only takes several seconds to run the code and the estimated parameters are indeed very close to the true parameters (see Table 8).
Next, we study the second learning task described in Subsection 3.3, where our aim is to learn the propensity functions of the identified reaction channels without knowing the structure of the chemical reactions. The propensity functions are approximated in the same way as in (48), with the same set of basis functions in (47) and . For each channel and each , the sparse minimization problem (49) is solved separately by “FISTA with backtracking” (Algorithm 1 in Appendix A), using the same number of processors (i.e., ) and the same convergence criterion as in the previous example.
However, as shown in Figure 3, the trajectory data in the current example exhibits further complexities, as the copy-number of the species in the system varies significantly (from to nearly ) within the time interval , unlike the trajectory data in the previous example, where the copy-numbers of the both species stay below (Figure 2). As a result, in Table 9 we see that the different basis functions in (47) are of vastly different orders of magnitude when they are evaluated at the states contained in the trajectories. At the same time, in the numerical experiment we find that direct minimization of (49) using FISTA does not converge at all for any of the reaction channels, due to the extremely small step-size between and (the step-size is determined by the algorithm itself; see Algorithm 1 in Appendix A and [7]).
To overcome this difficulty, we apply the preconditioning idea discussed in Remark 4. Let denote the basis functions, where , for , . For each index belonging to the th channel , we record the maximal values of among all the states in the trajectory data at which has been activated. These maximal values are then used to (empirically) determine the rescaling constants , shown in Table 9 such that the functions after rescaling are roughly of the same order of magnitude. As discussed in Remark 4, we solve the rescaled sparse minimization problem, which is similar to (44), for each channel separately, and restore the parameters in the propensity functions. It turns out that the problems after rescaling become much easier to solve, because in this case the step-size is increased to on average, which is to orders of magnitude larger than the step-size in the unrescaled problem. It takes less than minutes in total to meet the convergence criterion for all reaction channels and the results are summarized in Table 10.
To compare with the true parameters in (51), notice that the true propensity functions of the channels in Table 7 can be expressed as
[TABLE]
where . From the expressions above, we can conclude that the propensity functions in (48), together with the parameters given in Table 10, indeed approximate the true propensity functions in (52) quite well. Comparing the results for , we observe that the solutions are slightly sparser for and (e.g., coefficients corresponding to the basis in Table 10), while the approximation of the true coefficients is better when is smaller (i.e., , underlined coefficients in Table 10). Finally, we point out that the solution could be further improved if necessary, by using thresholding techniques (i.e., removing unimportant basis functions) [10] or cross-validation techniques (i.e., tuning ) [8].
4.3 Example 3: reaction network modeling intracellular viral infection
In the third example, we consider the reaction network in [44], which models intracellular viral infection. We refer the readers to [44] for the biological background and to [25, 5] for further details. As shown in Table 11, the system consists of different species, i.e., the viral template (T), the viral genome (G), the viral structure protein (S), and the virus (V). These species are involved in chemical reactions.
First of all, starting from the state at time , trajectories of the system are generated using SSA until , with the parameters
[TABLE]
in Table 11. For illustration purposes, of these trajectories are shown in Figure 4. It can be observed that the copy-numbers , of may increase to –, while the copy-numbers , of , remain relatively small (less than ) within the time interval . After analyzing the trajectory data, we can identify the reaction channels of the system. The numbers of occurrences of activations for each channel within the trajectories can be counted as well (see Table 12).
With these trajectory data, we study the estimation of the parameters , , assuming that the structure of the reactions in Table 11 is known (learning task 1). In the same way as we did in the previous two examples, the parameters are estimated by minimizing the same negative log-likelihood function (46). Since each reaction channel contains only one reaction, the parameters can be directly computed (see (27)) and are indeed very close to the true parameters in (53), as shown in Table 13.
In what follows, we continue to study the second learning task in Subsection 3.3, where we want to learn the propensity functions of the identified reaction channels in the system without knowing the structure of the chemical reactions. As discussed in Table 1 and Remark 3, since there are different species in the system, we construct the following basis functions (i.e., polynomials of degree at most )
[TABLE]
where , to learn the propensity function of each reaction channel. Similar to (48) in the first example, the propensity functions of the reaction channels are approximated by
[TABLE]
with . For each , the same sparse minimization problem in (49) is solved in order to determine the coefficients \bm{\omega}^{(i)}=\big{(}\omega_{15(i-1)+1}, , , \omega_{15(i-1)+15}\big{)}^{\top}. From Table 14, we can again observe that the maximal values of the different basis functions in (54), evaluated on the trajectory data, are of different orders of magnitude. Therefore, the same rescaling strategy discussed in Remark 4 and in the previous example is applied to precondition the problem, using the rescaling constants in Table 14 which are determined empirically based on the maximal values of basis functions. Notice that, since for different channels the basis functions attain similar maximal values, the same set of rescaling constants is used for all the channels. For each reaction channel, the rescaled minimization problem is solved in parallel using processors, since the trajectory data only contains trajectories, and the iteration procedure continues until the relative difference between the minimal and the maximal values of the objective function in the last iteration steps is smaller than . The estimated coefficients are summarized in Table 15. For each channel except channel , it takes around minutes to meet the convergence criterion, while for channel it takes roughly two hours, because the corresponding solution of channel has a large coefficient (i.e., the underlined coefficient in Table 15) which is very different from the zero initial guess.
To compare with the true propensity functions of the channels in Table 12 with the true parameters in (53), let us write the true propensity functions as
[TABLE]
where . From the expressions above, we can conclude that the propensity functions in (55), together with the estimated parameters in Table 15, indeed provide good approximation of the true propensity functions in (56). Note that in this numerical experiment we have empirically chosen different values of for different channels since we only want to demonstrate that the true parameters can indeed be estimated with properly chosen . A more systematic way of choosing is cross-validation [8, 20, 26]. See Remark 5. Finally, we point out that the solution could be further refined if necessary, by applying thresholding techniques (i.e., removing unimportant basis functions and then solving the minimization problem again) [10].
5 Asymptotic analysis of the two learning tasks
In this section, we consider the two learning tasks introduced in Section 3 when . Although we are mainly interested in the second learning task and the corresponding minimization problem (36), in Subsection 5.1 we start with the first learning task, because it is highly relevant to the second learning task. The analysis in Subsection 5.1 will be useful when we study the second learning task in Subsection 5.2. The proofs of the various results will be given in Appendix D.
Assume the true propensity functions of the underlying chemical reaction system are in (1), and recall that the system’s state satisfies the dynamical equation (3), where , , are independent unit Poisson processes. For most of the results in this section, we will make the following assumptions about the system. Readers are referred to [38] for the study of the ergodicity of stochastic systems.
Assumption 1**.**
The state space is a finite set.
Assumption 2**.**
* is ergodic on . It has a unique invariant distribution , such that .*
Remark 6**.**
Assumption 1 simplifies the analysis in this section. In particular, it implies that any function on , e.g., the basis function , is bounded. For many systems in chemical reaction applications, the state spaces, which although can be large, are indeed finite. This is especially the case when there are conservation relations in the reactions of the system. At the same time, we also expect the analysis presented below can be extended to systems whose state space is an infinite set, after taking into account additional technical issues.
Our asymptotic analysis of the limit combines both techniques from the large sample theory [19, 34] in statistics and the limit theorems for stochastic processes [17]. In particular, we rely on the important fact that the log-likelihood functions in (19) and (35), as well as their derivatives, can be expressed as integrations with respect to the counting processes
[TABLE]
and the corresponding compensated Poisson processes (martingales)
[TABLE]
where and . As an example, it is apparent that the process is related to in (8), i.e., the total activation number of the channel within the time , since
[TABLE]
We refer the readers to Appendix C for two limit results concerning integrations with respect to the processes and when .
5.1 Learning task 1: analysis of the log-likelihood maximizer
In this subsection, we consider the first learning task in Subsection 3.2. Recall that is the true parameter vector such that (15) holds and that
[TABLE]
For fixed , denotes the solution of the minimization problem (18). We will study the asymptotic convergence of to , as . It should be pointed out that the consistency of maximum likelihood estimation has been well studied in the statistics community [52, 19, 50]. We refer the readers to [18, 31] for the asymptotic study of maximum likelihood estimation for continuous-time stochastic processes.
Let us first express the log-likelihood function in (19) and its derivatives using the processes in (57) and (58). For the log-likelihood function, since the trajectory of the system is piecewise constant, we have
[TABLE]
while for its first order derivatives in (21), we obtain
[TABLE]
where and is the index of channel such that . Similarly, the second order derivatives in (22) can be expressed as
[TABLE]
for two indices when there is a common channel index , , such that , and otherwise
[TABLE]
when and where are two different channel indices.
In particular, (5.1) and (62) become simpler when , and we have
[TABLE]
Let us first recall the law of large numbers (LLN) for the unit Poisson processes , , which states that [1]
[TABLE]
It allows us to study the simple case when the reaction channel contains a single reaction.
Proposition 3**.**
*Given , suppose and , for some . Assume that *
[TABLE]
Then , almost surely.
Note that Assumption 1 and Assumption 2 are actually not necessary in Proposition 3. In what follows, we study the case when , i.e., when more than one reactions belong to the same reaction channel . We need to further make the following two assumptions.
Assumption 3**.**
There is a unique vector , such that (60) is satisfied.
Assumption 4**.**
The basis functions , , are nonnegative on .
As a consequence of Assumption 3, we have the following lemma which concerns the uniqueness of , when is sufficiently large.
Lemma 1**.**
Suppose that Assumptions 1, 2, 3, 4 hold. With probability one, the minimization problem (18)–(19) has a unique solution , when is sufficiently large.
To proceed, we will need the Kullback–Leibler divergence between two probability distributions [36]. It is known that the Kullback–Leibler divergence is nonnegative and it equals zero if and only if the two distributions are identical. In particular, for the probability distributions whose density functions are and in (4), the Kullback–Leibler divergences can be computed as
[TABLE]
respectively, where and , are two parameter vectors in (16).
The convergence of towards as is established in the following result.
Proposition 4**.**
Suppose that Assumptions 1, 2, 3, 4 hold.
For any vector in (16), we have
[TABLE] 2. 2.
Let be the unique minimizer of the problem (18), such that for each . With probability one, it holds that .
We now study the asymptotic normality of the sequence as . We have the following result.
Proposition 5**.**
Suppose that Assumptions 1, 2, 3, 4 hold. Let be the matrix whose entries are
[TABLE]
for . Then, as , \sqrt{T}\big{(}\bm{\omega}^{(T)}-\bm{\omega}^{*}\big{)} converges in distribution to , i.e., is a Gaussian random variable whose mean equals zero and whose covariance matrix is .
5.2 Learning task 2: asymptotic analysis of the sparse optimization problem
Based on the analysis in Subsection 5.1, in this subsection we study the minimizer of the sparse minimization problem (36) as , where both and depend on .
Recall that are the probability densities (distributions) in (2). With the convention , for , we will denote
[TABLE]
and, correspondingly,
[TABLE]
Instead of Assumption 3, here we assume that the set of basis functions is chosen such that the underlying (true) system can be uniquely parameterized.
Assumption 5**.**
*There is a unique vector , such that *
[TABLE]
We also need the following assumption in order to guarantee the boundedness of .
Assumption 6**.**
*For each , assume that the index set is \mathcal{I}_{i}=\big{\{}j_{1},j_{2},\dots,j_{N_{i}}\big{\}}. , , is a sequence of vectors satisfying . Then such that and *
[TABLE]
Remark 7**.**
In fact, under Assumption 5 (uniqueness of ), one can argue by contradiction and show that there exists such that (70) holds. Therefore, Assumption 6 simply further asserts that is positive. In particular, Assumption 6 is not needed if is true for all and .
Similar to (62) and (63), it will be helpful to express the derivatives of the log-likelihood function in (35) using the processes , in (57) and (58). For the first order derivative (40), we have
[TABLE]
for . For the second order derivatives, we have
[TABLE]
when there is an index , , such that , and otherwise
[TABLE]
when , for two different indices .
The following technical lemma addresses the boundedness of the minimizers of the minimization problem (36).
Lemma 2**.**
Suppose that Assumptions 1, 2, and 5 hold. The parameter satisfies . Let be the minimizer of the minimization problem (36) and be the likelihood function in (35). Then, for each index , , and , such that , we have
[TABLE]
Assuming furthermore Assumption 6 holds, then the sequence is bounded for .
Now we are ready to state the asymptotic results for the sequence , as . Readers are referred to Appendix D for their proofs.
Theorem 1**.**
Suppose that Assumptions 1, 2, 5, and 6 hold. The parameters , in the minimization problem (36) satisfy
[TABLE]
Then we have , a.s.
Theorem 2**.**
Suppose that Assumptions 1, 2, 5, and 6 hold. Let be the matrix whose entries are given in (68) and let be the minimizer of the problem (36). Further assume that the following conditions are met.
The parameters , in (36) satisfy
[TABLE]
for some . 2. 2.
There exists , such that for all and satisfying , we have either for all , or .
Then, as , \sqrt{T}\big{(}\bm{\omega}^{(T,\epsilon,\lambda)}-\bm{\omega}^{*}\big{)} converges in distribution to a Gaussian random variable with mean zero and covariance matrix .
Appendix A Pseudocode of FISTA with backtracking
We summarize the main algorithmic steps of FISTA with backtracking [7] for the optimization problem
[TABLE]
in Algorithm 1, where and . The optimization problems (37) and (44) are in the form of (76), with being the (negative) logarithmic likelihood function. We refer the readers to the original paper [7], where FISTA is developed for optimization problems which are more general than (76).
Appendix B Properties of the function
We now summarize some asymptotic properties of the function in (32). Given , recall that G_{\epsilon}(x)=\epsilon\ln\big{(}1+e^{x/\epsilon}\big{)}, for all , whose first and second derivatives are
[TABLE]
respectively. The following lemma can be easily proved and therefore its proof is omitted.
Lemma 3**.**
Given , we have the following estimates.
, . 2. 2.
, if , and , if . 3. 3.
**
In particular, Lemma 3 implies , uniformly for , and
[TABLE]
We also need to study the function \ln G_{\epsilon}(x)=\ln\big{[}\epsilon\ln(1+e^{x/\epsilon})\big{]}, whose first and second derivatives are
[TABLE]
Lemma 4**.**
Given , we have the following estimates.
For all , it holds that
[TABLE] 2. 2.
For , we have
[TABLE] 3. 3.
For all , we have
[TABLE]
Proof.
We will only prove the inequalities concerning .
When , using (78) and the fact , we have . For the upper bound, using Lemma 3, we have
[TABLE]
and therefore (79) is obtained. 2. 2.
When , using the fact that , for all , we have Therefore,
[TABLE]
∎
Summarizing the estimates in Lemma 4, we can conclude that
[TABLE]
Appendix C Two limit lemmas on integrations with respect to counting processes
In this section, we summarize two useful results pertaining to integrations with respect to the processes , in (57) and (58), respectively. These results play an important role in the asymptotic analysis in Section 5. The first result is a type of law of large numbers (LLN) for Poisson processes.
Lemma 5**.**
*Suppose that Assumptions 1-2 hold. Functions satisfy , . For each , we have *
[TABLE]
Proof.
Since is a finite set (Assumption 1), the convergence of to is in fact uniform on . Using the LLN of Poisson processes in (65) and the uniform convergence of , we have
[TABLE]
Therefore, it is sufficient to prove (80) for the case . Note that we have
[TABLE]
where denotes the indicator function at state . For each , \int_{0}^{T}\mathbf{1}_{x}\big{(}X(s)\big{)}\,dR_{i}(s) can be interpreted as the total number of times that the th channel becomes active within time when the state of the system is . Similarly, \int_{0}^{T}\mathbf{1}_{x}\big{(}X(s)\big{)}\,ds is the total time that the system spends at state within time . Since the waiting times at state before the channel becomes activated are independent and follow exponential distributions with mean value \big{(}a_{i}^{*}(x)\big{)}^{-1}, the LLN of exponential distributions implies that
[TABLE]
Since the system is ergodic (Assumption 2), Birkhoff’s ergodic theorem implies
[TABLE]
Combining (82)–(84), we obtain \lim\limits_{T\rightarrow+\infty}\frac{1}{T}\int_{0}^{T}f\big{(}X(s)\big{)}\,dR_{i}(s)=\sum\limits_{x\in\mathbb{X}}f(x)a_{i}^{*}(x)\,\pi(x) , a.s. The conclusion (81) follows as a consequence, using the definition of in (58) and the ergodicity of the system. ∎
The second result is a corollary of the martingale central limit theorem [17, Theorem 7.1.4].
Lemma 6**.**
Suppose that Assumptions 1-2 hold. For each , functions satisfy , . Let denote the -dimensional process whose components are \mathcal{W}^{(T)}_{j}(u)=\frac{1}{\sqrt{T}}\int_{0}^{Tu}f^{(T)}_{j}\big{(}X(s)\big{)}\,d\widetilde{R}_{i}(s) , where , , and the index satisfies , . Moreover, is the matrix whose entries are given by
[TABLE]
for . We define the matrix-valued (linear) process , .
As , converges in distribution to , where is an -dimensional process with independent Gaussian increments whose quadratic variation process is . In particular, converges in distribution to a Gaussian random variable whose mean is zero and whose covariance matrix is in (85).
Proof.
For each , we define the matrix-valued process , , whose entries are given by
[TABLE]
for . Let us verify the conditions required by the martingale central limit theorem [17, Theorem 7.1.4].
Firstly, from (57) and (58), applying Ito’s formula, we know the process
[TABLE]
is a martingale. Using the expression (86) and the ergodicity of the system, we have
[TABLE]
Furthermore, since converge to and is a finite set (Assumption 1), it is clear that are uniformly bounded. This implies that
[TABLE]
Secondly, because the processes in (86) have continuous paths, the limit
[TABLE]
holds trivially. Therefore, we can apply the martingale central limit theorem [17, Theorem 7.1.4] and the conclusion follows readily. ∎
Appendix D Proofs of results in Section 5
In this section, we prove the results presented in Section 5.
We start with the results in Subsection 5.1.
Proof of Proposition 3.
As already pointed out in Subsection 3.2, the Euler–Lagrange equation (21) can be explicitly solved when and the solution is given in (27). Using the representations in (57) and (59), we can rewrite (27) as
[TABLE]
Applying (65) together with (66), we conclude that , almost surely. ∎
Proof of Lemma 1.
From (19) and (20), it is not difficult to see that, with probability one, there is at least one minimizer for large enough . We show the uniqueness by contradiction. Suppose that, with positive probability, the solution of (18)–(19) is not unique for an increasing subsequence , where . According to Proposition 1, we can find an index , , such that the column vectors , , of the matrix in (23) are linearly dependent for , where . Let us order the states in such that , where is positive. The ergodicity of the system (Assumption 2) implies that with probability one the states will be visited by the system within some large finite time. Since there is a positive probability that the column vectors are linearly dependent for all where , we can find a nonzero vector , such that , , where \mathcal{I}_{i}=\big{\{}j_{1},j_{2},\dots,j_{N_{i}}\big{\}}. This contradicts Assumption 3. ∎
Proof of Proposition 4.
Under Assumption 2, using expressions (5.1), (64), and applying Lemma 5 in Appendix C, we can compute
[TABLE]
where we have used (67) in the last equality. Therefore, the first conclusion is obtained. 2. 2.
Firstly, let us show that the sequence \big{(}\omega^{(T)}_{j}\big{)}_{T>0} is almost surely bounded for each . From the Euler–Lagrange equation (21), we can obtain the relation
[TABLE]
which implies
[TABLE]
where , , is the index such that . Note that both the numerator and the denominator on the right-hand side of (88) converge, as consequences of Lemma 5 in Appendix C and the ergodicity of the system (Assumption 2), respectively. Taking the limit in (88) and using (17), we have
[TABLE]
which implies that the sequence \big{(}\omega_{j}^{(T)}\big{)}_{T>0} is almost surely bounded.
Secondly, from (62) we know that the minimizer satisfies the identity
[TABLE]
In particular, for each state , it implies
[TABLE]
where denotes the indicator function at . Therefore, applying Lemma 5 in Appendix C and using the ergodicity of the system, we have
[TABLE]
Note that whenever there is an increment for the counting process when , we know and we can find an index such that the lower bound in (89) is positive.
Finally, let be a limit point of as . Using a similar derivation as in (1) and taking the lower bound (89) into account, we obtain
[TABLE]
On the other hand, since is the minimizer of (18), we also have
[TABLE]
Therefore, the Kullback–Leibler divergences in (90) must be equal to zero at each state . The expressions (67) then imply a_{i}\big{(}x\,;\,\bar{\bm{\omega}}\big{)}=a_{i}\big{(}x\,;\,\bm{\omega}^{*}\big{)}, and . Using (17) and Assumption 3, we conclude and therefore .
∎
Proof of Proposition 5.
First of all, under Assumption 3, it is straightforward to verify that the matrix is positive definite and therefore invertible. Given , expanding the function in (21), we have
[TABLE]
Since \mathcal{M}^{(T)}_{j}\big{(}\bm{\omega}^{(T)}\big{)}=0, dividing both sides of the equality above by , using (63) and (64), we have
[TABLE]
where , , is the index such that , and we have introduced
[TABLE]
if , for some index , , and , otherwise.
Let denote the matrix whose entries are , and let denote the -dimensional vector whose component equals the left-hand side of (92). With these notations, (92) can be written as
[TABLE]
Applying Lemma 6 in Appendix C, we know that, as , the vector converges in distribution to a Gaussian random variable whose mean equals zero and whose covariance matrix is given by . At the same time, since almost surely according to Proposition 4, Lemma 5 in Appendix C implies Therefore, applying Slutsky’s Theorem [19], we can conclude
[TABLE]
∎
We continue to prove the results in Subsection 5.2.
Proof of Lemma 2.
Lemma 3 in Appendix B implies that
[TABLE]
Since is the minimizer, we can derive
[TABLE]
Taking the limit , using the fact , as well as Lemma 5 in Appendix C, we obtain
[TABLE]
Now we show (73) by contradiction. Suppose it does not hold, applying Lemma 3 in Appendix B, we can find an index , , and a state with , such that by extracting a subsequence, which will be again denoted by , we have either
[TABLE]
Using (35), we can estimate
[TABLE]
where we have used Lemma 7 below, as well as the convention . Since , applying Lemma 5 in Appendix C, we have
[TABLE]
Therefore, Lemma 7 below implies that , almost surely in the both cases in (95). For the same reason, applying Lemma 5 in Appendix C, we know that
[TABLE]
Taking the limit in (96), we obtain , which contradicts (94). Therefore, (73) has been proved. The boundedness of the sequence follows directly from (73) and Assumption 6. ∎
The following elementary facts have been used in the proof above.
Lemma 7**.**
Consider the function , where are two constants. We have
* is convex on .* 2. 2.
, and . 3. 3.
When , then .
Finally, we briefly present the proofs of Theorem 1 and Theorem 2, since the argument is similar to the one in Proposition 4 and Proposition 5.
Proof of Theorem 1.
Lemma 2 implies that the sequence , , is bounded. Let be a limit point of as . Similar to (35), let us define
[TABLE]
For any , using a similar derivation as (1), we can obtain
[TABLE]
as well as
[TABLE]
Since is the minimizer of the problem (36), we have
[TABLE]
Taking the limit in the inequality above, using (74), (97) and (98), we obtain
[TABLE]
In particular, choosing , we get
[TABLE]
which implies a_{i}^{(0)}\big{(}x\,;\,\bar{\bm{\omega}}\big{)}=a_{i}^{(0)}\big{(}x\,;\,\bm{\omega}^{*}\big{)} , and . From the uniqueness of (Assumption 5), we know and therefore the convergence is obtained. ∎
Proof of Theorem 2.
First of all, the assumption (75) implies . Therefore, Theorem 1 assures the almost sure convergence of the sequence to .
The same identity (91) still holds for and in the current setting. Similar to (93), using the relation (41), in the current case we can obtain
[TABLE]
where the vector is bounded, is given by
[TABLE]
for , and is given by
[TABLE]
if there is an index , , such that , and otherwise . See (71) and (72).
Due to the second assumption in Theorem 2, we can find another constant such that , for all and , unless for all . Applying Lemma 3 in Appendix B, we have
[TABLE]
Since and the functions are bounded, the second terms in the expressions of both in (100) and in (101) converge to zero as . At the same time, Lemma 4 in Appendix B implies that and , uniformly on . Applying Lemma 6 in Appendix C, we know that the vector converges in distribution to a Gaussian random variable with zero mean and covariance matrix given by . Since almost surely, Lemma 5 in Appendix C implies , a.s. Applying Slutsky’s Theorem [19] and using the assumption (75), we can conclude
[TABLE]
∎
Remark 8**.**
The second assumption in Theorem 2 is used to handle the second terms of in (100) and in (101). It is not needed if for all and .
Acknowledgements
This work is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy — The Berlin Mathematics Research Center MATH+ (EXC-2046/1, project ID: 390685689). The authors also acknowledge financial support from the Einstein Center of Mathematics (ECMath) through project CH21.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] D. F. Anderson and T. G. Kurtz. Continuous time Markov chain models for chemical reaction networks. In H. Koeppl, G. Setti, M. di Bernardo, and D. Densmore, editors, Design and Analysis of Biomolecular Circuits: Engineering Approaches to Systems and Synthetic Biology , pages 3–42. Springer New York, New York, NY, 2011.
- 2[2] D. Angeli. A tutorial on chemical reaction network dynamics. Eur. J. Control , 15(3):398 – 406, 2009.
- 3[3] M. Ashyraliyev, Y. Fomekong-Nanfack, J. A. Kaandorp, and J. G. Blom. Systems biology: parameter estimation for biochemical models. FEBS J. , 276(4):886–902, 2009.
- 4[4] A. Bagirov, N. Karmitsa, and M. M. Mäkelä. Introduction to Nonsmooth Optimization: Theory, Practice and Software . Springer Publishing Company, Incorporated, 2014.
- 5[5] K. Ball, T. G. Kurtz, L. Popovic, and G. Rempala. Asymptotic analysis of multiscale approximations to reaction networks. Ann. Appl. Probab. , 16(4):1925–1961, 2006.
- 6[6] A.-L. Barabási and Z. N Oltvai. Network biology: Understanding the cell’s functional organization. Nat. Rev. Genet. , 5:101–113, 2004.
- 7[7] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. , 2(1):183–202, 2009.
- 8[8] L. Boninsegna, F. Nüske, and C. Clementi. Sparse learning of stochastic dynamical equations. J. Chem. Phys. , 148(24):241723, 2018.
