Learning chemical reaction networks from trajectory data

Wei Zhang; Stefan Klus; Tim Conrad; Christof Sch\"utte

arXiv:1902.04920·math.OC·November 25, 2019·SIAM J. Appl. Dyn. Syst.

Learning chemical reaction networks from trajectory data

Wei Zhang, Stefan Klus, Tim Conrad, Christof Sch\"utte

PDF

1 Repo

TL;DR

This paper introduces a data-driven approach to infer chemical reaction networks from trajectory data by modeling them as continuous-time Markov chains and employing sparse regularization to learn propensity functions.

Contribution

The paper presents a novel method for learning reaction networks from trajectory data using likelihood maximization and $l^1$ regularization, with theoretical analysis and numerical validation.

Findings

01

Effective learning of propensity functions from synthetic data

02

Asymptotic analysis confirms method's consistency in infinite-data limit

03

Demonstrated applicability to fully observed reaction systems

Abstract

We develop a data-driven method to learn chemical reaction networks from trajectory data. Modeling the reaction system as a continuous-time Markov chain and assuming the system is fully observed, our method learns the propensity functions of the system with predetermined basis functions by maximizing the likelihood function of the trajectory data under $l^{1}$ sparse regularization. We demonstrate our method with numerical examples using synthetic data and carry out an asymptotic analysis of the proposed learning procedure in the infinite-data limit.

Tables15

Table 1. Table 1: For different types of chemical reactions, propensity function a ℛ ∗ ( x ) superscript subscript 𝑎 ℛ 𝑥 a_{\mathcal{R}}^{*}(x) as a function of system’s state x = ( x ( 1 ) , … , x ( n ) ) ⊤ 𝑥 superscript superscript 𝑥 1 … superscript 𝑥 𝑛 top x=(x^{(1)},\dots,x^{(n)})^{\top} is given by law of mass-action. V 𝑉 V is a constant related to either the volume or the total number of molecules in the system and κ 𝜅 \kappa denotes the rate constants of chemical reactions.

No.	Reaction $ℛ$	$a_{ℛ}^{*} (x)$
$1$	\ce $\emptyset$ -¿[κ] products	$κ V$
$2$	\ce $S_{i}$ -¿[κ] products	$κ x^{(i)}$
$3$	\ce2 $S_{i}$ -¿[κ] products	$\frac{κ}{V} x^{(i)} (x^{(i)} - 1)$
$4$	\ce $S_{i}$ + $S_{j}$ -¿[κ] products	$\frac{κ}{V} x^{(i)} x^{(j)}$

Table 2. Table 2: Example 1. Chemical reaction system consists of two species A 𝐴 A and B 𝐵 B and 4 4 4 chemical reactions. The copy-numbers of these two species are denoted by x = ( x ( 1 ) , x ( 2 ) ) ⊤ 𝑥 superscript superscript 𝑥 1 superscript 𝑥 2 top x=(x^{(1)},x^{(2)})^{\top} . Here, κ i subscript 𝜅 𝑖 \kappa_{i} , v 𝑣 v , and a ℛ ∗ ( x ) subscript superscript 𝑎 ℛ 𝑥 a^{*}_{\mathcal{R}}(x) are the rate constant, the state change vector, and the propensity function of the reactions, respectively.

No.	Reaction	$v^{⊤}$	Channel	$a_{ℛ}^{*} (x)$
$1$	\ce $A$ -¿[κ_1] $\emptyset$	$(- 1, 0)$	$1$	$κ_{1} x^{(1)}$
$2$	\ce $A$ + $B$ -¿[κ_2] $2 B$	$(- 1, 1)$	$2$	$κ_{2} x^{(1)} x^{(2)}$
$3$	\ce $B$ -¿[κ_3] $\emptyset$	$(0, - 1)$	$3$	$κ_{3} x^{(2)}$
$4$	\ce $A$ -¿[κ_4] 2 $A$	$(1, 0)$	$4$	$κ_{4} x^{(1)}$

Table 3. Table 3: Example 1. The state change vectors v 𝑣 v of the 4 4 4 reaction channels in the system and the numbers of occurrences of their activations within the 100 100 100 trajectories are obtained by analyzing the trajectory data.

Channel	$1$	$2$	$3$	$4$
Vector $v^{⊤}$	$(- 1, 0)$	$(- 1, 1)$	$(0, - 1)$	$(1, 0)$
No. of occurrences	$2296$	$1778$	$2777$	$2135$

Table 4. Table 4: The first learning task in Example 1. The row with label “True” shows the parameters in ( 45 ) used to generate the 100 100 100 trajectories of the reaction system. The row with label “Estimated” shows the parameters obtained by minimizing the negative log-likelihood function ( 46 ).

	$κ_{1}$	$κ_{2}$	$κ_{3}$	$κ_{4}$
True	$1.0$	$0.1$	$1.0$	$0.9$
Estimated	$0.98$	$0.10$	$0.97$	$0.91$

Table 5. Table 5: The second learning task in Example 1. The parameters in the propensity functions ( 48 ) of the 4 4 4 channels are estimated by solving the sparse minimization problems ( 49 ), with ϵ = 0.1 italic-ϵ 0.1 \epsilon=0.1 and λ = 0.2 , 0.1 , 0.01 𝜆 0.2 0.1 0.01 \lambda=0.2,\,0.1,\,0.01 , respectively. For each channel 𝒞 i subscript 𝒞 𝑖 \mathcal{C}_{i} , 1 ≤ i ≤ 4 1 𝑖 4 1\leq i\leq 4 , the same set of basis functions in ( 47 ) is used in the estimation. In each row, the estimated parameters 𝝎 ( i ) = ( ω 6 ( i − 1 ) + 1 , ω 6 ( i − 1 ) + 2 , … , ω 6 ( i − 1 ) + 6 ) ⊤ superscript 𝝎 𝑖 superscript subscript 𝜔 6 𝑖 1 1 subscript 𝜔 6 𝑖 1 2 … subscript 𝜔 6 𝑖 1 6 top \bm{\omega}^{(i)}=\big{(}\omega_{6(i-1)+1},\,\omega_{6(i-1)+2},\,\dots,\omega_{6(i-1)+6})^{\top} , which are involved in ( 48 ) in front of the basis functions 1 1 1 , x ( 1 ) superscript 𝑥 1 x^{(1)} , x ( 2 ) superscript 𝑥 2 x^{(2)} , ( x ( 1 ) ) 2 superscript superscript 𝑥 1 2 (x^{(1)})^{2} , x ( 1 ) x ( 2 ) superscript 𝑥 1 superscript 𝑥 2 x^{(1)}x^{(2)} , and ( x ( 2 ) ) 2 superscript superscript 𝑥 2 2 (x^{(2)})^{2} , are shown. The parameter that has the largest absolute value within the same row is underlined.

Channel	$λ$	$1$	$x^{(1)}$	$x^{(2)}$	${(x^{(1)})}^{2}$	$x^{(1)} x^{(2)}$	${(x^{(2)})}^{2}$
$1$	$0.2$	$- 1.7 \cdot 10^{- 2}$	$0.66$	$0$	$1.7 \cdot 10^{- 2}$	$1.1 \cdot 10^{- 2}$	$1.7 \cdot 10^{- 4}$
	$0.1$	$- 1.2 \cdot 10^{- 1}$	$0.84$	$0$	$6.6 \cdot 10^{- 3}$	$6.7 \cdot 10^{- 3}$	$1.6 \cdot 10^{- 4}$
	$0.01$	$- 0.24$	$1.02$	$2.6 \cdot 10^{- 3}$	$- 2.4 \cdot 10^{- 3}$	$1.4 \cdot 10^{- 3}$	$1.0 \cdot 10^{- 4}$
$2$	$0.2$	$- 7.6 \cdot 10^{- 2}$	$0$	$0$	$- 3.8 \cdot 10^{- 4}$	$0.10$	$- 2.4 \cdot 10^{- 4}$
	$0.1$	$- 0.14$	$0$	$0$	$- 1.5 \cdot 10^{- 4}$	$0.10$	$0$
	$0.01$	$- 0.24$	$1.8 \cdot 10^{- 2}$	$2.1 \cdot 10^{- 2}$	$- 1.2 \cdot 10^{- 3}$	$0.10$	$- 1.1 \cdot 10^{- 3}$
$3$	$0.2$	$0$	$0$	$0.73$	$- 2.0 \cdot 10^{- 3}$	$0$	$2.0 \cdot 10^{- 2}$
	$0.1$	$- 0.11$	$- 8.4 \cdot 10^{- 6}$	$0.90$	$- 2.6 \cdot 10^{- 3}$	$0$	$7.5 \cdot 10^{- 3}$
	$0.01$	$- 0.25$	$3.5 \cdot 10^{- 5}$	$1.12$	$- 3.3 \cdot 10^{- 3}$	$- 1.5 \cdot 10^{- 3}$	$- 6.7 \cdot 10^{- 3}$
$4$	$0.2$	$- 1.7 \cdot 10^{- 2}$	$0.62$	$0$	$1.6 \cdot 10^{- 2}$	$8.0 \cdot 10^{- 3}$	$4.8 \cdot 10^{- 4}$
	$0.1$	$- 0.11$	$0.79$	$9.9 \cdot 10^{- 6}$	$6.0 \cdot 10^{- 3}$	$4.5 \cdot 10^{- 3}$	$4.4 \cdot 10^{- 4}$
	$0.01$	$- 0.25$	$0.96$	$1.7 \cdot 10^{- 6}$	$- 2.3 \cdot 10^{- 3}$	$3.5 \cdot 10^{- 4}$	$6.7 \cdot 10^{- 4}$

Table 6. Table 6: Example 2. Chemical reaction system of predator-prey type. Two species A 𝐴 A (prey) and B 𝐵 B (predator) are involved in 5 5 5 chemical reactions. The copy-numbers of A , B 𝐴 𝐵 A,B are denoted by x = ( x ( 1 ) , x ( 2 ) ) ⊤ 𝑥 superscript superscript 𝑥 1 superscript 𝑥 2 top x=(x^{(1)},x^{(2)})^{\top} . The 1 1 1 st and the 3 3 3 rd reactions model the replication (birth) of A 𝐴 A and B 𝐵 B , respectively. The 2 2 2 nd and the 4 4 4 th reactions model the depopulation (death) of A 𝐴 A and B 𝐵 B , respectively. The 5 5 5 th reaction models the preying process of B 𝐵 B on A 𝐴 A . Here, κ i subscript 𝜅 𝑖 \kappa_{i} , v 𝑣 v , and a ℛ ∗ ( x ) subscript superscript 𝑎 ℛ 𝑥 a^{*}_{\mathcal{R}}(x) are the rate constant, the state change vector, and the propensity function of the reactions, respectively. The 2 2 2 nd and the 5 5 5 th reactions have the same state change vector v = ( − 1 , 0 ) ⊤ 𝑣 superscript 1 0 top v=(-1,0)^{\top} and belong to the same reaction channel 𝒞 1 subscript 𝒞 1 \mathcal{C}_{1} .

No.	Reaction	$v^{⊤}$	Channel	$a_{ℛ}^{*} (x)$
$1$	\ce $A$ -¿[κ_1] $2 A$	$(1, 0)$	$4$	$κ_{1} x^{(1)}$
$2$	\ce $A$ -¿[κ_2] $\emptyset$	$(- 1, 0)$	$1$	$κ_{2} x^{(1)}$
$3$	\ce $B$ -¿[κ_3] $2 B$	$(0, 1)$	$3$	$κ_{3} x^{(2)}$
$4$	\ce $B$ -¿[κ_4] $\emptyset$	$(0, - 1)$	$2$	$κ_{4} x^{(2)}$
$5$	\ce $A$ + $B$ -¿[κ_5] $B$	$(- 1, 0)$	$1$	$κ_{5} x^{(1)} x^{(2)}$

Table 7. Table 7: Example 2. Both the state change vectors v 𝑣 v of the 4 4 4 reaction channels in the system and the numbers of occurrences of their activations within the 100 100 100 trajectories can be obtained by analyzing the trajectory data.

Channel	$1$	$2$	$3$	$4$
Vector $v^{⊤}$	$(- 1, 0)$	$(0, - 1)$	$(0, 1)$	$(1, 0)$
No. of occurrences	$22828$	$14065$	$14840$	$42837$

Table 8. Table 8: The first learning task in Example 2. The row with label “True” shows the parameters in ( 51 ) which are used to generate the 100 100 100 trajectories of the system. The row with label “Estimated” shows the parameters obtained by minimizing the negative log-likelihood function ( 46 ).

	$κ_{1}$	$κ_{2}$	$κ_{3}$	$κ_{4}$	$κ_{5}$
True	$1.2$	$0.3$	$0.8$	$0.75$	$0.1$
Estimated	$1.20$	$0.30$	$0.80$	$0.76$	$0.10$

Table 9. Table 9: Example 2. As discussed in Remark 3 , index k 𝑘 k , 1 ≤ k ≤ 6 1 𝑘 6 1\leq k\leq 6 , counts the different basis functions ϕ k subscript italic-ϕ 𝑘 \phi_{k} , while index j 𝑗 j , 1 ≤ j ≤ 24 1 𝑗 24 1\leq j\leq 24 , counts the basis functions φ j subscript 𝜑 𝑗 \varphi_{j} for all 4 4 4 channels. The same set of basis functions ϕ k subscript italic-ϕ 𝑘 \phi_{k} in ( 47 ) is used for each of the 4 4 4 channels. For each j 𝑗 j belonging to channel 𝒞 i subscript 𝒞 𝑖 \mathcal{C}_{i} , i.e., 6 ( i − 1 ) < j ≤ 6 i 6 𝑖 1 𝑗 6 𝑖 6(i-1)<j\leq 6i , we have the correspondence φ j = ϕ k subscript 𝜑 𝑗 subscript italic-ϕ 𝑘 \varphi_{j}=\phi_{k} , if j = 6 ( i − 1 ) + k 𝑗 6 𝑖 1 𝑘 j=6(i-1)+k . See ( 42 ). For each channel 𝒞 i subscript 𝒞 𝑖 \mathcal{C}_{i} , the column with label “ max ⁡ φ j subscript 𝜑 𝑗 \max\varphi_{j} ” shows the maximal values of the 6 6 6 basis functions ϕ k subscript italic-ϕ 𝑘 \phi_{k} (in different rows) evaluated on the trajectory data. The maximal values are computed among all the states in the trajectory data at which 𝒞 i subscript 𝒞 𝑖 \mathcal{C}_{i} has been activated. The rescaling constants c j subscript 𝑐 𝑗 c_{j} are determined empirically depending on these maximal values such that the functions φ j / c j subscript 𝜑 𝑗 subscript 𝑐 𝑗 \varphi_{j}/c_{j} are roughly of the same order of magnitude.

		Channel $1$		Channel $2$		Channel $3$		Channel $4$
$k$	$ϕ_{k}$	$\max φ_{j}$	$c_{j}$	$\max φ_{j}$	$c_{j}$	$\max φ_{j}$	$c_{j}$	$\max φ_{j}$	$c_{j}$
$1$	$1$	$1$	$1$	$1$	$1$	$1$	$1$	$1$	$1$
$2$	$x^{(1)}$	$5.3 \cdot 10^{3}$	$10$	$2.2 \cdot 10^{3}$	$10$	$2.1 \cdot 10^{3}$	$10$	$5.3 \cdot 10^{3}$	$50$
$3$	$x^{(2)}$	$41$	$1$	$104$	$1$	$103$	$1$	$38$	$1$
$4$	${(x^{(1)})}^{2}$	$2.8 \cdot 10^{7}$	$50000$	$4.8 \cdot 10^{6}$	$10000$	$4.4 \cdot 10^{6}$	$20000$	$2.8 \cdot 10^{7}$	$100000$
$5$	$x^{(1)} x^{(2)}$	$1.1 \cdot 10^{4}$	$100$	$1.2 \cdot 10^{4}$	$100$	$8.3 \cdot 10^{3}$	$20$	$1.2 \cdot 10^{4}$	$100$
$6$	${(x^{(2)})}^{2}$	$1.7 \cdot 10^{3}$	$5$	$1.1 \cdot 10^{4}$	$100$	$1.1 \cdot 10^{4}$	$50$	$1.4 \cdot 10^{3}$	$10$

Table 10. Table 10: The second learning task in Example 2. The parameters in the propensity functions ( 48 ) of the 4 4 4 channels in Table 7 are estimated, with ϵ = 0.1 italic-ϵ 0.1 \epsilon=0.1 and λ = 0.2 , 0.1 , 0.01 𝜆 0.2 0.1 0.01 \lambda=0.2,\,0.1,\,0.01 , respectively. As discussed in Remark 3 , for each channel i 𝑖 i , 1 ≤ i ≤ 4 1 𝑖 4 1\leq i\leq 4 , the same set of basis functions in ( 47 ) is used and the rescaled version of the sparse minimization problem ( 49 ) is solved, by rescaling the basis functions using the constants c j subscript 𝑐 𝑗 c_{j} in Table 9 . In each row, the estimated parameters 𝝎 ( i ) = ( ω 6 ( i − 1 ) + 1 , ω 6 ( i − 1 ) + 2 , … , ω 6 ( i − 1 ) + 6 ) ⊤ superscript 𝝎 𝑖 superscript subscript 𝜔 6 𝑖 1 1 subscript 𝜔 6 𝑖 1 2 … subscript 𝜔 6 𝑖 1 6 top \bm{\omega}^{(i)}=\big{(}\omega_{6(i-1)+1},\,\omega_{6(i-1)+2},\,\dots,\omega_{6(i-1)+6})^{\top} , which are involved in ( 48 ) in front of the basis functions 1 1 1 , x ( 1 ) superscript 𝑥 1 x^{(1)} , x ( 2 ) superscript 𝑥 2 x^{(2)} , ( x ( 1 ) ) 2 superscript superscript 𝑥 1 2 (x^{(1)})^{2} , x ( 1 ) x ( 2 ) superscript 𝑥 1 superscript 𝑥 2 x^{(1)}x^{(2)} , and ( x ( 2 ) ) 2 superscript superscript 𝑥 2 2 (x^{(2)})^{2} , are shown for different λ 𝜆 \lambda . The parameters that have relatively significant absolute values within the same row are underlined.

Channel	$λ$	$1$	$x^{(1)}$	$x^{(2)}$	${(x^{(1)})}^{2}$	$x^{(1)} x^{(2)}$	${(x^{(2)})}^{2}$
$1$	$0.2$	$0$	$0.30$	$- 1.1 \cdot 10^{- 2}$	$8.4 \cdot 10^{- 7}$	$0.10$	$- 2.8 \cdot 10^{- 4}$
	$0.1$	$0$	$0.30$	$- 2.2 \cdot 10^{- 2}$	$- 6.2 \cdot 10^{- 7}$	$0.10$	$1.9 \cdot 10^{- 4}$
	$0.01$	$- 1.7 \cdot 10^{- 2}$	$0.30$	$- 2.4 \cdot 10^{- 2}$	$- 2.3 \cdot 10^{- 6}$	$0.10$	$2.4 \cdot 10^{- 4}$
$2$	$0.2$	$0$	$- 1.1 \cdot 10^{- 3}$	$0.71$	$2.8 \cdot 10^{- 7}$	$4.1 \cdot 10^{- 4}$	$1.3 \cdot 10^{- 3}$
	$0.1$	$0$	$- 1.1 \cdot 10^{- 3}$	$0.73$	$2.8 \cdot 10^{- 7}$	$3.6 \cdot 10^{- 4}$	$8.4 \cdot 10^{- 4}$
	$0.01$	$- 1.1 \cdot 10^{- 3}$	$- 1.1 \cdot 10^{- 3}$	$0.75$	$2.7 \cdot 10^{- 7}$	$3.0 \cdot 10^{- 4}$	$3.9 \cdot 10^{- 4}$
$3$	$0.2$	$0$	$- 2.3 \cdot 10^{- 4}$	$0.76$	$- 3.3 \cdot 10^{- 7}$	$1.8 \cdot 10^{- 4}$	$1.1 \cdot 10^{- 3}$
	$0.1$	$0$	$- 3.3 \cdot 10^{- 4}$	$0.78$	$- 1.6 \cdot 10^{- 7}$	$9.6 \cdot 10^{- 5}$	$5.8 \cdot 10^{- 4}$
	$0.01$	$- 5.1 \cdot 10^{- 2}$	$- 1.5 \cdot 10^{- 4}$	$0.80$	$- 1.6 \cdot 10^{- 7}$	$9.4 \cdot 10^{- 6}$	$4.1 \cdot 10^{- 5}$
$4$	$0.2$	$0$	$1.16$	$- 1.2 \cdot 10^{- 2}$	$1.9 \cdot 10^{- 5}$	$4.8 \cdot 10^{- 3}$	$- 4.0 \cdot 10^{- 5}$
	$0.1$	$0$	$1.17$	$- 1.7 \cdot 10^{- 2}$	$1.4 \cdot 10^{- 5}$	$4.0 \cdot 10^{- 3}$	$1.4 \cdot 10^{- 4}$
	$0.01$	$- 0.13$	$1.18$	$- 1.0 \cdot 10^{- 2}$	$1.0 \cdot 10^{- 5}$	$3.5 \cdot 10^{- 3}$	$7.6 \cdot 10^{- 5}$

Table 11. Table 11: Example 3. The reaction network models a type of intracellular viral infection [ 44 ] . There are 4 4 4 different species in the system, i.e., the viral template (T), the viral genome (G), the viral structure protein (S), and the virus (V), which are involved in 6 6 6 chemical reactions. The copy-numbers of T 𝑇 T , G 𝐺 G , S 𝑆 S , and V 𝑉 V are denoted by the state vector x = ( x ( 1 ) , x ( 2 ) , x ( 3 ) , x ( 4 ) ) ⊤ 𝑥 superscript superscript 𝑥 1 superscript 𝑥 2 superscript 𝑥 3 superscript 𝑥 4 top x=(x^{(1)},x^{(2)},x^{(3)},x^{(4)})^{\top} .

No.	Reaction	$v^{⊤}$	Channel	$a_{ℛ}^{*} (x)$
$1$	\ce $T$ -¿[κ_1] $\emptyset$	$(- 1, 0, 0, 0)$	$1$	$κ_{1} x^{(1)}$
$2$	\ce $G$ + $S$ -¿[κ_2] $V$	$(0, - 1, - 1, 1)$	$2$	$κ_{2} x^{(2)} x^{(3)}$
$3$	\ce $S$ -¿[κ_3] $\emptyset$	$(0, 0, - 1, 0)$	$3$	$κ_{3} x^{(3)}$
$4$	\ce $T$ -¿[κ_4] $T$ + $S$	$(0, 0, 1, 0)$	$4$	$κ_{4} x^{(1)}$
$5$	\ce $T$ -¿[κ_5] $T$ + $G$	$(0, 1, 0, 0)$	$5$	$κ_{5} x^{(1)}$
$6$	\ce $G$ -¿[κ_6] $T$	$(1, - 1, 0, 0)$	$6$	$κ_{6} x^{(2)}$

Table 12. Table 12: Example 3 3 3 . The state change vectors v 𝑣 v of the 6 6 6 reaction channels in the system and the numbers of occurrences of their activations within the 10 10 10 trajectories can be obtained by analyzing the trajectory data.

Channel	$1$	$2$	$3$	$4$	$5$	$6$
Vector $v^{⊤}$	$(- 1, 0, 0, 0)$	$(0, - 1, - 1, 1)$	$(0, 0, - 1, 0)$	$(0, 0, 1, 0)$	$(0, 1, 0, 0)$	$(1, - 1, 0, 0)$
No. of occurrences	$214$	$1534$	$87942$	$90130$	$1743$	$206$

Table 13. Table 13: The first learning task in Example 3 3 3 . The row with label “True” shows the parameters in ( 53 ) which are used to generate the 10 10 10 trajectories of the system. The row with label “Estimated” shows the parameters obtained by minimizing the negative log-likelihood function ( 46 ).

	$κ_{1}$	$κ_{2}$	$κ_{3}$	$κ_{4}$	$κ_{5}$	$κ_{6}$
True	$0.25$	$0.001$	$0.3$	$100.0$	$2.0$	$0.1$
Estimated	$0.24$	$0.001$	$0.30$	$99.3$	$1.92$	$0.10$

Table 14. Table 14: Example 3. For the reaction channels 𝒞 1 , 𝒞 2 , … , 𝒞 6 subscript 𝒞 1 subscript 𝒞 2 … subscript 𝒞 6 \mathcal{C}_{1},\mathcal{C}_{2},\dots,\mathcal{C}_{6} in the system, the maximal values of the 15 15 15 basis functions ϕ k subscript italic-ϕ 𝑘 \phi_{k} in ( 54 ) are shown in the columns with label “Ch. 1 1 1 ”, “Ch. 2 2 2 ”, … … \dots , and “Ch. 6 6 6 ”, respectively. The same set of basis functions ϕ k subscript italic-ϕ 𝑘 \phi_{k} , 1 ≤ k ≤ 15 1 𝑘 15 1\leq k\leq 15 , is used for each of the 6 6 6 channels. As discussed in Remark 3 , index k 𝑘 k counts different basis functions ϕ k subscript italic-ϕ 𝑘 \phi_{k} , while index j 𝑗 j , 1 ≤ j ≤ 6 ⋅ 15 1 𝑗 ⋅ 6 15 1\leq j\leq 6\cdot 15 , counts basis functions φ j subscript 𝜑 𝑗 \varphi_{j} for all the 6 6 6 channels. For each j 𝑗 j belonging to channel 𝒞 i subscript 𝒞 𝑖 \mathcal{C}_{i} , i.e., 15 ( i − 1 ) < j ≤ 15 i 15 𝑖 1 𝑗 15 𝑖 15(i-1)<j\leq 15i , we have the correspondence φ j = ϕ k subscript 𝜑 𝑗 subscript italic-ϕ 𝑘 \varphi_{j}=\phi_{k} , if j = 15 ( i − 1 ) + k 𝑗 15 𝑖 1 𝑘 j=15(i-1)+k . See ( 42 ). For each channel 𝒞 i subscript 𝒞 𝑖 \mathcal{C}_{i} , the column with label “Ch. i 𝑖 i ” shows the maximal values of the 15 15 15 basis functions ϕ k subscript italic-ϕ 𝑘 \phi_{k} (in different rows) evaluated for the trajectory data. The maximal values are computed among all the states in the 10 10 10 trajectories at which 𝒞 i subscript 𝒞 𝑖 \mathcal{C}_{i} has been activated. The rescaling constants c j subscript 𝑐 𝑗 c_{j} are determined empirically, such that after rescaling the basis functions are roughly of the same order of magnitude. Since the basis functions have similar maximal values in different channels, the same rescaling constants are used for all 6 6 6 channels.

		$\max φ_{j}$
$k$	$ϕ_{k}$	Ch. $1$	Ch. $2$	Ch. $3$	Ch. $4$	Ch. $5$	Ch. $6$	$c_{j}$
$1$	$1$	$1$	$1$	$1$	$1$	$1$	$1$	$1$
$2$	$x^{(1)}$	$9$	$9$	$9$	$9$	$9$	$8$	$1$
$3$	$x^{(2)}$	$17$	$18$	$18$	$18$	$17$	$17$	$1$
$4$	$x^{(3)}$	$1857$	$1865$	$1868$	$1868$	$1855$	$1737$	$10$
$5$	$x^{(4)}$	$363$	$364$	$365$	$365$	$362$	$362$	$3$
$6$	${(x^{(1)})}^{2}$	$81$	$81$	$81$	$81$	$81$	$64$	$1$
$7$	$x^{(1)} x^{(2)}$	$102$	$119$	$119$	$119$	$112$	$98$	$1$
$8$	$x^{(1)} x^{(3)}$	$15786$	$15795$	$15804$	$15804$	$15723$	$12992$	$100$
$9$	$x^{(1)} x^{(4)}$	$2254$	$2247$	$2254$	$2254$	$2254$	$1890$	$20$
$10$	${(x^{(2)})}^{2}$	$289$	$324$	$324$	$324$	$289$	$289$	$2$
$11$	$x^{(2)} x^{(3)}$	$20434$	$26608$	$26640$	$26640$	$24825$	$20434$	$200$
$12$	$x^{(2)} x^{(4)}$	$2997$	$3320$	$3320$	$3320$	$2988$	$3150$	$30$
$13$	${(x^{(3)})}^{2}$	$3.4 \cdot 10^{6}$	$3.5 \cdot 10^{6}$	$3.5 \cdot 10^{6}$	$3.5 \cdot 10^{6}$	$3.4 \cdot 10^{6}$	$3.0 \cdot 10^{6}$	$30000$
$14$	$x^{(3)} x^{(4)}$	$514485$	$515264$	$516483$	$516483$	$514272$	$509288$	$5000$
$15$	${(x^{(4)})}^{2}$	$131769$	$132496$	$133225$	$133225$	$131044$	$131044$	$1000$

Table 15. Table 15: The second learning task in Example 3. The parameters in the propensity functions ( 55 ) of the 6 6 6 channels in Table 12 are estimated with ϵ = 0.1 italic-ϵ 0.1 \epsilon=0.1 . In this example, different λ 𝜆 \lambda have been chosen for different reaction channels. For each channel 𝒞 i subscript 𝒞 𝑖 \mathcal{C}_{i} , 1 ≤ i ≤ 6 1 𝑖 6 1\leq i\leq 6 , the rescaled version of the sparse minimization problem ( 49 ) is solved by rescaling the basis functions using the constants c j subscript 𝑐 𝑗 c_{j} in Table 14 . The same set of basis functions in ( 54 ) and the same set of rescaling constants are used in estimating the parameters for all the channels. In each column, the estimated parameters 𝝎 ( i ) = ( ω 15 ( i − 1 ) + 1 , ω 15 ( i − 1 ) + 2 , … , ω 15 ( i − 1 ) + 15 ) ⊤ superscript 𝝎 𝑖 superscript subscript 𝜔 15 𝑖 1 1 subscript 𝜔 15 𝑖 1 2 … subscript 𝜔 15 𝑖 1 15 top \bm{\omega}^{(i)}=\big{(}\omega_{15(i-1)+1},\,\omega_{15(i-1)+2},\,\dots,\omega_{15(i-1)+15})^{\top} , which are involved in ( 55 ) in front of the basis functions ϕ k subscript italic-ϕ 𝑘 \phi_{k} are shown. The parameters that have relatively significant absolute values within the same column are underlined.

		Ch. $1$	Ch. $2$	Ch. $3$	Ch. $4$	Ch. $5$	Ch. $6$
$k$	$ϕ_{k}$	$λ = 0.01$	$λ = 10$	$λ = 0.1$	$λ = 0.005$	$λ = 0.005$	$λ = 0.01$
$1$	$1$	$- 2.6 \cdot 10^{- 2}$	$0$	$0$	$- 8.4 \cdot 10^{- 1}$	$- 9.0 \cdot 10^{- 2}$	$- 1.1 \cdot 10^{- 2}$
$2$	$x^{(1)}$	$0.28$	$0$	$0$	$92.1$	$1.81$	$0$
$3$	$x^{(2)}$	$1.8 \cdot 10^{- 2}$	$0$	$0$	$- 3.3 \cdot 10^{- 3}$	$1.4 \cdot 10^{- 2}$	$0.11$
$4$	$x^{(3)}$	$- 4.3 \cdot 10^{- 4}$	$0$	$0.30$	$8.8 \cdot 10^{- 3}$	$1.0 \cdot 10^{- 3}$	$2.6 \cdot 10^{- 5}$
$5$	$x^{(4)}$	$- 5.0 \cdot 10^{- 4}$	$0$	$- 4.1 \cdot 10^{- 4}$	$- 5.2 \cdot 10^{- 3}$	$- 2.2 \cdot 10^{- 3}$	$- 3.2 \cdot 10^{- 4}$
$6$	${(x^{(1)})}^{2}$	$0$	$0$	$0$	$- 2.2 \cdot 10^{- 4}$	$- 3.7 \cdot 10^{- 2}$	$- 2.0 \cdot 10^{- 2}$
$7$	$x^{(1)} x^{(2)}$	$1.0 \cdot 10^{- 2}$	$0$	$0$	$3.3 \cdot 10^{- 1}$	$1.2 \cdot 10^{- 3}$	$9.4 \cdot 10^{- 3}$
$8$	$x^{(1)} x^{(3)}$	$- 1.3 \cdot 10^{- 4}$	$4.5 \cdot 10^{- 5}$	$- 1.6 \cdot 10^{- 3}$	$5.2 \cdot 10^{- 3}$	$3.6 \cdot 10^{- 4}$	$4.1 \cdot 10^{- 6}$
$9$	$x^{(1)} x^{(4)}$	$2.3 \cdot 10^{- 4}$	$0$	$- 6.7 \cdot 10^{- 4}$	$1.2 \cdot 10^{- 2}$	$7.4 \cdot 10^{- 4}$	$- 2.3 \cdot 10^{- 4}$
$10$	${(x^{(2)})}^{2}$	$- 3.7 \cdot 10^{- 3}$	$0$	$0$	$- 8.8 \cdot 10^{- 3}$	$- 3.7 \cdot 10^{- 3}$	$- 7.4 \cdot 10^{- 3}$
$11$	$x^{(2)} x^{(3)}$	$- 2.1 \cdot 10^{- 5}$	$9.5 \cdot 10^{- 4}$	$6.1 \cdot 10^{- 4}$	$8.3 \cdot 10^{- 5}$	$7.4 \cdot 10^{- 5}$	$8.5 \cdot 10^{- 5}$
$12$	$x^{(2)} x^{(4)}$	$2.2 \cdot 10^{- 4}$	$9.5 \cdot 10^{- 5}$	$6.5 \cdot 10^{- 4}$	$1.4 \cdot 10^{- 3}$	$- 5.4 \cdot 10^{- 4}$	$- 2.4 \cdot 10^{- 5}$
$13$	${(x^{(3)})}^{2}$	$6.9 \cdot 10^{- 7}$	$- 1.7 \cdot 10^{- 7}$	$1.2 \cdot 10^{- 7}$	$- 1.0 \cdot 10^{- 5}$	$- 1.8 \cdot 10^{- 6}$	$- 1.9 \cdot 10^{- 7}$
$14$	$x^{(3)} x^{(4)}$	$- 1.5 \cdot 10^{- 6}$	$6.4 \cdot 10^{- 7}$	$- 1.3 \cdot 10^{- 6}$	$- 3.0 \cdot 10^{- 5}$	$4.6 \cdot 10^{- 6}$	$7.5 \cdot 10^{- 7}$
$15$	${(x^{(4)})}^{2}$	$1.5 \cdot 10^{- 6}$	$- 4.3 \cdot 10^{- 7}$	$4.0 \cdot 10^{- 7}$	$8.9 \cdot 10^{- 6}$	$4.1 \cdot 10^{- 7}$	$5.5 \cdot 10^{- 7}$

Equations318

x = (x^{(1)}, x^{(2)}, \dots, x^{(n)})^{⊤} \in X \subseteq N^{n},

x = (x^{(1)}, x^{(2)}, \dots, x^{(n)})^{⊤} \in X \subseteq N^{n},

ψ_{R}^{*} (t ∣ x) = a_{R}^{*} (x) exp (- a_{R}^{*} (x) t), t \geq 0 .

ψ_{R}^{*} (t ∣ x) = a_{R}^{*} (x) exp (- a_{R}^{*} (x) t), t \geq 0 .

\displaystyle\mathcal{I}_{i}=\Big{\{}j\,\Big{|}\,1\leq j\leq N,~{}\mathcal{R}_{j}~{}\mbox{belongs to the channel}~{}\mathcal{C}_{i}\Big{\}}\,,

\displaystyle\mathcal{I}_{i}=\Big{\{}j\,\Big{|}\,1\leq j\leq N,~{}\mathcal{R}_{j}~{}\mbox{belongs to the channel}~{}\mathcal{C}_{i}\Big{\}}\,,

a_{i}^{*} (x) = j \in I_{i} \sum a_{R_{j}}^{*} (x), a^{*} (x) = i = 1 \sum K a_{i}^{*} (x) = j = 1 \sum N a_{R_{j}}^{*} (x),

a_{i}^{*} (x) = j \in I_{i} \sum a_{R_{j}}^{*} (x), a^{*} (x) = i = 1 \sum K a_{i}^{*} (x) = j = 1 \sum N a_{R_{j}}^{*} (x),

\begin{split}\psi^{*}(t\,;\,x)&=a^{*}(x)\exp\big{(}-a^{*}(x)\,t\big{)}\,,\quad t\geq 0\,,\\ p^{*}(i\,;\,x)&=\frac{a^{*}_{i}(x)}{a^{*}(x)}\,,\quad 1\leq i\leq K\,.\end{split}

\begin{split}\psi^{*}(t\,;\,x)&=a^{*}(x)\exp\big{(}-a^{*}(x)\,t\big{)}\,,\quad t\geq 0\,,\\ p^{*}(i\,;\,x)&=\frac{a^{*}_{i}(x)}{a^{*}(x)}\,,\quad 1\leq i\leq K\,.\end{split}

\displaystyle X(t)=X(0)+\sum_{i=1}^{K}\mathcal{P}_{i}\Big{(}\int_{0}^{t}a^{*}_{i}(X(s))\,ds\Big{)}v_{i}\,,\quad t\geq 0\,,

\displaystyle X(t)=X(0)+\sum_{i=1}^{K}\mathcal{P}_{i}\Big{(}\int_{0}^{t}a^{*}_{i}(X(s))\,ds\Big{)}v_{i}\,,\quad t\geq 0\,,

\displaystyle\begin{split}\psi(t\,;x,\bm{\omega})&=a(x\,;\bm{\omega})\exp\big{(}-a(x\,;\,\bm{\omega})t\big{)}\,,\quad t\geq 0\,,\\ p(i\,;x,\bm{\omega})&=\frac{a_{i}(x\,;\bm{\omega})}{a(x\,;\bm{\omega})}\,,\quad 1\leq i\leq K\,.\end{split}

\displaystyle\begin{split}\psi(t\,;x,\bm{\omega})&=a(x\,;\bm{\omega})\exp\big{(}-a(x\,;\,\bm{\omega})t\big{)}\,,\quad t\geq 0\,,\\ p(i\,;x,\bm{\omega})&=\frac{a_{i}(x\,;\bm{\omega})}{a(x\,;\bm{\omega})}\,,\quad 1\leq i\leq K\,.\end{split}

(y_{0}, t_{0}), (y_{1}, t_{1}), (y_{2}, t_{2}), \dots, (y_{M}, t_{M}),

(y_{0}, t_{0}), (y_{1}, t_{1}), (y_{2}, t_{2}), \dots, (y_{M}, t_{M}),

(i_{0}, t_{0}), (i_{1}, t_{1}), (i_{2}, t_{2}), \dots, (i_{M - 1}, t_{M - 1}),

(i_{0}, t_{0}), (i_{1}, t_{1}), (i_{2}, t_{2}), \dots, (i_{M - 1}, t_{M - 1}),

0 \leq l_{1}^{(i)} < l_{2}^{(i)} < \dots < l_{M_{i}}^{(i)} < M,

0 \leq l_{1}^{(i)} < l_{2}^{(i)} < \dots < l_{M_{i}}^{(i)} < M,

i = 1 \sum K M_{i} = M

i = 1 \sum K M_{i} = M

\displaystyle\mathbf{X}=\Big{(}M,(y_{l},t_{l})_{l=0,1,\dots,M}\Big{)}

\displaystyle\mathbf{X}=\Big{(}M,(y_{l},t_{l})_{l=0,1,\dots,M}\Big{)}

\displaystyle\rho^{(T)}(\mathbf{X}\,|\,\bm{\omega})=\bigg{[}\prod_{l=0}^{M-1}\psi(t_{l}\,;\,y_{l},\bm{\omega})\,p(i_{l}\,;\,y_{l},\bm{\omega})\bigg{]}\exp\Big{(}-a(y_{M}\,;\,\bm{\omega})\,t_{M}\Big{)}\,,

\displaystyle\rho^{(T)}(\mathbf{X}\,|\,\bm{\omega})=\bigg{[}\prod_{l=0}^{M-1}\psi(t_{l}\,;\,y_{l},\bm{\omega})\,p(i_{l}\,;\,y_{l},\bm{\omega})\bigg{]}\exp\Big{(}-a(y_{M}\,;\,\bm{\omega})\,t_{M}\Big{)}\,,

\displaystyle\mathbf{E}\,g(\mathbf{X})=\sum_{M=0}^{+\infty}\sum_{i_{0}=1}^{K}\sum_{i_{1}=1}^{K}\cdots\sum_{i_{M-1}=1}^{K}\int_{\big{\{}t_{0}+t_{1}+\cdots+t_{M}=T\big{\}}}\,g(\mathbf{X})\,\rho^{(T)}(\mathbf{X}\,|\,\bm{\omega})~{}dt_{0}\,\cdots\,dt_{M-1}\,,

\displaystyle\mathbf{E}\,g(\mathbf{X})=\sum_{M=0}^{+\infty}\sum_{i_{0}=1}^{K}\sum_{i_{1}=1}^{K}\cdots\sum_{i_{M-1}=1}^{K}\int_{\big{\{}t_{0}+t_{1}+\cdots+t_{M}=T\big{\}}}\,g(\mathbf{X})\,\rho^{(T)}(\mathbf{X}\,|\,\bm{\omega})~{}dt_{0}\,\cdots\,dt_{M-1}\,,

E g (X) = \int_{D_{T}} g (X) ρ^{(T)} (X ∣ ω) d X

E g (X) = \int_{D_{T}} g (X) ρ^{(T)} (X ∣ ω) d X

\displaystyle\begin{split}\mathcal{L}^{(T)}(\bm{\omega})&=\mathcal{L}^{(T)}\big{(}\bm{\omega}\,\big{|}\,\mathbf{X}\big{)}\\ &=\rho^{(T)}(\mathbf{X}\,|\,\bm{\omega})\\ &=\bigg{[}\prod_{l=0}^{M-1}\psi(t_{l}\,;\,y_{l},\bm{\omega})\,p(i_{l}\,;\,y_{l},\bm{\omega})\bigg{]}\exp\Big{(}-a(y_{M}\,;\,\bm{\omega})\,t_{M}\Big{)}\\ &=\bigg{[}\prod_{l=0}^{M}\exp\Big{(}-a(y_{l}\,;\,\bm{\omega})t_{l}\Big{)}\bigg{]}\prod_{l=0}^{M-1}a_{i_{l}}(y_{l}\,;\,\bm{\omega})\\ &=\prod_{i=1}^{K}\mathcal{L}_{i}^{(T)}(\bm{\omega})\,,\end{split}

\displaystyle\begin{split}\mathcal{L}^{(T)}(\bm{\omega})&=\mathcal{L}^{(T)}\big{(}\bm{\omega}\,\big{|}\,\mathbf{X}\big{)}\\ &=\rho^{(T)}(\mathbf{X}\,|\,\bm{\omega})\\ &=\bigg{[}\prod_{l=0}^{M-1}\psi(t_{l}\,;\,y_{l},\bm{\omega})\,p(i_{l}\,;\,y_{l},\bm{\omega})\bigg{]}\exp\Big{(}-a(y_{M}\,;\,\bm{\omega})\,t_{M}\Big{)}\\ &=\bigg{[}\prod_{l=0}^{M}\exp\Big{(}-a(y_{l}\,;\,\bm{\omega})t_{l}\Big{)}\bigg{]}\prod_{l=0}^{M-1}a_{i_{l}}(y_{l}\,;\,\bm{\omega})\\ &=\prod_{i=1}^{K}\mathcal{L}_{i}^{(T)}(\bm{\omega})\,,\end{split}

\displaystyle\mathcal{L}_{i}^{(T)}(\bm{\omega})=\bigg{[}\prod_{l=0}^{M}\exp\Big{(}-a_{i}(y_{l}\,;\,\bm{\omega})t_{l}\Big{)}\bigg{]}\prod_{k=1}^{M_{i}}a_{i}(y_{l_{k}^{(i)}}\,;\,\bm{\omega})\,,\qquad 1\leq i\leq K\,,

\displaystyle\mathcal{L}_{i}^{(T)}(\bm{\omega})=\bigg{[}\prod_{l=0}^{M}\exp\Big{(}-a_{i}(y_{l}\,;\,\bm{\omega})t_{l}\Big{)}\bigg{]}\prod_{k=1}^{M_{i}}a_{i}(y_{l_{k}^{(i)}}\,;\,\bm{\omega})\,,\qquad 1\leq i\leq K\,,

a_{R_{j}}^{*} (x) = ω_{j} φ_{j} (x), 1 \leq j \leq N,

a_{R_{j}}^{*} (x) = ω_{j} φ_{j} (x), 1 \leq j \leq N,

ω = (ω_{1}, ω_{2}, \dots, ω_{N})^{⊤} \in R^{N},

ω = (ω_{1}, ω_{2}, \dots, ω_{N})^{⊤} \in R^{N},

ω^{(i)} = (ω_{j_{1}}, ω_{j_{2}}, \dots, ω_{j_{N_{i}}})^{⊤}, \mbox w h er e I_{i} = {j_{1}, j_{2}, \dots, j_{N_{i}}},

ω^{(i)} = (ω_{j_{1}}, ω_{j_{2}}, \dots, ω_{j_{N_{i}}})^{⊤}, \mbox w h er e I_{i} = {j_{1}, j_{2}, \dots, j_{N_{i}}},

\displaystyle\begin{split}a_{i}\big{(}x\,;\bm{\omega}\big{)}&=a_{i}\big{(}x\,;\bm{\omega}^{(i)}\big{)}=\sum_{j\in\mathcal{I}_{i}}\omega_{j}\varphi_{j}(x)\,,\quad 1\leq i\leq K\,,\\[-2.0pt] \mbox{and}\quad a\big{(}x\,;\bm{\omega}\big{)}&=\sum_{j=1}^{N}\omega_{j}\varphi_{j}(x)\,,\end{split}

\displaystyle\begin{split}a_{i}\big{(}x\,;\bm{\omega}\big{)}&=a_{i}\big{(}x\,;\bm{\omega}^{(i)}\big{)}=\sum_{j\in\mathcal{I}_{i}}\omega_{j}\varphi_{j}(x)\,,\quad 1\leq i\leq K\,,\\[-2.0pt] \mbox{and}\quad a\big{(}x\,;\bm{\omega}\big{)}&=\sum_{j=1}^{N}\omega_{j}\varphi_{j}(x)\,,\end{split}

\displaystyle\min_{\bm{\omega}}\Big{[}-\ln\mathcal{L}^{(T)}(\bm{\omega})\Big{]}\,.

\displaystyle\min_{\bm{\omega}}\Big{[}-\ln\mathcal{L}^{(T)}(\bm{\omega})\Big{]}\,.

\displaystyle\begin{split}\ln\mathcal{L}^{(T)}(\bm{\omega})&=-\sum_{l=0}^{M-1}\ln\bigg{[}\sum_{j\in\mathcal{I}_{i_{l}}}\omega_{j}\,\varphi_{j}(y_{l})\bigg{]}+\sum_{l=0}^{M}t_{l}\bigg{[}\sum_{j=1}^{N}\omega_{j}\,\varphi_{j}(y_{l})\bigg{]}\,\\ &=-\sum_{i=1}^{K}\sum_{k=1}^{M_{i}}\ln\bigg{[}\sum_{j\in\mathcal{I}_{i}}\omega_{j}\,\varphi_{j}(y_{l^{(i)}_{k}})\bigg{]}+\sum_{l=0}^{M}t_{l}\bigg{[}\sum_{j=1}^{N}\omega_{j}\,\varphi_{j}(y_{l})\bigg{]}\\ &=-\sum_{i=1}^{K}\ln\mathcal{L}_{i}^{(T)}(\bm{\omega}^{(i)})\,.\end{split}

\displaystyle\begin{split}\ln\mathcal{L}^{(T)}(\bm{\omega})&=-\sum_{l=0}^{M-1}\ln\bigg{[}\sum_{j\in\mathcal{I}_{i_{l}}}\omega_{j}\,\varphi_{j}(y_{l})\bigg{]}+\sum_{l=0}^{M}t_{l}\bigg{[}\sum_{j=1}^{N}\omega_{j}\,\varphi_{j}(y_{l})\bigg{]}\,\\ &=-\sum_{i=1}^{K}\sum_{k=1}^{M_{i}}\ln\bigg{[}\sum_{j\in\mathcal{I}_{i}}\omega_{j}\,\varphi_{j}(y_{l^{(i)}_{k}})\bigg{]}+\sum_{l=0}^{M}t_{l}\bigg{[}\sum_{j=1}^{N}\omega_{j}\,\varphi_{j}(y_{l})\bigg{]}\\ &=-\sum_{i=1}^{K}\ln\mathcal{L}_{i}^{(T)}(\bm{\omega}^{(i)})\,.\end{split}

\displaystyle\ln\mathcal{L}_{i}^{(T)}(\bm{\omega}^{(i)})=\sum_{k=1}^{M_{i}}\ln\bigg{[}\sum_{j\in\mathcal{I}_{i}}\omega_{j}\,\varphi_{j}(y_{l^{(i)}_{k}})\bigg{]}-\sum_{l=0}^{M}t_{l}\bigg{[}\sum_{j\in\mathcal{I}_{i}}\omega_{j}\,\varphi_{j}(y_{l})\bigg{]}

\displaystyle\ln\mathcal{L}_{i}^{(T)}(\bm{\omega}^{(i)})=\sum_{k=1}^{M_{i}}\ln\bigg{[}\sum_{j\in\mathcal{I}_{i}}\omega_{j}\,\varphi_{j}(y_{l^{(i)}_{k}})\bigg{]}-\sum_{l=0}^{M}t_{l}\bigg{[}\sum_{j\in\mathcal{I}_{i}}\omega_{j}\,\varphi_{j}(y_{l})\bigg{]}

\displaystyle\min_{\bm{\omega}^{(i)}}\Big{[}-\ln\mathcal{L}_{i}^{(T)}(\bm{\omega}^{(i)})\Big{]}\,,\qquad 1\leq i\leq K\,,

\displaystyle\min_{\bm{\omega}^{(i)}}\Big{[}-\ln\mathcal{L}_{i}^{(T)}(\bm{\omega}^{(i)})\Big{]}\,,\qquad 1\leq i\leq K\,,

\displaystyle\mathcal{M}^{(T)}_{j}(\bm{\omega})=\frac{\partial\big{(}-\ln\mathcal{L}^{(T)}\big{)}}{\partial\omega_{j}}(\bm{\omega})=-\sum_{k=1}^{M_{i}}\frac{\varphi_{j}(y_{l^{(i)}_{k}})}{\sum\limits_{j^{\prime}\in\mathcal{I}_{i}}\omega_{j^{\prime}}\,\varphi_{j^{\prime}}(y_{l^{(i)}_{k}})}+\sum_{l=0}^{M}t_{l}\,\varphi_{j}(y_{l})=0\,.

\displaystyle\mathcal{M}^{(T)}_{j}(\bm{\omega})=\frac{\partial\big{(}-\ln\mathcal{L}^{(T)}\big{)}}{\partial\omega_{j}}(\bm{\omega})=-\sum_{k=1}^{M_{i}}\frac{\varphi_{j}(y_{l^{(i)}_{k}})}{\sum\limits_{j^{\prime}\in\mathcal{I}_{i}}\omega_{j^{\prime}}\,\varphi_{j^{\prime}}(y_{l^{(i)}_{k}})}+\sum_{l=0}^{M}t_{l}\,\varphi_{j}(y_{l})=0\,.

\displaystyle\frac{\partial^{2}\big{(}-\ln\mathcal{L}^{(T)}\big{)}}{\partial\omega_{j}\partial\omega_{j^{\prime}}}(\bm{\omega})=\frac{\partial\mathcal{M}^{(T)}_{j}}{\partial\omega_{j^{\prime}}}(\bm{\omega})=\begin{cases}\sum\limits_{k=1}^{M_{i}}\frac{\varphi_{j}(y_{l^{(i)}_{k}})\,\varphi_{j^{\prime}}(y_{l^{(i)}_{k}})}{\big{(}\sum\limits_{r\in\mathcal{I}_{i}}\omega_{r}\varphi_{r}(y_{l^{(i)}_{k}})\big{)}^{2}}\,,&\quad\mbox{if}~{}j,\,j^{\prime}\in\mathcal{I}_{i}\,,\\ \,0\,,&\quad\mbox{otherwise}\,,\end{cases}

\displaystyle\frac{\partial^{2}\big{(}-\ln\mathcal{L}^{(T)}\big{)}}{\partial\omega_{j}\partial\omega_{j^{\prime}}}(\bm{\omega})=\frac{\partial\mathcal{M}^{(T)}_{j}}{\partial\omega_{j^{\prime}}}(\bm{\omega})=\begin{cases}\sum\limits_{k=1}^{M_{i}}\frac{\varphi_{j}(y_{l^{(i)}_{k}})\,\varphi_{j^{\prime}}(y_{l^{(i)}_{k}})}{\big{(}\sum\limits_{r\in\mathcal{I}_{i}}\omega_{r}\varphi_{r}(y_{l^{(i)}_{k}})\big{)}^{2}}\,,&\quad\mbox{if}~{}j,\,j^{\prime}\in\mathcal{I}_{i}\,,\\ \,0\,,&\quad\mbox{otherwise}\,,\end{cases}

Φ_{i} = φ_{j_{1}} (y_{l_{1}^{(i)}}) φ_{j_{1}} (y_{l_{2}^{(i)}}) φ_{j_{1}} (y_{l_{3}^{(i)}}) ⋮ φ_{j_{1}} (y_{l_{M_{i}}^{(i)}}) φ_{j_{2}} (y_{l_{1}^{(i)}}) φ_{j_{2}} (y_{l_{2}^{(i)}}) φ_{j_{2}} (y_{l_{3}^{(i)}}) ⋮ φ_{j_{2}} (y_{l_{M_{i}}^{(i)}}) \dots \dots \dots ⋱ \dots φ_{j_{N_{i}}} (y_{l_{1}^{(i)}}) φ_{j_{N_{i}}} (y_{l_{2}^{(i)}}) φ_{j_{N_{i}}} (y_{l_{3}^{(i)}}) ⋮ φ_{j_{N_{i}}} (y_{l_{M_{i}}^{(i)}}) \in R^{M_{i} \times N_{i}},

Φ_{i} = φ_{j_{1}} (y_{l_{1}^{(i)}}) φ_{j_{1}} (y_{l_{2}^{(i)}}) φ_{j_{1}} (y_{l_{3}^{(i)}}) ⋮ φ_{j_{1}} (y_{l_{M_{i}}^{(i)}}) φ_{j_{2}} (y_{l_{1}^{(i)}}) φ_{j_{2}} (y_{l_{2}^{(i)}}) φ_{j_{2}} (y_{l_{3}^{(i)}}) ⋮ φ_{j_{2}} (y_{l_{M_{i}}^{(i)}}) \dots \dots \dots ⋱ \dots φ_{j_{N_{i}}} (y_{l_{1}^{(i)}}) φ_{j_{N_{i}}} (y_{l_{2}^{(i)}}) φ_{j_{N_{i}}} (y_{l_{3}^{(i)}}) ⋮ φ_{j_{N_{i}}} (y_{l_{M_{i}}^{(i)}}) \in R^{M_{i} \times N_{i}},

\displaystyle\sum_{j=1}^{N}\sum_{j^{\prime}=1}^{N}\frac{\partial^{2}\big{(}-\ln\mathcal{L}^{(T)}\big{)}}{\partial\omega_{j}\partial\omega_{j^{\prime}}}\eta_{j}\eta_{j^{\prime}}=\sum_{i=1}^{K}\sum\limits_{k=1}^{M_{i}}\frac{\Big{(}\sum\limits_{j\in\mathcal{I}_{i}}\eta_{j}\,\varphi_{j}(y_{l_{k}^{(i)}})\Big{)}^{2}}{\Big{(}\sum\limits_{j\in\mathcal{I}_{i}}\omega_{j}\,\varphi_{j}(y_{l^{(i)}_{k}})\Big{)}^{2}}\geq 0\,.

\displaystyle\sum_{j=1}^{N}\sum_{j^{\prime}=1}^{N}\frac{\partial^{2}\big{(}-\ln\mathcal{L}^{(T)}\big{)}}{\partial\omega_{j}\partial\omega_{j^{\prime}}}\eta_{j}\eta_{j^{\prime}}=\sum_{i=1}^{K}\sum\limits_{k=1}^{M_{i}}\frac{\Big{(}\sum\limits_{j\in\mathcal{I}_{i}}\eta_{j}\,\varphi_{j}(y_{l_{k}^{(i)}})\Big{)}^{2}}{\Big{(}\sum\limits_{j\in\mathcal{I}_{i}}\omega_{j}\,\varphi_{j}(y_{l^{(i)}_{k}})\Big{)}^{2}}\geq 0\,.

ω_{j} \mbox an d j \in I_{i} \sum ω_{j} φ_{j} (y_{l_{k}^{(i)}}) = ω_{j}, = j \in I_{i} \sum ω_{j} φ_{j} (y_{l_{k}^{(i)}}), \forall j \in I_{i^{'}}, i^{'} \neq = i, \forall 1 \leq k \leq M_{i} .

ω_{j} \mbox an d j \in I_{i} \sum ω_{j} φ_{j} (y_{l_{k}^{(i)}}) = ω_{j}, = j \in I_{i} \sum ω_{j} φ_{j} (y_{l_{k}^{(i)}}), \forall j \in I_{i^{'}}, i^{'} \neq = i, \forall 1 \leq k \leq M_{i} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zwpku/sparse-learning-CRN
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Learning chemical reaction networks from trajectory data

Wei Zhang Zuse Institute Berlin, D-14195 Berlin, Germany.

Stefan Klus Department of Mathematics and Computer Science, Freie Universität Berlin, D-14195 Berlin, Germany.

Tim Conrad11footnotemark: 1 22footnotemark: 2

Christof Schütte11footnotemark: 1 22footnotemark: 2

Abstract

We develop a data-driven method to learn chemical reaction networks from trajectory data. Modeling the reaction system as a continuous-time Markov chain and assuming the system is fully observed, our method learns the propensity functions of the system with predetermined basis functions by maximizing the likelihood function of the trajectory data under $l^{1}$ sparse regularization. We demonstrate our method with numerical examples using synthetic data and carry out an asymptotic analysis of the proposed learning procedure in the infinite-data limit.

**Keywords ** chemical reaction, inverse problem, data-driven method, $l^{1}$ sparse optimization, asymptotic analysis

**AMS ** 92C42, 62M86

1 Introduction

Chemical reaction networks [23, 1] have been shown to be very useful in studying dynamical processes in chemistry and biology, where systems under investigation typically contain many different reactants that interact with each other. In in-silico biology, for instance, the cellular processes are often modeled as chemical reaction networks, which take the relevant biological/chemical components as well as their interactions into account [24, 6, 44, 35]. Modeling cellular processes, or finding the kinetic structure of the underlying reaction networks [54, 11, 14, 40, 51, 32], is one of the most prominent fields of in-silico biology due to the important role of such models in understanding the cellular behavior. This task is particularly challenging for realistic reaction networks that are characterized by a large number of elements and interactions (reactions). At the same time, more and more trajectory data of cellular processes is becoming available due to state-of-the-art single-cell based laboratory techniques [12, 42].

The aim of this work is to develop data-driven methods [33] that allow us to learn chemical reaction networks from trajectory data and to apply the new methods to the modeling of cellular processes. Given trajectory data of a stochastic chemical reaction process, we propose a numerical approach to reconstruct the underlying reaction network by maximizing the likelihood function of the trajectory with sparsity regularization. Roughly speaking, our approach consists of three steps. In the first step, preliminary information of the reaction network such as the number of different elements (reactant, products) and the total number of reaction channels is extracted from trajectory data by counting and enumerating. Based on this information, the second step is to define several basis functions which will be used in learning the propensity functions of the reaction network. The theory of chemical reactions suggests that we can choose each basis function as the product of copy-numbers of at most two different reactants [2, 16], i.e., polynomial functions of degree up to $2$ . In the third step, the propensity function of each reaction channel is represented using linear combinations of the basis functions involving unknown coefficients, which are then determined by maximizing the log-likelihood function of the trajectory data along with sparse regularization techniques using the $l^{1}$ -norm [27].

In contrast to Lasso [47, 48], the optimization problem that needs to be solved in our learning approach is a nonlinear sparse optimization problem, due to the nonlinearity of the log-likelihood function of the reaction network. In our study, we find that FISTA (Fast Iterative Shrinkage-Thresholding Algorithm) proposed in [7] is a suitable algorithm for solving our problem. We also propose a simple preconditioning technique which can significantly improve the performance of the numerical algorithm by allowing larger step-sizes in FISTA. This preconditioning technique turns out to be particularly useful when the basis functions take values at different orders of magnitudes for the given trajectory data. Furthermore, we provide an asymptotic analysis of our learning approach in the infinite-data limit. Under certain technical assumptions, by applying large sample theory [19, 34, 50] and limit theorems for stochastic processes [17], we establish the asymptotic consistency and the asymptotic normality of the estimators in our learning procedure, which therefore provides a solid theoretical basis for the data-driven method proposed in this paper.

Let us first review related work and summarize the contributions of this paper. The reconstruction of the governing equations from data using sparsity constraints is getting more and more attention, see [53, 10, 37, 14] for methods pertaining to ordinary differential equations (ODEs) and [8] for learning stochastic differential equations (SDEs). For chemical and biological reaction systems, the problem of estimating unknown parameters has been well studied when the systems are modeled both as ODEs [28, 3] and as continuous-time Markov chain processes [1, 41, 9, 54], while the reconstruction of the entire chemical reaction networks, i.e., finding parsimonious models, has only been considered when the systems are modeled as ODE systems [51, 37, 14]. We refer the readers to the nice review [51] for recent developments on the reverse engineering in systems biology. Compared to the aforementioned existing results, our work is new in the following three aspects. Firstly, we study sparse reconstruction of chemical reaction networks as continuous-time Markov chains, which, to the best of our knowledge, has not been considered in the literature. In contrast to ODE models, a continuous-time Markov chain as a stochastic model has the ability to provide more details of the reaction systems by capturing stochastic effects, which are known to be important for cellular processes [46, 45, 30]. Secondly, we have developed numerical codes in which we implemented the FISTA method [7] to solve a nonlinear sparse optimization problem in order to learn the reaction networks from trajectory data. Our numerical approaches, in particular the preconditioning technique, may be useful in other sparse optimization problems as well. Thirdly, we provide a theoretical justification of the proposed data-driven method. Note that, although different data-driven methods using sparsity [53, 10, 8] have been developed in the literature for different types of dynamical systems, the theoretical analysis of these methods is largely incomplete (see [49]). We expect the theoretical analysis presented in the current work to shed light on the characteristic properties of other data-driven methods as well.

Before concluding this introduction, we discuss several issues that will not be studied in detail in the current paper. Most importantly, we assume that the dynamics of the system is fully observed. In applications, it may be the case that either only certain “important” species in the system are observed or the full dynamics is only discretely observed at a fixed observation frequency [9]. In the former case, one can still apply the sparse learning approach proposed here and the output will be an “effective” model for the observed “important” species. However, the theoretical asymptotic analysis does not carry over directly, and it is therefore important to assess the quality of the effective model provided by the learning approach. In the latter case, where the full dynamics is observed discretely, learning the parsimonious model becomes more challenging. First of all, since not all reactions are observed, the reaction channels of the system need to be identified by other means. Supposing that this can be done, the likelihood function of the given trajectory data can be obtained by summing up the likelihood of all possible underlying trajectories that are consistent with the observation data. One can formulate the learning approach again as a sparse minimization problem, but it will be necessary to sample the underlying trajectories of the system in order to evaluate both the likelihood function and its derivatives. This results in some difficulties when solving the sparse minimization problem. We will address these issues in future work.

The remainder of the paper is organized as follows. In Section 2, we introduce chemical reaction networks and the required notation. Learning chemical reaction networks from trajectory data and its formulation as an optimization problem will be considered in Section 3. In Section 4, we demonstrate the efficiency of the numerical algorithm for solving the (sparse) optimization problems with three concrete numerical examples. In Section 5, we analyze the learning tasks when the length of the trajectory data goes to infinity and study the asymptotic behavior of the solutions of the optimization problems. Appendix A summarizes the main steps of the algorithm FISTA. Appendix B contains properties of an elementary function used in the current work. Two useful limit lemmas of counting processes are summarized in Appendix C. Finally, the proofs of results in Section 5 are collected in Appendix D.

The code used for producing the numerical results in Section 4 is available at: https://github.com/zwpku/sparse-learning-CRN.

2 Chemical reaction networks as continuous-time Markov chains: forward problem

Chemical reaction networks consist of different chemical species that can interact with each other through independent chemical reactions. Suppose the system has $n$ different chemical species, denoted by $S_{1},S_{2},\dots,S_{n}$ . Each species $S_{i}$ , $1\leq i\leq n$ , has $x^{(i)}$ copies, where the copy-number $x^{(i)}\geq 0$ may change whenever a reaction involving the species $S_{i}$ has occurred. The state of the system can be represented as the vector

[TABLE]

where $\mathbb{N}=\{0,1,2,\dots\}$ and $\mathbb{X}$ is the set of all possible states of the system.

The evolution of the system’s state $x$ can be modeled as a state-dependent continuous-time Markov chain [1, 23]. Let $\mathcal{R}$ denote a reaction in the system. The state change vector $v$ of $\mathcal{R}$ , $v\in\mathbb{N}^{n}$ , is defined such that, starting in state $x$ , the state of the system will change to $x+v$ when the reaction $\mathcal{R}$ occurs. The waiting time $\tau_{\mathcal{R}}$ of the system before the reaction $\mathcal{R}$ occurs satisfies an exponential distribution with the rate parameter $a^{*}_{\mathcal{R}}(x)$ (propensity function), which in turn depends on both the state $x$ and the structure of $\mathcal{R}$ . Specifically, the probability density function of $\tau_{\mathcal{R}}$ is given by

[TABLE]

In Table 1, we list the propensity functions of reactions which consume at most two molecules (see [5, 29] for further details). In particular, note that the propensity functions for the reactions in Table 1 are polynomial functions whose degrees are less or equal to $2$ .

In many reaction systems, different chemical reactions may have the same state change vector $v$ (see the second example in Remark 1). Assume that $N$ chemical reactions $\mathcal{R}_{1}$ , $\mathcal{R}_{2}$ , $\dots$ , $\mathcal{R}_{N}$ are involved in the evolution of the system and these $N$ reactions have in total $K$ different state change vectors $v_{1},v_{2},\dots,v_{K}$ , where $K\leq N$ . For each $v_{i}$ , $1\leq i\leq K$ , we introduce the terminology chemical channel $\mathcal{C}_{i}$ . We say the reaction $\mathcal{R}$ belongs to the channel $\mathcal{C}_{i}$ , or $\mathcal{C}_{i}$ contains the reaction $\mathcal{R}$ , if the state change vector of $\mathcal{R}$ is $v_{i}$ . For each $\mathcal{C}_{i}$ , we also define the index set

[TABLE]

and let $N_{i}$ be the number of chemical reactions belonging to $\mathcal{C}_{i}$ , i.e., $N_{i}=|\mathcal{I}_{i}|$ . Clearly, these index sets satisfy $\bigcup_{i=1}^{K}\mathcal{I}_{i}=\big{\{}1,2,\dots,N\big{\}}$ , $\mathcal{I}_{i}\bigcap\mathcal{I}_{i^{\prime}}=\emptyset$ , if $i\neq i^{\prime}$ , and therefore $\sum\limits_{i=1}^{K}N_{i}=N$ .

A reaction channel $\mathcal{C}_{i}$ is said to be activated when a certain reaction $\mathcal{R}$ belonging to $\mathcal{C}_{i}$ occurs. For each $1\leq i\leq K$ , $\tau_{i}=\min\limits_{j\in\mathcal{I}_{i}}\tau_{\mathcal{R}_{j}}$ is the waiting time at a state $x$ before the activation of the channel $\mathcal{C}_{i}$ , while $\tau=\min\limits_{1\leq j\leq N}\tau_{\mathcal{R}_{j}}$ is the waiting time before any of the chemical reactions in the system occurs. Assuming the chemical reactions are independent of each other and the waiting times $\tau_{\mathcal{R}_{j}}$ follow exponential distributions, we know that the waiting times $\tau_{i}$ and $\tau$ also follow exponential distributions, with the propensity functions

[TABLE]

respectively. In particular, let $\psi^{*}(t\,;\,x)$ be the probability density function of $\tau$ and $p^{*}(i\,;\,x)$ the probability that $\mathcal{C}_{i}$ is the first channel which becomes activated at state $x$ , then

[TABLE]

We point out that the evolution equation of the dynamics described above (continuous-time Markov chains) can be expressed in a simple form. In fact, denoting $X(t)\in\mathbb{N}^{n}$ the state of the system at time $t\geq 0$ , from [1] we know that $X(t)$ satisfies the dynamical equation

[TABLE]

where $\mathcal{P}_{i}$ , $i=1,\dots,K$ , are independent unit Poisson processes.

Remark 1.

As concrete examples, let us consider two simple reaction networks.

Reactions $\ce{$ A+B $->[\kappa_{1}]2B}\,,\,\ce{$ B $->[\kappa_{2}]A}$ , with rate constants $\kappa_{1}$ , $\kappa_{2}$ . In this case, we have two different reactions ( $N=2$ ) and two different reaction channels ( $K=2$ ), with state change vectors $v_{1}=(-1,1)^{\top}$ and $v_{2}=(1,-1)^{\top}$ respectively. According to Table 1, the propensity functions of these two channels (assuming $V=1$ ) are $a_{1}^{*}(x)=\kappa_{1}\,x^{(1)}x^{(2)}$ , $a_{2}^{*}(x)=\kappa_{2}\,x^{(2)}$ . 2. 2.

Reactions $\ce{$ A $+$ B $->[\kappa_{1}]$ B $}\,,\,\ce{$ A $->[\kappa_{2}]$ \emptyset $}$ , with rate constants $\kappa_{1}$ , $\kappa_{2}$ . In this case, we have $N=2$ , $K=1$ , since the state change vector of both reactions is $v=(-1,0)^{\top}$ . The propensity functions of the two reactions $\mathcal{R}_{1}$ , $\mathcal{R}_{2}$ (assuming $V=1$ ) are $a_{\mathcal{R}_{1}}^{*}(x)=\kappa_{1}\,x^{(1)}x^{(2)}$ and $a_{\mathcal{R}_{2}}^{*}(x)=\kappa_{2}\,x^{(1)}$ , while the propensity function of the channel $v$ is $a_{1}^{*}(x)=a_{\mathcal{R}_{1}}^{*}(x)+a_{\mathcal{R}_{2}}^{*}(x)=\kappa_{1}\,x^{(1)}x^{(2)}+\kappa_{2}\,x^{(1)}$ .

3 Learning chemical reaction networks: inverse problem

In this section, we study the problem of learning chemical reaction networks from trajectory data. Depending on the information known about the chemical reaction networks, we consider two different learning tasks in Subsection 3.2 and Subsection 3.3, where the second task is the main focus of this paper. In both tasks, the propensity functions in (1) are determined by maximizing the log-likelihood function among the parameterized propensity functions which depend on both a set of basis functions and several parameters. To emphasize the dependence on parameters, let the parameterized propensity functions be denoted by $a_{i}(x\,;\,\bm{\omega})$ and $a(x\,;\,\bm{\omega})$ , respectively, where $x\in\mathbb{X}$ and $\bm{\omega}$ is the vector consisting of all parameters. Similar to (2), we define the probability (density) functions corresponding to $\bm{\omega}$

[TABLE]

In the first learning task (Subsection 3.2), we assume that the structure of the chemical reactions is known and the goal is to determine the reaction rate constant of each reaction, i.e., the constants $\kappa$ in Table 1. In this case, each basis function in the parameterized propensity functions corresponds to an actual chemical reaction that is indeed involved in the evolution of the system (no redundancy), while the task is to determine the value of each parameter (parameter estimation) by maximizing the log-likelihood function. This is indeed a standard problem and has been widely studied in the literature. We include it in this section due to its connections to the sparse learning task considered in Subsection 3.3.

In the second learning task (Subsection 3.3), on the other hand, we assume that the structure of the chemical reactions in the system is also unknown. In this case, candidate basis functions are chosen to parameterize the propensity functions, and $l^{1}$ sparsity regularization is used to remove the redundancy in the basis functions.

Before introducing the two learning tasks, we briefly discuss the trajectory of the system and derive the likelihood function of a given trajectory.

3.1 Space of trajectories and the likelihood function

Given $T>0$ , there are two different ways to represent the trajectories of the system in the interval $[0,T]$ . The first representation relies on the total number $M$ of reactions occurred within $[0,T]$ , the waiting time $\tau$ of each reaction, and the new state of the system after each of the $M$ reactions. Specifically, starting from a state $y_{0}\in\mathbb{X}$ at time $s=0$ , each trajectory $X(s)$ in the time $[0,T]$ can be represented as a sequence

[TABLE]

which means that, starting from $y_{0}$ , the state of the system changes from $y_{l}$ to $y_{l+1}$ after waiting for a period of time of length $t_{l}$ , where $0\leq l<M$ . The final time $t_{M}$ in (5) is the amount of time that the system spends at the final state $y_{M}$ before time $s=T$ . Clearly, we have $\sum\limits_{l=0}^{M}t_{l}=T$ . In the second representation, the indices of the reaction channels are used instead of the new state after each reaction. That is, we represent the same trajectory $X(s)$ , $s\in[0,T]$ , as

[TABLE]

where, for each $0\leq l<M$ , $i_{l}\in\{1,2,\dots,K\}$ denotes the index of the reaction channel and $t_{l}>0$ is the waiting time before the $(l+1)$ -th reaction occurs, respectively. The two representations (5) and (6) can be converted from one to the other, using the relation $v_{i_{l}}=y_{l+1}-y_{l}$ , which holds for $0\leq l<M$ .

In this work, we assume that a trajectory $X(s)$ of the system, represented either as described in (5) or (6), is available up to time $T$ . In other words, we assume that both the change of the state and the length of the waiting time are known for each occurrence of the $M$ chemical reactions. From the trajectory data, we can deduce the total number of different reaction channels $K$ , as well as the state change vector $v_{i}\in\mathbb{N}^{n}$ for each channel $\mathcal{C}_{i}$ , $1\leq i\leq K$ . (Note, however, that when a certain channel $\mathcal{C}$ contains more than one reaction, from the data alone we will not be able to tell which reaction $\mathcal{R}$ belonging to $\mathcal{C}$ has actually occurred when $\mathcal{C}$ is activated.) For each $1\leq i\leq K$ , we denote by

[TABLE]

the indices $l$ such that $i_{l}=i$ in (6), where $M_{i}\geq 0$ is the total number of times that the channel $\mathcal{C}_{i}$ has been activated within time $[0,T]$ , and therefore the relation

[TABLE]

is satisfied. For brevity, let us introduce the notation

[TABLE]

to describe the trajectory of the system within the time interval $[0,T]$ . The space consisting of all trajectories of the system on $[0,T]$ will be denoted by $\mathcal{D}_{T}$ . Note that, as a random variable, $\mathbf{X}$ contains both continuous and discrete components. Given a parameter vector $\bm{\omega}$ , we consider the chemical reaction system determined by the (parameterized) probability density functions $\psi$ , $p$ in (4), and define

[TABLE]

for the trajectory $\mathbf{X}$ in (9). Let $\mathbf{E}$ denote the mathematical expectation with respect to the trajectories of the system. Then, for any bounded measurable function $g\colon\mathcal{D}_{T}\rightarrow\mathbb{R}$ , we have

[TABLE]

from which we can view the function $\rho^{(T)}(\mathbf{X}\,|\,\bm{\omega})$ as the probability density (distribution) of $\mathbf{X}$ on the space $\mathcal{D}_{T}$ (we can indeed verify that $\mathbf{E}1=1$ ). To simplify the notation, we will formally write

[TABLE]

as the integration on the right-hand side of (11). Using (10) and (12), we can write down the likelihood function of the trajectory data as

[TABLE]

where

[TABLE]

can be considered as the likelihood function along the reaction channel $\mathcal{C}_{i}$ .

3.2 Learning task 1: determine rate constants by maximizing the log-likelihood

Assuming that the structure of the chemical reactions of the system is known, we now consider the problem of determining the reaction rate constant of each reaction. Note that the propensity function of each reaction $\mathcal{R}$ in Table 1 can be written as $\omega\varphi(x)$ , where $\varphi(x)$ is a polynomial of the system’s state whose specific form depends on the structure of $\mathcal{R}$ , and $\omega$ is the rate constant. Therefore, in the current learning task we assume that the propensity function of the $j$ th chemical reaction $\mathcal{R}_{j}$ in the system is given by

[TABLE]

where the nonnegative function $\varphi_{j}$ is known from the structure of $\mathcal{R}_{j}$ , and $\omega_{j}$ is the unknown rate constant which we want to determine from trajectory data.

Let $\bm{\omega}$ be the vector

[TABLE]

consisting of all the unknown rate constants, where $\omega_{j}\geq 0$ for all $1\leq j\leq N$ . For each channel $\mathcal{C}_{i}$ , $1\leq i\leq K$ , we also define the vector

[TABLE]

which consists of the rate constants of reactions belonging to $\mathcal{C}_{i}$ . Corresponding to (15), the parameterized propensity functions in (1) are

[TABLE]

while the optimal value of $\bm{\omega}$ is determined by maximizing the (logarithmic) likelihood functions in (13), or equivalently, by solving the minimization problem

[TABLE]

With the trajectory data as defined in (5) and using the propensity functions in (17), the objective function above can be computed explicitly and we have

[TABLE]

In the above, we recall that the indices $l^{(i)}_{k}$ are defined in (7), the logarithmic likelihood function

[TABLE]

only depends on $\bm{\omega}^{(i)}$ and should be compared to (14). Note that the expressions above also imply that the minimization problem (18) can be decomposed into $K$ minimization problems

[TABLE]

which can be solved separately.

For each index $j$ , $1\leq j\leq N$ , such that $j\in\mathcal{I}_{i}$ for some $1\leq i\leq K$ , the corresponding Euler–Lagrange equation of (18) is

[TABLE]

Differentiating one more time, we get the Hessian matrix of the objective function in (18)

[TABLE]

where $1\leq j,j^{\prime}\leq N$ .

In order to study the optimization problem (18)–(19), let us introduce the matrix

[TABLE]

for each $1\leq i\leq K$ , where we have assumed that the index set $\mathcal{I}_{i}=\big{\{}j_{1},j_{2},\dots,j_{N_{i}}\big{\}}$ . We define $\Phi_{i,k}\in\mathbb{R}^{M_{i}}$ to be the $k$ th column vector of $\Phi_{i}$ for $1\leq k\leq N_{i}$ and thus obtain the following result concerning the solution of the optimization problem (18)–(19).

Proposition 1.

The following three conditions are equivalent.

For each $1\leq i\leq K$ , the vectors $\Phi_{i,1},\Phi_{i,2},\dots,\Phi_{i,N_{i}}$ are linearly independent. 2. 2.

The function $-\ln\mathcal{L}^{(T)}(\bm{\omega})$ in (19) is strictly convex. 3. 3.

The optimization problem (18)–(19) has a unique solution.

Proof.

(2) $\Rightarrow$ (3) is obvious. To show that (1) implies (2), it is sufficient to verify that the Hessian matrix of $-\ln\mathcal{L}^{(T)}$ is positive definite. Using (22), for any vector $\bm{\eta}=(\eta_{1},\eta_{2},\dots,\eta_{N})^{\top}\in\mathbb{R}^{N}$ , we have

[TABLE]

Since the columns of $\Phi_{i}$ are linearly independent for each $i$ , we conclude that (24) is zero if and only if $\bm{\eta}$ is a zero vector. This implies that $-\ln\mathcal{L}^{(T)}$ is strictly convex.

Finally, let us prove that (3) implies (1) by contradiction. Define $\bm{\omega}$ to be the unique solution of the optimization problem (18). Assume that there is $i$ , $1\leq i\leq K$ , such that the vectors $\Phi_{i,1},\Phi_{i,2},\dots,\Phi_{i,N_{i}}$ are linearly dependent. As a result, we can find a vector $\widetilde{\bm{\omega}}=(\widetilde{\omega}_{1},\widetilde{\omega}_{2},\dots,\widetilde{\omega}_{N})^{\top}\neq\bm{\omega}$ , such that

[TABLE]

Since $\bm{\omega}$ satisfies (21), the property (25) implies that $\widetilde{\bm{\omega}}$ satisfies (21) as well. Multiplying by $\omega_{j}$ (or $\widetilde{\omega}_{j}$ ) on both sides of (21) and summing up the indices, we get

[TABLE]

Combining (25), (26), as well as the expressions in (19), we obtain $-\ln\mathcal{L}^{(T)}(\bm{\omega})=-\ln\mathcal{L}^{(T)}(\widetilde{\bm{\omega}})$ , which contradicts the uniqueness of $\bm{\omega}$ . ∎

To distinguish the parameters obtained from solving the optimization problem (18) and the true parameters of the system, we will define $\bm{\omega}^{(T)}$ to be the maximizer of $-\ln\mathcal{L}^{(T)}$ for fixed time $T>0$ in what follows, and $\bm{\omega}^{*}$ to be the vector consisting of the true parameters such that (15) holds. In particular, when $N_{i}=1$ and $\mathcal{I}_{i}=\{j\}$ , i.e., the channel $\mathcal{C}_{i}$ only contains one reaction $\mathcal{R}_{j}$ , the Euler–Lagrange equation (21) can be solved analytically and we have

[TABLE]

3.3 Learning task 2: determine the rate constants and the structure of chemical reactions using sparsity

In this subsection, we study the problem of learning the propensity functions of the chemical reaction networks from trajectory data when neither the structure of the chemical reactions nor their rate constants is known.

First of all, we can figure out the total number $K$ of the reaction channels from the trajectory data, as discussed in Subsection 3.1. Now suppose that we are given $N$ candidate basis functions

[TABLE]

together with $K$ index sets $\mathcal{I}_{i}=\{j_{1},j_{2},\dots,j_{N_{i}}\}$ , $1\leq i\leq K$ , such that $N_{i}=|\mathcal{I}_{i}|$ ,

[TABLE]

Accordingly, we introduce the vectors

[TABLE]

For each channel $\mathcal{C}_{i}$ , the propensity function $a_{i}^{*}$ in (1) will be approximated using the basis functions $\varphi_{j}$ , $j\in\mathcal{I}_{i}$ , and the coefficients in $\bm{\omega}^{(i)}$ . More precisely, we define

[TABLE]

where $\epsilon>0$ , and the function

[TABLE]

is introduced (see Figure 1), in order to guarantee the non-negativity of $a^{(\epsilon)}_{i}$ for all vectors $\bm{\omega}\in\mathbb{R}^{N}$ . Corresponding to (31), the total propensity function is given by

[TABLE]

Since the propensity functions of reactions in many applications typically have a simple form (Table 1), there is likely redundancy in the basis functions and therefore we can assume that the unknown vector $\bm{\omega}$ only has a few nonzero entries (and is thus sparse). With this observation in mind, we propose to determine $\bm{\omega}$ by maximizing the (logarithmic) likelihood function under the sparsity assumption, or, equivalently, by solving the nonlinear sparse minimization problem

[TABLE]

where $\mathcal{L}^{(T,\epsilon)}(\bm{\omega})$ is the likelihood function (13) with the propensity functions $a_{i}=a_{i}^{(\epsilon)},a=a^{(\epsilon)}$ in (31) and (33). Explicitly, we have

[TABLE]

If we quantify the sparsity of $\bm{\omega}$ using the $l^{1}$ norm (denoted by $\|\cdot\|_{1}$ ), then (34) results in

[TABLE]

In (36), the log-likelihood function is rescaled by $1/T$ (this scaling is suggested by the analysis in Section 5), and the constant $\lambda=\lambda(T)>0$ , which measures the strength of the sparsity regularization, can be chosen depending on $T$ .

Similar to the problem (18) in the previous subsection, the minimizer of (36) can be computed by solving $K$ sparse minimization problems

[TABLE]

separately, where

[TABLE]

In practice, we find that (36), or equivalently (37), can be efficiently solved by FISTA proposed in [7], especially when preconditioning is applied (see Remark 4 below and examples in Section 4). The main algorithmic steps of FISTA are provided in Algorithm 1 in Appendix A.

We obtain the following result concerning the minimization problems (36) and (37).

Proposition 2.

Suppose $\epsilon,\lambda>0$ . The objective functions of the optimization problems (36) and (37) are strictly convex.

Proof.

It is sufficient to consider the objective function in (37). By straightforward calculations (for instance, see (77) and (78) in Appendix B), we can verify that both $-\ln G_{\epsilon}$ and $G_{\epsilon}$ are strictly convex functions. Therefore, the function $-\ln\mathcal{L}^{(T,\epsilon)}_{i}$ in (38) is strictly convex. Since the norm $\|\cdot\|_{1}$ is convex as well, we conclude that the objective function in (37) is strictly convex. ∎

Let $\bm{\omega}^{(T,\epsilon,\lambda)}$ denote the unique minimizer of the problem (36). Similar to the Euler–Lagrange equation (21), in the current case $\bm{\omega}^{(T,\epsilon,\lambda)}$ satisfies the inclusion relation [4, 13]

[TABLE]

where

[TABLE]

for $j\in\mathcal{I}_{i}$ , and $\partial|\omega_{j}|$ is the subdifferential of the absolute value function $|\omega_{j}|$ , defined by

[TABLE]

Finally, let $\mathcal{M}^{(T,\epsilon)}$ be the vector in $\mathbb{R}^{N}$ whose components are defined in (40) and define the set $\partial|\bm{\omega}|=\big{\{}\bm{v}\in\mathbb{R}^{N}~{}\big{|}~{}\bm{v}=(v_{1},v_{2},\dots,v_{N})^{\top}\,,~{}v_{j}\in\partial|\omega_{j}|\,,~{}1\leq j\leq N\big{\}}$ . We can express the condition (39) in vector form as

[TABLE]

The characterization above of the minimizers will be used in the analysis in Section 5.

We conclude this section with the following remarks.

Remark 2 (Role of the function $G_{\epsilon}$ ).

In principle, we would like to allow both the basis functions $\varphi_{j}$ and the unknown coefficients $\omega_{j}$ to be either positive or negative. By introducing the function $G_{\epsilon}$ in (35), we avoid imposing many inequality constraints which would be otherwise needed in order to guarantee that the log-likelihood function in (35) is well-defined. The properties of $G_{\epsilon}$ in (32) are discussed in Appendix B. In particular, we have $\lim\limits_{\epsilon\rightarrow 0+}G_{\epsilon}(x)=\max(x,0)$ , uniformly $\forall~{}x\in\mathbb{R}$ . For this reason, we define $G_{0}(x)=\max(x,0)$ .

Remark 3 (Choice of basis functions).

In the sparse minimization problem (36), the vector $\bm{\omega}$ contains all the $N$ coefficients $\omega_{j}$ , and the corresponding $N$ basis functions $\varphi_{j}$ in (28) are involved. This formulation makes the notations simpler and is also convenient for analysis, particularly in Section 5. Numerically, on the other hand, the coefficient vectors $\bm{\omega}^{(i)}$ in (30) can be computed separately by solving the minimization problems (37), $1\leq i\leq K$ , with the same set of basis functions $\phi_{1},\,\phi_{2},\dots,\phi_{L}$ , $L>0$ , for all the $K$ channels. In this case, corresponding to the formulation adopted at the beginning of this subsection where all $N$ coefficients are put together, we define the index sets $\mathcal{I}_{i}=\big{\{}(i-1)L+1\,,~{}(i-1)L+2\,,\,\dots\,,~{}iL\big{\}}$ , $1\leq i\leq K$ , and for each $j\in\mathcal{I}_{i}$ , we define the function

[TABLE]

Accordingly, we have $\bm{\omega}^{(i)}=\big{(}\omega_{(i-1)L+1},\,\omega_{(i-1)L+2},\,\dots,\omega_{(i-1)L+L}\big{)}^{\top}$ , and the propensity function in (31) can be written more transparently as

[TABLE] 2. 2.

While we are mainly interested in chemical reaction systems, the same learning approach can be applied to other types of continuous-time Markov chains whose jump distributions are state-dependent. In particular, for chemical reaction systems that obey law of mass-action, according to Table 1 we may choose $\varphi_{j}$ from the polynomials

[TABLE]

*where $x^{(k)}$ denotes the * $k$ th component of the state $x=(x^{(1)},x^{(2)},\dots,x^{(n)})^{\top}$ , based on the knowledge about the potential chemical reactions that are possibly involved in the system.

Remark 4 (Preconditioning).

In concrete applications, due to the complexity of the trajectory data, different basis functions may take values that are of different orders of magnitude. As a result, the objective functions in (37), or equivalently in (36), may become inhomogeneous along different components $\omega_{j}$ . This leads to numerical difficulties in solving (37) since a small step-size has to be used as a result of the strong dependence of the objective function on the change of $\bm{\omega}$ along certain directions (i.e., large gradient, ill-conditioned). A simple way to alleviate this numerical issue is to precondition the problems (37) by rescaling the basis functions. Equivalently, let $c_{j}$ denote the rescaling constants, where $c_{j}>0$ , $1\leq j\leq N$ . Instead of (37), we can compute the minimizer $\overline{\bm{\omega}}^{(i)}$ of the rescaled sparse minimization problem

[TABLE]

where the vector $\overline{\bm{\omega}}^{(i)}$ consists of $\overline{\omega}_{j}$ , $j\in\mathcal{I}_{i}$ . Then it is easy to verify that the minimizer $\bm{\omega}^{(i)}$ of (37) can be recovered from $\omega_{j}=\frac{\overline{\omega}_{j}}{c_{j}}$ , for $j\in\mathcal{I}_{i}$ . By properly choosing the constants $c_{j}$ based on analyzing the trajectory data, we can expect that minimizing (44) will be easier compared to (37). Readers are referred to Section 4 for further discussions on this issue and concrete examples.

Remark 5 (Possible extensions).

Below we discuss several possible generalizations.

So far, we have assumed that the evolution of the system is fully observed. In concrete applications, sometimes a small subset of species in the system is supposed to be able to describe the system’s dynamics **[15]**. Correspondingly, it may happen that the trajectory data is only partially observed for these “important” species. In this case, one can still apply the second learning approach in this subsection to learn the system and the outcome of the optimization problem (36) will be an effective dynamics for these selected “important” species. However, we point out that, since the effective reactions among these “important” species do not necessarily obey the law of mass-action any more, it may be important to include other types of basis functions (e.g., rational functions for Michaelis–Menten type kinetics **[39]**) together with the polynomial basis in (43) in order to obtain a good approximation of the effective dynamics. 2. 2.

It is straightforward to generalize the analysis to the case where multiple trajectories of the system are available. We refer the readers to the numerical examples in Section 4 for details. 3. 3.

In this work, in particular in Section 5, we are mainly interested in the theoretical justification of the two learning approaches in the infinite-data limit, i.e., $T\rightarrow+\infty$ . The numerical examples in Section 4 also mainly serve this purpose. Regarding the choice of the sparsity parameter $\lambda$ , one can expect that a large $\lambda$ will increase the sparsity of the solution, but at the same time will also introduce bias in the prediction. Therefore, in the numerical experiments in Section 4, we empirically choose $\lambda$ in such a way that the sparsity and accuracy of the solution are balanced. In practice, instead of choosing a fixed $\lambda>0$ in (36) for coefficients in front of all basis functions, it is helpful to consider different values of $\lambda$ for different coefficients and to tune the parameter(s) $\lambda$ carefully using the cross-validation technique **[20, 26]**. See **[8]** for more details.

4 Examples

In this section, we study the learning tasks discussed in Section 3 with three concrete numerical examples.

4.1 Example 1

In the first example, we study the chemical reaction system given by Table 2, where two different species $A,B$ are involved in $4$ chemical reactions. The propensity functions of these $4$ reactions depend on both the state $x=(x^{(1)},x^{(2)})^{\top}$ of the system, i.e., the copy-numbers of the species $A$ and $B$ , and the rate constants $\kappa_{i}$ , $i=1,2,3,4$ .

To study the two learning tasks discussed in Section 3, we fix the parameters

[TABLE]

and $Q=100$ trajectories of the system are generated using the stochastic simulation algorithm (SSA) [21, 22, 23]. Each trajectory starts from the same initial state $x=(20,10)^{\top}$ at time $t=0$ and is simulated until time $T=10$ ( $5$ of the $100$ trajectories are shown in Figure 2 for illustration). From Table 2, it is clear that different reactions belong to different reaction channels and therefore there are in total $4$ reaction channels in the reaction network. For the quantities introduced in Section 2, we obtain $N_{i}=1$ and $K=N=4$ . After processing the trajectory data, we find that the activation numbers of the $4$ reaction channels within these $100$ trajectories are $2296$ , $1778$ , $2777$ , and $2135$ , respectively, as shown in Table 3.

With the prepared trajectory data, let us first consider the problem of learning the rate constants $\kappa_{i}$ , $1\leq i\leq 4$ , assuming that the types of these $4$ reactions are known. For this purpose, we consider the negative log-likelihood function

[TABLE]

which is similar to (19), except that in (46) we have taken all the $100$ trajectories into account. Specifically, $q$ in (46) denotes the index of the trajectory, while the notation $M^{(q)}$ , $M^{(q)}_{i}$ , $i^{(q)}_{l}$ , $y^{(q)}_{l}$ , $t^{(q)}_{l}$ , $l^{(q,i)}_{k}$ has the same meaning (for the $q$ th trajectory) as the corresponding notations $M$ , $M_{i}$ , $i_{l}$ , $y_{l}$ , $t_{l}$ , and $l^{(i)}_{k}$ in (19), respectively. Following the setting in Subsection 3.2, in this example we have the parameter set $\bm{\omega}=(\kappa_{1},\kappa_{2},\kappa_{3},\kappa_{4})^{\top}$ , the index set $\mathcal{I}_{i}=\{i\}$ , $1\leq i\leq 4$ , as well as the functions given by

[TABLE]

Since each reaction channel only contains one single reaction, the minimizer of the objective function (46) can be computed explicitly using an expression similar to (27), and we get $\bm{\omega}^{(T)}=(0.98,~{}0.10,~{}0.97,~{}0.91)^{\top}$ , which is indeed close to the true parameters (see Table 4).

Let us now study the second learning task in Subsection 3.3 with the same trajectory data, where we assume that the structure of the chemical reactions involved in the system is unknown as well. Notice that, by analyzing the trajectory data, in this case we can still figure out that there are in total $2$ species and $4$ different reaction channels in the reaction network (see Table 3). In order to determine the propensity function of each reaction channel, based on Table 1 and the discussions in Remark 3, we choose polynomials of degree at most $2$ for $x=(x^{(1)},x^{(2)})^{\top}$ , i.e.,

[TABLE]

as basis functions. The propensity functions of the reaction channels are approximated by

[TABLE]

where $G_{\epsilon}$ is defined in (32) and we set $\epsilon=0.1$ . In (48), the function $a_{i}^{(\epsilon)}$ depends on the $6$ parameters $\bm{\omega}^{(i)}=\big{(}\omega_{6(i-1)+1}$ , $\omega_{6(i-1)+2}$ , $\dots$ , $\omega_{6(i-1)+6}\big{)}^{\top}$ , and the same set of basis functions in (47) is used for each of the $4$ channels.

To determine the value of $\bm{\omega}=(\omega_{1},\omega_{2},\dots,\omega_{24})^{\top}$ , which consists of all the unknown parameters, we follow the discussions in Remark 3 of Subsection 3.3 and solve the sparse minimization problems

[TABLE]

for each channel $\mathcal{C}_{i}$ separately, by applying Algorithm 1 in Appendix A. We choose the parameter $\lambda=0.2,~{}0.1,~{}0.01$ empirically, such that the sparsity and accuracy of the solution are balanced. In each iteration step, evaluating the objective function in (49) as well as its derivative requires traversing every reaction along the $100$ trajectories. This part of the calculation is performed in parallel using MPI in our code. The iteration procedure continues until the relative difference between the minimal and the maximal values of the objective function in the last $20$ iteration steps is smaller than $5\cdot 10^{-8}$ . In this example, we run the code using $20$ processors in parallel and it takes only a few seconds to meet the convergence criterion.

The final results are summarized in Table 5. To make a comparison with the true parameters in (45), we notice that, with the basis functions in (47), the true propensity functions of the $4$ reaction channels in the system (see Table 2 and Table 4) can be expressed as

[TABLE]

where $G_{0}(x)=\max(x,0)$ . From the expressions above, we see that the propensity functions in (48), with the estimated parameters in Table 5 (for $\lambda=0.1$ or $0.01$ ), indeed provide reasonable approximations of the true propensity functions in (50). Comparing the results for different $\lambda$ , we can observe that while the solution is sparser for $\lambda=0.2$ (e.g., coefficients corresponding to the basis $\phi_{1}\equiv 1$ in Table 5), the approximation of the true coefficients is better when $\lambda$ is smaller (i.e., $\lambda=0.01$ , underlined coefficients in Table 5).

4.2 Example 2: predator-prey system

In the second example, we consider the predator-prey type reaction system in Table 6, which has two different species and $5$ chemical reactions [55]. The system models the birth and death of two different species and is widely used as building block of more complicate chemical or biological systems. In contrast to the previous example where different reactions have different state change vectors, in the current case both the reaction \ce $A$ -¿[κ_2] $\emptyset$ and the reaction \ce $A$ + $B$ -¿[κ_5] $B$ have the same state change vector $v=(-1,0)^{\top}$ .

In the first step, we generate the trajectory data of the system with the parameters

[TABLE]

Starting from the state $x=(25,15)^{\top}$ at time $t=0$ , $Q=100$ trajectories are simulated using SSA until the final time $T=10$ , and $5$ of these $100$ trajectories are shown in Figure 3. After analyzing the trajectory data, we can identify the $4$ different reaction channels in the system as well as the numbers of occurrences of activations for each channel within the $100$ trajectories (see Table 7).

With the prepared trajectory data, we study the estimation of the parameters $\kappa_{i}$ , $1\leq i\leq 5$ , assuming that the structure of the $5$ reactions in Table 6 is known (learning task 1). In the same way as in the previous example, we consider the minimization of the same negative log-likelihood function (46). The parameters $\kappa_{1},\kappa_{3},\kappa_{4}$ can be computed explicitly from the expression which is similar to (27) since the corresponding reaction channel contains only one single reaction, while the parameters $\kappa_{2},\kappa_{5}$ , both of which are involved in the same channel $v^{\top}=(-1,0)$ , can be found using a standard gradient descent method [43]. In the latter case, we choose the time step-size $\Delta t=10^{-3}$ and the initial values are set to $1.0$ . In both cases, it only takes several seconds to run the code and the estimated parameters $\kappa_{i}$ are indeed very close to the true parameters (see Table 8).

Next, we study the second learning task described in Subsection 3.3, where our aim is to learn the propensity functions of the $4$ identified reaction channels without knowing the structure of the chemical reactions. The propensity functions are approximated in the same way as in (48), with the same set of basis functions in (47) and $\epsilon=0.1$ . For each channel $\mathcal{C}_{i}$ and each $\lambda=0.2,\,0.1,\,0.01$ , the sparse minimization problem (49) is solved separately by “FISTA with backtracking” (Algorithm 1 in Appendix A), using the same number of processors (i.e., $20$ ) and the same convergence criterion as in the previous example.

However, as shown in Figure 3, the trajectory data in the current example exhibits further complexities, as the copy-number $x^{(1)}$ of the species $A$ in the system varies significantly (from $25$ to nearly $10^{4}$ ) within the time interval $[0,10]$ , unlike the trajectory data in the previous example, where the copy-numbers of the both species stay below $30$ (Figure 2). As a result, in Table 9 we see that the different basis functions in (47) are of vastly different orders of magnitude when they are evaluated at the states contained in the $100$ trajectories. At the same time, in the numerical experiment we find that direct minimization of (49) using FISTA does not converge at all for any of the $4$ reaction channels, due to the extremely small step-size between $10^{-11}$ and $10^{-8}$ (the step-size is determined by the algorithm itself; see Algorithm 1 in Appendix A and [7]).

To overcome this difficulty, we apply the preconditioning idea discussed in Remark 4. Let $\varphi_{j}$ denote the basis functions, where $\varphi_{j}=\phi_{k}$ , for $j=6(i-1)+k$ , $1\leq k\leq 6$ . For each index $j$ belonging to the $i$ th channel $\mathcal{C}_{i}$ , we record the maximal values of $\varphi_{j}$ among all the states in the trajectory data at which $\mathcal{C}_{i}$ has been activated. These maximal values are then used to (empirically) determine the rescaling constants $c_{j}$ , shown in Table 9 such that the functions $\varphi_{j}/c_{j}$ after rescaling are roughly of the same order of magnitude. As discussed in Remark 4, we solve the rescaled sparse minimization problem, which is similar to (44), for each channel separately, and restore the parameters $\bm{\omega}$ in the propensity functions. It turns out that the problems after rescaling become much easier to solve, because in this case the step-size is increased to $10^{-5}$ on average, which is $3$ to $6$ orders of magnitude larger than the step-size in the unrescaled problem. It takes less than $10$ minutes in total to meet the convergence criterion for all $4$ reaction channels and the results are summarized in Table 10.

To compare with the true parameters in (51), notice that the true propensity functions of the $4$ channels in Table 7 can be expressed as

[TABLE]

where $G_{0}(x)=\max(x,0)$ . From the expressions above, we can conclude that the propensity functions in (48), together with the parameters given in Table 10, indeed approximate the true propensity functions in (52) quite well. Comparing the results for $\lambda=0.01$ , we observe that the solutions are slightly sparser for $\lambda=0.2$ and $\lambda=0.1$ (e.g., coefficients corresponding to the basis $\phi_{1}\equiv 1$ in Table 10), while the approximation of the true coefficients is better when $\lambda$ is smaller (i.e., $\lambda=0.01$ , underlined coefficients in Table 10). Finally, we point out that the solution could be further improved if necessary, by using thresholding techniques (i.e., removing unimportant basis functions) [10] or cross-validation techniques (i.e., tuning $\lambda$ ) [8].

4.3 Example 3: reaction network modeling intracellular viral infection

In the third example, we consider the reaction network in [44], which models intracellular viral infection. We refer the readers to [44] for the biological background and to [25, 5] for further details. As shown in Table 11, the system consists of $4$ different species, i.e., the viral template (T), the viral genome (G), the viral structure protein (S), and the virus (V). These species are involved in $6$ chemical reactions.

First of all, starting from the state $x=(1,0,0,0)^{\top}$ at time $t=0$ , $Q=10$ trajectories of the system are generated using SSA until $T=100$ , with the parameters

[TABLE]

in Table 11. For illustration purposes, $5$ of these $10$ trajectories are shown in Figure 4. It can be observed that the copy-numbers $x^{(3)}$ , $x^{(4)}$ of $S,V$ may increase to $10^{2}$ – $10^{3}$ , while the copy-numbers $x^{(1)}$ , $x^{(2)}$ of $T$ , $G$ remain relatively small (less than $20$ ) within the time interval $[0,100]$ . After analyzing the trajectory data, we can identify the $6$ reaction channels of the system. The numbers of occurrences of activations for each channel within the $10$ trajectories can be counted as well (see Table 12).

With these trajectory data, we study the estimation of the parameters $\kappa_{i}$ , $1\leq i\leq 6$ , assuming that the structure of the $6$ reactions in Table 11 is known (learning task 1). In the same way as we did in the previous two examples, the parameters are estimated by minimizing the same negative log-likelihood function (46). Since each reaction channel contains only one reaction, the parameters $\kappa_{i}$ can be directly computed (see (27)) and are indeed very close to the true parameters in (53), as shown in Table 13.

In what follows, we continue to study the second learning task in Subsection 3.3, where we want to learn the propensity functions of the $6$ identified reaction channels in the system without knowing the structure of the chemical reactions. As discussed in Table 1 and Remark 3, since there are $4$ different species in the system, we construct the following basis functions (i.e., polynomials of degree at most $2$ )

[TABLE]

where $x=(x^{(1)},x^{(2)},x^{(3)},x^{(4)})^{\top}$ , to learn the propensity function of each reaction channel. Similar to (48) in the first example, the propensity functions of the $6$ reaction channels are approximated by

[TABLE]

with $\epsilon=0.1$ . For each $1\leq i\leq 6$ , the same sparse minimization problem in (49) is solved in order to determine the coefficients $\bm{\omega}^{(i)}=\big{(}\omega_{15(i-1)+1}$ , $\omega_{15(i-1)+2}$ , $\dots$ , $\omega_{15(i-1)+15}\big{)}^{\top}$ . From Table 14, we can again observe that the maximal values of the different basis functions in (54), evaluated on the trajectory data, are of different orders of magnitude. Therefore, the same rescaling strategy discussed in Remark 4 and in the previous example is applied to precondition the problem, using the rescaling constants $c_{j}$ in Table 14 which are determined empirically based on the maximal values of basis functions. Notice that, since for different channels the basis functions attain similar maximal values, the same set of rescaling constants is used for all the $6$ channels. For each reaction channel, the rescaled minimization problem is solved in parallel using $10$ processors, since the trajectory data only contains $10$ trajectories, and the iteration procedure continues until the relative difference between the minimal and the maximal values of the objective function in the last $20$ iteration steps is smaller than $1.0\cdot 10^{-7}$ . The estimated coefficients are summarized in Table 15. For each channel except channel $\mathcal{C}_{4}$ , it takes around $10$ minutes to meet the convergence criterion, while for channel $\mathcal{C}_{4}$ it takes roughly two hours, because the corresponding solution of channel $\mathcal{C}_{4}$ has a large coefficient (i.e., the underlined coefficient $92.1$ in Table 15) which is very different from the zero initial guess.

To compare with the true propensity functions of the $6$ channels in Table 12 with the true parameters in (53), let us write the true propensity functions as

[TABLE]

where $G_{0}(x)=\max(x,0)$ . From the expressions above, we can conclude that the propensity functions in (55), together with the estimated parameters in Table 15, indeed provide good approximation of the true propensity functions in (56). Note that in this numerical experiment we have empirically chosen different values of $\lambda$ for different channels since we only want to demonstrate that the true parameters can indeed be estimated with properly chosen $\lambda$ . A more systematic way of choosing $\lambda$ is cross-validation [8, 20, 26]. See Remark 5. Finally, we point out that the solution could be further refined if necessary, by applying thresholding techniques (i.e., removing unimportant basis functions and then solving the minimization problem again) [10].

5 Asymptotic analysis of the two learning tasks

In this section, we consider the two learning tasks introduced in Section 3 when $T\rightarrow+\infty$ . Although we are mainly interested in the second learning task and the corresponding minimization problem (36), in Subsection 5.1 we start with the first learning task, because it is highly relevant to the second learning task. The analysis in Subsection 5.1 will be useful when we study the second learning task in Subsection 5.2. The proofs of the various results will be given in Appendix D.

Assume the true propensity functions of the underlying chemical reaction system are $a^{*}_{i},a^{*}$ in (1), and recall that the system’s state $X(t)$ satisfies the dynamical equation (3), where $\mathcal{P}_{i}$ , $1\leq i\leq K$ , are independent unit Poisson processes. For most of the results in this section, we will make the following assumptions about the system. Readers are referred to [38] for the study of the ergodicity of stochastic systems.

Assumption 1.

The state space $\mathbb{X}$ is a finite set.

Assumption 2.

$X(t)$ * is ergodic on $\mathbb{X}$ . It has a unique invariant distribution $\pi$ , such that $\pi(x)>0,~{}\forall x\in\mathbb{X}$ .*

Remark 6.

Assumption 1 simplifies the analysis in this section. In particular, it implies that any function on $\mathbb{X}$ , e.g., the basis function $\varphi_{j}$ , is bounded. For many systems in chemical reaction applications, the state spaces, which although can be large, are indeed finite. This is especially the case when there are conservation relations in the reactions of the system. At the same time, we also expect the analysis presented below can be extended to systems whose state space is an infinite set, after taking into account additional technical issues.

Our asymptotic analysis of the limit $T\rightarrow+\infty$ combines both techniques from the large sample theory [19, 34] in statistics and the limit theorems for stochastic processes [17]. In particular, we rely on the important fact that the log-likelihood functions in (19) and (35), as well as their derivatives, can be expressed as integrations with respect to the counting processes

[TABLE]

and the corresponding compensated Poisson processes (martingales)

[TABLE]

where $1\leq i\leq K$ and $t\geq 0$ . As an example, it is apparent that the process $R_{i}$ is related to $M_{i}$ in (8), i.e., the total activation number of the channel $\mathcal{C}_{i}$ within the time $[0,T]$ , since

[TABLE]

We refer the readers to Appendix C for two limit results concerning integrations with respect to the processes $R_{i}$ and $\widetilde{R}_{i}$ when $T\rightarrow+\infty$ .

5.1 Learning task 1: analysis of the log-likelihood maximizer

In this subsection, we consider the first learning task in Subsection 3.2. Recall that $\bm{\omega}^{*}=(\omega^{*}_{1},\omega^{*}_{2},\dots,\omega^{*}_{N})^{\top}$ is the true parameter vector such that (15) holds and that

[TABLE]

For fixed $T>0$ , $\bm{\omega}^{(T)}$ denotes the solution of the minimization problem (18). We will study the asymptotic convergence of $\bm{\omega}^{(T)}$ to $\bm{\omega}^{*}$ , as $T\rightarrow+\infty$ . It should be pointed out that the consistency of maximum likelihood estimation has been well studied in the statistics community [52, 19, 50]. We refer the readers to [18, 31] for the asymptotic study of maximum likelihood estimation for continuous-time stochastic processes.

Let us first express the log-likelihood function $\ln\mathcal{L}^{(T)}$ in (19) and its derivatives using the processes in (57) and (58). For the log-likelihood function, since the trajectory of the system is piecewise constant, we have

[TABLE]

while for its first order derivatives in (21), we obtain

[TABLE]

where $1\leq j\leq N$ and $i$ is the index of channel such that $j\in\mathcal{I}_{i}$ . Similarly, the second order derivatives in (22) can be expressed as

[TABLE]

for two indices $1\leq j,j^{\prime}\leq N$ when there is a common channel index $i$ , $1\leq i\leq K$ , such that $j,j^{\prime}\in\mathcal{I}_{i}$ , and otherwise

[TABLE]

when $j\in\mathcal{I}_{i}$ and $j^{\prime}\in\mathcal{I}_{i^{\prime}}$ where $1\leq i\neq i^{\prime}\leq K$ are two different channel indices.

In particular, (5.1) and (62) become simpler when $\bm{\omega}=\bm{\omega}^{*}$ , and we have

[TABLE]

Let us first recall the law of large numbers (LLN) for the unit Poisson processes $\mathcal{P}_{i}$ , $1\leq i\leq K$ , which states that [1]

[TABLE]

It allows us to study the simple case when the reaction channel $\mathcal{C}_{i}$ contains a single reaction.

Proposition 3.

*Given $1\leq i\leq K$ , suppose $N_{i}=1$ and $\mathcal{I}_{i}=\{j\}$ , for some $1\leq j\leq N$ . Assume that *

[TABLE]

Then $\lim\limits_{T\rightarrow+\infty}\omega^{(T)}_{j}=\omega^{*}_{j}$ , almost surely.

Note that Assumption 1 and Assumption 2 are actually not necessary in Proposition 3. In what follows, we study the case when $N_{i}>1$ , i.e., when more than one reactions belong to the same reaction channel $\mathcal{C}_{i}$ . We need to further make the following two assumptions.

Assumption 3.

There is a unique vector $\bm{\omega}^{*}\in\mathbb{R}^{N}$ , such that (60) is satisfied.

Assumption 4.

The basis functions $\varphi_{j}$ , $1\leq j\leq N$ , are nonnegative on $\mathbb{X}$ .

As a consequence of Assumption 3, we have the following lemma which concerns the uniqueness of $\bm{\omega}^{(T)}$ , when $T$ is sufficiently large.

Lemma 1.

Suppose that Assumptions 1, 2, 3, 4 hold. With probability one, the minimization problem (18)–(19) has a unique solution $\bm{\omega}^{(T)}$ , when $T$ is sufficiently large.

To proceed, we will need the Kullback–Leibler divergence between two probability distributions [36]. It is known that the Kullback–Leibler divergence is nonnegative and it equals zero if and only if the two distributions are identical. In particular, for the probability distributions whose density functions are $\psi$ and $p$ in (4), the Kullback–Leibler divergences can be computed as

[TABLE]

respectively, where $x\in\mathbb{X}$ and $\bm{\omega}$ , $\bm{\omega}^{\prime}$ are two parameter vectors in (16).

The convergence of $\bm{\omega}^{(T)}$ towards $\bm{\omega}^{*}$ as $T\rightarrow+\infty$ is established in the following result.

Proposition 4.

Suppose that Assumptions 1, 2, 3, 4 hold.

For any vector $\bm{\omega}$ in (16), we have

[TABLE] 2. 2.

Let $\bm{\omega}^{(T)}=(\omega^{(T)}_{1},\omega^{(T)}_{2},\dots,\omega^{(T)}_{N})^{\top}$ be the unique minimizer of the problem (18), such that $\omega^{(T)}_{j}\geq 0$ for each $1\leq j\leq N$ . With probability one, it holds that $\lim\limits_{T\rightarrow+\infty}\bm{\omega}^{(T)}=\bm{\omega}^{*}$ .

We now study the asymptotic normality of the sequence $\bm{\omega}^{(T)}$ as $T\rightarrow+\infty$ . We have the following result.

Proposition 5.

Suppose that Assumptions 1, 2, 3, 4 hold. Let $\mathcal{F}$ be the $N\times N$ matrix whose entries are

[TABLE]

for $1\leq j,\,j^{\prime}\leq N$ . Then, as $T\rightarrow+\infty$ , $\sqrt{T}\big{(}\bm{\omega}^{(T)}-\bm{\omega}^{*}\big{)}$ converges in distribution to $\mathcal{Z}\sim\mathcal{N}(\bm{0},\mathcal{F}^{-1})$ , i.e., $\mathcal{Z}$ is a Gaussian random variable whose mean equals zero and whose covariance matrix is $\mathcal{F}^{-1}$ .

5.2 Learning task 2: asymptotic analysis of the sparse optimization problem

Based on the analysis in Subsection 5.1, in this subsection we study the minimizer $\bm{\omega}^{(T,\epsilon,\lambda)}$ of the sparse minimization problem (36) as $T\rightarrow+\infty$ , where both $\epsilon=\epsilon(T)$ and $\lambda=\lambda(T)$ depend on $T$ .

Recall that $\psi^{*},p^{*}$ are the probability densities (distributions) in (2). With the convention $G_{0}(z)=\lim\limits_{\epsilon\rightarrow 0+}G_{\epsilon}(z)=\max(z,0)$ , for $z\in\mathbb{R}$ , we will denote

[TABLE]

and, correspondingly,

[TABLE]

Instead of Assumption 3, here we assume that the set of basis functions is chosen such that the underlying (true) system can be uniquely parameterized.

Assumption 5.

*There is a unique vector $\bm{\omega}^{*}\in\mathbb{R}^{N}$ , such that *

[TABLE]

We also need the following assumption in order to guarantee the boundedness of $\bm{\omega}^{(T,\epsilon,\lambda)}$ .

Assumption 6.

*For each $1\leq i\leq K$ , assume that the index set is $\mathcal{I}_{i}=\big{\{}j_{1},j_{2},\dots,j_{N_{i}}\big{\}}$ . $\bm{\eta}^{(k)}=(\eta^{(k)}_{1},\eta^{(k)}_{2},\dots,\eta^{(k)}_{N_{i}})^{\top}\in\mathbb{R}^{N_{i}}$ , $k\geq 1$ , is a sequence of vectors satisfying $\lim\limits_{k\rightarrow+\infty}\|\bm{\eta}^{(k)}\|_{2}=+\infty$ . Then $\exists x\in\mathbb{X}$ such that $a^{*}_{i}(x)>0$ and *

[TABLE]

Remark 7.

In fact, under Assumption 5 (uniqueness of $\bm{\omega}^{*}$ ), one can argue by contradiction and show that there exists $x\in\mathbb{X}$ such that (70) holds. Therefore, Assumption 6 simply further asserts that $a^{*}_{i}(x)$ is positive. In particular, Assumption 6 is not needed if $a^{*}_{i}(x)>0$ is true for all $x\in\mathbb{X}$ and $1\leq i\leq K$ .

Similar to (62) and (63), it will be helpful to express the derivatives of the log-likelihood function in (35) using the processes $R_{i}$ , $\widetilde{R}_{i}$ in (57) and (58). For the first order derivative (40), we have

[TABLE]

for $j\in\mathcal{I}_{i}$ . For the second order derivatives, we have

[TABLE]

when there is an index $i$ , $1\leq i\leq K$ , such that $j,j^{\prime}\in\mathcal{I}_{i}$ , and otherwise

[TABLE]

when $j\in\mathcal{I}_{i},\,j^{\prime}\in\mathcal{I}_{i^{\prime}}$ , for two different indices $1\leq i\neq i^{\prime}\leq K$ .

The following technical lemma addresses the boundedness of the minimizers of the minimization problem (36).

Lemma 2.

Suppose that Assumptions 1, 2, and 5 hold. The parameter $\epsilon=\epsilon(T)$ satisfies $\lim\limits_{T\rightarrow+\infty}\epsilon(T)=0$ . Let $\bm{\omega}^{(T,\epsilon,\lambda)}$ be the minimizer of the minimization problem (36) and $\mathcal{L}^{(T,\epsilon)}$ be the likelihood function in (35). Then, for each index $i$ , $1\leq i\leq K$ , and $x\in\mathbb{X}$ , such that $a^{*}_{i}(x)>0$ , we have

[TABLE]

Assuming furthermore Assumption 6 holds, then the sequence $\bm{\omega}^{(T,\epsilon,\lambda)}$ is bounded for $T>0$ .

Now we are ready to state the asymptotic results for the sequence $(\bm{\omega}^{(T,\epsilon,\lambda)})_{T>0}$ , as $T\rightarrow+\infty$ . Readers are referred to Appendix D for their proofs.

Theorem 1.

Suppose that Assumptions 1, 2, 5, and 6 hold. The parameters $\lambda=\lambda(T)$ , $\epsilon=\epsilon(T)$ in the minimization problem (36) satisfy

[TABLE]

Then we have $\lim\limits_{T\rightarrow+\infty}\bm{\omega}^{(T,\epsilon,\lambda)}=\bm{\omega}^{*}$ , a.s.

Theorem 2.

Suppose that Assumptions 1, 2, 5, and 6 hold. Let $\mathcal{F}$ be the $N\times N$ matrix whose entries are given in (68) and let $\bm{\omega}^{(T,\epsilon,\lambda)}$ be the minimizer of the problem (36). Further assume that the following conditions are met.

The parameters $\lambda=\lambda(T)$ , $\epsilon=\epsilon(T)$ in (36) satisfy

[TABLE]

for some $\alpha>0$ . 2. 2.

There exists $c>0$ , such that for all $x\in\mathbb{X}$ and $1\leq i\leq K$ satisfying $a^{*}_{i}(x)=0$ , we have either $\varphi_{j}(x)=0$ for all $j\in\mathcal{I}_{i}$ , or $\sum_{j\in\mathcal{I}_{i}}\omega_{j}^{*}\varphi_{j}(x)\leq-c<0$ .

Then, as $T\rightarrow+\infty$ , $\sqrt{T}\big{(}\bm{\omega}^{(T,\epsilon,\lambda)}-\bm{\omega}^{*}\big{)}$ converges in distribution to a Gaussian random variable with mean zero and covariance matrix $\mathcal{F}^{-1}$ .

Appendix A Pseudocode of FISTA with backtracking

We summarize the main algorithmic steps of FISTA with backtracking [7] for the optimization problem

[TABLE]

in Algorithm 1, where $\lambda>0$ and $c_{j}>0$ . The optimization problems (37) and (44) are in the form of (76), with $f$ being the (negative) logarithmic likelihood function. We refer the readers to the original paper [7], where FISTA is developed for optimization problems which are more general than (76).

Appendix B Properties of the function $G_{\epsilon}$

We now summarize some asymptotic properties of the function $G_{\epsilon}$ in (32). Given $\epsilon>0$ , recall that $G_{\epsilon}(x)=\epsilon\ln\big{(}1+e^{x/\epsilon}\big{)}$ , for all $x\in\mathbb{R}$ , whose first and second derivatives are

[TABLE]

respectively. The following lemma can be easily proved and therefore its proof is omitted.

Lemma 3.

Given $\epsilon>0$ , we have the following estimates.

$\max(x,0)<G_{\epsilon}(x)\leq\max(x,0)+\epsilon\ln 2$ , $\quad\forall\,x\in\mathbb{R}$ . 2. 2.

$1-e^{-x/\epsilon}<G_{\epsilon}^{\prime}(x)<1$ , if $x\geq 0$ , and $0<G_{\epsilon}^{\prime}(x)<e^{x/\epsilon}$ , if $x<0$ . 3. 3.

$0<G^{\prime\prime}_{\epsilon}(x)<\frac{1}{\epsilon}e^{-|x|/\epsilon}\,,\quad\forall~{}x\in\mathbb{R}\,.$ **

In particular, Lemma 3 implies $\lim\limits_{\epsilon\rightarrow 0+}G_{\epsilon}(x)=\max(x,0)$ , uniformly for $x\in\mathbb{R}$ , and

[TABLE]

We also need to study the function $\ln G_{\epsilon}(x)=\ln\big{[}\epsilon\ln(1+e^{x/\epsilon})\big{]}$ , whose first and second derivatives are

[TABLE]

Lemma 4.

Given $\epsilon>0$ , we have the following estimates.

For all $x>0$ , it holds that

[TABLE] 2. 2.

For $x=0$ , we have

[TABLE] 3. 3.

For all $x<0$ , we have

[TABLE]

Proof.

We will only prove the inequalities concerning $(\ln G_{\epsilon})^{\prime\prime}$ .

When $x>0$ , using (78) and the fact $\epsilon\ln(1+e^{x/\epsilon})>x$ , we have $(\ln G_{\epsilon})^{\prime\prime}(x)>-\frac{1}{x^{2}}$ . For the upper bound, using Lemma 3, we have

[TABLE]

and therefore (79) is obtained. 2. 2.

When $x<0$ , using the fact that $-\frac{u^{2}}{2}<\ln(1+u)-u<0$ , for all $u>0$ , we have $\ln(1+e^{x/\epsilon})>e^{x/\epsilon}-\frac{1}{2}e^{2x/\epsilon}>\frac{1}{2}e^{x/\epsilon}.$ Therefore,

[TABLE]

∎

Summarizing the estimates in Lemma 4, we can conclude that

[TABLE]

Appendix C Two limit lemmas on integrations with respect to counting processes

In this section, we summarize two useful results pertaining to integrations with respect to the processes $R_{i}$ , $\widetilde{R}_{i}$ in (57) and (58), respectively. These results play an important role in the asymptotic analysis in Section 5. The first result is a type of law of large numbers (LLN) for Poisson processes.

Lemma 5.

*Suppose that Assumptions 1-2 hold. Functions $f^{(T)}:\mathbb{X}\rightarrow\mathbb{R}$ satisfy $\lim\limits_{T\rightarrow+\infty}f^{(T)}(x)=f(x)$ , $\forall~{}x\in\mathbb{X}$ . For each $1\leq i\leq K$ , we have *

[TABLE]

Proof.

Since $\mathbb{X}$ is a finite set (Assumption 1), the convergence of $f^{(T)}$ to $f$ is in fact uniform on $\mathbb{X}$ . Using the LLN of Poisson processes in (65) and the uniform convergence of $f^{(T)}$ , we have

[TABLE]

Therefore, it is sufficient to prove (80) for the case $f^{(T)}\equiv f$ . Note that we have

[TABLE]

where $\mathbf{1}_{x}$ denotes the indicator function at state $x$ . For each $x\in\mathbb{X}$ , $\int_{0}^{T}\mathbf{1}_{x}\big{(}X(s)\big{)}\,dR_{i}(s)$ can be interpreted as the total number of times that the $i$ th channel $\mathcal{C}_{i}$ becomes active within time $[0,T]$ when the state of the system is $x$ . Similarly, $\int_{0}^{T}\mathbf{1}_{x}\big{(}X(s)\big{)}\,ds$ is the total time that the system spends at state $x$ within time $[0,T]$ . Since the waiting times at state $x$ before the channel $\mathcal{C}_{i}$ becomes activated are independent and follow exponential distributions with mean value $\big{(}a_{i}^{*}(x)\big{)}^{-1}$ , the LLN of exponential distributions implies that

[TABLE]

Since the system is ergodic (Assumption 2), Birkhoff’s ergodic theorem implies

[TABLE]

Combining (82)–(84), we obtain $\lim\limits_{T\rightarrow+\infty}\frac{1}{T}\int_{0}^{T}f\big{(}X(s)\big{)}\,dR_{i}(s)=\sum\limits_{x\in\mathbb{X}}f(x)a_{i}^{*}(x)\,\pi(x)$ , a.s. The conclusion (81) follows as a consequence, using the definition of $\widetilde{R}_{i}$ in (58) and the ergodicity of the system. ∎

The second result is a corollary of the martingale central limit theorem [17, Theorem 7.1.4].

Lemma 6.

Suppose that Assumptions 1-2 hold. For each $1\leq j\leq N$ , functions $f_{j},f^{(T)}_{j}\colon\mathbb{X}\rightarrow\mathbb{R}$ satisfy $\lim\limits_{T\rightarrow+\infty}f^{(T)}_{j}(x)=f_{j}(x)$ , $\forall~{}x\in\mathbb{X}$ . Let $\mathcal{W}^{(T)}(u)\in\mathbb{R}^{N}$ denote the $N$ -dimensional process whose components are $\mathcal{W}^{(T)}_{j}(u)=\frac{1}{\sqrt{T}}\int_{0}^{Tu}f^{(T)}_{j}\big{(}X(s)\big{)}\,d\widetilde{R}_{i}(s)$ , where $u\geq 0$ , $1\leq j\leq N$ , and the index $i$ satisfies $j\in\mathcal{I}_{i}$ , $1\leq i\leq K$ . Moreover, $\mathcal{F}$ is the $N\times N$ matrix whose entries are given by

[TABLE]

for $1\leq j,j^{\prime}\leq N$ . We define the matrix-valued (linear) process $\mathcal{A}(u)=u\,\mathcal{F}$ , $u\geq 0$ .

As $T\rightarrow\infty$ , $\mathcal{W}^{(T)}$ converges in distribution to $\mathcal{W}$ , where $\mathcal{W}$ is an $N$ -dimensional process with independent Gaussian increments whose quadratic variation process is $\mathcal{A}$ . In particular, $\mathcal{W}^{(T)}(1)$ converges in distribution to a Gaussian random variable whose mean is zero and whose covariance matrix is $\mathcal{F}$ in (85).

Proof.

For each $T>0$ , we define the matrix-valued process $\mathcal{A}^{(T)}(u)$ , $u\geq 0$ , whose entries are given by

[TABLE]

for $1\leq j,\,j^{\prime}\leq N$ . Let us verify the conditions required by the martingale central limit theorem [17, Theorem 7.1.4].

Firstly, from (57) and (58), applying Ito’s formula, we know the process

[TABLE]

is a martingale. Using the expression (86) and the ergodicity of the system, we have

[TABLE]

Furthermore, since $f^{(T)}_{j}$ converge to $f_{j}$ and $\mathbb{X}$ is a finite set (Assumption 1), it is clear that $f^{(T)}_{j}$ are uniformly bounded. This implies that

[TABLE]

Secondly, because the processes $\mathcal{A}_{j,j^{\prime}}^{(T)}(u)$ in (86) have continuous paths, the limit

[TABLE]

holds trivially. Therefore, we can apply the martingale central limit theorem [17, Theorem 7.1.4] and the conclusion follows readily. ∎

Appendix D Proofs of results in Section 5

In this section, we prove the results presented in Section 5.

We start with the results in Subsection 5.1.

Proof of Proposition 3.

As already pointed out in Subsection 3.2, the Euler–Lagrange equation (21) can be explicitly solved when $N_{i}=1$ and the solution is given in (27). Using the representations in (57) and (59), we can rewrite (27) as

[TABLE]

Applying (65) together with (66), we conclude that $\lim\limits_{T\rightarrow+\infty}\omega^{(T)}_{j}=\omega^{*}_{j}$ , almost surely. ∎

Proof of Lemma 1.

From (19) and (20), it is not difficult to see that, with probability one, there is at least one minimizer for large enough $T$ . We show the uniqueness by contradiction. Suppose that, with positive probability, the solution of (18)–(19) is not unique for an increasing subsequence $T_{k}$ , where $\lim\limits_{k\rightarrow+\infty}T_{k}=+\infty$ . According to Proposition 1, we can find an index $i$ , $1\leq i\leq K$ , such that the column vectors $\Phi_{i,l}$ , $1\leq l\leq N_{i}$ , of the matrix $\Phi_{i}$ in (23) are linearly dependent for $T_{k}$ , where $k=1,2,\cdots$ . Let us order the states in $\mathbb{X}$ such that $\mathbb{X}=\{x_{1},x_{2},\dots,x_{m}\}$ , where $m=|\mathbb{X}|$ is positive. The ergodicity of the system (Assumption 2) implies that with probability one the states $x_{1},x_{2},\dots,x_{m}$ will be visited by the system within some large finite time. Since there is a positive probability that the column vectors $\Phi_{i,l}$ are linearly dependent for all $T_{k}$ where $\lim\limits_{k\rightarrow+\infty}T_{k}=+\infty$ , we can find a nonzero vector $\bm{\eta}\in(\eta_{1},\eta_{2},\dots,\eta_{N_{i}})^{\top}\in\mathbb{R}^{N_{i}}$ , such that $\sum\limits_{k=1}^{N_{i}}\eta_{k}\varphi_{j_{k}}(x_{l})=0$ , $\forall\,1\leq l\leq m$ , where $\mathcal{I}_{i}=\big{\{}j_{1},j_{2},\dots,j_{N_{i}}\big{\}}$ . This contradicts Assumption 3. ∎

Proof of Proposition 4.

Under Assumption 2, using expressions (5.1), (64), and applying Lemma 5 in Appendix C, we can compute

[TABLE]

where we have used (67) in the last equality. Therefore, the first conclusion is obtained. 2. 2.

Firstly, let us show that the sequence $\big{(}\omega^{(T)}_{j}\big{)}_{T>0}$ is almost surely bounded for each $1\leq j\leq N$ . From the Euler–Lagrange equation (21), we can obtain the relation

[TABLE]

which implies

[TABLE]

where $i$ , $1\leq i\leq K$ , is the index such that $j\in\mathcal{I}_{i}$ . Note that both the numerator and the denominator on the right-hand side of (88) converge, as consequences of Lemma 5 in Appendix C and the ergodicity of the system (Assumption 2), respectively. Taking the limit $T\rightarrow+\infty$ in (88) and using (17), we have

[TABLE]

which implies that the sequence $\big{(}\omega_{j}^{(T)}\big{)}_{T>0}$ is almost surely bounded.

Secondly, from (62) we know that the minimizer $\bm{\omega}^{(T)}$ satisfies the identity

[TABLE]

In particular, for each state $x\in\mathbb{X}$ , it implies

[TABLE]

where $\mathbf{1}_{x}$ denotes the indicator function at $x$ . Therefore, applying Lemma 5 in Appendix C and using the ergodicity of the system, we have

[TABLE]

Note that whenever there is an increment for the counting process $R_{i}(s)$ when $X(s)=x$ , we know $a_{i}(x\,;\,\bm{\omega}^{*})>0$ and we can find an index $j\in\mathcal{I}_{i}$ such that the lower bound in (89) is positive.

Finally, let $\bar{\bm{\omega}}$ be a limit point of $\bm{\omega}^{(T)}$ as $T\rightarrow+\infty$ . Using a similar derivation as in (1) and taking the lower bound (89) into account, we obtain

[TABLE]

On the other hand, since $\bm{\omega}^{(T)}$ is the minimizer of (18), we also have

[TABLE]

Therefore, the Kullback–Leibler divergences in (90) must be equal to zero at each state $x$ . The expressions (67) then imply $a_{i}\big{(}x\,;\,\bar{\bm{\omega}}\big{)}=a_{i}\big{(}x\,;\,\bm{\omega}^{*}\big{)}$ , $\forall\,1\leq i\leq K$ and $\forall\,x\in\mathbb{X}$ . Using (17) and Assumption 3, we conclude $\bar{\bm{\omega}}=\bm{\omega}^{*}$ and therefore $\lim\limits_{T\rightarrow+\infty}\bm{\omega}^{(T)}=\bm{\omega}^{*}$ .

∎

Proof of Proposition 5.

First of all, under Assumption 3, it is straightforward to verify that the matrix $\mathcal{F}$ is positive definite and therefore invertible. Given $1\leq j\leq N$ , expanding the function $\mathcal{M}^{(T)}_{j}(\bm{\omega})$ in (21), we have

[TABLE]

Since $\mathcal{M}^{(T)}_{j}\big{(}\bm{\omega}^{(T)}\big{)}=0$ , dividing both sides of the equality above by $\sqrt{T}$ , using (63) and (64), we have

[TABLE]

where $i$ , $1\leq i\leq K$ , is the index such that $j\in\mathcal{I}_{i}$ , and we have introduced

[TABLE]

if $j,j^{\prime}\in\mathcal{I}_{i}$ , for some index $i$ , $1\leq i\leq K$ , and $\mathcal{B}_{j,j^{\prime}}^{(T)}=0$ , otherwise.

Let $\mathcal{B}^{(T)}$ denote the $N\times N$ matrix whose entries are $B^{(T)}_{j,j^{\prime}}$ , and let $\mathcal{W}^{(T)}$ denote the $N$ -dimensional vector whose component $\mathcal{W}^{(T)}_{j}$ equals the left-hand side of (92). With these notations, (92) can be written as

[TABLE]

Applying Lemma 6 in Appendix C, we know that, as $T\rightarrow+\infty$ , the vector $\mathcal{W}^{(T)}$ converges in distribution to a Gaussian random variable whose mean equals zero and whose covariance matrix is given by $\mathcal{F}$ . At the same time, since $\lim\limits_{T\rightarrow+\infty}\bm{\omega}^{(T)}=\bm{\omega}^{*}$ almost surely according to Proposition 4, Lemma 5 in Appendix C implies $\lim\limits_{T\rightarrow+\infty}\mathcal{B}^{(T)}=\mathcal{F}\,,a.s.$ Therefore, applying Slutsky’s Theorem [19], we can conclude

[TABLE]

∎

We continue to prove the results in Subsection 5.2.

Proof of Lemma 2.

Lemma 3 in Appendix B implies that

[TABLE]

Since $\bm{\omega}^{(T,\epsilon,\lambda)}$ is the minimizer, we can derive

[TABLE]

Taking the limit $T\rightarrow+\infty$ , using the fact $\lim\limits_{\epsilon\rightarrow 0}G_{\epsilon}=G_{0}$ , as well as Lemma 5 in Appendix C, we obtain

[TABLE]

Now we show (73) by contradiction. Suppose it does not hold, applying Lemma 3 in Appendix B, we can find an index $i$ , $1\leq i\leq K$ , and a state $x\in\mathbb{X}$ with $a_{i}^{*}(x)>0$ , such that by extracting a subsequence, which will be again denoted by $\bm{\omega}^{(T,\epsilon,\lambda)}$ , we have either

[TABLE]

Using (35), we can estimate

[TABLE]

where we have used Lemma 7 below, as well as the convention $0\ln 0=0$ . Since $a^{*}_{i}(x)>0$ , applying Lemma 5 in Appendix C, we have

[TABLE]

Therefore, Lemma 7 below implies that $\lim\limits_{T\rightarrow+\infty}J_{3}=+\infty$ , almost surely in the both cases in (95). For the same reason, applying Lemma 5 in Appendix C, we know that

[TABLE]

Taking the limit $T\rightarrow+\infty$ in (96), we obtain $\limsup\limits_{T\rightarrow+\infty}-\frac{1}{T}\ln\mathcal{L}^{(T,\epsilon)}(\bm{\omega}^{(T,\epsilon,\lambda)})=+\infty$ , which contradicts (94). Therefore, (73) has been proved. The boundedness of the sequence $\bm{\omega}^{(T,\epsilon,\lambda)}$ follows directly from (73) and Assumption 6. ∎

The following elementary facts have been used in the proof above.

Lemma 7.

Consider the function $f(x)=-c_{1}\ln x+c_{2}\,x$ , where $c_{1}\geq 0,c_{2}>0$ are two constants. We have

$f(x)$ * is convex on $(0,+\infty)$ .* 2. 2.

$f(x)\geq-c_{1}\ln\frac{c_{1}}{c_{2}}+c_{1},~{}\forall x\in(0,+\infty)$ , and $\lim\limits_{x\rightarrow+\infty}f(x)=+\infty$ . 3. 3.

When $c_{1}>0$ , then $\lim\limits_{x\rightarrow 0+}f(x)=+\infty$ .

Finally, we briefly present the proofs of Theorem 1 and Theorem 2, since the argument is similar to the one in Proposition 4 and Proposition 5.

Proof of Theorem 1.

Lemma 2 implies that the sequence $\bm{\omega}^{(T,\epsilon,\lambda)}$ , $T>0$ , is bounded. Let $\bar{\bm{\omega}}$ be a limit point of $\bm{\omega}^{(T,\epsilon,\lambda)}$ as $T\rightarrow+\infty$ . Similar to (35), let us define

[TABLE]

For any $\bm{\omega}\in\mathbb{R}^{N}$ , using a similar derivation as (1), we can obtain

[TABLE]

as well as

[TABLE]

Since $\bm{\omega}^{(T,\epsilon,\lambda)}$ is the minimizer of the problem (36), we have

[TABLE]

Taking the limit $T\rightarrow+\infty$ in the inequality above, using (74), (97) and (98), we obtain

[TABLE]

In particular, choosing $\bm{\omega}=\bm{\omega}^{*}$ , we get

[TABLE]

which implies $a_{i}^{(0)}\big{(}x\,;\,\bar{\bm{\omega}}\big{)}=a_{i}^{(0)}\big{(}x\,;\,\bm{\omega}^{*}\big{)}$ , $\forall~{}1\leq i\leq K$ and $\forall~{}x\in\mathbb{X}$ . From the uniqueness of $\bm{\omega}^{*}$ (Assumption 5), we know $\bar{\bm{\omega}}=\bm{\omega}^{*}$ and therefore the convergence $\lim\limits_{T\rightarrow+\infty}\bm{\omega}^{(T,\epsilon,\lambda)}=\bm{\omega}^{*}$ is obtained. ∎

Proof of Theorem 2.

First of all, the assumption (75) implies $\lim\limits_{T\rightarrow+\infty}\lambda(T)=0$ . Therefore, Theorem 1 assures the almost sure convergence of the sequence $\bm{\omega}^{(T,\epsilon,\lambda)}$ to $\bm{\omega}^{*}$ .

The same identity (91) still holds for $\mathcal{M}^{(T,\epsilon)}_{j}$ and $\bm{\omega}^{(T,\epsilon,\lambda)}$ in the current setting. Similar to (93), using the relation (41), in the current case we can obtain

[TABLE]

where the vector $\bm{v}^{(T)}\in-\partial|\bm{\omega}|(\bm{\omega}^{(T,\epsilon,\lambda)})$ is bounded, $\mathcal{W}^{(T)}\in\mathbb{R}^{N}$ is given by

[TABLE]

for $1\leq j\leq N$ , and $\mathcal{B}^{(T)}\in\mathbb{R}^{N\times N}$ is given by

[TABLE]

if there is an index $i$ , $1\leq i\leq K$ , such that $j,j^{\prime}\in\mathcal{I}_{i}$ , and otherwise $\mathcal{B}_{j,j^{\prime}}^{(T)}=0$ . See (71) and (72).

Due to the second assumption in Theorem 2, we can find another constant $c^{\prime}>0$ such that $|\sum\limits_{j\in\mathcal{I}_{i}}\omega_{j}^{*}\varphi_{j}(x)|\geq c$ , for all $x\in\mathbb{X}$ and $1\leq i\leq K$ , unless $\varphi_{j}(x)=0$ for all $j\in\mathcal{I}_{i}$ . Applying Lemma 3 in Appendix B, we have

[TABLE]

Since $\epsilon=\mathcal{O}(T^{-\alpha})$ and the functions $\varphi_{j}$ are bounded, the second terms in the expressions of both $\mathcal{W}_{j}^{(T)}$ in (100) and $\mathcal{B}_{j,j^{\prime}}^{(T)}$ in (101) converge to zero as $T\rightarrow+\infty$ . At the same time, Lemma 4 in Appendix B implies that $\lim\limits_{\epsilon\rightarrow 0+}(\ln G_{\epsilon})^{\prime}(x)=\frac{1}{x}$ and $\lim\limits_{\epsilon\rightarrow 0+}(\ln G_{\epsilon})^{\prime\prime}(x)=-\frac{1}{x^{2}}$ , uniformly on $x\geq c^{\prime}>0$ . Applying Lemma 6 in Appendix C, we know that the vector $\mathcal{W}^{(T)}$ converges in distribution to a Gaussian random variable with zero mean and covariance matrix given by $\mathcal{F}$ . Since $\lim\limits_{T\rightarrow+\infty}\bm{\omega}^{(T,\epsilon,\lambda)}=\bm{\omega}^{*}$ almost surely, Lemma 5 in Appendix C implies $\lim\limits_{T\rightarrow\infty}\mathcal{B}^{(T)}=\mathcal{F}$ , a.s. Applying Slutsky’s Theorem [19] and using the assumption (75), we can conclude

[TABLE]

∎

Remark 8.

The second assumption in Theorem 2 is used to handle the second terms of $\mathcal{W}_{j}^{(T)}$ in (100) and $\mathcal{B}_{j,j^{\prime}}^{(T)}$ in (101). It is not needed if $a^{*}_{i}(x)>0$ for all $x\in\mathbb{X}$ and $1\leq i\leq K$ .

Acknowledgements

This work is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy — The Berlin Mathematics Research Center MATH+ (EXC-2046/1, project ID: 390685689). The authors also acknowledge financial support from the Einstein Center of Mathematics (ECMath) through project CH21.

Bibliography55

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] D. F. Anderson and T. G. Kurtz. Continuous time Markov chain models for chemical reaction networks. In H. Koeppl, G. Setti, M. di Bernardo, and D. Densmore, editors, Design and Analysis of Biomolecular Circuits: Engineering Approaches to Systems and Synthetic Biology , pages 3–42. Springer New York, New York, NY, 2011.
2[2] D. Angeli. A tutorial on chemical reaction network dynamics. Eur. J. Control , 15(3):398 – 406, 2009.
3[3] M. Ashyraliyev, Y. Fomekong-Nanfack, J. A. Kaandorp, and J. G. Blom. Systems biology: parameter estimation for biochemical models. FEBS J. , 276(4):886–902, 2009.
4[4] A. Bagirov, N. Karmitsa, and M. M. Mäkelä. Introduction to Nonsmooth Optimization: Theory, Practice and Software . Springer Publishing Company, Incorporated, 2014.
5[5] K. Ball, T. G. Kurtz, L. Popovic, and G. Rempala. Asymptotic analysis of multiscale approximations to reaction networks. Ann. Appl. Probab. , 16(4):1925–1961, 2006.
6[6] A.-L. Barabási and Z. N Oltvai. Network biology: Understanding the cell’s functional organization. Nat. Rev. Genet. , 5:101–113, 2004.
7[7] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. , 2(1):183–202, 2009.
8[8] L. Boninsegna, F. Nüske, and C. Clementi. Sparse learning of stochastic dynamical equations. J. Chem. Phys. , 148(24):241723, 2018.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Learning chemical reaction networks from trajectory data

Abstract

1 Introduction

2 Chemical reaction networks as continuous-time Markov chains: forward problem

Remark 1**.**

3 Learning chemical reaction networks: inverse problem

3.1 Space of trajectories and the likelihood function

3.2 Learning task 1: determine rate constants by maximizing the log-likelihood

Proposition 1**.**

Proof.

3.3 Learning task 2: determine the rate constants and the structure of chemical reactions using sparsity

Proposition 2**.**

Proof.

Remark 2** (Role of the function GϵG_{\epsilon}Gϵ​).**

Remark 3** (Choice of basis functions).**

Remark 4** (Preconditioning).**

Remark 5** (Possible extensions).**

4 Examples

4.1 Example 1

4.2 Example 2: predator-prey system

4.3 Example 3: reaction network modeling intracellular viral infection

5 Asymptotic analysis of the two learning tasks

Assumption 1**.**

Assumption 2**.**

Remark 6**.**

5.1 Learning task 1: analysis of the log-likelihood maximizer

Proposition 3**.**

Assumption 3**.**

Assumption 4**.**

Lemma 1**.**

Proposition 4**.**

Proposition 5**.**

5.2 Learning task 2: asymptotic analysis of the sparse optimization problem

Assumption 5**.**

Assumption 6**.**

Remark 7**.**

Lemma 2**.**

Theorem 1**.**

Theorem 2**.**

Appendix A Pseudocode of FISTA with backtracking

Appendix B Properties of the function GϵG_{\epsilon}Gϵ​

Lemma 3**.**

Lemma 4**.**

Proof.

Appendix C Two limit lemmas on integrations with respect to counting processes

Lemma 5**.**

Proof.

Lemma 6**.**

Proof.

Appendix D Proofs of results in Section 5

Proof of Proposition 3.

Proof of Lemma 1.

Proof of Proposition 4.

Proof of Proposition 5.

Proof of Lemma 2.

Lemma 7**.**

Proof of Theorem 1.

Proof of Theorem 2.

Remark 8**.**

Acknowledgements

Remark 1.

Proposition 1.

Proposition 2.

Remark 2 (Role of the function $G_{\epsilon}$ ).

Remark 3 (Choice of basis functions).

Remark 4 (Preconditioning).

Remark 5 (Possible extensions).

Assumption 1.

Assumption 2.

Remark 6.

Proposition 3.

Assumption 3.

Assumption 4.

Lemma 1.

Proposition 4.

Proposition 5.

Assumption 5.

Assumption 6.

Remark 7.

Lemma 2.

Theorem 1.

Theorem 2.

Appendix B Properties of the function $G_{\epsilon}$

Lemma 3.

Lemma 4.

Lemma 5.

Lemma 6.

Lemma 7.

Remark 8.