Using biochemical circuits to approximately compute log-likelihood ratio for detecting persistent signals
Chun Tung Chou

TL;DR
This paper explores how biochemical circuits can be designed to approximate the computation of log-likelihood ratios for detecting persistent signals, using stochastic modeling and gene circuit implementations.
Contribution
It introduces a method to approximate log-likelihood ratio computation for persistent signal detection using biochemical circuits, based on stochastic modeling and detection theory.
Findings
Biochemical circuits can approximate log-likelihood ratios for signal detection.
Time-scale separation allows simplified computation of detection statistics.
Coherent feedforward gene circuits can implement the detection algorithm.
Abstract
Given that biochemical circuits can process information by using analog computation, a question is: What can biochemical circuits compute? This paper considers the problem of using biochemical circuits to distinguish persistent signals from transient ones. We define a statistical detection problem over a reaction pathway consisting of three species: an inducer, a transcription factor (TF) and a gene promoter, where the inducer can activate the TF and an active TF can bind to the gene promoter. We model the pathway using the chemical master equation so the counts of bound promoters over time is a stochastic signal. We consider the problem of using the continuous-time stochastic signal of the counts of bound promoters to infer whether the inducer signal is persistent or not. We use statistical detection theory to derive the solution to this detection problem, which is to compute the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Using biochemical circuits to approximately compute log-likelihood ratio for detecting persistent signals
Chun Tung Chou
School of Computer Science and Engineering, University of New South Wales, Sydney, New South Wales 2052, Australia.
E-mail: [email protected]
Abstract
Given that biochemical circuits can process information by using analog computation, a question is: What can biochemical circuits compute? This paper considers the problem of using biochemical circuits to distinguish persistent signals from transient ones. We define a statistical detection problem over a reaction pathway consisting of three species: an inducer, a transcription factor (TF) and a gene promoter, where the inducer can activate the TF and an active TF can bind to the gene promoter. We model the pathway using the chemical master equation so the counts of bound promoters over time is a stochastic signal. We consider the problem of using the continuous-time stochastic signal of the counts of bound promoters to infer whether the inducer signal is persistent or not. We use statistical detection theory to derive the solution to this detection problem, which is to compute the log-likelihood ratio of observing a persistent signal to a transient one. We then show, using time-scale separation and other assumptions, that this log-likelihood ratio can be approximately computed by using the continuous-time signals of the number of active TF molecules and the number of bound promoters when the input is persistent. Finally, we show that the coherent feedforward gene circuits can be used to approximately compute this log-likelihood ratio when the inducer signal is persistent.
Keywords:
Statistical signal processing, signal detection, molecular computing, analog computation, biochemical circuits.
I Introduction
The fact that living organisms use biomolecular circuits for information processing has provided much inspiration for engineers to design new biomolecular circuits. Firstly, engineers have been inspired to use engineering theory to design synthetic circuits. E.g., control theory has been used to design circuits for homeostatic control [Teo:2019bk], concentration regulation [Pasotti:2019fi] and to counter the effect of loading [McBride:2019is]; signal processing theory has been used to design molecular circuits that can be used for communications [Kuscu:2019ey, Chou:2019gf, Awan:2018ig] and filtering [Zechner:2016fb]. Secondly, one can view the information processing carried out by biomolecular circuits as analog computation [Sarpeshkar:2014dg, Teo:2015kva]. This view point has helped engineers to design molecular circuits that can perform logarithmic sensing [Daniel:2013ke], parity check [Marcone:2018kp] and integral control [Briat:2016ds]. This paper considers the problem of persistence detection, i.e. the problem of deciding whether a particular chemical species has been present in sufficient quantity for a long enough time. In the natural world, the bacteria Escherichia coli (E. coli) perform persistence detection [Mangan:2003ia]. In the synthetic world, cell-based therapy [Mimee:2015jo] can make use of persistence detection on biomarkers. In this paper, we will solve the persistence detection problem by using statistical detection theory [Kay_v2] and show that a gene circuit (which is specific type of biochemical circuit) can be used to approximately compute the solution of this detection problem.
A gene circuit that can perform persistence detection in E. coli is the Coherent Type-1 Feedforward Loop with an AND logic at the output [ShenOrr:2002jo] or the C1-FFL for short. The C1-FFL is a network motif and is a frequently found circuit in both E. coli and Saccharomyces cerevisiae (yeast) [Milo:2002cg, ShenOrr:2002jo]. This means that the C1-FFL carries out important functions in cells. The authors in [ShenOrr:2002jo] showed that the C1-FFL can act as a persistence detector. They did this by modelling the gene expression in the C1-FFL by using ordinary differential equations (ODEs) and show that a persistent (resp. transient) input to the C1-FFL will result in a high (zero) output.
The paper [ShenOrr:2002jo] took a deterministic approach to understand persistence detection. Given that the biochemical environment is stochastic, it is therefore necessary to understand how cells can infer information on the environment from a stochastic point of view [Libby:2007fw]. This paper considers a reaction pathway consisting of three chemical species: an inducer, a transcription factor (TF) and a gene promoter. In this reaction pathway, the inducer can activate the TF and the activated TF can bind with the gene promoter. In order to model a stochastic biochemical environment, we model the reaction pathway using the chemical master equation [Gardiner]. We consider a detection problem whose aim is to infer whether the inducer signal is persistent or not by using the signal of the number of bound promoters over time. According to detection theory, the solution to this detection problem is to compute a log-likelihood ratio and we derive an ODE which describes the evolution of this log-likelihood ratio over time. In order to connect this ODE to the C1-FFL, we use time-scale separation and other assumptions to derive an intermediate approximation which is an ODE that can approximately compute the log-likelihood ratio for persistent signals. We then show that this intermediate approximation can be realised by using a C1-FFL. The key contribution of this paper is to show that it is possible to find the parameters of a C1-FFL so that its output is approximately equal to the log-likelihood ratio of statistical detection problem when the inducer signal is persistent. More specifically, the aim of this statistical detection problem is to detect whether the inducer signal is persistent in an inducer-TF-gene pathway. In addition, the methodology in this paper can be useful for designing synthetic molecular circuits for performing other signal processing tasks.
This paper makes advances compared to our previous work [Chou:2018jh]. In comparison to [Chou:2018jh], this paper makes two different assumptions: (i) This paper considers an inducer-TF-gene pathway but [Chou:2018jh] considered only an inducer-TF pathway; (ii) This paper assumes that the inducer signal is stochastic while [Chou:2018jh] assumed that the inducer signal is deterministic. These two different assumptions mean that new methodologies are needed to show that the C1-FFL can be used to approximately compute the log-likelihood ratio of a persistence detection problem. First, we need to show how the log-likelihood ratio can be computed exactly (Sec. III-B) in an inducer-TF-gene pathway. Second, we need to derive a new method to approximately compute the log-likelihood ratio (Sec. LABEL:sec:approx_llr). In particular, this paper needs to approximate the solution to a Bayesian filtering problem [Bronstein:2018eh] but this is not required in [Chou:2018jh] as it considered a deterministic inducer signal. Third, we need to derive a method to show how the parameters of the approximate log-likelihood ratio computation can be mapped to the C1-FFL parameters (Sec. LABEL:sec:c1ffl). These three aspects are the new elements of this paper in comparison to [Chou:2018jh]. Furthermore, in comparison with our earlier conference paper [Chou:2019ur], this paper explains why a C1-FFL can be used to approximately compute the approximate log-likelihood ratio (Sec. LABEL:sec:c1ffl) and provide full derivation on the computation of exact and approximate log-likelihood ratios (Appendices LABEL:app:sol:dp and LABEL:app:ia).
The rest of this paper is organised as follows. Sec. II presents background information on the C1-FFL. We then define the detection problem and present its solution in Sec. III. After that in Sec. LABEL:sec:approx_llr, we present a method to approximately compute the log-likelihood ratio and use this approximation in Sec. LABEL:sec:c1ffl to show that the C1-FFL can be used to approximately compute the the log-likelihood ratio when it is positive. Finally, Sec. LABEL:sec:final presents a discussion and concludes the paper.
II Background on C1-FFL
The C1-FFL can be depicted as a network where each link is associated with a signal and each node transforms the input signal(s) into an output signal. Fig. 1 shows the network of the C1-FFL. The input signal is and output signal is . Both and are intermediate signals, and is an external signal.
The C1-FFL in Fig. 1 is an abstraction of the molecular interactions which are depicted in Fig. 2. In the figure, both S and \ceeS_y are inducers. Both X and Y are TFs, which are expressed by their corresponding gene. The inducer S (resp. \ceeS_y) turns the inactive form X (Y) into its active form \ceeX_* (\ceeY_). The activation of gene Z requires the binding of both \ceeX_ and \ceeY_* to the promoter of Z, i.e. the AND gate in Fig. 1.
Note that there is a one-to-one correspondence between the chemical species in Fig. 2 with their corresponding time signals in Fig. 1, e.g. is the concentration of \ceeX_* at time and so on. In this paper, we will assume that the inducer \ceeS_y is always present and its concentration is always above the threshold needed to activate Y. Furthermore, we assume the activation of Y by \ceeS_y is fast, this allows us to write and we will use for from now on. By using Hill function to model the gene expression, [Mangan:2003ia] presents an ODE model for the C1-FFL, as follows:
[TABLE]
where , , and are reaction rate constants; , , , , , , , and are coefficients for Hill functions , and . Lastly, is the constant . The multiplication of and on the right-hand side (RHS) of (1c) implements the AND gate in Fig. 1. With suitably chosen parameter values, the C1-FFL in (1) acts as a persistence detector in the sense that if the input signal is a persistent (resp. transient), then the output has a high (low) value.
III Statistical detection on a reaction pathway
Our aim is to consider a statistical detection problem to determine whether the input signal is persistent or not. However, in this section, we will consider a more general detection problem because it can readily be solved and we will specialise it to persistence detection in Sec. LABEL:sec:approx_llr. This section is divided into two parts. We define the detection problem in Sec. III-A and present its solution in Sec. III-B.
Convention: In this paper, we use upper case letters to denote a chemical species, e.g. S, \ceeX_* etc. For each chemical species, there are two corresponding continuous-time signals based on its concentration and molecular counts. E.g. for the chemical species \ceeX_*, we denote its concentration over time as (note: lower case ) and its molecular counts over time is (note: upper case ).
III-A Detection problem
In order that we can connect the detection problem to the C1-FFL later on, we will define the detection problem using a reaction pathway which is a subset of the C1-FFL species and reactions in Fig. 2. We have depicted the reaction pathway used in the detection problem in Fig. 3. The reaction pathway consists of five chemical species: S, inactive \ceeX and its corresponding active form \ceeX_, as well as inactive and the complex which is formed by the binding of \ceeX_ to . These five species take part in the following four chemical reactions:
[TABLE]
where , , and are reaction propensity constants. For the time being, we will make the simplifying assumption that the volume scaling needed to convert between propensity and reaction rate constants is 1. This simplification allows us to equate propensity constants with reaction rate constants. We will explain how non-unit volume can be dealt with in Remark LABEL:re:v. With this assumption, note that and in (2a) and (2b) are equal to those in (1a).
In terms of molecular biology, \ceeS is an inducer and \ceeX is a TF. In Reaction (2a), the species \ceeS activates \ceeX to produce \ceeX_*. Reaction (2b) is a deactivation reaction. The reactions (2a) and (2b) are depicted in both Figs. 2 and 3.
The species is a gene. In fact, in Fig. 3 is the same as \ceeZ in Fig. 2. Note that Fig. 2 follows the standard convention in molecular biology where a gene and the protein that it expresses are given the same symbol \ceeZ. However, in this paper, we need different symbols for the gene and the protein that the gene expresses so that we can clearly distinguish their corresponding time signals. Therefore, we have chosen to use to denote the gene and use \ceeZ to denote the protein expressed by . In Reaction (2c), an active \ceeX_* binds with the promoter of to produce the complex . Lastly, Reaction (LABEL:cr:off2) is an unbinding reaction.
Let , , , and denote, respectively, the number of \ceeS, \ceeX, \ceeX_*, and molecules at time . Note these signals are piecewise constant because they are molecular counts. We assume that (resp. ) is a constant for all and we denote this constant by ().
We will refer to as the input signal. The goal of the detection is to use the signal to determine whether the input is persistent or not. We assume that the signal is generated by some species and chemical reactions upstream of (2). We will model these upstream reactions and (2) using the chemical master equation [Gardiner] which is a specific type of continuous-time Markov chain. We further assume that the upstream species do not react with \ceeX, \ceeX_, and . Intuitively, this means we can predict the behaviour of \ceeX, \ceeX_, and from that of \ceeS.
We have now defined the reaction pathway and its model. The next task is to specify the measured data and the hypotheses for the detection problem. The measured datum at time is . However, in the formulation of the detection problem, we will assume that at time , the data available to the detection problem are for all ; in other words, the data are continuous in time and are the history of the counts of up to time inclusively. We will use to denote the continuous-time history of up to time inclusively.
We now specify the hypotheses for the detection problem. Later on, we will identify and with, respectively, transient and persistent signals. However, at this stage, we want to solve the detection problem in a general way. We assume that and are two distinct subsets of the set of all possible . Intuitively, the aim of the detection problem is to decide which signal class or is more likely to have produced the observed history.
We remark that in the definition of the detection problem, the input signal is not directly observable. Since \ceeS reacts with the molecules in the reaction pathway in Fig. 3, the downstream signal contains information on . The aim of the detection problem is to infer the information on from this downstream signal. Given that we model the chemical system with the chemical master equation, both signals and are noisy.
III-B Solution to the detection problem
The aim of the detection problem is to decide which hypothesis is more likely to have generated the observed history .
We assume that the two hypotheses are equally likely, therefore the log-ratio of posteriori probabilities is equal to the log-likelihood ratio :
[TABLE]
where is the conditional probability of observing the history given hypothesis .
In Appendix LABEL:app:sol:dp, we show that the time evolution of is given by the following ODE:
[TABLE]
where , denotes the expectation and is the conditional expectation of given the history and . Note that in deriving (2d), we assume that the hypotheses have been properly chosen so that (for and ) which in turn implies that is well defined.
Since is a piecewise constant function counting the number of molecules, its derivative is a sequence of Dirac deltas at the time instants that forms or unbinds. Note that the Dirac deltas corresponding to the formation of carries a positive sign and the operator keeps only these. Fig. 4 shows an example and its corresponding .
At the time instant that is formed or unbind, the expectation also has a jump in value at time . The term in (2d) refers to the value of just before the jump. Lastly, we assume that the two hypotheses are a priori equally likely, so .
We next present a numerical example to illustrate the properties of (2d) and to explain what information is important for persistence detection. This example will also provide some intuition on how we will approximately compute the log-likelihood in Sec. LABEL:sec:approx_llr.
III-B1 Numerical example
The aim of this example is to show that we can use the log-likelihood ratio computed by (2d) to distinguish a persistent signal from a transient one. In order to conduct the numerical study, we will assume that the inducer is produced and degraded by the following chemical reactions:
[TABLE]
(Step 1) The aim of this step is to approximate in (2d). The computation of this expectation requires the solution of a Bayesian filtering problem which is computationally intensive. We will argue that if and are small, then we can use the approximation:
[TABLE]
where , and and are defined in (LABEL:eq:ia:Xi). In other words, we approximate by a rectangular pulse.
In order to argue for (2fat), we start by deriving the solution of the Bayesian filtering problem of using the history to determine the posteriori probability distribution . The typical temporal behaviour of is depicted in Fig. 11. We can see that at the time instants where binds or unbinds, the probability has a discrete jump; at other times, i.e. between two consecutive jumps, the probability varies continuously. By using [Bronstein:2018eh, Eq. (21)], we can show that the time evolution of the continuously varying part of is governed by the following ODE:
[TABLE]
where
[TABLE]
If is small, then the last term in (2fau) can be viewed as a perturbation to the ODE:
[TABLE]
This ODE is in fact the chemical master equation which governs the reactions (2f), (2) with . For small , we have . In addition, if both and are small, then the time between two jumps in is long; this further implies that is mostly around steady state, e.g. see Fig. 11 in the time interval [320,386]. Overall, this means that we can approximate by using the steady state mean number of \ceeX_* molecules assuming . When , the steady state mean number of \ceeX_* molecules is approximately given by the expression in (LABEL:eq:ia:Xi). Hence (2fat).
After using the approximation in Step 1, (2d) becomes:
[TABLE]
where is defined in (LABEL:eq:pi).
We now demonstrate the accuracy of (2fat) and (2fax) using numerical examples. We use the same parameter values as in Sec. LABEL:sec:int_approx:numer. Fig. 12(a) plots the two sides of (2fat) for the two hypotheses. It can seen that, other than the transients, the approximation is fairly accurate. Next, we use Fig. 12(b) to demonstrate the accuracy of (2fax). By using one realisation of , we calculate the exact log-likelihood ratio (2d) and the approximation (2fax). The figure shows that the approximation is accurate. In addition, the figure also shows the RMS error between for 100 realisations of . It can be seem that the RMS error is small. In order to show that the approximation holds for different parameter settings, we use a different value and plot the result in Fig. 12(c).
(Step 2) The aim of this step is to replace and in (2fax) by alternative expressions. Since in (2fax) is zero for , we only have to consider input signals whose duration . Recall that we assume in Sec. LABEL:sec:assumptions that the pathway (2) reaches steady state by if the input has a duration of at least . This means that the probability distributions of \ceeX_* and are in steady state in the time interval
Note that the RHS of (2fax) is the sum of two terms that do not depend on . We can therefore consider the contribution of each term to separately.
First, we consider the contribution of the first term on the RHS of (2fax) to , which we will call :
[TABLE]
By using a method similar to that in [Chou:2019gf], which is based on the renewal theorem [Grimmett], we can show that (2fay) can be approximated by:
[TABLE]
Next, we consider the contribution of the second term on the RHS of (2fax) to , which we will call :
[TABLE]
By integrating the above equation, we have for ; for , we have:
[TABLE]
Since the reaction pathway is in steady state in the time interval , we can replace the time average in (2fbb) by its ensemble average. In this part, we will overload the symbol to use it to refer to the random variable of the number of molecules at steady state. This should not cause any confusion because the meaning should be clear from the context. In addition, we will overload the symbol in the same way. With this overloading, the mean number of \ceeX_* and molecules at steady state are denoted by and respectively. We can now rewrite (2fbb) as:
[TABLE]
In order to be able to connect to the C1-FFL, we will need to replace the expression by a different expression. The derivation of this replacement expression requires a few auxiliary results.
(Auxiliary Result 1) By considering the global balance of the steady state of the reaction pathway (2), we have:
[TABLE]
(Auxiliary Result 2) If the amplitude of the input is sufficiently high, then the number of \ceeX_* molecules can be approximately modelled as a binomial distribution with trials and a success probability of .
Consider a binomial distribution with parameters (number of trials) and (success probability), then for sufficiently large and , we have
[TABLE]
where
[TABLE]
This result essentially says that the mean of the reciprocal of a binomial random variable (with excluded) is approximately equal to the reciprocal of the mean of the binomial random variable. If and , the binomial distribution has a single outcome with a non-zero probability so (2fbe) is exact. Intuitively, if a probability has a single modal distribution with a narrow spread, then (2fbe) holds approximately. For , the relative error of using (2fbe) is 3.21% for and drops to 1.87% for . In general, the approximation is better for large and .
(Auxiliary Result 3) Since the inducer-TF reactions are faster than the TF-gene reactions and , we can show that:
[TABLE]
We will argue that the above approximation holds by using time average to compute . Let , , be a sequence of time instants at which changes its value. Since the continuous-time Markov chain associated with the chemical system is ergodic, we have:
[TABLE]
Since the TF-gene reactions are slow in comparison, is a slow species while \ceeX_* is a fast species. This means the time interval (during which the count of is a constant) is likely to be long compared to the time-scale of the fast species \ceeX_*. This allows us to approximate the integral on the RHS of (2fbj) by . Hence (2fbi). Note that the above argument is identical to the one used in [Cao:2005gj] to derive the slow-scale tau-leaping simulation algorithm.
(Auxiliary Result 4) By using the same argument as in Auxiliary Result 3, we can show that:
[TABLE]
We will now use the above auxiliary results and (2fbc) to derive the replacement expression. By using Auxiliary Results 1 and 3, we have:
[TABLE]
We then apply Auxiliary Result 2 to the RHS of (2fbl) to obtain:
[TABLE]
By applying Auxiliary Result 4 to the RHS of (2fbm), we have:
[TABLE]
By substituting (2fbn) into (2fbc), we have:
[TABLE]
By turning the above equation into the differential form, we have:
[TABLE]
Next, by combining (2faz) and (2fbp), we have:
[TABLE]
(Step 3) Since a set of chemical reactions can be modelled by a set of ODEs, we want to turn the ODE in (2fbq) into a form that can be implemented by a set of chemical reactions. However, (2fbq) cannot be directly implemented by chemical reactions because log-likelihood ratio can take both positive and negative values but chemical concentration is always non-negative. Although [Oishi:2011ig] has derived a chemical computation system that can have both positive and negative numbers, it requires double the number of species and reactions. As in our previous work [Chou:2019gf, Chou:2018jh], we choose to compute only the log-likelihood ratio when it is positive. We do that by applying to the RHS of (2fbq); we have:
[TABLE]
We now replace in (2fbr) by to obtain:
[TABLE]
The removal of will not make much difference because the probability of having equals to 0 is small when the input signal is persistent. Note that (2fbs) is the same as (LABEL:eq:Lfinal). This completes the derivation for (LABEL:eq:Lfinal).
In order to derive (LABEL:eq:Lfinal_mean), we start from (2fbs) and take expectation on both sides. If the amplitude is sufficiently high, then there is a high probability that is large. This means we can take the expectation operator to the inside of the operator. After that we apply Auxiliary Results 2, 3 and 4 to obtain (LABEL:eq:Lfinal_mean).
Appendix C Utility maximisation
In this section, we will show that the utility maximisation problem (LABEL:eq:max_util) leads to a detection criterion based on the likelihood ratio. The proof here uses the same method as Appendix 3A in [Kay_v2] for proving the Neyman-Pearson lemma.
By using Lagrangian multiplier , we rewrite (LABEL:eq:max_util) as the maximisation of:
[TABLE]
Let be the observations that are available to the detection problem. Also let and be two disjoint sets which are to be used as the decision regions for the detection problem. In particular, the detection problem will decide for hypothesis (resp. ) if (). Given these definitions, we can write and as:
[TABLE]
By substituting (2fbv) and (2fbw) into (2fbu), we have the objective to be maximised is:
[TABLE]
In order to maximise (2fbx), we should include in if the integrand in (2fbx) is positive. In other words, should be the set of all ’s such that:
[TABLE]
If then (2fby) is equivalent to:
[TABLE]
This shows that we can use a criterion based on the likelihood ratio to maximise the utility. Note that if the requirement holds, then a non-empty may be found because the benefit gained (or utility) from deciding for outweighs the cost required. Otherwise, if , (2fby) suggests that should be an empty set.
Note that (2fbx) can be interpreted as the maximisation of subject to an upper bound on , which is the same class of optimisation formulation that the Neyman-Pearson lemma considers.
We remark that we can see from the above derivation that if both the mean utility and mean cost are linear in the probabilities , , and , then the utility maximisation problem can be solved by using a criterion based on the likelihood ratio. We further remark that it is straightforward to generalise the above proof to the case with non-zero utility and cost.
Appendix D Necessity of delay
We use the method of contradiction to argue that there must be a delay in the cloud if . Let us assume that there is a “memoryless” function which can carry out the computation in the cloud, i.e.
[TABLE]
Since we assume that the pathway (2) reaches the steady state by , we can find time instants and where such that is at steady state at both times and , i.e. . Let us consider (2fca) at time instants and , we have:
[TABLE]
which is a contradiction because . This establishes that, if , there must be a delay in the cloud in Fig. LABEL:fig:compute:Lhat.
