# Stochastic simulation of biochemical reactions

Conventional study of chemical reactions tracks reactant concentrations. This is not a problem if the system of interest contains large number of molecules because all the information needed to describe it is effectively condensed in concentration values. For systems with small number of molecules, concentration values are not descriptive enough. This issue is of particular importance in biochemistry because the number of molecules inside a cell is certainly small.

In these cases, chemical reactions cannot be interpreted as differential equations followed by reactant concentrations. Instead, chemical reactions are interpreted as what they actually are, namely the conversion of a molecule or groups of molecules into different ones. Since occurrence of molecular reactions is largely determined by chance, e.g., a fortuitous encounter of reactants, stochastic modeling is warranted. Further note that it is reasonable to expect the probability of a particular reaction to depend only on the current numbers of molecules for each reactant. Therefore, if we regard numbers of molecules as the state of a stochastic system, chemical reactions can be reasonably modeled as a continuous time Markov chain (CTMC).

This idea is widely used to simulate chemical reactions. The rest of this page covers the following topics

- Lotka-Volterra predator-prey model. A simple example of how chemical reactions are modeled using differential equations.
- Gillespie's algorithm. Discuss stochastic modeling of Lotka-Volterra's equations and introduce the workhorse Gillespie's algorithm.
- Dimerization kinetics. Back and forth conversion between the a molecule and its dimer.
- Enzymatic reactions. Enzymes mediate the production of molecules from reactant components.
- A simple auto-regulatory gene network. An auto-regulatory process to control the amount of protein transcribed from a particular gene.
- Lac operon. The mechanism to balance the digestion of glucose and lactose.

Matlab code for the examples discussed below is in this compressed folder. In the following we make references to files in this folder. Slides containing figures and brief explanations are here. Matlab code and slides were developed by Cameron Finucane during the summer of 2009.

## Lotka-Volterra predator-prey model

Lotka and Volterra's model involves a prey species $Y$ and a predator species $X$. The reactions that define the model describe the birth and death of prey and predator elements. As shorthand notation for the model we use

$R_1$ : | $Y$ | $\longrightarrow$ | $2Y$ |

$R_2$ : | $X + Y$ | $\longrightarrow$ | $2X$ |

$R_3$ : | $X$ | $\longrightarrow$ | $\emptyset$ |

The first reaction $R_1$ denotes spontaneous reproduction of prey $Y$. Reaction $R_2$ denotes consumption of a prey molecule $Y$ to generate a copy of the predator molecule $X$. Reaction $R_3$ denotes spontaneous death of predators $X$. This is a very simple model but suffices for illustration purposes. Notice that the reactions as summarized in $R_1$-$R_3$ are not enough to describe the dynamic behavior of the system for which we need to specify the rate at which the reactions occur. Therefore, we need to specify rates $r_1(x,y)$, $r_2(x,y)$, and $r_3(x,y)$ that determine how likely a reaction is to occur.

Classical chemodynamics starts rewriting the above equations to describe the rate of change of predator and prey concentrations. Let $y(t)$ denote the concentration of prey and $x(t)$ the concentration of predator at time $t$. The rates of change in predator and prey concentration are given by the time derivatives $dx(t)/dt$ and $dy(t)/dt$. The concentration of prey $y(t)$ increases when reaction $R_1$ occurs and decreases when reaction $R_2$ occurs. Considering that $R_1$ happens at a rate $r_1(x,y)$ and $R_2$ at a rate $r_2(x,y)$ the rate of change $dy(t)/dt$ in prey concentration is

$dy(t)/dt$ | $=$ | $r_1(x,y)$ | $-$ | $r_2(x,y)$ |

Likewise, predator numbers increase upon reaction $R_2$ and decrease with reaction $R_3$. Because these reactions happen at rates $r_2(x,y)$ and $r_3(x,y)$ it follows that

$dx(t)/dt$ | $=$ | $r_2(x,y)$ | $-$ | $r_3(x,y)$ |

There are reasons to expect reaction rates $r_i$ to be proportional to the concentration of reactants involved in the reaction. That is, we expect $r_1(x,y)=c_1 y$ to be proportional to the prey's concentration $y$, $r_2(x,y)=c_2 xy$ proportional to the product of both concentrations $x$ and $y$ and $r_3(x,y)=c_3 y$ to be proportional to the predator's concentration $y$. Regardless of what is reasonable to expect there is abundant empirical evidence that this is true. Substituting these expressions into the rate change equations above we obtain the pair of differential equations

$dy(t)/dt$ | $=$ | $c_1 y$ | $-$ | $c_2 xy$ |

$dx(t)/dt$ | $=$ | $c_2 xy$ | $-$ | $c_3 x$ |

The above system of ordinary differential equations (ODE) are the Lotka-Volterra's ODEs. To simulate a deterministic Lotka-Volterra system use the code in the file ssas_lotka_volterra_deterministic.m.

## Gillespie's algorithm

The differential equations in the previous section are accurate if predator and prey numbers can be accurately inferred from respective concentrations. As expressed before this is not true if the number of reactants is small. In such case randomness might yield distinctive behaviors that can only be captured through a stochastic model. To introduce such stochastic models reconsider the prey-predator model

$R_1$ : | $Y$ | $\longrightarrow$ | $2Y$ |

$R_2$ : | $X + Y$ | $\longrightarrow$ | $2X$ |

$R_3$ : | $X$ | $\longrightarrow$ | $\emptyset$ |

where Reaction $R_1$ denotes spontaneous reproduction of prey $Y$, $R_2$ consumption of a prey molecule $Y$ to generate a copy of the predator molecule $X$ and $R_3$ spontaneous death of predators $X$. To complete the description we interpret $X$ and $Y$ as the number of molecules of each species and introduce hazard functions $h_1(X,Y)$, $h_2(X,Y)$, and $h_3(X,Y)$.

The definition of the hazard function $h_i(X,Y)$ is as follows. Given that the number of molecules are $X$ and $Y$ at time $t$, the probability of reaction $R_i$ occurring in an infinitesimal time interval $(t, t+dt]$ is given by $h_i(X,Y)dt$. As a consequence of this definition, if no other reaction occurs, it follows that the time $s_i$ that elapses until the next occurrence of reaction $R_i$ is exponentially distributed with parameter $h_i(X,Y)$. This is a basic property of continuous times Markov chains. Notice, though, that other reactions are likely to occur and that as a consequence, time $s_i$ is not exponentially distributed. However, note that the probability of any reaction occurring in the infinitesimal time interval $(t, t+dt]$ is given by $h_1(X,Y)dt + h_2(X,Y)dt + h_3(X,Y)dt$. Thus, upon defining the aggregate hazard $h(X,Y) = h_1(X,Y) + h_2(X,Y) + h_3(X,Y)$ we can model the time $s$ until the next reaction as exponentially distributed with parameter $h(X,Y)$. Therefore, in a stochastic model of the above predator-prey relations, times $s$ between reactions are exponentially distributed with parameter parameters $h(X,Y)$. Once the inter-reaction time $s$ is determined the reaction type $R_i$ is determined by drawing a decision variable with probabilities proportional to the ratio $h_1(X,Y)/h(X,Y)$, $h_2(X,Y)/h(X,Y)$ and $h_3(X,Y)/h(X,Y)$.

To complete the model we need to specify the reaction hazards $h_i(X,Y)$. Reaction $R_1$ happens spontaneously when a prey molecule reproduces. Assuming that prey molecules act independently it is reasonable to assume that the hazard $h_1(X,Y)=c_1Y$ is proportional to the number of prey molecules $Y$. Similarly, Reaction $R_2$ happens upon the chance encounter between a prey and a predator molecule. Notice that there are $XY$ possible pairs of prey and predator molecules. Thus, hazard $h_2(X,Y)=c_2XY$ is reasonably modeled as proportional to the product of number of molecules $X$ and $Y$. Similarly to $R_1$, reaction $R_3$ happens spontaneously, its hazard therefore modeled as $h_3(X,Y)=c_3X$ proportional to the number of predator molecules.

Implementation of this model yields Gillespie's algorithm for simulation of chemical reactions. The algorithm's steps are the following

- Initialize $t = 0$, $X = X(0)$, $Y = Y(0)$
- Calculate all hazards $h_i(X,Y)$
- Find the total hazard, $h(X,Y) = h_1(X,Y)+h_2(X,Y)+h_3(X,Y)$
- Find the time to the next reaction: $s \sim$ Exp[h(X,Y)]$
- Set $t = t + s$
- Find the index of the current reaction, where Pr[$R_i$] = h_i(X,Y)/h(X,Y) $
- Update the state vector to account for this reaction. If reaction $R_1$ is drawn, update $Y=Y+1$. If reaction 2 occurs update $X=X+1$ and $Y=Y-1$. If $R_3$ is drawn update $X=X-1$.
- Repeat from 2

The core steps in the above algorithm are: The random drawing of the next reaction time $s$ (step 4). Determination of the reaction type $R_i$ (step 6). Update of the reactant numbers $X$ and $Y$ (step 7). The remaining steps compute values that are needed to implement these steps. To simulate an stochastic version of the Predator-prey model discussed in the previous subsection use the file ssas_lotka_volterra_stochastic.m. To run this code you need the file ssas_gillespie.m that implements Gillespie's algorithm for a generically defined chemical equation.

## Dimerization kinetics

A dimer is a molecule composed of two identical (or similar) chemical structures. Each of the pieces of the dimer is called a monomer and can exist in free form. The dimer can separate into its two component monomers and two monomers may combine to form a dimer. The schematic representation of this process is

$R_1$ : | $D$ | $\longrightarrow$ | $2M$ |

$R_2$ : | $2M$ | $\longrightarrow$ | $D$ |

To complete the model we need to specify either reaction rates $r_1(M,D)$ and $r_2(M,D)$ for a deterministic model or hazards $h_1(M,D)$ and $h_2(M,D)$ for a stochastic framework. Reaction $R_1$ happens spontaneously on dimers. Its hazard is therefore $h_1(M,D)=c_1D$ proportional to the number of dimer molecules. For reaction $R_2$ we have to notice that it happens upon encounter of a pair of monomers. To form a monomer pair we have $M$ choices for the first member of the pair and $M-1$ for the second member. The number of ordered pairs is then $M(M-1)$ and the number of unordered pairs is $M(M-1)/2$ (there are 2 ordered pairs for each unordered pair). The hazard for reaction $R_2$ is then modeled as $h_2(M,D) = c_2M(M-1)/2$ proportional to the number of unordered monomer pairs. To simulate dimerization kinetics in a deterministic setting use the code in the file ssas_dimerization_deterministic.m. The counterpart code in a stochastic setting is implemented in the file ssas_dimerization_stochastic.m.

## Enzymatic reactions

Enzymes are molecules that mediate chemical reactions. A certain set of reactants can be converted into a certain set of products with some energy expenditure. The enzyme reduces the energy threshold for the reaction to occur. The simplest enzymatic reaction involves the conversion of a substrate $S$ into a product $P$. Enzymes $E$ react with the substrate to generate an intermediate product $I$. The intermediate product $I$ then liberates the enzyme $E$ to yield the final product $P$. A simple model of this process is given by the following set of reactions

$R_1$ : | $S + E$ | $\longrightarrow$ | $I$ |

$R_2$ : | $I$ | $\longrightarrow$ | $S+E$ |

$R_3$ : | $I$ | $\longrightarrow$ | $P+E$ |

Reaction $R_1$ models the combination of a substrate molecule $S$ and an enzyme $E$ to yield an intermediate specimen $I$. The latter may disassociate from the enzyme yielding either the original substrate $S$ (Reaction $R_2$) or the the final product $P$ (Reaction $R_3$). Hazards can be determined using the ideas in previous examples. It is ready to realize that hazards can be written as $h_1(S,E,I,P)=c_1 S E$, $h_2(S,E,I,P)=c_2 I$ and $h_3(S,E,I,P)=c_3 I$. Deterministic simulation of enzymatic reactions is in the file ssas_enzyme_deterministic. Stochastic simulation of enzymatic reactions is in ssas_enzyme_stochastic.m.

## Auto-regulatory gene network

The three cases studied so far, namely the predator prey model, dimerization kinetics and enzymatic reactions serve as introductory examples, but there are not really interesting differences between the deterministic and stochastic models. Stochastic effects can only be appreciated in full in systems that can be reasonably expected to involve just a handful of specimens.

A good example of this occurs in gene expression. Protein synthesis inside a prokaryotic cell starts with transcription of the corresponding gene into a messenger ribonucleic acid (mRNA). The mRNA then passes through a ribosome to assembly the protein. The number of mRNA molecules synthesized may well be just a few units. The mRNA is an intermediate molecule, the ultimate outcome of this process being the generation of protein. An interesting question is how is the production of protein started and halted at convenient concentrations. The answer to this question varies but in general there are self-regulating feedback loops controlling the production of protein. Roughly speaking, presence of an external stimulus triggers transcription of mRNA and subsequent protein synthesis. As the number of protein molecules increases a byproduct is generated. This byproduct has affinity to bind with the encoding gene thus repressing future transcriptions and halting protein synthesis. A simplified version of an auto-regulatory gene network is the following

Transcription: | $G$ | $\longrightarrow$ | $G+mRNA$ |

Assembly: | $mRNA$ | $\longrightarrow$ | $mRNA+P$ |

Dimerization: | $2P$ | $\longleftrightarrow$ | $D$ |

Repression: | $G+D$ | $\longleftrightarrow$ | $DG$ |

mRNA degradation: | $mRNA$ | $\longrightarrow$ | $\emptyset$ |

Protein degradation: | $P$ | $\longrightarrow$ | $\emptyset$ |

Transcription and assembly reactions model the two step protein synthesis. The gene $G$ is transcribed into $mRNA$ and the $mRNA$ is then used to synthesize protein $P$. Notice that $G$ is not altered upon transcription and that neither is $mRNA$ upon protein synthesis. Auto-regulation is through a dimer $D$ of the protein $P$. This dimer generates spontaneously from two protein molecules and may also dissociate spontaneously. This back and forth conversion is denoted by a double arrow in the schematic above. The dimer $D$ may attach to the gene $G$. When a dimer is attached to $G$, transcription is not possible and we say that the gene is repressed. We denote this as the generation of the molecule $DG$. The dimer may spontaneously separate from the gene, making transcription possible once again. This is modeled as the conversion of $DG$ into $D$ and $G$. In the repression reaction the arrow points in both directions to signify binding and unbinding of the dimer $D$ to the gene $G$. The model is completed by the degradation of mRNA and protein.

This mechanism auto-regulates production of protein $P$. As the number of protein molecules increases, so does the number of dimers. As the number of dimers increases the likelihood of gene repression increases. We then expect the transcription of mRNA to be halted most of the time. As mRNA molecules degrade the synthesis of protein slows and eventually stops. When this happens, protein degradation reduces the number of dimers. Gene repression becomes less likely, mRNA transcription resumes and consequently so does protein synthesis. The code to simulate this simple auto-regulatory gene network is in the file ssas_autoregulatory_gene_network.m.

## The lac operon

Auto-regulation in biological systems typically involves group of genes, the proteins they encode for and the byproducts of the reactions of this proteins with other reactants present in the medium. These mechanisms are called auto-regulatory gene networks. One of such networks is the lac operon formed by three adjacent genes that control the metabolism of lactose in some bacteria. Cells can only use glucose to generate energy, but they can reduce lactose to glucose if the latter is unavailable. Such digestion of lactose into glucose occurs in presence of the enzyme $\beta$-galactosidase. A simplified model of lactose digestion and glucose consumption is therefore

Lactose digestion: | $\beta G + L$ | $\longrightarrow$ | $\beta G + G$ |

Glucose consumption: | $G$ | $\longrightarrow$ | $\emptyset$ |

where we have use a simplified model of enzymatic reactions (compare with the model in enzymatic reactions. The first reaction models the reduction of lactose $L$ to glucose $G$ mediated by the enzyme $\beta G$. The second reaction models the consumption of glucose molecules $G$.

To enable lactose digestion as per the reactions above, cells have to produce the enzyme $\beta$-galactosidase, which in itself requires some energy expenditure. Thus, production of $\beta$-galactosidase is only justified when lactose is abundant and glucose scarce. This is controlled by the lac operon.

The constitutive genes on the lac operon are the promoter, the operator and the gene that encodes for $\beta$-galactosidase itself. These three genes are adjacent to each other. We note that there are in fact three different kinds of $\beta$-galactosidase, and three corresponding encoding genes, but for the discussion here we can regard them as a single gene. Promoters are integral parts of Deoxyribonucleic acid (DNA) transcription. Transcription of DNA into mRNA necessitates mediation of the enzyme RNA polymerase (RNAP). This enzyme binds to the promoter to initiate transcription of DNA into mRNA. If RNAP does not bind to the promoter, no transcription occurs. The operator is a gene located between the promoter and the lactose encoding sequence. If there is a molecule attached to the operator, it interferes with the action of the enzyme RNAP thereby halting transcription.

A model of this process is then to consider three forms of the the lac operon, the regular operon $Op$, the repressed operon $ROp$ and the activated operon $AOp$. At different rates, transcription of DNA is possible from each of these states, whereby we have the following set of reactions

Repressed transcription: | $ROp$ | $\longrightarrow$ | $ROp + mRNA$ |

Regular transcription: | $Op$ | $\longrightarrow$ | $Op + mRNA$ |

Activated transcription: | $AOp$ | $\longrightarrow$ | $AOp + mRNA$ |

The difference in the above reactions is their hazard. Transcription from the repressed operon $ROp$ is much slower than transcription from the regular operon $Op$ resulting in a much smaller hazard. Similartly, for transcription from the activated operon $AOp$ is much larger than the hazard for transcription from the regular operon $Op$. Notice that to complete the modeling of $\beta$-galactosidase production we need to add reactions to model the synthesis of protein from mRNA molecules. These reactions are

Protein synthesis: | $mRNA$ | $\longrightarrow$ | $mRNA + \beta G$ |

mRNA degradation: | $mRNA$ | $\longrightarrow$ | $\emptyset$ |

Enzyme degradation: | $\beta G$ | $\longrightarrow$ | $\emptyset$ |

The first reaction models enzyme synthesis while the other two model degradation of $mRNA$ and $\beta G$.

Besides promoter, operator and encoding genes, the control of $\beta$-galactosidase production involves a promoter protein and a repressor protein. Upstream of the lac operon is a gene coding for a repressor protein $R$ with affinity for the operator gene. This protein is always expressed. When no lactose is present, the repressor binds to the operator thus hindering mRNA transcription and resulting in low $\beta$-galactosidase production. When lactose is present, however, the repressor binds preferentially to lactose therefore not interfering with transcription leading to increased production of $\beta$-galactosidase. The reactions modeling repression of the lac operon are then

Operon repression: | $R + Op$ | $\longleftrightarrow$ | $ROp$ |

Repressor neutralization: | $R + L$ | $\longleftrightarrow$ | $RL$ |

The first reaction models the adherence of the repressor protein $R$ to the operon $OP$ to yield a repressed operon $ROp$. This reaction can be undone as signified by the double arrow above. The second reaction models the binding and unbinding of repressor specimens $R$ to lactose molecules $L$. When there is lactose present we expect the second reaction to occur more frequently than the first. This is not necessarily because the hazard is larger but due to the much larger number of lactose proteins $L$ with respect to the number of genes $Op$.

The second part of the control involves the catabolite activator protein (CAP) that when bound to the operon facilitates mRNA transcription of $\beta$-galactosidase. The amount of CAP present is inversely proportional to the amount of glucose. Hence, when glucose level decreases, the amount of CAP increases and the operon is likely to switch to its activated version. With the operon activated, the rate of mRNA transcription increases and as a consequence so does the production of $\beta$-galactosidase. The reactions to model this interaction are

Operon activation: | $CAP + Op$ | $\longleftrightarrow$ | $AOp$ |

CAP repression: | $CAP + G$ | $\longleftrightarrow$ | $CAPG$ |

The first reaction models the activation of the operon. The second reaction models binding of CAP to glucose $G$ so that the number of free CAP molecules changes in opposite direction to the number of glucose molecules present. We note that CAP does not actually bind to glucose, but for a preliminary model the reactions above suffice.

When lactose and glucose are present this control mechanism results in a distinctive diauxie pattern with glucose consumed first and lactose processed after glucose is depleted. Simulation of the lac operon is in the file ssas_lac_operon.m.