vignettes/stat0002-ch1c-stochastic-simulation-vignette-2.Rmd
stat0002-ch1c-stochastic-simulation-vignette-2.Rmd
This document provides further technical information to supplement the Stochastic simulation vignette.
The main purpose of this worksheet is to introduce you to a computational technique, called stochastic simulation, that your lecturers at UCL will use to illustrate statistical ideas and results. Our aims are to consider how stochastic simulation works, to provide some examples of its many uses in Statistics and to show you how to perform stochastic simulation using R.
One of our examples, A simple model for an epidemic, is included owing to its relevance to recent events, but involves some challenging material. What matters most here is the important role played by stochastic simulation. Therefore, we concentrate on this in the text and provide technical details in an appendix.
If you have already studied Statistics then you may be familiar with some of the terminology used below, such as “probability distribution” and “probability density function”. If not, then please do not worry: what matters at this stage is to appreciate what stochastic simulation is and why it is useful.
Stochastic means random. Simulate means to imitate or mimic.
Stochastic simulation uses computer-generated pseudo-random numbers to mimic a stochastic real event or dataset.
Pseudo-random numbers are not random, because they are produced by a deterministic algorithm. However, a good pseudo-random number generator will produce numbers that appear to us to be random and this may be sufficient for our purposes. Pseudo-random numbers are useful because we can produce a lot of them quickly and we can repeat the generation of the same numbers if we need to reproduce our results.
A common task is to simulate numbers that behave approximately like a random sample from a particular probability distribution, that is, a set of numbers that have the characteristic statistical properties of that distribution and are approximately unrelated to each other.
If you have studied Statistics before then you may have performed hypothesis testing and calculated confidence intervals. These are examples of statistical inference, drawing conclusions by analysing data. In some examples exact mathematical results can be derived with which to perform these tasks. Simple teaching examples are often based on random samples from a normal distributions. However, these examples are often too simplistic for real-world problems: they involve unrealistically strong assumptions. When we relax these assumptions to increase realism we tend to find that doing the maths becomes much more difficult. Often we need to be satisfied with mathematics that produces results are not exact and only provide a good approximation to the truth if we have a lot of data.
Stochastic simulation can enable us solve problems when we can’t do the maths. When we can do the maths it can also be used to check our maths.
Examples of this are: studying the distribution of a statistic calculated from data, its sampling distribution; checking that a statistical procedure has the properties that we expect; estimating the value of an integral; investigating the properties of a distribution, model or process by simulating samples from it.
For our purposes, knowing exactly how pseudo-random numbers are produced is far less important than understanding how these numbers can be useful. However, we provide information that conveys the general ideas.
The building block of stochastic simulation methods is the ability to simulate numbers (pseudo-)randomly from the interval (0,1), using a pseudo-random number generator (PRNG).
Early approaches used a linear congruential generator (LCG), based on the recurrence relation
where is a positive integer and and are integers satisfying , and . is a known seed. is the remainder when is divided by and so is an integer in for all . If we let then for all .
Once the seed is set the numbers are clearly not random: each number is determined from the previous one. Also, the maximum number of unique values for is and therefore the sequence of values will repeat after a period . However, if the values of and are set carefully and is a very large number then may appear initially to us to be unrelated to each other and uniformly spread in the interval .
Appearances can be deceptive. Consider the RANDU generator, which used and . A good PRNG should produce numbers that are spread close to randomly over the interval , successive pairs of numbers that are spread close to randomly in the unit square, successive triplets that are spread close to randomly in the unit cube, and so on. However, RANDU has the property that all consecutive triplets of numbers fall on one of only 15 two-dimensional planes, so that there are large spaces between these planes where no numbers can ever be produced. All LCGs have some non-random structure like this, but RANDU is a particularly bad example and is therefore no longer used.
The following R function randu
produces successive
triplets from RANDU generator, using
.
You need not worry about how the code works, but you could find out
about the functions involved using, for example ?numeric
or
help(numeric)
.
randu <- function(N, X0 = 1) {
#
# Generates N successive triplets from the RANDU PRNG
#
# Arguments:
# N - an integer. The number of successive triplets required.
# X0 - an integer. The value of the seed, the default value is 1.
# Returns:
# An N by 3 numeric matrix. Row i contains the ith triplet.
#
res <- numeric(N + 2)
res[1] <- X0
m <- 2 ^ 31
for (i in 2 + (0:N)) {
res[i] <- (65539 * res[i - 1]) %% m
}
res <- res / m
return(embed(res, 3))
}
We call randu()
with the argument N = 1000
to simulate 10000 triplets.
randu <- randu(10000)
The triplets are then plotted in a 3D scatter plot, using the
plot3d
function from the rgl
R package, which
is loaded using library(rgl)
. rgl
is one of
many packages contributed by individuals to supplement base R. At the
time of writing there are 15707 such packages. To install a package use,
for example, install.packages("rgl")
.
library(rgl)
par(mar = c(0, 0, 0, 0))
plot3d(randu[, 1], randu[, 2], randu[, 3], xlab = expression(u[i]),
ylab = expression(u[i + 1]), zlab = expression(u[i + 2]))
3D scatter plot of successive triplets from the RANDU generator. When the plot is rotated we can see that the points lie on 2D planes in 3D space.
You should be able to left click on the plot above and move your mouse in order to rotate the plot in 3D. If you orientate the plot correctly you will see RANDU’s very non-random behaviour. (If this doesn’t work for you then please look at the plot at RANDU.)
RANDU is a particularly bad PRNG. No PRNG can be perfect but we want to avoid very obvious non-random behaviour. Various statistical tests have been designed to detect certain types of non-random behaviour in PRNG output. The vignette of the RDieHarder R package provides some examples. Improvements have been made to the basic setup described above, resulting in improved performance. The general idea is the same but the implementation is more complicated. If you are interested then please see the Wikipedia PRNG page.
We mention briefly an alternative approach, which aims to produce true random numbers, based on measurements of phenomena that behave randomly. The R packages random and qrandom provide functions to do this.
To generate pseudo-random numbers in the interval (0, 1) we may use
the runif
function. The unif
part of the name
relates to a uniform
distribution. We view numbers generated using runif
as
behaving approximately like a random sample from a standard uniform U(0,
1) distribution. We often refer to this as simulating (a random sample)
from a uniform distribution. By default R uses the Mersenne-Twister
generator. See ?Random
for details. We repeat the 3D
plot that exposed the big problem with RANDU, using numbers generated
from runif
.
x <- embed(runif(10000 + 2), 3)
par(mar = c(0, 0, 0, 0))
plot3d(x[, 1], x[, 2], x[, 3], xlab = expression(u[i]),
ylab = expression(u[i + 1]), zlab = expression(u[i + 2]))
3D scatter plot of successive triplets from the RANDU generator. When the plot is rotated we cannot see any obvious structure.
There is some non-random structure hidden in here but I can’t identify it!
We may use the set.seed
function to set the seed. If
this is not set by us then R sets a seed automatically based on the
current time. The following code simulates a sample of size 6 from a
uniform distribution.
If we do this again then we get a different sets of numbers.
If we set the seed, simulate the sample, reset the seed and simulate again, then we obtain exactly the same numbers.
# Set a random number `seed'.
set.seed(29052020)
runif(6)
[1] 0.09185512 0.83476898 0.99637843 0.28297779 0.98253045 0.42240114
set.seed(29052020)
runif(6)
[1] 0.09185512 0.83476898 0.99637843 0.28297779 0.98253045 0.42240114
This may seem like a strange thing to do, but there are instances where we would like to use again exactly the same set of pseudo-random numbers, perhaps to replicate previous results or facilitate a comparison between two different approaches.
We have seen how we can simulate from a distribution, to produce numbers . We can use these numbers to simulate from other distributions.
Consider an experiment where only two outcomes are possible: failure and success. We call such an experiment a Bernoulli trial. Suppose that we conduct Bernoulli trials and record the total number of successes. We suppose that: (a) the trials are independent, that is, the outcome of each trial is completely unrelated to the outcome of all other trials, and (b) that the probability of a success is equal to , where , for each trial. Under these assumptions has a binomial distribution with parameters and , or for short.
A direct, but naive, way to simulate from a binomial distribution is
to use
numbers
generated from runif
. Each of these numbers has the
property that it is less than
with probability
.
Therefore, if
then we conclude that the result of the first trial is a success, and
similarly for the other trials. This is like tossing a coin that is
biased so that ‘heads’ (a success) is obtained with probability
.
Then we simply count the number of successes. The function
rbinomial
below simulates one value from a
binomial distribution with parameters size
()
and prob
().
rbinomial <- function(size, prob) {
#
# Simulates one value from a binomial(size, prob) distribution in a direct,
# but inefficient, way.
#
# Arguments:
# size - an integer. The number of independent Bernoulli trials.
# prob - a number in [0, 1] integer. The probability of success.
# Returns:
# An integer: the total number of successes in size trials.
#
# Simulate size values (pseudo-)randomly between 0 and 1.
u <- runif(size)
# Find out whether (TRUE) or not (FALSE) each value of u is less than p.
outcome <- u < prob
# Count the number of TRUEs, i.e. the number of successes.
n_successes <- sum(outcome)
# Return the number of successes.
return(n_successes)
}
We simulate one value from a distribution.
rbinomial
? The following should help.set.seed(2020)
size <- 6
prob <- 0.2
u <- runif(size)
u
[1] 0.64690284 0.39422576 0.61850181 0.47689114 0.13609719 0.06738439
outcome <- u < prob
outcome
[1] FALSE FALSE FALSE FALSE TRUE TRUE
n_successes <- sum(outcome)
n_successes
[1] 2
sum(outcome)
is quite subtle. Can
you see what is happening? Check your answer using
?sum
.A related point is that if we turn the logical vector
outcome
(which contains TRUE
s and
FALSE
s) into a numeric vector, then we get …
rbinomial
will be slow is
is large. Can you see why?The in-built function in R that simulates from a binomial
distribution is rbinom
. Use ?rbinom
to find
out what the arguments to rbinom
mean. Unlike our simple
function rbinomial
the function rbinom
can
simulate more than one value from a binomial distribution.
rbinom(n = 1, size = 6, prob = 0.2)
[1] 0
rbinom(n = 10, size = 6, prob = 0.2)
[1] 1 0 1 2 2 2 1 1 1 3
Underlying rbinom
is an algorithm that works quickly
even in cases where the number of trials (size
) is
large.
Consider a non-negative continuous random variable with the property that , for all and for some . We say that has an exponential distribution with rate parameter and write for short. This distribution will appear later in A simple model for an epidemic
We simulate a sample of size 1000 from an exponential distribution with parameter equal to 2.
lambda <- 2
exp_sim <- rexp(n = 1000, rate = lambda)
As a crude check we produce a histogram of these simulated values and superimpose the probability density function (p.d.f.) of the exponential distribution from which these values have been simulated.
par(mar = c(4, 4, 1, 1))
hist(exp_sim, probability = TRUE, ylim = c(0, lambda), main = "", xlab = "t", ylab = "density")
x <- seq(0, max(exp_sim), len = 500)
lines(x, dexp(x, rate = lambda))
The p.d.f. of an exp() distribution is given by
A p.d.f. has the property that it integrates to 1 over the real line and can be used to calculate the probabilities of the form . In the exponential case, for any ,
dexp
does? Use
?dexp
to find out.It can be shown that if then . This means that we could use the following R code to simulate from an exponential distribution.
The code immediately above is an example of the inversion method of simulation. The following code implements this for the normal distribution.
u <- runif(1000)
mu <- 0
sigma <- 1
norm_sim <- qnorm(u, mean = mu, sd = sigma)
hist(norm_sim, probability = TRUE, main = "", col = "grey",
xlab = "simulated value")
x <- seq(min(norm_sim), max(norm_sim), len = 500)
lines(x, dnorm(x, mean = mu, sd = sigma))
?qnorm
to find out what qnorm
does.Note that the help page calls this the quantile function. An alternative name is the inverse cumulative distribution function or inverse c.d.f. The c.d.f. of a random variable is . We denote the inverse c.d.f. by . It can be shown that if then has c.d.f. , that is, has the required distribution.
The inversion method requires us to be able to evaluate quickly the
quantile function of the variable in question. This is not always the
case. Although it may seem easy to use qnorm
to simulate
from a normal distribution using the inversion method, calculations
using qnorm
are not fast and R’s function
rnorm
uses a more efficient method that avoids using the
quantile function. See ?rnorm
.
We have illustrated the inversion method using a one-dimensional continuous random variables, the univariate case. The same approach can be used to simulate from the distribution of a discrete random variable and in the multivariate case, that is, the (joint) distribution of a set of two or more related random variables.
Often the most efficient method of simulating values from a probability distribution is one that has been devised for that particular distribution. However, it is useful to have more general methods that can, in principle, be used to simulate from any distribution, although, in practice, there are constraints on the type of distribution to which a method can be applied. Many of these methods work by proposing values and then accepting or rejecting them using a rule, hence they are often called acceptance-rejection or rejection methods. In principle these methods can be used in the -dimensional multivariate case but in practice they quickly become inefficient as increases.
A commonly-used summary of a random variable its mean, also known as its expectation. For a random variable , this is denoted . For a continuous random variable ,
In simple cases this integral can be evaluated easily, but there are instances where this is not easy. For example, there are cases where the value of p.d.f. cannot be calculated. If we are able to simulate from the distribution of then we can use these simulated values to estimate . This technique is called Monte Carlo integration.
Suppose that It can be shown that .
If we simulate a large sample from a distribution then we expect its sample mean to be close to . The following R code checks this.
Hopefully, this seems obvious. There is some theory that supports this idea, the Law of large numbers. Loosely speaking, this means that the estimate of the expectation of the variable provided by the sample mean can be made as precise as we like by increasing the size of the simulated sample. In short, the sample mean converges to the expectation as tends to infinity. This is illustrated in the following plot.
m <- cumsum(u) / seq_along(u)
plot(seq_along(u), m, type = "l", xlab = "n", ylab = expression(estimate~of~E(X)))
abline(h = 1/2, col = "blue", lty = 2)
The convergence that we see here is not of a non-random quantity getting inexorably closer and closer to the limit , because as increases the estimate can move further from and then move back towards it. For the so-called weak law of large numbers the idea is that however close to we require the estimate to be, as tends to infinity the probability that our requirement is not satisfied tends to .
This idea can generalised. Suppose that has p.d.f. and instead of we are interested in for some function . can be calculated using If we simulate a sample from the distribution of then can be used an as estimate of .
You will be familiar with the mathematical constant . It can be shown that
where and for and otherwise.
is the p.d.f. of a random variable. Therefore, , where and if we simulate a sample from a distribution then is an estimate of whose precision increases with .
n <- 1000
x <- runif(n)
y <- sqrt(1 - x ^ 2)
theta <- 4 * cumsum(y) / seq_len(n)
plot(1:n, theta, type = "l", xlab = "n", ylab = expression(estimate~of~pi))
abline(h = pi, col = "blue", lty = 2)
Of course, this isn’t how we would calculate in practice and when presented with a 1-dimensional integral that we cannot do ‘by hand’, it may be more sensible to use a method of numerical integration, for example the trapezium rule. However, in many statistical problems the in is not dimensional but dimensional, where is large. For a sufficiently large , numerical integration techniques may not be effective and Monte Carlo integration may be a better option.
A Monte Carlo estimate of E is a random quantity. Its quality can be measured by its standard deviation, which is of the form . Tricks can be used to make the value of smaller than the simple version described above. If we can simulate effectively from the distribution of (this might not be easy!) and make small enough then we can produce a precise estimate.
Unfortunately, we are well aware of the serious consequences of an outbreak of an infectious disease and the importance of measures to mitigate the effects of such an epidemic. Understanding the way that the infection spreads within a population over time, the dynamics of the epidemic, can help governments to decide which measures to use and when. A common approach is to represent the dynamics using a model, based on a set of assumptions. The model isn’t correct, because it is a simplification of reality. However, if it captures important aspects of the behaviour of the epidemic then it can be useful. In particular, it may be used to answer `What if?’ questions, such as
“What may happen if we implement, or relax, particular public health measures?”.
We consider a very simple epidemic model based on a stochastic process, which is a collection of quantities that vary over time while undergoing random fluctuations. We measure time in days. For simplicity, we assume that everybody recovers from the disease and that people in a population are either:
After an infected person recovers from the disease they either become immune to the disease, and enter the recovered category, or they do not develop immunity and return to the susceptible category. A model of this general type is called an SIR model.
In describing the assumptions of our model the concept of a Poisson process is important. Informally, in a Poisson process events of interest occur singly, randomly and independently of each other and at a constant rate per day. We will observe differing numbers of events on different days, but the mean number of events per day is constant over time. A key property of a Poisson process is that if we inspect it at an arbitrarily chosen time then the time until the next event has an exponential() distribution.
Model assumptions
New people enter the population in a Poisson process of rate per day.
- Each new person that arrives has a probability of being infected with the disease.
Each infected person contacts other people in a Poisson process of rate per day.
- If an infected person contacts a susceptible person then they are infected immediately.
An infected person recovers after a time that has an exponential() distribution, with a mean of .
- When a person recovers the probability that they become immune is .
In addition we assume that each person acts independently of all other people and that the processes governing infection, recovery an the arrival of new people are independent.
It is relatively simple to simulate from this model, so that we can
study how an epidemic might evolve over time. We have written an R
function rSIR
that does this. This function, and technical
details that explain how it works, are given in the appendix SIR theory and simulation code
Even for this simple model important properties may not be obtained easily using mathematics. For example, we may be interested in the probability that demand for treatment for the disease will exceed supply and, if it is exceeded, by how much it will be exceeded and for how long. The answers to the latter questions are not single numbers but probability distributions and these in particular will be difficult to obtain in simple form. The problem will be even more difficult for a more realistic model.
Simulation allows us to answer these questions by simulating from the model, using parameter values that are considered relevant, and calculating quantities of interest based on the simulated data. If we simulate many times then we can estimate the probability that an event happens, using the proportion of simulations in which it happens, and study how important quantities vary between simulations.
The following simulations illustrate the general behaviour of this SIR model an epidemic under this SIR model, using parameter values that are chosen in a fairly arbitrary way. In all cases we consider an initial population size of 1000 people and simulate for a period of 100 days.
Firstly, we suppose that initially there is are 10 infected people and the rest of the population is susceptible. We suppose that:
set.seed(31052020)
# parameters: (alpha, pI, beta, pR, gamma)
pars1 <- c(1/10, 0.1, 1, 1/5, 0.9)
I0 <- 10
r1 <- rSIR(pars = pars1, I0 = I0)
par(mar = c(4, 4, 1, 1))
leg <- c("susceptible", "infected", "immune", "population")
matplot(r1[, 1], r1[, -1], type = "l", lty = 1, lwd = 2, col = 1:4,
ylab = "number of people", xlab = "time / days")
legend("right", legend = leg, lty = 1, lwd = 2, col = 1:4, bg = "transparent")
After the infection arrives in the population the number of infected people increases quickly. We expect this because the basic reproduction number is equal to , which is greater than 1. Then the number of infected people drops quickly because many of these people become immune. At the end of 100 days there are still approximately 100 susceptible people in the population. This means that the disease will spread again at some point in the future, but the outbreak may be less dramatic if the proportion of susceptible people in the population is low.
Suppose that we only have resources to treat infected 500 people at any one time. We repeat the simulation from this SIR model 100 times. We plot only the evolution of the number of infected people and calculate the proportion of simulations where the number of infected people exceeds 500 during the 100-day period. This give an estimate of 69% for the probability that our resources are not sufficient.
set.seed(31052020)
r1 <- rSIR(pars = pars1, I0 = I0)
par(mar = c(4, 4, 1, 1))
matplot(r1[, 1], r1[, 3], type = "l", ylab = "number of infected people",
xlab = "time / days", ylim = c(0, 550), col = 2)
nsim <- 100
gt500 <- numeric(nsim)
gt500[1] <- any(r1[, 3] > 500)
for (i in 2:nsim) {
r2 <- rSIR(pars = pars1, I0 = I0)
matlines(r2[, 1], r2[, 3], col = 2)
gt500[i] <- any(r2[, 3] > 500)
}
abline(h = 500, lty = 2)
mean(gt500)
[1] 0.69
We re-run the first simulation with the probability of post-infection immunity set at 10%, rather than 90%. We also extend the simulation to 200 days.
set.seed(31052020)
# parameters: (alpha, pI, beta, pR, gamma)
pars2 <- c(1/10, 0.1, 1, 1/5, 0.1)
r2 <- rSIR(pars = pars2, I0 = I0, days = 200)
par(mar = c(4, 4, 1, 1))
matplot(r2[, 1], r2[, -1], type = "l", lty = 1, lwd = 2, col = 1:4,
ylab = "number of people", xlab = "time / days")
legend("right", legend = leg, lty = 1, lwd = 2, col = 1:4, bg = "transparent")
Now the behaviour is rather different. The number of people who are immune rises more slowly than before and the number of infected people drops more slowly after the peak. At the end of 200 days approximately 20% of the population are susceptible to the disease when a new infected person arrives.
Our model is clearly not a useful representation of a COVID-19 outbreak, because it doesn’t reflect the reality of the way that the disease is transmitted and the public health measures used to mitigate its effects. Also the parameter values that we have used have not been informed by expert knowledge or estimated from data.
To see some ways in which this kind of model could be extended to increase realism with regard to the COVID-19 outbreak, see Simulating COVID-19 interventions with R. If you read this then you will see that although several simulations are performed only results averaged over these simulations are presented. Studying only the behaviour of averages over time is potentially dangerous, because it this may hide important information about how variable results are over different simulations.
I hope that we have convinced you that stochastic simulation is a useful technique and given you some idea how it works. We have presented two examples, but simulation has many more uses. For example, simulation can be used to fit models in cases where it is not possible to perform the mathematics required for more traditional fitting methods. The general idea is that data simulated from a good model for real data should look like the real data, and therefore we should identify combinations of parameters for which this is the case.
Let and denote denote the respective numbers of susceptible, infected and recovered people, and the population size, at time . Let . We start the process at time , with people in the population and and people in the three disease categories.
The assumptions of our SIR model are
convenient mathematically, that is, they make it easier to study certain
properties of the epidemic.
In particular, the following property holds.
The Markov property. If we know the value of then the future evolution of the disease does not depend on the numbers of infected, susceptible and recovered people prior to time .
It is easy to simulate from a Markov process, because it breaks down into two separate parts.
- How long is it until something changes?
- What kind of a change is it?
It is common to refer to a change as a “jump”. Here, each jump reflects something happening to one person.
For question 1, it doesn’t matter which type of event causes the jump. Suppose that at a time , . We need to consider the overall rate at which events happen under these conditions. The rate is . The rate of new arrivals is . The term is the overall infection rate, which follows because there are people who could infect others and each of their contacts has probability of being with a susceptible person. The overall recovery rate from the infected people is , because each infected person is recovering at a rate . Special properties of the exponential distribution mean that the time until the next jump has an exponential distribution, with rate equal to the overall rate.
For question 2, the probability that an event of a particular type is the first event to occur is proportional it rate of occurrence. In fact, the probability that is proportional to it’s rate of occurrence. For question 2 we need also to consider whether or not an arrival is infected and whether or not an infected person becomes immune after recovery.
Simulation from our model
Suppose that .
- The time until the next jump has an exponential() distribution, where .
- This jump is to
- with probability (arrival of an infected person)
- with probability (arrival of a susceptible person)
- with probability (recovery with immunity)
- with probability (recovery with no immunity)
- with probability (infection of a susceptible)
The overall rate of arrival splits into a rate of for infected arrivals and of susceptible arrivals. Similarly, the overall recovery rate from the infected people is , which splits into a rate for recovery with immunity and for recovery without immunity.
We already how to simulate from an exponential distribution and simulation of the outcome of a jump is a simple extension, from 2 outcomes to 5 outcomes, of the idea that we used when simulating from a binomial distribution.
The Markov property means that when the process jumps, at a time say, we can just start again with parts 1 and 2. Only the value of matters, not the history of how this was reached.
The simulation is performed using two R functions. rSIR
is the function that we call to set up the simulation.
SIRjump
is called repeatedly within rSIR
: each
call simulates the next change, or jump in the state of
the process. The arguments (inputs) to the functions are explained in
comments at the start. Default values of the arguments are given in the
first line of the function. We can change these by specifying different
numbers when we use rSIR
rSIR <- function(N0 = 1000, I0 = 0, S0 = N0 - I0, days = 100,
pars = c(1, 0.1, 2, 1, 0.9)) {
#
# Simulates from a simple stochastic SIR epidemic model
#
# Arguments:
# N0 - an integer. The population size at time 0.
# I0 - an integer. The initial number of infected people.
# S0 - an integer. The initial number of susceptible people.
# R0 - an integer. The initial number of recovered people.
# days - an integer. The number of days for which to simulate.
# pars - a numeric vector: (alpha, pI, beta, pR, gamma).
# alpha : rate of arrival of people into the population
# pI : probability that an arrival is infected
# beta : individual infection rate at time t is beta / Nt
# gamma : recovery rate for each infected individual
# pR : probability that an infected person is immune after recovery
#
# Returns:
# A numeric matrix with 5 columns. Row i contains the values of
# (t, S_t, I_t, R_t, N_t) immediately after transition i - 1.
#
# Checks:
if (I0 + S0 > N0) {
stop("There can be at most N people who are susceptible or infected")
}
# Infer the number of recovered people
R0 <- N0 - I0 - S0
# The initial state, at time = 0
x <- c(0, S0, I0, R0, N0)
# A list in which to store the results
res <- list()
i <- 1
# Simulate the next change, or jump, in the process until days have passed
while (x[1] < days) {
res[[i]] <- x
x <- SIRjump(x, pars)
i <- i + 1
}
# Convert the list to a matrix
res <- do.call(rbind, res)
colnames(res) <- c("t", "St", "It", "Rt", "Nt")
return(res)
}
SIRjump <- function (x, pars) {
#
# Simulates one jump from a simple stochastic SIS epidemic model
#
# Arguments:
# x - a numeric vector: (time, S, I, R, N) at the previous jump.
# pars - a numeric vector: (alpha, pI, beta, gamma, pR).
#
# Returns:
# a numeric vector: (time, S, I, R, N) immediately after the next jump
#
# The numbers of susceptible and infected people and the population size
St <- x[2]
It <- x[3]
Rt <- x[4]
Nt <- x[5]
# The parameter values
alpha <- pars[1]
pI <- pars[2]
beta <- pars[3]
gamma <- pars[4]
pR <- pars[5]
# Simulate the time at which the next transition occurs
total_rate <- alpha + beta * St * It / Nt + gamma * It
x[1] <- x[1] + rexp(1, total_rate)
# Jump probabilities
p1 <- alpha * pI / total_rate
p2 <- alpha * (1 - pI) / total_rate
p3 <- gamma * pR * It / total_rate
p4 <- gamma * (1 - pR) * It / total_rate
u <- runif(1)
if (u < p1) {
# Arrival of a new infected person It increases by 1, Nt increases by 1
x[3] <- x[3] + 1
x[5] <- x[5] + 1
} else if (u < p1 + p2) {
# Arrival of a new susceptible person St increases by 1, Nt increases by 1
x[2] <- x[2] + 1
x[5] <- x[5] + 1
} else if (u < p1 + p2 + p3) {
# Infected person becomes immune: It decreases by 1, Rt increases by 1
x[3] <- x[3] - 1
x[4] <- x[4] + 1
} else if (u < p1 + p2 + p3 + p4) {
# Infected person becomes susceptible: It decreases by 1, St increases by 1
x[3] <- x[3] - 1
x[2] <- x[2] + 1
} else {
# Infection of a susceptible: St decreases by 1, It increases by 1
x[2] <- x[2] - 1
x[3] <- x[3] + 1
}
return(x)
}