vignettes/stat0002-ch5-random-variables-vignette.Rmd
stat0002-ch5-random-variables-vignette.Rmd
This vignette provides some R code that is related to some of the content of Chapter 5 of the STAT0002 notes, namely to random variables and their properties.
We look at two example discrete random variables and consider how to find their respective modes, medians and means.
Consider the random variable with support and p.m.f. satisfying
The following code produces plots of the p.m.f. and c.d.f. of , fiddling with the basic plots to produce a slightly prettier ones.
> # Plot the p.m.f.
> x <- 0:2
> px <- c(1/6, 1/2, 1/3)
> plot(x, px, type = "h", axes = FALSE, ylab = "P(X = x)", xlab = "x",
+ ylim = c(0, 1/2), lwd = 4, cex.lab = 1.5, cex.axis = 1.5, las = 1)
> axis(1, at = 0:2, lwd = 2)
> axis(2, at = c(0, sort(px)), labels = c("0", "1/6", "1/3", "1/2"), lwd = 2)
> # Plot c.d.f.
> x0 <- c(-0.5, 0, 0, 1, 1, 2, 2)
> y0 <- c(0, 0, 1/6, 1/6, 2/3, 2/3, 1)
> x1 <- c(0, 0, 1, 1, 2, 2, 2.5)
> y1 <- c(0, 1/6, 1/6, 2/3, 2/3, 1, 1)
> plot(c(x0, x1), c(y0, y1), axes = FALSE, ylab = "", xlab = "x", las = 1, type = "n", cex.lab = 1.5,
+ cex.axis = 1.5, lwd = 4)
> segments(x0, y0, x1, y1, lty = rep(1, 7), lwd = 2, pch = 0)
> axis(1, at = 0:2, labels = 0:2, pos = 0, lwd = 2)
> axis(2, at = cumsum(c(0, px)), labels = c("0", "1/6", "2/3", "1"), las = 1, lwd = 2)
> title(ylab = expression("P(X"<="x)"), cex.lab = 1.5, line = 2.5)
> # Add lines to indicate the median of X
> axis(2, at = 1 / 2, labels = "1/2", las = 1, lwd = 2)
> segments(-0.5, 1 / 2, 1, 1 / 2, lty = 2)
> segments(1, 0, 1, 1 / 2, lty = 2)
The plot of the p.m.f. shows that the mode of is 1 and the plot of the c.d.f. that the median of is 1. This random variable has finite support. Therefore, its mean exists. We could calculate using the following code.
Consider the function
This is a valid p.m.f. of a discrete random variable , because
To illustrate 2. look at the following output and plot.
> m <- 100
> x <- 1:m
> px <- (6 / pi ^ 2) / x ^ 2
> mat <- cbind(c(0, x), cumsum(c(0, px)))
> plot(mat[, 1], mat[, 2], pch = 20, ann = FALSE)
> title(xlab = "x", ylab = expression(P(X <= x)))
> abline(h = 1, lty = 2)
> tail(mat)
[,1] [,2]
[96,] 95 0.9936343
[97,] 96 0.9937003
[98,] 97 0.9937649
[99,] 98 0.9938282
[100,] 99 0.9938902
[101,] 100 0.9939510
Can you find the median, mode and mean of ?
You will have trouble finding the mean because does not converge. Suppose that we try to approximate the value of this sum using for some large value . The following code plots the against for .
> m <- 100
> x <- 1:m
> px <- (6 / pi ^ 2) / x ^ 2
> s <- cumsum(x * px)
> plot(x, s, pch = 20, ann = FALSE)
> title(xlab = "m")
> title(ylab = expression(hat(S)(m)), line = 2.25)
Does it look like is converging as ? You could try some larger values of to investigate. You could also examine algebraically. Does it involve an example series that you have seen before? Is this series convergent or divergent?
This distribution is a special case of a Zeta
distribution, where a parameter
.
This distribution has a (finite) mean only if
.
If
then we can argue either that this distribution has no mean or that it
has an infinite mean. If we simulate samples of finite size from a Zeta
distribution with
then we can calculate the sample means of these samples. However, these
sample means are of little use to us because this distribution does not
have a finite mean. The values of these sample means are liable to be
influenced by the values of the largest sample values which may be very
large. The VGAM
R package has a function rzeta
that simulates samples from a Zeta distribution. If you would like to
explore how these sample means behave then you can use the code below.
You could repeat this a few times and look at how the sample means vary.
When using the VGAM
package setting the argument
shape = 1
corresponds to the distribution that we have
considered.
We look at a continuous random variable that may be used to model the Oxford birth times data and consider how to calculate the mode, median and mean of this random variable.
The plot in the bottom of Figure
5.2 of the notes is of the p.d.f. of a gamma distribution with
parameters chosen so that the shape of the p.d.f. is similar to that of
the histogram of the times in the Oxford Birth Times dataset. To find
out about the gamma distribution type ?GammaDist
. The
following code finds values of the shape parameter
and scale parameter
for which the mean and variance of the gamma distribution are equal,
respectively, to the sample mean and sample variance of the Oxford birth
times data.
Can you see how this works?
> library(stat0002)
> tbar <- mean(ox_births$time)
> vart <- var(ox_births$time)
> # Sample mean and variance
> c(tbar, vart)
[1] 7.723158 12.731320
> scale_par <- vart / tbar
> shape_par <- tbar / scale_par
> # Estimates of gamma shape and scale parameters
> c(shape_par, scale_par)
[1] 4.685073 1.648460
We produce a basic plot of the gamma distribution that we have fitted to these data, to check that it has done what we expected. If is greater than 1 then the mode of a gamma() distribution is at . We add a vertical line at the mode to check this.
If we did not know how to find the mode using the values of
and
then we could try to find it numerically by searching for the point at
which the p.d.f. is maximised. For functions like this it is common to
seek to maximise the log of the p.d.f.. In practice,
optimisation algorithms, like the function optim
used
below, are often set up to minimise a function by
default. Here we set the control option fnscale
to
-1
to multiply the target function by
,
turning our maximisation problem into a minimisation problem.
Note the use of ...
to pass the values of
shape
, scale
and log
to the
function fn
. Also, method = "L-BFGS-B"
chooses
an optimisation method that allows us to set bounds on the solution and
lower = 0
gives the information that we know that the mode
cannot be less than zero.
> # A function to calculate the (log of the) gamma p.d.f.
> fn <- function(x, ...) dgamma(x, ...)
> find_mode <- optim(1, fn, shape = shape_par, scale = scale_par, log = TRUE,
+ method = "L-BFGS-B", lower = 0, control = list(fnscale = -1))
> # Approximate value of the mode
> find_mode$par
[1] 6.074697
> # Value of the p.d.f. at this value
> exp(find_mode$value)
[1] 0.1232569
It can be shown algebraically that the gamma p.d.f. integrates to 1.
(This must be true otherwise it is not a p.d.f..) R has a function
called integrate
that performs numerical integration to
estimate the value of an integral. The code below uses
integrate
to check that this gamma p.d.f. integrates to 1
over
.
> # Check that the p.d.f. integrates to 1
> integrate(dgamma, 0, Inf, shape = shape_par, scale = scale_par)
1 with absolute error < 1.1e-08
We could also use integrate
to check that our fitted
gamma distribution has the mean that we expect. Note that the
...
in the definition of the function
integrand
that calculates the integrand can be used to pass
the arguments shape
and/or scale
to the
function dgamma
that calculates the p.d.f. of a gamma
distribution.
> # Check that the gamma mean is equal to the sample mean, tbar, of the data
> integrand <- function(x, ...) x * dgamma(x, ...)
> integrate(integrand, 0, Inf, shape = shape_par, scale = scale_par)
7.723158 with absolute error < 3e-07
> tbar
[1] 7.723158
The qgamma
function can be used to calculate quantiles
of a given gamma distribution. Suppose that we wish to calculate the
median (the
quantile) of our fitted gamma distribution. The following code does
this. It also shows that we can calculate more than one quantile at a
time.
> tmedian <- qgamma(1 / 2, shape = shape_par, scale = scale_par)
> tmedian
[1] 7.181169
> qgamma(c(0.05, 0.5, 0.95), shape = shape_par, scale = scale_par)
[1] 2.926326 7.181169 14.370700
We could use integrate
to check that qgamma
has returned the correct median, or use the function
pgamma
, which calculates the c.d.f. of a gamma random
variable.