vignettes/stat0002-ch2d-box-cox-transformation-vignette.Rmd
stat0002-ch2d-box-cox-transformation-vignette.Rmd
The vignette involves concepts that we won’t cover until Chapters 4 and 5 of the course: the probability density function (p.d.f.) of a continuous random variable and p.d.f.s of the normal and exponential distributions. For the purpose of this vignette we just need to appreciate that skewness (asymmetry) in the p.d.f. will tend to be reflected by skewness in data sampled from these distributions.
Suppose that we are provided with data, say, relating to a variable and a question of interest concerning some aspect of this variable. Our task is to analyse these data in order to try to answer the question. Naively, we may think that we must perform our calculations using , but this is not the case. It can be perfectly acceptable to analyse instead , where , provided that the function used to transform into is such that our results can be translated back to answer the original question. Typically, is an invertible continuous function. The transformations that we have used in lectures: power and logarithmic transformations have these properties.
Your initial reaction may be that this seems like cheating: indeed some students have already suggested this to me. However, we do this kind of thing in mathematics all the time to simplify problems. We use transformation of variable (perhaps calling it substitution) to: solve polynomial equations; solve systems of equations; perform differentiation (e.g. the chain rule); transform a difficult integral into one that is easier, or that we have solved before; solve differential equations.
There can be benefits to working with instead of . The methods that we use to perform statistical inference (i.e. to analyse data to infer answers to questions) are often based on assumptions about the behaviour of the data. For example, we may assume that the data have been sampled from a particular probability distribution, e.g. a normal distribution, or that the mean of one variable of interest is linearly related to the value of another variable. These assumptions can be convenient because they lead to methods of analysis that are simple, convenient and well-understood. For more information and examples see Data transformation.
We don’t need to believe that these assumptions are exactly correct, but we should appreciate that if they are very far from being correct then our inferences may be misleading. If, for example, it seems clear from a plot of that the variable has a distribution that is strongly skewed then assuming that have been randomly from a normal distribution (which is a symmetric distribution) may be a bad idea. However, perhaps , for example, is much closer being normally distributed than , or at least seems closer to being symmetrically distributed. If we use a method that is intended to work well for data sampled randomly from a normal distribution then it is better to analyse than .
Please note that transformation of data is by no means the only way to solve the problem that data do not behave in a way that is consistent with a convenient simple model. In particular, note that although the normal distribution is used in many areas of Statistics there are many instances where methods based on other distributions are more appropriate.
If we do wish to transform data then we need to consider which transformation to apply. The following family of power transformations can be helpful, provided that the data are positive. If they are not we could first add a constant to the data to achieve this property, or use a similar transformation (Yeo and Johnson (2000)) that does not require the data to be positive.
The transformation that we have considered in the lectures is called a Box-Cox transformation, after the authors of a paper (Box and Cox (1964)) in which this type of transformation was studied extensively. (The names of the authors are a very pleasing coincidence, see Box and Cox.)
In it’s simplest form a Box-Cox transformation of to is given by The inverse of this transformation is Firstly, we consider how we could use a Box-Cox transformation to transform data relating to one variable so that they appear to be more symmetrically distributed. In Transformation to Approximate Linearity we use Box-Cox transformation to make the apparent relationship between two variables closer to being linear.
The normal (or Gaussian) distribution is a symmetric probability distribution. If has a normal distribution then has a log-normal distribution. This is a convenient example to study because if we know that we have a sample from a log-normal distribution then taking logs of the data produces a sample from a normal distribution.
The following code: simulates a sample of size 100 from a log-normal
distribution; draws a histogram of these data with the log-normal p.d.f.
superimposed; and then produces a similar plot after transforming using
a log transformation (lambda = 0
).
> library(stat0002)
> x <- rlnorm(100)
> boxcox_plot(x, density_fn = dlnorm, main = "data and true density function")
> boxcox_plot(x, density_fn = dlnorm, lambda = 0, main = "after transformation")
lambda
and look at the result in the plot.In this example we know which value of should produce approximate symmetry. When confronted with real data we won’t be in this position. However, we could look at a plot of the original data and judge whether we need to make the data less positively skewed (use ) or less negatively skewed (use ), and then use trial-and-error.
There is also a more automatic way to suggest a suitable value for : we try a number of values of and summarise how well a normal distribution fits the transformed data using the value of something called a profile log-likelihood. Don’t worry about what this means, just appreciate that larger values of the profile log-likelihood suggest a better fit. This produces a ‘best guess’ at the value of . We can also produce a confidence interval for . We don’t study confidence intervals in detail in STAT0002. The idea is that this interval is a measure of uncertainty about the value of .
The function boxcox
in the MASS package (Venables and Ripley (2002)) produces plot of the
profile log-likelihood against
,
with a 95% confidence interval for
.
The ~ 1
bit in the formula notation
x ~ 1
below is specifying that there is no other variable
to consider in addition to x
. In Transformation to Approximate Linearity below we
will extend this notation to deal with the case when consider two
variables.
> # Find the value of lambda with the largest profile log-likelihood
> res$x[which.max(res$y)]
[1] 0.1010101
As we expect the ‘best’ value of is close to zero and zero lies within the confidence intervals for . Typically, we choose a simple value of , e.g. a simple fraction or whole number, that is close to the ‘best’ and lies within the confidence interval.
An exponential random variable exhibits strong positive skewness.
Therefore, if we wish to Box-Cox transform data that have been sampled
from an exponential distribution in order to make it more symmetric then
we should use
.
We can use the MASS function boxcox
to help us to decide
which value of
we could use. However, an exponential distribution is a special case of
a gamma distribution (we don’t study this in STAT0002) and it has long
been known (Wilson and Hilferty (1931))
that a cube root transformation tends to make a gamma random variable
behave rather like a normal random variable. Therefore,
might work well.
> x2 <- rexp(100)
> boxcox_plot(x2, density_fn = dexp, main = "data and true density function")
> boxcox_plot(x2, density_fn = dexp, lambda = 1 / 3, main = "after transformation")
> boxcox(x2 ~ 1)
> abline(v = 1/3, col = "red")
We find that
does work well and that the ‘best’ value, calculated by
boxcox
is close to 1/3.
lambda
.We consider a simple probability distribution with p.d.f.
i.e. a triangle with base along (0,1).
The functions dtri
and rtri
in the following
code are used to calculate this p.d.f. and to simulate from this
distribution, respectively. The latter uses the inversion
method of simulation.
> # A function to define the p.d.f.
> dtri <- function(x) {
+ return(ifelse(x > 0 & x < 1, 2 * x, 0))
+ }
> # A function to simulate from this distribution
> rtri <- function(n = 1) {
+ return(sqrt(runif(n)))
+ }
In this example it can be shown that using will produce a random variable whose p.d.f. is constant (and equal to 2) over the range (-1/2, 0), that is, we produce a uniform U(-1/2, 0) random variable.
We simulate same data from this triangular distribution and show that using does indeed produce a density that is flat, and therefore symmetric.
> x3 <- rtri(200)
> boxcox_plot(x3, density_fn = dtri)
> boxcox_plot(x3, density_fn = dtri, lambda = 2)
Interestingly the boxcox
function suggests a values of
that is close to 1.5 and the resulting transformed density is not
particularly close to being symmetric. This is because the objective of
the boxcox
function is to seek a Box-Cox transformation for
which the transformed density is as close as possible to being normal.
It can’t do this very well in this case but we know that
produces a density that is perfectly symmetric, although it is a uniform
density, not a normal density.
In Chapter 8 of the STAT0002 notes we study simple linear regression. In short, the idea is to suppose that the distribution of a response variable has a mean that is a linear function of the observed value of an *explanatory variable . Alternatively, we could think of this as being a linear function of plus some random error that has a zero mean.
Consider the following plot of the percentage of the vote obtained by the candidate Pat Buchanan () against the total population (in thousands of people) of the county () in each of the 67 counties in Florida.
> xlab <- "population (1000s)"
> ylab <- "Buchanan's % vote"
> y <- 100 * USelection$buch / USelection$tvot
> x <- USelection$npop / 1000
> plot(x, y, pch = 16, xlab = xlab, ylab = ylab)
Clearly is not linearly related to . In Section 9.4 of the notes we consider the use of transformation in the setting of simple linear regression. If we are hoping to use a linear regression model then one of the most important considerations is whether the relationship between and is approximately linear. If it is not then one approach is to consider transforming and/or . In the we saw that we could choose these transformations using trial-and-error.
In Smith (2002) the author uses
instead of
and
instead of
.
In a lecture we found that this did indeed produce a relationship that
looked much closer to being linear. Here we also use
instead of
and we use the boxcox
function to suggest a Box-Cox
transformation for
.
[It is possible to look to look simultaneously for the ‘best’ Box-Cox
transformations of both
and
,
but the boxcox
function doesn’t provide this option.] In
the regression setting the boxcox
function seeks a
transformation of
for which the distribution of the random error is as close as possible
to being normal.
Here the plot suggests that we might also transform using . The plot on the left below is of against (we plot both variables on log scales). The plot on the right is of against .
> xlab <- "population (1000s)"
> ylab <- "Buchanan vote"
> plot(x, y, pch = 16, log = "xy", xlab = xlab, ylab = ylab)
> ylab <- expression(sqrt(Buchanan~vote))
> plot(x, sqrt(y), pch = 16, log = "x", xlab = xlab, ylab = ylab)
In both plots the relationships between the variables is closer to being approximately linear than in the original plot. We might argue that the plot on the left looks lightly better than the other plot and therefore we may question why Smith chose rather than . However, the relationship between and is far from the whole story. Smith’s model was much more complicated than this, and he has to worry about other assumptions, in addition to linearity.