vignettes/stat0002-ch2a-descriptive-statistics-vignette.Rmd
stat0002-ch2a-descriptive-statistics-vignette.Rmd
The main purpose of this vignette is to provide R code to calculate the summary statistics that feature in Chapter 2 of the STAT0002 notes (apart from correlation, which we defer until Chapter 9). An important point to appreciate is that usually there is more than one way to estimate from data a particular theoretical property of the distribution from which the data came. For example, we will see that there are many different rules (estimators) that can be used to estimate a quantile of a distribution.
The functions five_number
, skew
and
q_skew
can be viewed by typing the name of the function at
R command prompt >
.
These data are available in the data frame ox_births
.
Use ?ox_births
to find out about these data.
We manipulate the data into a matrix that is of the same format as
Table 2.1 in the notes. The number of birth times varies between days so
we pad the matrix with R’s missing values code NA
in order
that each column of the matrix has the same number of rows.
> ox_mat <- matrix(NA, ncol = 7, nrow = 16)
> for (i in 1:7) {
+ day_i_times <- ox_births$time[which(ox_births$day == i)]
+ ox_mat[1:length(day_i_times), i] <- sort(day_i_times)
+ colnames(ox_mat) <- paste("day", 1:7, sep = "")
+ }
> ox_mat
day1 day2 day3 day4 day5 day6 day7
[1,] 2.10 4.00 2.60 1.50 2.50 4.00 2.00
[2,] 3.40 4.10 3.60 4.70 2.50 4.00 2.70
[3,] 4.25 5.00 3.60 4.70 3.40 5.25 2.75
[4,] 5.60 5.50 6.40 7.20 4.20 6.10 3.40
[5,] 6.40 5.70 6.80 7.25 5.90 6.50 4.20
[6,] 7.30 6.50 7.50 8.10 6.25 6.90 4.30
[7,] 8.50 7.25 7.50 8.50 7.30 7.00 4.90
[8,] 8.75 7.30 8.25 9.20 7.50 8.45 6.25
[9,] 8.90 7.50 8.50 9.50 7.80 9.25 7.00
[10,] 9.50 8.20 10.40 10.70 8.30 10.10 9.00
[11,] 9.75 8.50 10.75 11.50 8.30 10.20 9.25
[12,] 10.00 9.75 14.25 NA 10.25 12.75 10.70
[13,] 10.40 11.00 14.50 NA 12.90 14.60 NA
[14,] 10.40 11.20 NA NA 14.30 NA NA
[15,] 16.00 15.00 NA NA NA NA NA
[16,] 19.00 16.50 NA NA NA NA NA
> i <- 4
> ox_births$day == i
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
[49] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> which(ox_births$day == i)
[1] 44 45 46 47 48 49 50 51 52 53 54
> ox_births$time[which(ox_births$day == i)]
[1] 8.10 10.70 11.50 7.20 7.25 9.50 8.50 1.50 4.70 4.70 9.20
> paste("day", 1:7, sep = "")
[1] "day1" "day2" "day3" "day4" "day5" "day6" "day7"
> paste("day", 1:7, sep = " ")
[1] "day 1" "day 2" "day 3" "day 4" "day 5" "day 6" "day 7"
We return to this matrix later. Until then we calculate summary statistics of the dataset containing the birth times from all days of the week.
> birth_times <- ox_births[, "time"]
> sort(birth_times)
[1] 1.50 2.00 2.10 2.50 2.50 2.60 2.70 2.75 3.40 3.40 3.40 3.60
[13] 3.60 4.00 4.00 4.00 4.10 4.20 4.20 4.25 4.30 4.70 4.70 4.90
[25] 5.00 5.25 5.50 5.60 5.70 5.90 6.10 6.25 6.25 6.40 6.40 6.50
[37] 6.50 6.80 6.90 7.00 7.00 7.20 7.25 7.25 7.30 7.30 7.30 7.50
[49] 7.50 7.50 7.50 7.80 8.10 8.20 8.25 8.30 8.30 8.45 8.50 8.50
[61] 8.50 8.50 8.75 8.90 9.00 9.20 9.25 9.25 9.50 9.50 9.75 9.75
[73] 10.00 10.10 10.20 10.25 10.40 10.40 10.40 10.70 10.70 10.75 11.00 11.20
[85] 11.50 12.75 12.90 14.25 14.30 14.50 14.60 15.00 16.00 16.50 19.00
The function five_number
calculates the five number
summary of data, using the particular method for estimating the lower
quartile, median and upper quartile described in the STAT0002 notes.
The summary
function can also be used to calculate a
five number summary.
summary
also
calculates the sample mean) does summary
produce the same
values as five_number
?No, the estimates of the lower quartile differ. This is because the
functions summary
and five_number
use
different rules to estimate quantiles: summary
calls
quantile
using type = 7
whereas
five_number
uses type = 6
. If we call
five_number
with type = 7
we get the same
numbers as summary
.
In fact the function quantile
has 9 different options
for type
. Use ?quantile
for more
information.
Until 2017/18 the STAT0002 notes gave -0.063 as the sample quartile
skewness. This was because I used the default setting,
type = 7
, in the quantile
function when
calculating it …
We can also calculate summary statistics for each of the seven days
of the week, i.e. for each of the columns of ox_mat
. In the
following the effect of the colMeans
function is fairly
obvious. apply
is a useful function. Use
?apply
to see what it does.
> five_number(ox_mat, na.rm = TRUE)
day1 day2 day3 day4 day5 day6 day7
min 2.100 4.0000 2.600 1.5 2.5000 4.000 2.0000
25% 5.800 5.5500 5.000 4.7 4.0000 5.675 2.9125
50% 8.825 7.4000 7.500 8.1 7.4000 7.000 4.6000
75% 10.300 10.6875 10.575 9.5 8.7875 10.150 8.5000
max 19.000 16.5000 14.500 11.5 14.3000 14.600 10.7000
> summary(ox_mat)
day1 day2 day3 day4
Min. : 2.100 Min. : 4.000 Min. : 2.60 Min. : 1.500
1st Qu.: 6.200 1st Qu.: 5.650 1st Qu.: 6.40 1st Qu.: 5.950
Median : 8.825 Median : 7.400 Median : 7.50 Median : 8.100
Mean : 8.766 Mean : 8.312 Mean : 8.05 Mean : 7.532
3rd Qu.:10.100 3rd Qu.:10.062 3rd Qu.:10.40 3rd Qu.: 9.350
Max. :19.000 Max. :16.500 Max. :14.50 Max. :11.500
NA's :3 NA's :5
day5 day6 day7
Min. : 2.500 Min. : 4.000 Min. : 2.000
1st Qu.: 4.625 1st Qu.: 6.100 1st Qu.: 3.237
Median : 7.400 Median : 7.000 Median : 4.600
Mean : 7.243 Mean : 8.085 Mean : 5.537
3rd Qu.: 8.300 3rd Qu.:10.100 3rd Qu.: 7.500
Max. :14.300 Max. :14.600 Max. :10.700
NA's :2 NA's :3 NA's :4
> colMeans(ox_mat, na.rm = TRUE)
day1 day2 day3 day4 day5 day6 day7
8.765625 8.312500 8.050000 7.531818 7.242857 8.084615 5.537500
> apply(ox_mat, 2, sd, na.rm = TRUE)
day1 day2 day3 day4 day5 day6 day7
4.296654 3.629348 3.733798 2.937880 3.565532 3.223313 2.887286