Chapter 2: Descriptive Statistics

The main purpose of this vignette is to provide R code to calculate the summary statistics that feature in Chapter 2 of the STAT0002 notes (apart from correlation, which we defer until Chapter 9). An important point to appreciate is that usually there is more than one way to estimate from data a particular theoretical property of the distribution from which the data came. For example, we will see that there are many different rules (estimators) that can be used to estimate a quantile of a distribution.

The functions five_number, skew and q_skew can be viewed by typing the name of the function at R command prompt >.

The Oxford Birth Times data

These data are available in the data frame ox_births. Use ?ox_births to find out about these data.

> library(stat0002)

We manipulate the data into a matrix that is of the same format as Table 2.1 in the notes. The number of birth times varies between days so we pad the matrix with R’s missing values code NA in order that each column of the matrix has the same number of rows.

> ox_mat <- matrix(NA, ncol = 7, nrow = 16)
> for (i in 1:7) {
+   day_i_times <- ox_births$time[which(ox_births$day == i)]
+   ox_mat[1:length(day_i_times), i] <- sort(day_i_times)
+   colnames(ox_mat) <- paste("day", 1:7, sep = "")
+ }  
> ox_mat
       day1  day2  day3  day4  day5  day6  day7
 [1,]  2.10  4.00  2.60  1.50  2.50  4.00  2.00
 [2,]  3.40  4.10  3.60  4.70  2.50  4.00  2.70
 [3,]  4.25  5.00  3.60  4.70  3.40  5.25  2.75
 [4,]  5.60  5.50  6.40  7.20  4.20  6.10  3.40
 [5,]  6.40  5.70  6.80  7.25  5.90  6.50  4.20
 [6,]  7.30  6.50  7.50  8.10  6.25  6.90  4.30
 [7,]  8.50  7.25  7.50  8.50  7.30  7.00  4.90
 [8,]  8.75  7.30  8.25  9.20  7.50  8.45  6.25
 [9,]  8.90  7.50  8.50  9.50  7.80  9.25  7.00
[10,]  9.50  8.20 10.40 10.70  8.30 10.10  9.00
[11,]  9.75  8.50 10.75 11.50  8.30 10.20  9.25
[12,] 10.00  9.75 14.25    NA 10.25 12.75 10.70
[13,] 10.40 11.00 14.50    NA 12.90 14.60    NA
[14,] 10.40 11.20    NA    NA 14.30    NA    NA
[15,] 16.00 15.00    NA    NA    NA    NA    NA
[16,] 19.00 16.50    NA    NA    NA    NA    NA

Can you see what the following parts of the code do?

> i <- 4
> ox_births$day == i
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
[49]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> which(ox_births$day == i)
 [1] 44 45 46 47 48 49 50 51 52 53 54
> ox_births$time[which(ox_births$day == i)]
 [1]  8.10 10.70 11.50  7.20  7.25  9.50  8.50  1.50  4.70  4.70  9.20
> paste("day", 1:7, sep = "")
[1] "day1" "day2" "day3" "day4" "day5" "day6" "day7"
> paste("day", 1:7, sep = " ")
[1] "day 1" "day 2" "day 3" "day 4" "day 5" "day 6" "day 7"

We return to this matrix later. Until then we calculate summary statistics of the dataset containing the birth times from all days of the week.

> birth_times <- ox_births[, "time"]
> sort(birth_times)
 [1]  1.50  2.00  2.10  2.50  2.50  2.60  2.70  2.75  3.40  3.40  3.40  3.60
[13]  3.60  4.00  4.00  4.00  4.10  4.20  4.20  4.25  4.30  4.70  4.70  4.90
[25]  5.00  5.25  5.50  5.60  5.70  5.90  6.10  6.25  6.25  6.40  6.40  6.50
[37]  6.50  6.80  6.90  7.00  7.00  7.20  7.25  7.25  7.30  7.30  7.30  7.50
[49]  7.50  7.50  7.50  7.80  8.10  8.20  8.25  8.30  8.30  8.45  8.50  8.50
[61]  8.50  8.50  8.75  8.90  9.00  9.20  9.25  9.25  9.50  9.50  9.75  9.75
[73] 10.00 10.10 10.20 10.25 10.40 10.40 10.40 10.70 10.70 10.75 11.00 11.20
[85] 11.50 12.75 12.90 14.25 14.30 14.50 14.60 15.00 16.00 16.50 19.00

Five number summary

The function five_number calculates the five number summary of data, using the particular method for estimating the lower quartile, median and upper quartile described in the STAT0002 notes.

> five_number(birth_times)
  min   25%   50%   75%   max 
 1.50  4.90  7.50  9.75 19.00

The summary function can also be used to calculate a five number summary.

> summary(birth_times)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.500   4.950   7.500   7.723   9.750  19.000

(If we ignore the fact that summary also calculates the sample mean) does summary produce the same values as five_number?

No, the estimates of the lower quartile differ. This is because the functions summary and five_number use different rules to estimate quantiles: summary calls quantile using type = 7 whereas five_number uses type = 6. If we call five_number with type = 7 we get the same numbers as summary.

> five_number(birth_times, type = 7)
  min   25%   50%   75%   max 
 1.50  4.95  7.50  9.75 19.00

In fact the function quantile has 9 different options for type. Use ?quantile for more information.

Sample mean

> mean(birth_times)
[1] 7.723158

Sample standard deviation and variance

> sd(birth_times)
[1] 3.568097
> var(birth_times)
[1] 12.73132
> sd(birth_times) ^ 2
[1] 12.73132

Measures of skewness

> # Standardized sample skewness
> skew(birth_times)
[1] 0.6254774

> # Sample quartile skewness
> q_skew(birth_times)
[1] -0.07216495

Until 2017/18 the STAT0002 notes gave -0.063 as the sample quartile skewness. This was because I used the default setting, type = 7, in the quantile function when calculating it …

> q_skew(birth_times, type = 7)
[1] -0.0625

Summary statistics for each day

We can also calculate summary statistics for each of the seven days of the week, i.e. for each of the columns of ox_mat. In the following the effect of the colMeans function is fairly obvious. apply is a useful function. Use ?apply to see what it does.

> five_number(ox_mat, na.rm = TRUE)
      day1    day2   day3 day4    day5   day6    day7
min  2.100  4.0000  2.600  1.5  2.5000  4.000  2.0000
25%  5.800  5.5500  5.000  4.7  4.0000  5.675  2.9125
50%  8.825  7.4000  7.500  8.1  7.4000  7.000  4.6000
75% 10.300 10.6875 10.575  9.5  8.7875 10.150  8.5000
max 19.000 16.5000 14.500 11.5 14.3000 14.600 10.7000

> summary(ox_mat)
      day1             day2             day3            day4       
 Min.   : 2.100   Min.   : 4.000   Min.   : 2.60   Min.   : 1.500  
 1st Qu.: 6.200   1st Qu.: 5.650   1st Qu.: 6.40   1st Qu.: 5.950  
 Median : 8.825   Median : 7.400   Median : 7.50   Median : 8.100  
 Mean   : 8.766   Mean   : 8.312   Mean   : 8.05   Mean   : 7.532  
 3rd Qu.:10.100   3rd Qu.:10.062   3rd Qu.:10.40   3rd Qu.: 9.350  
 Max.   :19.000   Max.   :16.500   Max.   :14.50   Max.   :11.500  
                                   NA's   :3       NA's   :5       
      day5             day6             day7       
 Min.   : 2.500   Min.   : 4.000   Min.   : 2.000  
 1st Qu.: 4.625   1st Qu.: 6.100   1st Qu.: 3.237  
 Median : 7.400   Median : 7.000   Median : 4.600  
 Mean   : 7.243   Mean   : 8.085   Mean   : 5.537  
 3rd Qu.: 8.300   3rd Qu.:10.100   3rd Qu.: 7.500  
 Max.   :14.300   Max.   :14.600   Max.   :10.700  
 NA's   :2        NA's   :3        NA's   :4

> colMeans(ox_mat, na.rm = TRUE)
    day1     day2     day3     day4     day5     day6     day7 
8.765625 8.312500 8.050000 7.531818 7.242857 8.084615 5.537500

> apply(ox_mat, 2, sd, na.rm = TRUE)
    day1     day2     day3     day4     day5     day6     day7 
4.296654 3.629348 3.733798 2.937880 3.565532 3.223313 2.887286

> skew(ox_mat, na.rm = TRUE)
[1]  0.6868543  0.8553875  0.2874617 -0.5639013  0.4105886  0.5136015  0.4524077