vignettes/stat0002-ch3-probability-vignette.Rmd
stat0002-ch3-probability-vignette.Rmd
This vignette provides R code to reproduce some of the content of Chapter 3 of the STAT0002 notes.
A summary of these data are available in the data frame
kerrich
. Use ?kerrich
for more
information.
> # This code produces the plot in Figure 3.1 of the STAT0002 notes
> plot(kerrich$throws, kerrich$heads / kerrich$throws,
+ ylab = "proportion of heads",
+ xlab = "number of throws (logarithmic scale)", lwd = 2, type = "l",
+ log = "x", ylim = c(0,1), axes = FALSE)
> abline(h = 0.5, lty = 2)
> axis(1, labels = as.character(c(3, 10, 30, 100, 300, 1000, 3000, 10000)),
+ at=c(3, 10, 30, 100, 300, 1000, 3000, 10000))
> axis(2, labels = c(0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0),
+ at=c(0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0))
We could extract from the data frame kerrich
the number
of trials (the 10,000 throws of the coin) and the number of heads. From
this we can calculate an estimate of the probability
P(H)
that the outcome of a trial is a head.
The object berkeley
is a 3-dimensional array that
contains information about applicants to graduate school at UC Berkeley
in 1973 for the six largest departments. Use ?UCBAdmissions
for more information. The 3 dimensions of the array correspond to the
gender of the applicant (dimension named Gender
), whether
or not they were admitted (named Admit
) and a letter code
for the department to which they applied (named Dept
). A
given entry in berkeley
gives the total number of
applicants in the corresponding (Admit
,
Gender
, Dept
) category. In Chapter 3 we
consider only the dimensions Admit
and
Gender
.
The following code collapses the 3-dimensional array to a 2-dimensional array by summing the frequencies over all the six departments, which gives the central part of Table 3.3 in the notes.
> # 2-way table: sex and outcome
> sex_outcome <- apply(berkeley, 2:1, FUN = sum)
> colnames(sex_outcome) <- c("A", "R")
> rownames(sex_outcome) <- c("M", "F")
> sex_outcome
Admit
Gender A R
M 1198 1493
F 557 1278
Now we add the column and row totals.
> # Add column totals
> sex_outcome <- rbind(sex_outcome, total = colSums(sex_outcome))
> # Add row totals
> sex_outcome <- cbind(sex_outcome, total = rowSums(sex_outcome))
> sex_outcome
A R total
M 1198 1493 2691
F 557 1278 1835
total 1755 2771 4526
We can divide by n (4526) to produce Table 3.5 of the notes.
> # Convert frequencies to probabilities
> pso <- sex_outcome/ sex_outcome[3, 3]
> round(pso, 3)
A R total
M 0.265 0.330 0.595
F 0.123 0.282 0.405
total 0.388 0.612 1.000
Notice that sex_outcome[3, 3]
extracts the [3, 3]
element of the matrix sex_outcome
, that is, 4526.
Can you use R to perform some of the calculations that are performed in Section 3.4 and/or 3.5 of the notes?
A summary of these data are available in the data frame
blood_types
.
> blood_types
ABO Rh percentage
1 O Rh+ 37
2 A Rh+ 35
3 B Rh+ 8
4 AB Rh+ 3
5 O Rh- 7
6 A Rh- 7
7 B Rh- 2
8 AB Rh- 1
One way to estimate the probabilities given on which Table 3.13 is
based is to use the aggregate
function.
> aggregate(percentage ~ Rh, data = blood_types, FUN = sum)
Rh percentage
1 Rh- 17
2 Rh+ 83
> aggregate(percentage ~ ABO, data = blood_types, FUN = sum)
ABO percentage
1 A 42
2 AB 4
3 B 10
4 O 44
To divide by 100 to convert the frequencies to estimates of probability we could do the following.