Chapter 4 More probability

In this chapter, we consider two results that involve conditional probabilities, the law of total probability and Bayes’ theorem.

4.1 Law of total probability

We continue with the Berkeley admissions data. We have already calculated that \(P(A)=0.388\). Now we calculate this in a different way.

From Table 3.2 we can see that \[ n(A) = n(A , M) + n(A , F). \] From Table 3.6 we can see that \[\begin{equation} P(A) = P(A , M) + P(A , F). \tag{4.1} \end{equation}\]

This also follows from equation (3.6) since the events \((A , M)\) and \((A , F)\) are mutually exclusive. That is, \[\begin{eqnarray} P(A) &=& P((A , M) \mbox{ or } (A , F) ), \\ &=& P(A , M) + P(A , F) - P((A , M) , (A , F)), \\ &=& P(A , M) + P(A , F). \end{eqnarray}\]

Applying the multiplication rule (3.7) to (4.1) gives \[\begin{eqnarray} P(A) &=& P(A , M) + P(A \, F), \\ &=& P(A \mid M)\,P(M) + P(A \mid F)\,P(F)\ \\ &=& 0.445 \times 0.595 + 0.304 \times 0.405, \\ &=& 0.388. \end{eqnarray}\] This is an example of the law of total probability. The probability of event \(A\) is expressed as a weighted average of the conditional probability of event \(A\) given that \(M\) has occurred, and the conditional probability of event \(A\) given that \(F\) has occurred. Each conditional probability is given a weight equal to the probability of the event on which it is conditioned.

Note that

\(M\) and \(F\) are mutually exclusive events.
\(P(M \mbox{ or } F) = 1\), that is, together \(M\) and \(F\) cover the entire sample space of the possible values of sex. This means that \(M\) and \(F\) are exhaustive events. More generally, a set of exhaustive events has the property that at least one of these events must occur.

Definition. The law of total probability. Let events \(B_1, \ldots, B_n\) be

possible, i.e. \(P(B_i) > 0\), for \(i = 1, \ldots, n\),
(pairwise) mutually exclusive, i.e. \(P(B_i , B_j)=0\) for all \(i \neq j\), and
exhaustive, that is, \(P(B_1 \mbox{ or } B_2 \mbox{ or } \cdots \mbox{ or } B_n) = 1\).

Then, for any event \(A\) \[\begin{eqnarray} P(A) &=& P(A \mid B_1)\,P(B_1) + \cdots + P(A \mid B_n)\,P(B_n), \\ &=& \displaystyle\sum_{i=1}^n P(A \mid B_i)\,P(B_i). \\ \end{eqnarray}\]

Alternatively, \[\begin{eqnarray} P(A) &=& P(A, B_1) + \cdots + P(A, B_n), \\ &=& \displaystyle\sum_{i=1}^n P(A, B_i). \\ \end{eqnarray}\]

Some books refer to the law of total probability as the partition theorem. This is because events \(B_1, \ldots, B_n\) that are both mutually exclusive and exhaustive are said to partition the sample space, that is, they split the sample space into \(n\) disjoint parts.

4.2 Bayes’ theorem

We continue with the Berkeley admissions data. We have already calculated that \(P(A \mid M) = 0.445\). Now we calculate \(P(M \mid A)\). It is important that you understand the difference between these two probabilities.

\(P(A \mid M)\) is the probability of acceptance given that the applicant is male; in other words, the probability that a male applicant is accepted. We calculated that \(P(A \mid M)=0.445\) using the \(M\) row of Table 3.7 (or Table 3.8), that is, by conditioning on the event \(M\).

\(P(M \ldots A)\) is the probability that the applicant is male given that they are accepted; in other words, the probability that an accepted applicant is male.

Exercise. Use Table 4.1 or Table 4.2 to show that \(P(M \mid A) = 0.683\).

Table 4.1: Conditioning on applicant being accepted (frequencies).
	A	R	total
M	1198	1493	2691
F	557	1278	1835
total	1755	2771	4526

Table 4.2: Conditioning on applicant being accepted (probabilites).
	A	R	total
M	0.265	0.330	0.595
F	0.123	0.282	0.405
total	0.388	0.612	1.000

Now we calculate \(P(M \mid A)\) in a different way. If you used Table 4.2 to calculate \(P(M \mid A)\) you used the equation \[\begin{equation} P(M \mid A) = \frac{P(M , A)}{P(A)}. \tag{4.2} \end{equation}\] The multiplication rule in Section 3.6 gives \[\begin{equation} P(M , A) = P(A~|~M)\,P(M). \tag{4.3} \end{equation}\] Substituting (4.2) into (4.3) gives \[\begin{equation} P(M \mid A) = \frac{P(A \mid M)\,P(M)}{P(A)}. \tag{4.4} \end{equation}\] Equation (4.4) is an example of Bayes’ theorem, which was first derived in a paper presented to the Royal Society in 1763 by Richard Price on behalf of the late Reverend Thomas Bayes.

In this example we can either calculate \(P(A)\) using the law of total probability: \[ P(A) = P(A \mid M)\,P(M) + P(A \mid F)\,P(F), \] or calculate \(P(A)\) directly from the table.

Bayes’ theorem. Let \(B_1, \ldots, B_n\) be mutually exclusive, exhaustive events, with \(P(B_i) > 0\) for all \(i\). Let \(A\) be an event with \(P(A) > 0\). Then \[\begin{eqnarray} P(B_i \mid A) &=& \frac{P(A \mid B_i)\,P(B_i)}{P(A)}, \\ &=& \frac{P(A \mid B_i)\,P(B_i)}{P(A \mid B_1)\,P(B_1) + \cdots + P(A \mid B_n)\,P(B_n)}, \\ &=& \frac{P(A \mid B_i)\,P(B_i)}{\displaystyle\sum_{j=1}^n P(A \mid B_j)\,P(B_j)}. \end{eqnarray}\] The proof of Bayes’ theorem is a straightforward extension of the case with \(n=2\) considered in the Berkeley admissions example above.

Conditioning on more than one event

\(P(A \mid B)\) is the conditional probability that event \(A\) occurs given that event \(B\) has occurred. We can extend this idea to condition on more than one event.

For example, \(P(A \mid B , C)\), or \(P(A \mid B \mbox{ and } C)\), or \(P(A \mid B \cap C)\) is the conditional probability that event \(A\) occurs given that both events \(B\) and \(C\) have occurred. The general principle is that we have conditioned on all events that are placed on the right hand side of the conditional \(\mid\) symbol. All the results that we have seen can be extended to probabilities conditioned on more than one event.

For example, for \(P(B, C)>0\), \[ P(A \mid B, C) = \frac{P(A , B~|~C)}{P(B~|~C)} \qquad \mbox{(definition of conditional probability)}, \]

and if, in addition, \(P(A, C)>0\) \[ P(A \mid B, C) = \frac{P(B \mid A, C)\,P(A \mid C)}{P(B \mid C)} \qquad \mbox{(Bayes' theorem)}. \]

In each of these equations, if you ignore the event \(C\) then you will see familiar equations. The general idea is that definition of conditional probability and Bayes’ theorem continue to hold if we condition all probabilities on the event \(C\), provided that all the conditional probabilities involved are valid.

Alternatively, noting that we could reverse the roles of \(B\) and \(C\), if in addition \(P(A, B)>0\), then \[ P(A \mid C, B) = \frac{P(A , C \mid B)}{P(C \mid B)} \qquad \mbox{(definition of conditional probability)}, \] \[ P(A \mid C, B) = \frac{P(C \mid A, B)\,P(A \mid B)}{P(C \mid B)} \qquad \mbox{(Bayes' theorem)}. \] These four expressions give different ways to express the probability \(P(A \mid C, B)\). You will be able to find other ways to express this probability. For example,
\[ P(A \mid B,C) = \frac{P(A, B, C)}{P(B, C)}. \] Exercise. Why this true?

Misleading statistical evidence in cot death trials (continued)

We will return to this example and use Bayes’ theorem to calculate the probability that Sally Clark was innocent given (only) the statistical evidence presented at her trial. We will make some assumptions that we know are unrealistic, but the general approach that we take will illustrate the importance of using sound probabilistic reasoning.

4.3 DNA identification evidence

DNA evidence is increasingly being used to catch and prosecute suspects of crimes. The following example is based on a real criminal case.

In 1996 Denis John Adams was put on trial for rape. Apart from the fact that he lived in the area local to where the crime was committed, the only evidence against him was that his DNA matched a sample of DNA obtained from the victim. In fact, all other evidence was in favour of Adams. The victim did not pick him out an identity parade; the victim said he did not look like her attacker, who she said was in his early 20s (Adams was 37); Adams had an alibi.

At Adam’s trial the Prosecution said that the match probability, the probability that Adam’s DNA would match the DNA evidence if he was innocent, is 1 in 200 million. The Defence disagreed with this, saying that 1 in 20 million or even 1 in 2 million was correct.

At the trial it was stated that there were approximately 150,000 males in the local area between 18 and 60 years old who, before any evidence was collected, could have been suspects.

Questions

Do you think the evidence against Adams is very strong?
If you were on the jury would you have voted `guilty’?
Would you want to do any calculations first? If so, what would you calculate?