PA2 due Monday
HW5 due Tuesday
MT1 grades up, μ ≈ 68.6, σ ≈ 15.8
MT1 solutions will be posted with some more comments

Review
Recall that the conditional property of event B given event A is
Pr[B|A] = Pr[A ∩ B]/Pr[A]. Also recall that events A and B are
independent if Pr[B|A] = Pr[A], or Pr[A ∩ B] = Pr[A] Pr[B].

Mutual Independence
Recall the coin flipping example from last time. We argued that the
probability of getting an outcome containing k heads in n flips of a
biased coin with probability p of heads is p^k (1-p)^(n-k). Let's
formalize this argument.

Let A_i be the event that the ith flip comes up heads. It's
reasonable to conclude that Pr[A_i] = p. Now suppose we have a
particular outcome with k heads, such as the outcome ω = A^k
T^(n-k) where the first k flips are heads and the rest are tails.
This is the sole outcome in the event E = A_1 ∩ ... ∩ A_k
∩ A_{k+1} ∩ A_n, so Pr[ω] = Pr[E]. We've
already argued that Pr[A_i] = p and Pr[A_i] = 1-p. How to
we compute the probability of the intersection Pr[E]?

We know from the definition of conditional probability that Pr[B|A]
= Pr[A ∩ B]/Pr[A], so Pr[A ∩ B] = Pr[B|A] Pr[B]. We can
generalize this to a product rule for n events A_1, ..., A_n:
Pr[A_1 ∩ ... ∩ A_n]
= Pr[A_1] * Pr[A_2|A_1] * Pr[A_3|A_1 ∩ A_2] * ...
* Pr[A_n | A_1 ∩ ... ∩ A_{n-1}].
We prove this using induction over n.
Base case: n = 1. Then Pr[A_1] = Pr[A_1] is trivially true.
Inductive hypothesis: For some n >= 1, the above equality holds.
Inductive step: Then
Pr[A_1 ∩ ... ∩ A_{n+1}]
= Pr[(A_1 ∩ ... &cap A_n) ∩ A_{n+1}]
= Pr[A_{n+1}|A_1 ∩ ... &cap A_n]
* Pr[A_1 ∩ ... &cap A_n]        (by def. of cond. prob.)
= Pr[A_{n+1}|A_1 ∩ ... &cap A_n]
* Pr[A_1] * Pr[A_2|A_1] * Pr[A_3|A_1 ∩ A_2] * ...
* Pr[A_n|A_1 ∩ ... ∩ A_{n-1}]                (by IH)
as required.

We can also define "mutual independence" for events. Events A_1,
..., A_n are mutually independent if for any i in [1, n] and any
subset I ⊆ {1, ..., n}\{i} (i.e. any subset I that does not
contain i), we have
Pr[A_i|∩_{j∈I} A_j] = A_i.
In other words, knowing that any combination of other events
happened gives us no information on whether or not A_i happened.
Then it follows from the product rule that
Pr[A_1 ∩ ... ∩ A_n] = Pr[A_1] Pr[A_2] ... Pr[A_n].

Note that mutual independence is not the same as pairwise
independence. Consider a roll of a red die and a blue die. Let A be
the event that the red die shows 1, B be the event that the blue die
shows 1, and C be the event that the sum of the two dice is 7. A and
B are clearly independent, and we showed last time that A and C are
independent. Similarly, B and C are independent. (If you don't
believe these claims, calculate each of the conditional
probabilities Pr[A|B], Pr[A|C], Pr[B|C].) But it is impossible for
both dice to show 1 and sum to 7, so Pr[A ∩ B ∩ C] = 0 ≠
Pr[A] Pr[B] Pr[C] = 1/216, and they are not mutually independent.

Now we can go back to flipping coins. We have that E = A_1 ∩ ...
∩ A_k ∩ A_{k+1} ∩ A_n, and the A_i are
mutually independent, so
Pr[E] = Pr[A_1] ... Pr[A_k] Pr[A_{k+1}] ... Pr[A_n]
= p * ... * p * (1-p) * ... * (1-p)
= p^k (1-p)^(n-k).
Since Pr[ω] = Pr[E], Pr[ω] = p^k (1-p)^(n-k).

Tree Diagrams
Suppose a have two coins in my pocket, when that is a fair coin and
one that has two heads. If I take a random coin out of my pocket and
flip it twice, what are the possible outcomes and their
probabilities?

We can draw a tree diagram to compute this. We can either choose the
fair coin or the biased coin, each with probability 1/2. Let F be
the event that we pick the fair coin, Pr[F] = 1/2. Let H_1 be the
event that we get heads in the first flip, H_2 that we get heads in
the second flip. Then if we used the fair coin, we get heads with
probability Pr[H_1|F] = 1/2, tails with probability
Pr[H_1|F] = 1/2.
Similarly, if we used the fair coin and got heads on the first flip,
we get heads on the second flip with probability Pr[H_2|F ∩ H_1]
= 1/2, tails with probability Pr[H_2|F ∩ H_1] = 1/2.
We can continue in this way until we've completed our decision tree.

Pr[F] = 1/2   Pr[H_1|F] = 1/2   Pr[H_2|F ∩ H_1] = 1/2     1/8
+-------------+-----------------+----------------------- (f, h, h)
|             |                 |
|             |                 | Pr[H_2|F ∩ H_1] = 1/2     1/8
|             |                 `----------------------- (f, h, t)
|             |
|             | Pr[H_1|F] = 1/2   Pr[H_2|F ∩ H_1] = 1/2     1/8
|             `-----------------+----------------------- (f, t, h)
|                               |
|                               | Pr[H_2|F ∩ H_1] = 1/2     1/8
|                               `----------------------- (f, t, t)
|
| Pr[F] = 1/2    Pr[H_1|F] = 1     Pr[H_2|F ∩ H_1] = 1      1/2
`-------------+-----------------+----------------------- (b, h, h)

What is the probability of the outcome (f, h, h) in which we pick
the fair coin and get two heads? This is the sole outcome in the
event F ∩ H_1 ∩ H_2. We can compute its probability by
multiplying the conditional probabilities along the edges from the
root of the tree to the leaf corresponding to that outcome:
Pr[(f, h, h)] = Pr[F ∩ H_1 ∩ H_2]
= Pr[F] Pr[H_1|F] Pr[H_2|F ∩ H_1]
= 1/8.
Note that this follows from the product rule. We can compute the
probabilities of the remaining outcomes in the same way, as given in
the diagram above.

Tree diagrams are not necessarily unique; there may be more than one
that adequately represents the sample space. In this example, we
could have defined events HH, HT, TH, and TT, which together cover
all possible results from flipping the coins twice (i.e. they
partition the sample space; more on that later). Then the following
two-level tree also represents the sample space:

Pr[F] = 1/2   Pr[HH|F] = 1/4     1/8
+-------------+---------------- (f, h, h)
|             |
|             | Pr[HT|F] = 1/4     1/8
|             +---------------- (f, h, t)
|             |
|             | Pr[TH|F] = 1/4     1/8
|             +---------------- (f, t, h)
|             |
|             | Pr[TT|F] = 1/4     1/8
|             `---------------- (f, t, t)
|
| Pr[F] = 1/2    Pr[HH|F] = 1      1/8
`-------------+---------------- (b, h, h)

Probabilities of outcomes are again computed using the product rule.
(We will formalize later, when we talk about conditional
independence, how we came up with Pr[HH|F] = 1/4. For now, it should
be obvious that the probability of getting two heads once we've
chosen the fair coin is 1/4.)

Along with the coin flipping example above, this illustrates how
probability models are constructed. We reduce an experiment to a
sequence of simple choices and then use the product rule, computing
conditional probabilities or relying on independence, to determine
the probabilities of each outcome.

Bayes' Rule
Recall our motivating example from last time. A pharmaceutical
company is marketing a new test for HIV that it claims is 99%
effective, meaning that it will report positive for 99% of people
who have HIV and negative for 99% of those who don't have HIV.
Suppose a random person takes the test and gets a positive test
result. What is the probability that the person has HIV?

Let A be the event that the person has HIV, B be the event that he
tests positive. We know that if he has HIV, he will test positive
with probability 0.99, so Pr[B|A] = 0.99. Similarly, he tests
negative with probability 0.99 if he doesn't have HIV, so
Pr[B|A] = 0.99.
We can also compute
Pr[B|A]
= 1 - Pr[B|A] = 0.01,
and similarly, Pr[B|A] = 0.01.

Now we want to compute Pr[A|B]. How can we do so given the
information we have? We can do the following:
Pr[A|B] = Pr[A ∩ B] / Pr[B]        (by def. of cond. prob.)
= Pr[B|A] * Pr[A] / Pr[B].           (by def. of cond. prob.)
This is called Bayes' Rule.

A "partition" of an event B is a set of mutually disjoint events
A_1, ..., A_n such that B = A_1 ∪ ... ∪ A_n. Then we get
Pr[B] = Pr[A_1 ∪ ... ∪ A_n] = Pr[A_1] + ... + Pr[A_n] since
A_1, ..., A_n are mutually disjoint.

Now suppose that A_1, ..., A_n partition Ω, the sample space
as a whole. Then Pr[A_1 ∪ ... ∪ A_n] = Pr[A_1] + ... +
Pr[A_n] = 1 ≠ Pr[B]. How can we get an expression for Pr[B] from
these events? From a Venn diagram, we can see that A_1 ∩ B, A_2
∩ B, ..., A_n ∩ B partition B. So Pr[B] = Pr[(A_1 ∩ B)
∪ ... ∪ (A_n ∩ B)] = Pr[A_1 ∩ B] + ... + Pr[A_n
∩ B].

Finally consider a single event A. Then the events A ∩ B and
A ∩ B are a partition of B. A Venn diagram shows that this
is the case, but intuitively, any outcome in B is either in A and
therefore in A ∩ B or is in A and therefore in A
∩ B. Then it follows that
Pr[B] = Pr[A ∩ B] + Pr[A ∩ B].
Equivalently, by using the definition of conditional probability,
Pr[B] = Pr[B|A] Pr[A] + Pr[B|A] Pr[A]
= Pr[B|A] Pr[A] + Pr[B|A] (1 - Pr[A]).
Both of the above are known as the Total Probability Rule.

Combining Bayes' Rule and the Total Probability Rule, we get
Pr[A|B] = Pr[B|A]Pr[A] / (Pr[B|A]Pr[A]+Pr[B|A](1-Pr[A])).

Now we have almost everything we need, except that we don't have
Pr[A], the probability that a random person has HIV. This turns out
to be (in the US) 250 out of every million people, so Pr[A] =
0.00025. Plugging into the above, we get
Pr[A|B] = 0.99 * 0.00025 / (0.99 * 0.00025 + 0.01 * 0.99975)
≈ 0.024.
So the person only has a 2.4% chance of having HIV! This is much
smaller than the claimed 99% accuracy.

This demonstrates that Pr[A|B], which is what we care about, can be
very different from Pr[B|A], which is what the manufacturer is
telling us. Confusing the two is known as a "base rate fallacy."

Here's some intuition on the result. Suppose 4000 random people come
in to get tested. Around 1 of the 4000 people will actually have HIV
and will most likely test positive. Around 3999 people won't have
HIV, but around 40 of them will test positive. So of the 41 people
who test positive, only 1 actually has HIV, so a random person who
tests positive has about a 1/41 chance of having HIV.

Note, however, that this doesn't mean the test is useless. If a
particular person goes in to be tested whose specific risk factors
substantially increase Pr[A], then Pr[A|B] would be much higher.
Suppose the person is a member of a subpopulation in which 1 in 5
people have HIV.
Then
Pr[A|B] = 0.99 * 0.2 / (0.99 * 0.2 + 0.01 * 0.8)
≈ 0.96.
So if the base rate is much higher, the test is far more effective
at detecting HIV.

The takeaway here is that we can't ignore the base rate when
evaluating the effectiveness of a test. While it doesn't make sense
to blanket test the entire population, since its base rate is quite
low, it does make sense to test subpopulations with much higher base
rates.