Administrative info
  PA2 due Monday
  HW5 due Tuesday
  MT1 grades up, μ ≈ 68.6, σ ≈ 15.8
  MT1 solutions will be posted with some more comments

  Recall that the conditional property of event B given event A is
  Pr[B|A] = Pr[A ∩ B]/Pr[A]. Also recall that events A and B are
  independent if Pr[B|A] = Pr[A], or Pr[A ∩ B] = Pr[A] Pr[B].

Mutual Independence
  Recall the coin flipping example from last time. We argued that the
  probability of getting an outcome containing k heads in n flips of a
  biased coin with probability p of heads is p^k (1-p)^(n-k). Let's
  formalize this argument.

  Let A_i be the event that the ith flip comes up heads. It's
  reasonable to conclude that Pr[A_i] = p. Now suppose we have a
  particular outcome with k heads, such as the outcome ω = A^k
  T^(n-k) where the first k flips are heads and the rest are tails.
  This is the sole outcome in the event E = A_1 ∩ ... ∩ A_k
  ∩ A_{k+1} ∩ A_n, so Pr[ω] = Pr[E]. We've
  already argued that Pr[A_i] = p and Pr[A_i] = 1-p. How to
  we compute the probability of the intersection Pr[E]?

  We know from the definition of conditional probability that Pr[B|A]
  = Pr[A ∩ B]/Pr[A], so Pr[A ∩ B] = Pr[B|A] Pr[B]. We can
  generalize this to a product rule for n events A_1, ..., A_n:
    Pr[A_1 ∩ ... ∩ A_n]
      = Pr[A_1] * Pr[A_2|A_1] * Pr[A_3|A_1 ∩ A_2] * ...
        * Pr[A_n | A_1 ∩ ... ∩ A_{n-1}].
  We prove this using induction over n.
  Base case: n = 1. Then Pr[A_1] = Pr[A_1] is trivially true.
  Inductive hypothesis: For some n >= 1, the above equality holds.
  Inductive step: Then
    Pr[A_1 ∩ ... ∩ A_{n+1}]
      = Pr[(A_1 ∩ ... &cap A_n) ∩ A_{n+1}]
      = Pr[A_{n+1}|A_1 ∩ ... &cap A_n]
        * Pr[A_1 ∩ ... &cap A_n]        (by def. of cond. prob.)
      = Pr[A_{n+1}|A_1 ∩ ... &cap A_n]
        * Pr[A_1] * Pr[A_2|A_1] * Pr[A_3|A_1 ∩ A_2] * ...
        * Pr[A_n|A_1 ∩ ... ∩ A_{n-1}]                (by IH)
    as required.

  We can also define "mutual independence" for events. Events A_1,
  ..., A_n are mutually independent if for any i in [1, n] and any
  subset I ⊆ {1, ..., n}\{i} (i.e. any subset I that does not
  contain i), we have
    Pr[A_i|∩_{j∈I} A_j] = A_i.
  In other words, knowing that any combination of other events
  happened gives us no information on whether or not A_i happened.
  Then it follows from the product rule that
    Pr[A_1 ∩ ... ∩ A_n] = Pr[A_1] Pr[A_2] ... Pr[A_n].

  Note that mutual independence is not the same as pairwise
  independence. Consider a roll of a red die and a blue die. Let A be
  the event that the red die shows 1, B be the event that the blue die
  shows 1, and C be the event that the sum of the two dice is 7. A and
  B are clearly independent, and we showed last time that A and C are
  independent. Similarly, B and C are independent. (If you don't
  believe these claims, calculate each of the conditional
  probabilities Pr[A|B], Pr[A|C], Pr[B|C].) But it is impossible for
  both dice to show 1 and sum to 7, so Pr[A ∩ B ∩ C] = 0 ≠
  Pr[A] Pr[B] Pr[C] = 1/216, and they are not mutually independent.

  Now we can go back to flipping coins. We have that E = A_1 ∩ ...
  ∩ A_k ∩ A_{k+1} ∩ A_n, and the A_i are
  mutually independent, so
    Pr[E] = Pr[A_1] ... Pr[A_k] Pr[A_{k+1}] ... Pr[A_n]
      = p * ... * p * (1-p) * ... * (1-p)
      = p^k (1-p)^(n-k).
  Since Pr[ω] = Pr[E], Pr[ω] = p^k (1-p)^(n-k).

Tree Diagrams
  Suppose a have two coins in my pocket, when that is a fair coin and
  one that has two heads. If I take a random coin out of my pocket and
  flip it twice, what are the possible outcomes and their

  We can draw a tree diagram to compute this. We can either choose the
  fair coin or the biased coin, each with probability 1/2. Let F be
  the event that we pick the fair coin, Pr[F] = 1/2. Let H_1 be the
  event that we get heads in the first flip, H_2 that we get heads in
  the second flip. Then if we used the fair coin, we get heads with
  probability Pr[H_1|F] = 1/2, tails with probability
  Pr[H_1|F] = 1/2.
  Similarly, if we used the fair coin and got heads on the first flip,
  we get heads on the second flip with probability Pr[H_2|F ∩ H_1]
  = 1/2, tails with probability Pr[H_2|F ∩ H_1] = 1/2.
  We can continue in this way until we've completed our decision tree.

    Pr[F] = 1/2   Pr[H_1|F] = 1/2   Pr[H_2|F ∩ H_1] = 1/2     1/8
  +-------------+-----------------+----------------------- (f, h, h)
  |             |                 |
  |             |                 | Pr[H_2|F ∩ H_1] = 1/2     1/8
  |             |                 `----------------------- (f, h, t)
  |             |
  |             | Pr[H_1|F] = 1/2   Pr[H_2|F ∩ H_1] = 1/2     1/8
  |             `-----------------+----------------------- (f, t, h)
  |                               |
  |                               | Pr[H_2|F ∩ H_1] = 1/2     1/8
  |                               `----------------------- (f, t, t)
  | Pr[F] = 1/2    Pr[H_1|F] = 1     Pr[H_2|F ∩ H_1] = 1      1/2
  `-------------+-----------------+----------------------- (b, h, h)

  What is the probability of the outcome (f, h, h) in which we pick
  the fair coin and get two heads? This is the sole outcome in the
  event F ∩ H_1 ∩ H_2. We can compute its probability by
  multiplying the conditional probabilities along the edges from the
  root of the tree to the leaf corresponding to that outcome:
    Pr[(f, h, h)] = Pr[F ∩ H_1 ∩ H_2]
      = Pr[F] Pr[H_1|F] Pr[H_2|F ∩ H_1]
      = 1/8.
  Note that this follows from the product rule. We can compute the
  probabilities of the remaining outcomes in the same way, as given in
  the diagram above.

  Tree diagrams are not necessarily unique; there may be more than one
  that adequately represents the sample space. In this example, we
  could have defined events HH, HT, TH, and TT, which together cover
  all possible results from flipping the coins twice (i.e. they
  partition the sample space; more on that later). Then the following
  two-level tree also represents the sample space:

    Pr[F] = 1/2   Pr[HH|F] = 1/4     1/8
  +-------------+---------------- (f, h, h)
  |             |
  |             | Pr[HT|F] = 1/4     1/8
  |             +---------------- (f, h, t)
  |             |
  |             | Pr[TH|F] = 1/4     1/8
  |             +---------------- (f, t, h)
  |             |
  |             | Pr[TT|F] = 1/4     1/8
  |             `---------------- (f, t, t)
  | Pr[F] = 1/2    Pr[HH|F] = 1      1/8
  `-------------+---------------- (b, h, h)

  Probabilities of outcomes are again computed using the product rule.
  (We will formalize later, when we talk about conditional
  independence, how we came up with Pr[HH|F] = 1/4. For now, it should
  be obvious that the probability of getting two heads once we've
  chosen the fair coin is 1/4.)

  Along with the coin flipping example above, this illustrates how
  probability models are constructed. We reduce an experiment to a
  sequence of simple choices and then use the product rule, computing
  conditional probabilities or relying on independence, to determine
  the probabilities of each outcome.

Bayes' Rule
  Recall our motivating example from last time. A pharmaceutical
  company is marketing a new test for HIV that it claims is 99%
  effective, meaning that it will report positive for 99% of people
  who have HIV and negative for 99% of those who don't have HIV.
  Suppose a random person takes the test and gets a positive test
  result. What is the probability that the person has HIV?

  Let A be the event that the person has HIV, B be the event that he
  tests positive. We know that if he has HIV, he will test positive
  with probability 0.99, so Pr[B|A] = 0.99. Similarly, he tests
  negative with probability 0.99 if he doesn't have HIV, so
  Pr[B|A] = 0.99.
  We can also compute
      = 1 - Pr[B|A] = 0.01,
  and similarly, Pr[B|A] = 0.01.

  Now we want to compute Pr[A|B]. How can we do so given the
  information we have? We can do the following:
    Pr[A|B] = Pr[A ∩ B] / Pr[B]        (by def. of cond. prob.)
      = Pr[B|A] * Pr[A] / Pr[B].           (by def. of cond. prob.)
  This is called Bayes' Rule.

  A "partition" of an event B is a set of mutually disjoint events
  A_1, ..., A_n such that B = A_1 ∪ ... ∪ A_n. Then we get
  Pr[B] = Pr[A_1 ∪ ... ∪ A_n] = Pr[A_1] + ... + Pr[A_n] since
  A_1, ..., A_n are mutually disjoint.

  Now suppose that A_1, ..., A_n partition Ω, the sample space
  as a whole. Then Pr[A_1 ∪ ... ∪ A_n] = Pr[A_1] + ... +
  Pr[A_n] = 1 ≠ Pr[B]. How can we get an expression for Pr[B] from
  these events? From a Venn diagram, we can see that A_1 ∩ B, A_2
  ∩ B, ..., A_n ∩ B partition B. So Pr[B] = Pr[(A_1 ∩ B)
  ∪ ... ∪ (A_n ∩ B)] = Pr[A_1 ∩ B] + ... + Pr[A_n
  ∩ B].

  Finally consider a single event A. Then the events A ∩ B and
  A ∩ B are a partition of B. A Venn diagram shows that this
  is the case, but intuitively, any outcome in B is either in A and
  therefore in A ∩ B or is in A and therefore in A
  ∩ B. Then it follows that
    Pr[B] = Pr[A ∩ B] + Pr[A ∩ B].
  Equivalently, by using the definition of conditional probability,
    Pr[B] = Pr[B|A] Pr[A] + Pr[B|A] Pr[A]
      = Pr[B|A] Pr[A] + Pr[B|A] (1 - Pr[A]).
  Both of the above are known as the Total Probability Rule.

  Combining Bayes' Rule and the Total Probability Rule, we get
    Pr[A|B] = Pr[B|A]Pr[A] / (Pr[B|A]Pr[A]+Pr[B|A](1-Pr[A])).

  Now we have almost everything we need, except that we don't have
  Pr[A], the probability that a random person has HIV. This turns out
  to be (in the US) 250 out of every million people, so Pr[A] =
  0.00025. Plugging into the above, we get
    Pr[A|B] = 0.99 * 0.00025 / (0.99 * 0.00025 + 0.01 * 0.99975)
      ≈ 0.024.
  So the person only has a 2.4% chance of having HIV! This is much
  smaller than the claimed 99% accuracy.

  This demonstrates that Pr[A|B], which is what we care about, can be
  very different from Pr[B|A], which is what the manufacturer is
  telling us. Confusing the two is known as a "base rate fallacy."

  Here's some intuition on the result. Suppose 4000 random people come
  in to get tested. Around 1 of the 4000 people will actually have HIV
  and will most likely test positive. Around 3999 people won't have
  HIV, but around 40 of them will test positive. So of the 41 people
  who test positive, only 1 actually has HIV, so a random person who
  tests positive has about a 1/41 chance of having HIV.

  Note, however, that this doesn't mean the test is useless. If a
  particular person goes in to be tested whose specific risk factors
  substantially increase Pr[A], then Pr[A|B] would be much higher.
  Suppose the person is a member of a subpopulation in which 1 in 5
  people have HIV.
    Pr[A|B] = 0.99 * 0.2 / (0.99 * 0.2 + 0.01 * 0.8)
      ≈ 0.96.
  So if the base rate is much higher, the test is far more effective
  at detecting HIV.

  The takeaway here is that we can't ignore the base rate when
  evaluating the effectiveness of a test. While it doesn't make sense
  to blanket test the entire population, since its base rate is quite
  low, it does make sense to test subpopulations with much higher base