PA2 due today
HW5 due tomorrow

Review
Recall that the conditional probability of event B given event A is
Pr[B|A] = Pr[A ∩ B]/Pr[A]. Also recall that events A and B are
independent if Pr[B|A] = Pr[A], or Pr[A ∩ B] = Pr[A] Pr[B].

Further recall the general product rule:
Pr[A_1 ∩ ... ∩ A_n]
= Pr[A_1] * Pr[A_2|A_1] * Pr[A_3|A_1 ∩ A_2] * ...
* Pr[A_n | A_1 ∩ ... ∩ A_{n-1}].

Events A_1, ..., A_n are mutually independent if for any i in [1, n]
and any subset I ⊆ {1, ..., n}\{i} (i.e. any subset I that does
not contain i), we have
Pr[A_i|∩_{j∈I} A_j] = A_i.
Then it follows from the product rule that
Pr[A_1 ∩ ... ∩ A_n] = Pr[A_1] Pr[A_2] ... Pr[A_n].

Recall Bayes' Rule:
Pr[A|B] = Pr[B|A] * Pr[A] / Pr[B].

Recall the variations of the Total Probability Rule:
Pr[B] = Pr[A ∩ B] + Pr[A ∩ B].
= Pr[B|A] Pr[A] + Pr[B|A] Pr[A]
= Pr[B|A] Pr[A] + Pr[B|A] (1 - Pr[A]).

Combining Bayes' Rule and the Total Probability Rule, we get
Pr[A|B] = Pr[B|A]Pr[A] / (Pr[B|A]Pr[A]+Pr[B|A](1-Pr[A])).

Base Rates
Recall the HIV test from last time. We defined A to be the event
that a random person has HIV, B to be the event that he tests
positive. We computed Pr[B|A] = 0.99,
Pr[B|A] = 0.99,
and , Pr[B|A] = 0.01.

We then computed that if we have a base rate of Pr[A] = 0.00025 in
the entire population, then
Pr[A|B] = 0.99 * 0.00025 / (0.99 * 0.00025 + 0.01 * 0.99975)
≈ 0.024.
This tells us that blanket testing the entire population is not a
good idea, since the test will produce far more false positives
than actual positives.

What if we only tested a subpopulation with a higher risk factor for
HIV, say in which 1 in 5 people are infected? That changes the base
rate to Pr[A] = 0.2, and we get
Pr[A|B] = 0.99 * 0.2 / (0.99 * 0.2 + 0.01 * 0.8)
≈ 0.96.
So if the base rate is much higher, the test is far more effective
at detecting HIV. And if you have a high risk factor, this is a test
you'd want to take.

The takeaway here is that we can't ignore the base rate when
evaluating the effectiveness of a test. While it doesn't make sense
to blanket test the entire population, since its base rate is quite
low, it does make sense to test subpopulations with much higher base
rates.

Inclusion/Exclusion
Recall the inclusion/exclusion principle for events A and B:
Pr[A ∪ B] = Pr[A] + Pr[B] - Pr[A ∩ B].
We count outcomes in A and in B, but that double counts outcomes in
both, so we adjust by subtracting them off.
What if we have three events? We get
Pr[A ∩ B ∩ C]
= Pr[A] + Pr[B] + Pr[C]
- Pr[A ∩ B] - Pr[A ∩ C] - Pr[B ∩ C]
+ Pr[A ∩ B ∩ C].
By counting outcomes in A, B, and C, we double count those that
appear in any pair of A, B, C, so we subtract those off. However, if
an outcome appears in all three of A, B, C, then we've added three
copies in the first line and subtracted three copies in the second
line, so we have to add one copy in the third line to include those
outcomes.
This generalizes to larger numbers of events, with alternating
additions and subtractions. (Can you see why it is called
inclusion/exclusion?) See the reader for the general formula.

EX: Recall the dice game from before. You pick a number from 1 to 6.
The casino rolls three dice, and if your number comes up, you
win. What is your probability of winning?
ANS: Let A be the event that your number comes up on the first die,
B on the second, and C on the third. Then you win for outcomes
that are in A ∪ B ∪ C. So by inclusion/exclusion,
Pr[A ∪ B ∪ C]
= Pr[A] + Pr[B] + Pr[C]
- Pr[A ∩ B] - Pr[A ∩ C] - Pr[B ∩ C]
+ Pr[A ∩ B ∩ C].
What is Pr[A]? Well, the probability that the first die has
your number is 1/6, so Pr[A] = 1/6, and similarly, Pr[B] =
Pr[C] = 1/6. What is Pr[A ∩ B]? The results on different
dice are independent, so Pr[A ∩ B] = Pr[A] Pr[B] = 1/36,
and similarly for Pr[A ∩ C] and Pr[B ∩ C]. By a similar
argument, Pr[A ∩ B ∩ C] = 1/216. Then
Pr[A ∪ B ∪ C]
= 1/6 + 1/6 + 1/6 - 1/36 - 1/36 - 1/36 + 1/216
= 1/2 - 1/12 + 1/216
= 108/216 - 18/216 + 1/216
= 91/216
≈ 0.42.
This is the same answer as before, but it took a lot more work
to get it.

Union Bound
From our reasoning for the inclusion/exclusion principle, we see
that Pr[A_1] + ... + Pr[A_n] overstates the probability of Pr[A_1
∪ ... ∪ A_n]. We can formalize this as the union bound:
Pr[A_1 ∪ ... ∪ A_n] <= Pr[A_1] + ... + Pr[A_n].

EX: Suppose for MT2, to prevent students from cheating, we place on
each desk in the lecture hall a random number from 1 to 1000. We
give one question that is parameterized by that number. If two
people sitting next to each other have the same number, then
they can copy off each other. What is the probability that any
of the 62 students will cheat?
ANS: Computing this exactly seems hard, so let's just compute an
upper bound. There are at most 61 pairs of students sitting
next to each other (think of them all sitting in one long row).
Let A_i be the event that the ith pair has the same number.
Then Pr[A_i] = 1/1000. Let B be the event that some pair has
the same number, B = A_1 ∪ ... ∪ A_61. Then
Pr[B] <= Pr[A_1] + ... + Pr[A_61]
= 61/1000.
So the probability of any pair sharing the same number is at
most 6%.

Hashing
Now that we've seen many techniques for computing probabilities, let
us apply them to two problems of interest: hashing and coupon
collecting.

Recall the birthday paradox. We computed the probability that two
people share the same birthday given 365 days and m people. We found
that when m = 23, we have a slightly higher than even chance of two
people sharing a birthday.

Last week was Neptune's birthday! It was exactly one Neptunian year
after it was first discovered in 1846. A year on Neptune is 89,666
Neptunian days. Now how many Neptunians do we need so that we have a
better than even chance of two of them sharing the same birthday?

Let's redo the analysis in the general case, where we have n days
and m individuals. How many sample points are there? There are
|Ω| = n^m, since each individual has n days to choose from and
there are m individuals. Each of these is assumed to be equally
likely. Now let E be the event that no two individuals share the same
birthday. How many outcomes are in E?

Well, the first person has n choices of days, the second person has
n-1 choices that are different than the first, the third person has
n-2 choices that are different thant the first two, and so on, until
the mth person has n-(m-1) = n-m+1 choices. Thus,
|E| = n * (n-1) * ... * (n-m+1),
and
Pr[E] = |E|/|Ω|
= n * (n-1) * ... * (n-m+1) / n^m
= n/n * (n-1)/n * ... * (n-m+1)/n.

We can compute Pr[E] another way using the product rule. Let E_i be
the event that the ith person's birthday is different than those of
persons 1, ..., i-1. Then
Pr[E] = Pr[E_1 ∩ E_2 ∩ ... ∩ E_m]
= Pr[E_1] *
Pr[E_2|E_1] *
Pr[E_3|E_1 ∩ E_2] *
... *
Pr[E_m|E_1 ∩ E_2 ∩ ... ∩ E_{m-1}].
Now we need to compute the probability
Pr[E_i|E_1 ∩ ... ∩ E_{i-1}],
the probability that the ith person's birthday is not the same as
persons 1, ..., i-1 given that all those people have different
birthdays. The ith person is left with n-(i-1) = n-i+1 choices of
distinct days out of n days total, so
Pr[E_i|E_1 ∩ ... ∩ E_{i-1}] = (n-i+1)/n.
Plugging into the product rule, we get
Pr[E] = (n-1+1)/n * (n-2+1)/n * ... * (n-m+1)/n
= n/n * (n-1)/n * ... * (n-m+1)/n,
as before.

Let us rewrite (n-i)/n as (1 - i/n) to get
Pr[E] = 1 * (1 - 1/n) * (1 - 2/n) * ... * (1 - (m-1)/n).

Before we continue, let's look at the Taylor series for e^{-x}:
e^{-x} = 1 - x + x^2/2! - x^3/3! + ...
If x is small, then x^2/2! is really small, x^3/3! is ridiculously
small, x^4/4! is ludicrously small, and so on. So we get
e^{-x} >= 1 - x
and if x is small, then they are very nearly equal.

Using this approximation, we get
Pr[E] = (1 - 1/n) * (1 - 2/n) * ... * (1 - (m-1)/n)
<= e^{-1/n} * e^{-2/n} * ... * e^{-(m-1)/n}
= exp(-(1/n + 2/n + ... + (m-1)/n))
= exp(-(1 + 2 + ... + (m-1))/n)
= exp(-(m-1)m/2n)
≈ exp(-m^2/2n).

Suppose we want to know when this probability is about 1/2. Then
Pr[E] ≈ exp(-m^2/2n) ≈ 1/2
-m^2/2n = -ln(2)
m^2 = 2n ln(2)
m = sqrt(2n ln(2)) ≈ 1.18 sqrt(n).

So when we have 1.18 sqrt(n) individuals, we have about an even
chance that two individuals share the same birthday.

In the case of Neptune, we plug in n = 89666 to get
m = 1.18 sqrt(89666)
≈ 353.
So we only need 353 Neptunians to make it likely that two of them
share a birthday!

This should make intuitive sense. When we have m people, there are
C(m, 2) ≈ m^2/2 pairs of people, each pair of which has a 1/n
chance of yielding a common birthday.

What does this have to do with hashing? A hash table is a data
structure for storing items. It it has n locations, then we use a
hash function h(x) to map an item x to a location 0 <= h(x) < n. At
each location, there is a linked list that stores all items that are
mapped to that location. The longer the list, the slower basic
operations on the hash table will be. Ideally, we want no two items
to be mapped to the same location, i.e. no "collisions." Then the
operations will take constant time.

Suppose we store m items into the hash table. How large can m be so
the the probability of a collision is less than 1/2?

Before we calculate, let's outline some assumptions we are making:
(1) For each item x, h(x) is uniformly random over [0, n-1], i.e.
all n locations are equally likely.
(2) The hash values for each item are mutually independent.

Then this is just the birthday paradox! The n locations are our n
days, and the m items are our m individuals, so we get
m ≈ 1.18 sqrt(n).

Another way to express this problem is in terms of balls and bins,
where each location is a bin and each item is a ball. Then we are
randomly throwing balls into bins. This abstraction is very useful
in Computer Science.

Finally, note that we made some approximations in the above
analysis. In the reader, you can see a table that demonstrates that
these approximations are very good even for small n.

Coupon Collector's Problem
Let's analyze a somewhat different problem. Suppose a local cereal
manufacturer places a baseball card with a random Giants player in
each box of cereal. There are n players who appear on a card, and
each box contains a card chosen uniformly at random and
independently from all other boxes.

Now I am a big fan of the Kung Fu Panda, i.e. Pablo Sandoval. I
really want his baseball card. How many boxes of cereal do I have to
buy to make it more than likely to get his card?

Suppose I buy m boxes of cereal. Let E be the event that I don't get
a Panda card, E_i be the event that the ith box doesn't have his
card. What is Pr[E_i]? Well, there are n cards, and n-1 don't have
the Panda, so Pr[E_i] = (n-1)/n = (1 - 1/n). Then
Pr[E] = Pr[E_1 ∩ ... ∩ E_m]
= Pr[E_1] ... Pr[E_m]                   (mutual independence)
= (1 - 1/n)^m.
Using the Taylor expansion from before,
Pr[E] <= (exp(-1/n))^m
= exp(-m/n).
Setting this equal to 1/2 for an even chance of getting a Panda
card, we get
1/2 = exp(-m/n)
-ln(2) = -m/n
m = n ln(2) ≈ 0.69n.
So if I buy 0.69n boxes, I have about an even chance of getting the
Panda.

Suppose I want all n players. (I like The Beard (Brian Wilson),
Buster Posey, and the rest of the Giants as well.) Now how many
boxes do I have to buy to have an even chance of getting all the
players?

Let F_j be the event that I don't get the jth player, F be the event
that I am missing some player. Then F = F_1 ∪ ... ∪ F_n.
Note that the F_j are not independent! Knowing that I didn't get a
Panda card makes it more likely that I got someone else's.
Pr[F_j] <= exp(-m/n).
Then we have
Pr[F] = Pr[F_1 ∪ ... ∪ F_n]
<= Pr[F_1] + ... + Pr[F_n]                     (union bound)
<= n exp(-m/n).
Setting this to 1/2, we get
1/2 = n exp(-m/n)
1/(2n) = exp(-m/n)
-ln(2n) = -m/n
m = n ln(2n)
So n ln(2n) are sufficient to guarantee an even chance of getting
all players.

As you can see, we need many boxes to make it likely that we find
the player we like or assemble a full collection of all players. So
this is a great marketing ploy for the cereal manufacturer.

Why did we do these examples above? They illustrate how the
probability techniques we learned can be applied to solve real-world
problems.