HW7 out, due Monday
MT2 next Tuesday
Same location and policies as MT1
Cover through polling/LLN (Wednesday)
Review session
Monday 6:30-8 in 320 Soda
Exams will be available for pickup from Soda front office

Review
Recall that we defined independence for random variables as follows.
Let X and Y be random variables, A be the set of values that
X can take on, B be the set of values Y can take on. Then X
and Y are independent if
∀a∈A ∀b∈B . Pr[X=a ∩ Y=b] = Pr[X=a] Pr[X=b].
Equivalently, X and Y are independent if
∀a∈A ∀b∈B . Pr[X=a|Y=b] = Pr[X=a].

We can similarly define mutual independence for more than two random
variables.

Independent Random Variables
We have used the fact that Var(X+Y) = Var(X) + Var(Y) for
independent random variables X and Y. We now prove that fact.

For independent random variables X and Y, we have
E(XY) = E(X)E(Y).
Proof:
E(XY) = ∑_{a,b} ab Pr[X=a ∩ Y=b]
= ∑_{a,b} ab Pr[X=a] Pr[Y=b]
= ∑_{a} ∑_{b} ab Pr[X=a] Pr[Y=b]
= (∑_{a} a Pr[X=a]) * (∑_{b} b Pr[Y=b])
= E(X) * E(Y).

We have already claimed that for independent random variables X and
Y, Var(X+Y) = Var(X) + Var(Y). Now we prove it.
Var(X+Y) = E((X+Y)^2) - E(X+Y)^2
= E(X^2 + 2XY + Y^2) - (E(X) + E(Y))^2
= E(X^2) + 2E(XY) + E(Y^2) - E(X)^2 - 2E(X)E(Y) - E(Y)^2
= E(X^2) + 2E(X)E(Y) + E(Y^2) - E(X)^2 - 2E(X)E(Y) - E(Y)^2
= E(X^2) + E(Y^2) - E(X)^2 - E(Y)^2
= Var(X) + Var(Y).

What about if X and Y are not independent? Let's consider the
extreme case Y = X. Then
E(XY) = E(XX) = E(X^2) ≠ E(X)^2 in general
Var(X+Y) = Var(2X) = 4Var(X) ≠ 2Var(X).
So the two facts above do not hold if X and Y are not independent.

Joint Distribution
The fact that two random variables X and Y are independent gives us
all the information that we need in order to perform calculations
involving those two variables. We now turn our attention to the
general case when X and Y are not independent.

Recall that for non-independent events A and B, in order to compute
probabilities of the form
Pr[A ∩ B]
Pr[A ∪ B]
Pr[A|B]
we needed to know Pr[A ∩ B]. This quantity encodes all the
information about how A and B are correlated.

For random variables, we need something similar. We need the values
Pr[X = a ∩ Y = b]
for all values a in the set of values that X can take on and all
values b in the set of values that Y can take on. This set of
probabilities is called the "joint distribution" of X and Y.

Since we now will be making quite heavy use of intersections and
writing ∩ everywhere is tedious, we use a comma instead to
denote intersection:
Pr[X = a, Y = b].

Before we continue with examples of joint distributions, we first
generalize our definition of random variables. Previously, we
defined a random variable X on a probability space Ω as a
function from Ω to the real numbers
X : Ω -> R,
so X(ω) is a real number for all ω ∈ Ω. We
can generalize the range of a random variable to be any set S:
X : Ω -> S.
For example, in PA3, we went to the trouble of defining a numerical
value for winning, drawing, and losing, and similarly for each type
of hand, in order to define the random variables W and O. Instead,
we could have defined the range of W to be the set
S = {Lose, Draw, Win}.
Then W would assign an element of S to each outcome ω.
∀ ω ∈ Ω . W(ω) ∈ S.
If the range of a random variable is not a subset of R, however, we
cannot talk about its expectation or variance. (In the case of PA3,
the members of S are ordered, so it made sense to use numerical
values so that we can compute expectation and variance, which give
us information about how much money we expect to win or lose.
Similarly, the types of hands are ordered as well.)

We now turn to examples of joint distributions using generalized
random variables.

Suppose we are trying to diagnose a rare ailment (say
neurocysticercosis or something else you would see on House,
M.D.) according to the severity of a particular symptom. Let X
be random variable that is 1 if the patient has the disease, 0
otherwise. Let Y be a random variable that takes on one of the
values {none, moderate, severe}. Then the joint distribution of X
and Y might be the following:
Pr[X=a,Y=b]    Y   none   moderate   severe
X
0     0.72     0.18      0.00
1     0.02     0.05      0.03
In other words, Pr[X = 0, Y = none] = 0.72, so 72% of patients have
neither the disease nor any symptoms. The table gives us all values
of Pr[X = a, Y = b], so it completely specifies the joint
distribution of X and Y.

The joint distribution gives us all of the information about X and
Y. Since random variables partition the sample space, we can use
the total probability rule to obtain "marginal" distributions for
X and Y:
Pr[X = 0] = Pr[X = 0, Y = none] +
Pr[X = 0, Y = moderate] +
Pr[X = 0, Y = severe]
= 0.90.
We can do this for all values of X and Y by adding the values in
the appropriate row or column of the table:
Pr[X=a,Y=b]    Y    none   moderate   severe  |  Pr[X = a]
X                                |
0      0.72     0.18      0.00   |    0.90
1      0.02     0.05      0.03   |    0.10
-------------------------------------'
Pr[Y = b]  0.74     0.23      0.03
This implies that 10% of all patients have the disease, and 3% of
all patients have severe symptoms.

For independent random variables Q and R, we have
Pr[Q = q, R = r] = Pr[Q = q] Pr[R = r],
so the joint distribution is the product of the marginals when the
random variables are independent.

Recall that Y = b is just an event, so we can compute conditional
probabilities given the event Y = b:
Pr[X = 0 | Y = b] = Pr[X = 0, Y = b] / Pr[Y = b]
Pr[X = 1 | Y = b] = Pr[X = 1, Y = b] / Pr[Y = b]
This set of probabilities is called the "conditional distribution"
of X given Y = b. We can write the conditional distributions Pr[X =
a | Y = b] in table form as well:
Pr[X=a|Y=b]    Y   none   moderate   severe
X
0     0.97     0.78      0.00
1     0.03     0.22      1.00
We can similarly compute the conditional distributions Pr[Y = b | X
= a]:
Pr[Y=b|X=a]    Y   none   moderate   severe
X
0     0.80     0.20      0.00
1     0.20     0.50      0.30

Like unconditional distributions, conditional distributions must sum
to 1. Recall how we defined conditional probability: Pr[A|B] is the
probability of A in a new sample space Ω' given by Ω' =
B. So conditioning merely defines a new sample space, and all the
rules of probability must hold in this new sample space.

Let's do another example. Suppose Victor and I play
rock-paper-scissors in order to determine who gets to grade problem
1 on midterm 2. Let X be my choice of weapon, Y Victor's choice.
Then the joint distribution of X and Y might be:
Pr[X=a,Y=b]      Y   rock   paper   scissors  |  Pr[X = a]
X                                |
rock      0.12   0.12      0.16    |     0.4
paper     0.09   0.09      0.12    |     0.3
scissors   0.09   0.09      0.12    |     0.3
-------------------------------------'
Pr[Y = b]   0.3    0.3       0.4
As you can see, I am slightly biased towards rock, and Victor is
slightly biased towards scissors. What's my probability of beating
Victor and getting an easy problem to grade?

Let W be 1 if I win, 0 if we draw, and -1 if I lose. Then we can
compute the joint distribution of X and W. We note that if I choose
rock, I win when he chooses scissors, I lose when he chooses paper,
and we draw when he chooses rock. So
Pr[X = rock, W = 1] = Pr[X = rock, Y = scissors]
Pr[X = rock, W = -1] = Pr[X = rock, Y = paper]
Pr[X = rock, W = 0] = Pr[X = rock, Y = rock].
Repeating this for all values of X, we get
Pr[X=a,W=c]      W    +1     0         -1     |  Pr[X = a]
X                                |
rock      0.16   0.12      0.12    |     0.4
paper     0.09   0.09      0.12    |     0.3
scissors   0.09   0.12      0.09    |     0.3
-------------------------------------'
Pr[W = c]   0.34   0.33      0.33
So I have a slightly higher probability of winning than losing,
which is good news for me. (I get to go home early from grading!)

We can also compute conditional distributions Pr[W=c|X=a], which
will tell me that I should choose rock (until he catches on, of
course).
Pr[W=c|X=a]      W    +1     0         -1
X
rock      0.4    0.3       0.3
paper     0.3    0.3       0.4
scissors   0.3    0.4       0.3
As expected, since Victor is biased towards scissors, I am more
likely to win if I choose rock and less likely if I choose paper.

Conditional Probability Spaces
Suppose that instead, we bet \$1 on the outcome of each game of
rock-paper-scissors. Then W is exactly the amount of money I win,
and I would like to know how much I can expect to win if I play
Victor many times. We can compute this from the marginal
distribution of W:
E(W) = 1 * 0.34 + 0 * 0.33 - 1 * 0.33
= 0.01.
I would also like to know how much I can expect to win for each of
my choices. That way, I have a better idea of what I should choose
(again, until Victor catches on). So I want to know the "conditional
expectation" of W given X = a for each a.

Again, conditioning on the event X = a gives us a new sample space,
and anything we can do in an arbitrary sample we can do in this new
sample space. We just have to replace all our probabilities with
conditional probabilities given by the new sample space.

Doing so, we define conditional expectation as follows:
E(W | E) = ∑_{c ∈ C} c * Pr[W = c | E],
where E is any event and C is the set of all possible values
that W can take on. So in this case, we have
E(W | X = rock) = 1 * 0.4 + 0 * 0.3 - 1 * 0.3 = 0.1
E(W | X = paper) = 1 * 0.3 + 0 * 0.3 - 1 * 0.4 = -0.1
E(W | X = scissors) = 1 * 0.3 + 0 * 0.4 - 1 * 0.3 = 0
So I expect to win more if I choose rock.

We can obtain the unconditional expectation E(W) from the
conditional expectations as follows, where A is the set of
values that X can take on:
E(W) = ∑_{c∈C} c Pr[W = c]
= ∑_{c∈C} c (∑_{a∈A} Pr[X=a]Pr[W=c|X=a])
(total probability rule)
= ∑_{a∈A} Pr[X=a] (∑_{c∈C} c Pr[W=c|X=a])
= ∑_{a∈A} Pr[X=a] E(W|X=a).
This is called the "total expectation law." In this case, we get
E(W) = E(W | X = rock) Pr[X = rock] +
E(W | X = paper) Pr[X = paper] +
E(W | X = scissors) Pr[X = scissors]
= 0.1 * 0.4 + 0 - 0.1 * 0.3 + 0 * 0.3
= 0.01,
which is the same as our previous answer.

In the next discussion section, you will see a neat way of computing
the expectation of a geometric random variable using conditional
expectation and the total expectation law.

We can also define "conditional independence" for events A and B.
A and B are independent conditional on C if
Pr[A, B | C] = Pr[A|C] Pr[B|C].
Equivalently, A and B are independent given C if
Pr[A | B, C] = Pr[A|C].
This tells us that if we are given C, then knowing B occurred gives
us no information about whether or not A occurred.

Note that Pr[A|B|C] is meaningless, since what is to the right of
the bar determines our sample space. So in order to condition on two
events B and C, we condition on their intersection.

We can similarly define conditional independence for random
variables.

There are conditional versions of other probability rules as well.
(1) inclusion/exclusion
Pr[A∪B|C] = Pr[A|C] + Pr[B|C] - Pr[A,B|C]
(2) total probability rule
Pr[A|C] = Pr[A,B|C] + Pr[A,B|C]
= Pr[A|B,C] Pr[B|C] + Pr[A|B,C] Pr[B|C]
(3) Bayes' rule
Pr[A|B,C] = Pr[B|A,C] Pr[A|C] / Pr[B|C]
As an exercise, you may wish to prove some of these on your own.