```Administrative info
HW7 out, due Monday
MT2 next Tuesday
Same location and policies as MT1
Cover through polling/LLN (Wednesday)

Review
We have now seen three important distributions. The first is the
binomial distribution. A random variable X ~ Bin(n, p) has the
distribution
Pr[X = i] = C(n, i) p^i (1-p)^(n-i)
for integer i, 0 <= i <= n. This distribution arises whenever we
have a fixed number of trials n, the trials are mutually
independent, the probability of success of any one trial is p, and
we are counting the number of successes. The expectation of X is
E(X) = np.

The second is the geometric distribution. A random variable Y ~
Geom(p) has the distribution
Pr[Y = i] = p(1-p)^(i-1)
for i ∈ Z^+. This distribution arises whenever we have
independent trials, the probability of success of any one trial is
p, and we are interested in the first success. The expectation of Y
is
E(Y) = 1/p.

The third is the Poisson distribution. A random variable Z ~
Poiss(λ) has the distribution
Pr[Z = i] = (λ^i)/i! e^{-λ}
for i ∈ N. This distribution is the limit of the binomial
distribution when n is large and p is small. It is used to model the
occurrence of rare events. The expectation of Z is
E(Z) = λ.

Poisson Distribution
The Poisson distribution is widely used for modeling rare events. It
is a good approximation of the binomial distribution when n >= 20
and p <= 0.05, and a very good approximation when n >= 100 and np
<= 10.

EX: Suppose a web server gets an average of 100K requests a day.
Each request takes 1 second to handle. How many servers are
needed to handle requests?
ANS: The website has an unknown number of customers n, and there is
a tiny probability p of each person making a request in any 1
second time period. Thus, the rare event is a person choosing
to make a request, and we can use the Poisson distribution to
model this situation. (We don't actually know n or p, so we
couldn't use the binomial distribution even if we wanted to.)

Since there are 100K requests a day on average, the average
number of requests in a 1 second time period is
λ = 100000/(24*3600) ≈ 1.2.
Unlike n and p, this can be measured directly, allowing us to
use the Poisson distribution. Let R be the number of requests
in a 1 second period. Then R ~ Poiss(1.2), and Pr[R = i] =
(λ^i)/i! e^{-λ}.

Plugging in λ = 1.2, we get the following values:
i     Pr[R = i]     Pr[R <= i]
0       0.301          0.301
1       0.361          0.662
2       0.217          0.879
3       0.087          0.966
4       0.026          0.992
5       0.006          0.999.
So if we have 5 servers, we can handle all requests without

(Note that we assumed a uniform distribution of requests over
the entire day. If this is not the case, we can measure the
average number of requests in the busiest 1 second time period
and use this as λ. The rest of our analysis will be the
same.)

Variance
Consider a random walk: I flip a (fair) coin, and if it is heads, I
take a step to the right, but if it is tails, I take a step to the
left. (This models many situations: a drunken sailor, the value of
my stock account, our coin flipping game from a previous lecture.)
How far from the starting point can I expect to be after n flips?

Let X_i be a random variable (not an indicator r.v!) that is +1 if
the ith flip is heads, -1 if it is tails. Let Y be my position after
n flips. Then
X_i = {1 with pr. 1/2, -1 with pr. 1/2}
Y = X_1 + ... + X_n
What is E(Y)? We have
E(X_i) = 0
E(Y) = E(X_1) + ... + E(X_n)
= 0.
So I can expect to be back where I started.

This isn't, however, exactly what the question asked. We wanted to
know our distance from the starting point, not where we end
up. What we actually want to know is E(|Y|).

Unfortunately, the random variable |Y| is difficult to work with.
So let's work with Y^2 instead, which will always be positive. Then
we will take a square root at the end to learn something about how
far we typically are from the starting point.

(Note that it is not true that √{E(Z^2)} = E(|Z|). As a simple
counterexample, consider an indicator random variable Z with Pr[Z =
1] = p. Then E(|Z|) = E(Z) = p, but E(Z^2) = p, so √{E(Z^2)} =
√{p} ≠ E(|Z|). We will see later how to relate |Z| and Z^2.)

We have
E(Y^2) = E((X_1 + ... + X_n)^2)
= E(∑_{i,j} X_i X_j)
= ∑_{i,j} E(X_i X_j).
In the above summations, i,j are in the range 1 <= i,j <= n, so
there are n^2 terms.

What is E(X_i X_j)? There are two cases:
(1) i = j
Then E(X_i X_j) = E(X_i^2) = 1, since X_i^2 is always 1.
(2) i ≠ j
Let's enumerate the possiblities for X_i X_j:
X_i    X_j    X_i X_j    prob.
1      1        1        1/4
1     -1       -1        1/4
-1      1       -1        1/4
-1     -1        1        1/4
In the last column, we used the fact that different coin flips
are independent, so the events X_i = a and X_j = b are
independent.

Putting this together, we get
Pr[X_i X_j = 1] = 1/2
Pr[X_i X_j = -1] = 1/2
E(X_i X_j) = 0.

In our summation, there are n terms that fall under case (1) and
n^2 - n that fall under case (2), so we get
E(Y^2) = n * 1 + (n^2 - n) * 0
= n.

This is called the "variance" of Y, and it tells us something about
the spread of the random variable Y.

More generally, for a random variable Z with arbitrary expectation
E(Z) = μ, we define the variance to be
Var(Z) = E((Z - μ)^2).
It tells us something about the deviation of Z from its mean.

The "standard deviation" of Z is
σ(Z) = √{Var(Z)},
which in some sense undoes the square in the variance.

(Why do we have both variance and standard deviation? Variance is
easier to work with, but standard deviation is on the same scale as
the random variable, so it gives us a better idea about the typical
deviation from the mean.)

In the random walk, σ(Y) = √{n}.

An alternative expression for variance is
Var(X) = E(X^2) - μ^2.
Proof:
Var(X) = E((X - μ)^2)
= E(X^2 - 2Xμ + μ^2)
= E(X^2) - 2μE(X) + μ^2
= E(X^2) - 2μ^2 + μ^2
= E(X^2) - μ^2.
In the third step above, we used linearity of expectation.

Let's do some more examples.

Uniform distribution
Let X be a random variable with uniform distribution in 1,...,n.
Then
μ = E(X) = 1/n (1 + ... + n) = 1/n n(n+1)/2 = (n+1)/2
μ^2 = (n+1)^2/4 = 3(n+1)^2/12
E(X^2) = 1/n (1 + 4 + ... + n^2)
= 1/n ∑_{i=1}^n i^2
= 1/n n(n+1)(2n+1)/6
= (n+1)(2n+1)/6 = 2(n+1)(2n+1)/12
Var(X) = E(X^2) - μ^2
= 2(n+1)(2n+1)/12 - 3(n+1)^2/12
= (n+1)/12 (4n+2 - 3n-3)
= (n+1)(n-1)/12
= (n^2-1)/12.

Compare this variance to that of the random walk; this is on the
order of n^2, while that of the random walk was on the order of n.
This should make sense, since in the case of the random walk, it's
much more likely to be closer to the mean than further, unlike in
a uniform distribution. (The probability "mass" is concentrated
near the mean, while in a uniform distribution, it is spread out.)

EX: Let X be the result of a roll of a fair die. What is Var(X)?
ANS: Var(X) = (6^2-1)/12 = 35/12
σ(X) ≈ 1.7.

Binomial distribution
Let X ~ Bin(n, p). Then we proceed as in the random walk. Let X_i
be an indicator random variable that is 1 if the ith trial
succeeds. Then
X = X_1 + ... + X_n
E(X^2) = E(∑_{i,j} X_i X_j)
= ∑_{i,j} E(X_i X_j).
For E(X_i X_j), we have two cases.
(1) i = j
Then E(X_i X_j) = E(X_i^2) = p, since Pr[X_i^2 = 1] = p.
(2) i ≠ j
Let's enumerate the possiblities for X_i X_j:
X_i    X_j    X_i X_j    prob.
1      1        1        p^2
1      0        0       p(1-p)
0      1        0       p(1-p)
0      0        0      (1-p)^2
In the last column, we used the fact that different coin flips
are independent. Thus, Pr[X_i X_j = 1] = p^2, and E(X_i X_j) =
p^2.
There are n terms in the summation that fall under case (1), n^2 -
n that fall under case (2), so we get
E(X^2) = np + (n^2-n)p^2
= np + n^2 p^2 - np^2
= n^2 p^2 + np(1-p).
Then
Var(X) = E(X^2) - E(X)^2
= n^2 p^2 + np(1-p) - n^2 p^2
= np(1-p).

Geometric distribution
Let X ~ Geom(p). Then
E(X^2) = p + 4p(1-p) + 9p(1-p)^2 + 16p(1-p)^3 + ...
Multiplying this by (1-p), we get
(1-p)E(X^2) =      p(1-p) + 4p(1-p)^2 +  9p(1-p)^3 + ...
Subtracting, we get
pE(X^2) = p + 3p(1-p) + 5p(1-p)^2 +  7p(1-p)^3 + ...
= 2[p + 2p(1-p) + 3p(1-p)^2 +  4p(1-p)^3 + ...]
- [p +  p(1-p) +  p(1-p)^2 +   p(1-p)^3 + ...]
The first sum is just E(X), and the second is the sum of the
probabilities of each of the outcomes, so it is 1. Thus,
pE(X^2) = 2E(X) - 1
= 2/p - 1
= (2-p)/p
E(X^2) = (2-p)/p^2.
Then
Var(X) = E(X^2) - E(X)^2
= (2-p)/p^2 - 1/p^2
= (1-p)/p^2.

Here are some useful facts about variance.
(1) Var(cX) = c^2 Var(X), where c is a constant.
(2) Var(X+c) = Var(X), where c is a constant.
EX: In the random walk, we have Y = 2H - n, where H is the
number of heads. (We showed this in a previous lecture.) So
Var(Y) = 4Var(H). We've computed Var(H) = np(1-p), so Var(Y)
= 4np(1-p). For a fair coin, p = 1/2, and we get Var(Y) =
4n(1/2)(1/2) = n, as before.
(3) Var(X+Y) = Var(X) + Var(Y) if X and Y are independent.
What does it mean for two random variables X and Y to be
independent? Let A be the set of values that X can take on, B be
the set of values Y can take on. Then X and Y are independent if
∀a∈A ∀b∈b . Pr[X=a ∩ Y=b] = Pr[X=a] Pr[X=b].
EX: Let X ~ Bin(n, p) and define indicator random variables X_i
as before. Then E(X_i^2) = p, so Var(X_i) = p - p^2 =
p(1-p). Then Var(X) = Var(X_1) + ... + Var(X_n) = np(1-p),
as before.
The proofs of (1) and (2) are straightforward from the definition
of variance. We will come back to (3) later.

```