```Administrative info
HW8 due Wednesday
Final exam Thursday 5-8pm in 10 Evans
No regrades for HW8 (not enough time) or final exam (UCB policy)
Review session tomorrow 3-5pm in 306 Soda

Review
We can describe a continuous random variable X in two ways.
(1) The cumulative distribution function (cdf):
F(x) = Pr[X <= x].
(2) The probability density function (pdf):
f(x) = d/dx F(x).

The cdf is defined for all random variables, discrete or continuous.
In HW8 Q12, if you choose to do it, you demonstrate that it contains
all information about a random variable.

The exponential distribution X ~ Exp(λ) has pdf
f(x) = { λ e^{-λ x}   if x >= 0
{ 0            if x < 0
and cdf
F(x) = { 1 - e^{-λ x}   if x >= 0
{ 0              if x < 0
It tells us how long until the first success when the rate of
success per unit time is λ. The expectation and variance are
E(X) = 1/λ
Var(X) = 1/λ^2.

The normal or Gaussian distribution Y ~ N(μ, σ^2) has pdf
f(y) = 1/√{2πσ^2} e^{-(y-μ)^2/(2σ^2)}
and expectation and variance
E(Y) = μ
Var(Y) = σ^2.

The pdf of a normal distribution is a symmetric bell-shaped curve
centered at μ, with a width determined by σ.

The cdf of a normal distribution does not have a simple, closed
form.

The standard normal distribution has parameters μ = 0, σ
= 1. So if Z is a standard normal, then
Z ~ N(0, 1),
and the pdf of Z is
g(z) = 1/√{2π} e^{-z^2/2}.

Normal Distribution (cont.)
We can turn any normal distribution into a standard normal by
translating and scaling. If X ~ N(μ, σ^2), then let
Z = (X-μ)/σ.
Then by linearity of expectation,
E(Z) = 1/σ (E(X) - μ)
= 1/σ (μ - μ)
= 0.
Similarly, using our variance facts, we have
Var(Z) = Var((X-μ)/σ)
= 1/σ^2 Var(X - μ)
= 1/σ^2 Var(X)
= 1/σ^2 σ^2
= 1.
So we have shown that Z has the right expectation and variance. We
need to show that Z is normal. Since Z = (X-μ)/σ, we have
that X = σZ + μ, so
Pr[a <= Z <= b] = Pr[σa+μ <= X <= σb+μ]
= 1/√{2πσ^2} ∫_{σa+μ}^{σb+μ} e^{-(x-μ)^2/(2σ^2)} dx.
We can do a change of variable from x to z, where z =
(x-μ)/σ, or x = zσ+μ. So the bounds of the
integral become
((σa+μ)-μ)/σ = a
((σb+μ)-μ)/σ = b,
the (x-μ)^2/σ^2 in the exponent becomes z^2, and the dx
becomes σdz, giving us
Pr[a <= Z <= b] = 1/√{2π} ∫_a^b e^{-z^2/2)} dz.
Thus, Z ~ N(0, 1).

Thus, we can turn any normal into a standard normal, so if we have a
table of probabilities for the standard normal, we can determine
probabilities for any normal. Often, probabilities for a standard
normal are given in a "z-score" table, which tabulates Pr[Z <= z]
for various values of z, where Z ~ N(0, 1).

EX: Suppose a set of exam scores follow a normal distribution with a
mean of 70 and a standard deviation of 10. What is the
probability that a random student scores at least 90?

Let X be the student's score. We have X ~ N(70, 100), and we
want Pr[X >= 90]. Let Z ~ N(0, 1). We get
Pr[X >= 90] = Pr[(X-70)/10 >= 2]
= Pr[Z >= 2]
= Pr[Z <= -2]  (since a normal is symmetric around its mean)
≈ 0.02.

Some features of the normal distribution X ~ N(μ, σ^2):
(1) The value of X falls within ±σ with probability 0.68.
(2) The value of X falls within ±2σ with probability 0.95.
(3) The value of X falls within ±3σ with probability 0.997.

Useful normal tricks:
(1) Pr[Z >= z] = Pr[Z <= -z]
(2) Pr[Z >= z] = 1 - Pr[Z <= z].

The sum of two independent normally distributed random variables X_1
~ N(μ_1, σ_1^2) and X_2 ~ N(μ_2, σ_2^2), Y = X_1 +
X_2, is also normally distributed, Y ~ N(μ_1+μ_2,
σ_1^2+σ_2^2). (Of course, you already knew its
expectation and variance; the important fact is that it is normal.)

The normal distribution models aggregate results from many
independent observations of the same random variable, as we will see
next.

The Central Limit Theorem
Recall the law of large numbers. Given i.i.d. random variables X_i
with common mean μ and variance σ^2, we defined the sample
average as
A_n = 1/n ∑_{i=1}^n X_i.
Then A_n has mean μ and variance σ^2/n. This implies, by
Chebyshev's inequality, that the probability of any deviation
α from the mean goes to 0 as n->∞:
Pr[|A_n - μ| >= α] <= Var(A_n)/α^2
= σ^2/(nα^2)
-> 0 as n->∞.

We can actually say something much stronger than the law of large
numbers: the distribution of A_n tends to the normal distribution
with mean μ and variance σ^2/n as n becomes large.

To state this precisely, so that we get a convergence to a single
distribution, we first scale A_n so that its mean is 0 and variance
is 1:
Z_n = (A_n - μ) √n / σ
= n (A_n - μ) / (σ √n)
= n (1/n ∑_{i=1}^n X_i - μ) / (σ √n)
= (∑_{i=1}^n X_i - nμ) / (σ√n).
Then the distribution of Z_n tends to that of the standard normal
Z as n->∞, meaning
∀α∈R Pr[Z_n <= α] -> Pr[Z <= α] as n->∞.

Since the sample mean A_n is just a scaling and translation of Z_n,
it too has an approximately normal distribution for large n, but
with mean μ and variance σ^2/n. Finally, the sample sum
S_n = ∑_{i=1}^n X_i
also has a normal distribution, with parameters nμ and
nσ^2, since it is just a scaling of the sample mean. (Note
that as we saw in discussing LLN, the probability of any deviation
of S_n from its mean does not tend to 0. Its distribution, however,
does tend to a normal distribution, but with increasing variance as
n->∞.)

The central limit theorem tells us that if we take n observations of
any random variable X_i, no matter what distribution X has (as long
as its mean and variance are finite, and its variance is nonzero),
then the distribution of the sample mean or sum tends to that of the
normal distribution. The sample mean tends to a normal distribution
with parameters μ and σ^2/n, where μ = E(X_i) and
σ^2 = Var(X_i), and the sample sum tends to a normal
distribution with parameters nμ and nσ^2. This explains the
prevalence of the normal distribution, and it allows us to
approximate distributions that are the sum of i.i.d. random
variables.

The simplest example of the CLT in action is the binomial
distribution. A binomial random variable X ~ Bin(n, p) is the sum of
n i.i.d. indicator random variables
X = X_1 + ... + X_n,
where
X_i = { 1  w.p. p
{ 0  w.p. 1-p.
This explains why the binomial distribution is bell-shaped. It also
allows us to approximate the binomial distribution using a normal
distribution with parameters np and np(1-p).

A standard rule of thumb is that the normal approximation is a
reasonable approximation if np >= 5 and n(1-p) >= 5.

EX: Suppose you flip a biased coin with probability p = 0.2 of heads
100 times. What is the probability that you get more than 30

Let X be the number of heads. Then X ~ Bin(100, 0.2), and np =
20 > 5 and n(1-p) = 80 > 5. Thus, we can approximate X as a
normally distributed random variable Y ~ N(20, 16). Then we want
Pr[X > 30] ≈ Pr[Y > 30]
= Pr[(Y-20)/4 > 2.5]
= Pr[Z > 2.5]          (where Z ~ N(0, 1))
= 1 - Pr[Z < 2.5]
≈ 0.006.

Since the binomial distribution is discrete while the normal
distribution is continuous, we can get a better approximation by
applying a "continuity correction." However, we do not require you
to use a continuity correction in this class.

Illustration of CLT
Let's do another simple example that illustrates the central limit
theorem. Consider the case where the X_i are i.i.d. and have the
uniform distribution
{ 0  w.p. 1/3      1/3 | *  *  *
X_i = { 1  w.p. 1/3          `---------
{ 2  w.p. 1/3.           0  1  2
Let Z_n be the sum of X_1, ..., X_n. For Z_2 we get Pr[Z_2 = k] is
just 1/9 times the number of ways that
X_1 + ... + X_n = k
for k = {0, 1, 2}. This is just pirate coins/stars and bars, so it
is
C(2+k-1, 2-1).
Then the distribution is symmetric around the mean, so we get
{ 0  w.p. 1/9      3/9 |       *
{ 1  w.p. 2/9      2/9 |    *  *  *
Z_2 = { 2  w.p. 3/9      1/9 | *  *  *  *  *
{ 3  w.p. 2/9          `---------------
{ 4  w.p. 1/9.           0  1  2  3  4
For Z_3, it is a little more complicated, but we get
7/27 |          *
{ 0  w.p. 1/27         |       *  *  *
{ 1  w.p. 3/27    5/27 |       *  *  *
{ 2  w.p. 6/27         |       *  *  *
Z_2 = { 3  w.p. 7/27    3/27 |    *  *  *  *  *
{ 4  w.p. 6/27         |    *  *  *  *  *
{ 5  w.p. 3/27    1/27 | *  *  *  *  *  *  *
{ 6  w.p. 1/27.        `---------------------
0  1  2  3  4  5  6
We can already see the beginnings of a bell-shaped curve, with the
sum of just three i.i.d. random variables.

Proof of CLT (Optional)
The following is an overview of the proof of the central limit
theorem. It is optional, was not covered in lecture, and will not be
on the exam, so feel free to skip this section if you are not
interested.

We start be defining the "characteristic function" of a random
variable X as the function
φ_X(t) = E(e^{itX}),
i.e. the value of φ_X(t) is the expectation of e^{itX}. Recall
that a random variable is a function from outcomes to another set,
so e^{itX} is another random variable defined as
(e^{itX})(ω) = e^{itX(ω)}.
This random variable is a function from outcomes to the complex
numbers, so it has an expectation.

Like the cdf, the characteristic function encodes all the
information about a random variable. Also like the cdf, it always
exists, even when the pdf does not or when the mean and variance do
not.

If the pdf does exist, then the characteristic function is its
(unscaled) Fourier transform:
φ_X(t) = E(e^{itX}) = ∫_{-∞}^{+∞} e^{itx} f(x) dx,
where f(x) is the pdf of X.

In particular, we can compute the characteristic function of a
normal random variable Y ~ N(μ, σ^2):
φ_Y(t) = e^{i t μ - 1/2 σ^2 t^2}.
The characteristic function of a standard normal Z ~ N(0, 1) then is
φ_Z(t) = e^{- 1/2 t^2}.

The characteristic function of the sum of two independent random
variables X and Y is the product of their characteristic functions:
φ_{X+Y}(t) = E(e^{it(X+Y)})
= E(e^{itX} e^{itY})
= E(e^{itX}) E(e^{itY})            (since X, Y independent)
= φ_X(t) φ_Y(t).
In the third line, we used the fact that since X and Y are
independent, e^{itX} and e^{itY} are independent.

The characteristic function of a scaled random variable cX, where c
is a constant, is
φ_{cX}(t) = E(e^{it(cX)})
= E(e^{i(ct)X})
= φ_X(ct)
by a simple change of variable.

If the mean and variance of a random variable exist and are finite,
we can use the Taylor expansion of e^x to approximate the
characteristic function of X/√n:
φ_{X/√n}(t) = E(e^{itX/√n})
≈ E(1 + itX/√n - t^2 X^2 / 2n)
= 1 + (it/√n) E(X) - (t^2/2n) E(X^2).
As n->∞, the lower order terms go to 0, so this will be a good
approximation. If the mean is 0 and the variance 1, then E(X) = 0
and Var(X) =  E(X^2) - 0 = 1, so we get
φ_{X/√n}(t) ≈ 1 - t^2/2n.

Finally, Levy's continuity theorem tells us that if the
characteristic functions of a sequence of random variables Z_1, Z_2,
... converge to the characteristic function of another random
variable Z, then so too do the cdfs of Z_1, Z_2, ... converge to the
cdf of Z. This means that they "converge in distribution" to the
distribution of Z.

We are now ready to prove the CLT. We will restrict ourselves to the
case that the individual random variables are i.i.d.

Consider a set of i.i.d. random variables X_1, ..., X_n with common
mean μ and variance σ^2, both finite and the variance
nonzero. Let
Y_i = (X_i - μ) / σ
for i = 1, ..., n. Then the Y_i have common mean 0 and variance 1.
Let
Z_n = (∑_{i=1}^n X_i - n μ) / (σ √n).
Then we see that
Z_n = ∑_{i=1}^n (Y_i/√n).
Since the Y_i have a mean of 0 and a variance of 1, the
characteristic function of Y_i/√n is
φ_{Y_i/√n}(t) ≈ 1 - t^2/2n.
Then the characteristic function of Z_n is
φ_{Z_n}(t) = φ_{∑ Y_i/√n}(t)
= φ_{Y_1/√n}(t) ... φ_{Y_n/√n}(t)
≈ (1 - t^2/2n)^n
≈ e^{(-t^2/2n) n}                          (as n->∞)
= e^{- 1/2 t^2}.
Thus, we see that the characteristic functions of the Z_n converge
to that of the standard normal as n->∞, so by Levy's
continuity theorem, the distributions of the Z_n converge to that of
the standard normal.

```