2025-01-19 02:07:58 -08:00
#import "@youwen/zen:0.1.0": *
2025-01-06 15:32:17 -08:00
2025-01-19 02:07:58 -08:00
#show: zen.with(
2025-01-06 17:45:05 -08:00
title: "PSTAT120A Course Notes",
2025-01-06 15:32:17 -08:00
author: "Youwen Wu",
2025-01-07 01:00:54 -08:00
date: "Winter 2025",
2025-01-06 17:45:05 -08:00
subtitle: "Taught by Brian Wainwright",
2025-01-06 15:32:17 -08:00
2025-01-19 02:07:58 -08:00
= Introduction
2025-02-19 16:41:37 -08:00
There are lecture notes from when PSTAT120A (Probability and Statistics) was
taught in Winter 2025 by Dr. Wainwright. Any errors contained within are the
scribe's, not the instructor's.
2025-01-19 02:07:58 -08:00
2025-01-09 00:49:33 -08:00
= Lecture #datetime(day: 6, month: 1, year: 2025).display()
2025-01-06 15:32:17 -08:00
== Set theory for dummies
2025-01-06 17:45:05 -08:00
2025-01-08 23:33:52 -08:00
A *set* is a collection of elements.
2025-01-06 15:32:17 -08:00
#example[Examples of sets][
+ Trivial set: ${1}$
+ Empty set: $emptyset$
+ $A = {a,b,c}$
2025-01-06 18:32:33 -08:00
We can construct sets using set-builder notation (also sometimes called set
2025-01-06 15:32:17 -08:00
$ {"expression with" x | "conditions on" x} $
#example("Set builder notation")[
+ The set of all even integers: ${2n | n in ZZ}$
+ The set of all perfect squares in $RR$: ${x^2 | x in NN}$
We also have notation for working with sets:
2025-01-06 17:45:05 -08:00
With arbitrary sets $A$, $B$:
2025-01-06 15:32:17 -08:00
+ $a in A$ ($a$ is a member of the set $A$)
+ $a in.not A$ ($a$ is not a member of the set $A$)
2025-01-06 17:45:05 -08:00
+ $A subset.eq B$ (Set theory: $A$ is a subset of $B$) (Stats: $A$ is a sample space in $B$)
+ $A subset B$ (Proper subset: $A != B$)
2025-01-08 20:40:52 -08:00
+ $A^c$ or $A'$ (read "complement of $A$," and introduced later)
2025-01-06 15:32:17 -08:00
+ $A union B$ (Union of $A$ and $B$. Gives a set with both the elements of $A$ and $B$)
+ $A sect B$ (Intersection of $A$ and $B$. Gives a set consisting of the elements in *both* $A$ and $B$)
+ $A \\ B$ (Set difference. The set of all elements of $A$ that are not also in $B$)
+ $A times B$ (Cartesian product. Ordered pairs of $(a,b)$ $forall a in A$, $forall b in B$)
We can also write a few of these operations precisely as set comprehensions.
2025-01-06 17:45:05 -08:00
+ $A subset B => A = {a | a in B, forall a in A}$
2025-01-06 15:32:17 -08:00
+ $A union B = {x | x in A or x in B}$ (here $or$ is the logical OR)
+ $A sect B = {x | x in A and x in B}$ (here $and$ is the logical AND)
+ $A \\ B = {a | a in A and a in.not B}$
+ $A times B = {(a,b) | forall a in A, forall b in B}$
2025-01-08 20:40:52 -08:00
Take a moment and convince yourself that these definitions are equivalent to
the previous ones.
2025-01-06 18:28:44 -08:00
The universal set $Omega$ is the set of all objects in a given set
theoretical universe.
2025-01-06 15:32:17 -08:00
2025-01-08 20:40:52 -08:00
With the above definition, we can now introduce the set complement.
The set complement $A'$ is given by
A' = Omega \\ A
where $Omega$ is the _universal set_.
2025-01-06 15:32:17 -08:00
#example[The real plane][
2025-01-06 18:28:44 -08:00
The real plane $RR^2$ can be defined as a Cartesian product of $RR$ with
2025-01-06 15:32:17 -08:00
$ RR^2 = RR times RR $
2025-01-06 17:45:05 -08:00
Check your intuition that this makes sense. Why do you think $RR^n$ was chosen
as the notation for $n$ dimensional spaces in $RR$?
2025-01-06 15:32:17 -08:00
2025-01-07 17:58:53 -08:00
#definition[Disjoint sets][
2025-01-06 15:32:17 -08:00
If $A sect B$ = $emptyset$, then we say that $A$ and $B$ are *disjoint*.
2025-01-07 17:58:53 -08:00
2025-01-06 17:45:05 -08:00
For any sets $A$ and $B$, we have DeMorgan's Laws:
+ $(A union B)' = A' sect B'$
+ $(A sect B)' = A' union B'$
2025-01-06 15:32:17 -08:00
2025-01-07 01:00:54 -08:00
#fact[Generalized DeMorgan's][
2025-01-08 15:43:40 -08:00
+ $(union.big_i A_i)' = sect.big_i A_i '$
+ $(sect.big_i A_i)' = union.big_i A_i '$
2025-01-06 15:32:17 -08:00
2025-01-06 17:45:05 -08:00
== Sizes of infinity
2025-01-06 15:32:17 -08:00
2025-01-06 17:45:05 -08:00
2025-01-06 15:32:17 -08:00
Let $N(A)$ be the number of elements in $A$. $N(A)$ is called the _cardinality_ of $A$.
2025-01-08 20:40:52 -08:00
We say a set is finite if it has finite cardinality, or infinite if it has an
infinite cardinality.
2025-01-06 15:32:17 -08:00
Infinite sets can be either _countably infinite_ or _uncountably infinite_.
When a set is countably infinite, its cardinality is $aleph_0$ (here $aleph$ is
the Hebrew letter aleph and read "aleph null").
When a set is uncountably infinite, its cardinality is greater than $aleph_0$.
#example("Countable sets")[
+ The natural numbers $NN$.
+ The rationals $QQ$.
+ The natural numbers $ZZ$.
2025-01-07 01:00:54 -08:00
+ The set of all logical tautologies.
2025-01-06 15:32:17 -08:00
#example("Uncountable sets")[
+ The real numbers $RR$.
+ The real numbers in the interval $[0,1]$.
2025-01-07 01:00:54 -08:00
+ The _power set_ of $ZZ$, which is the set of all subsets of $ZZ$.
All the uncountable sets above have cardinality $2^(aleph_0)$ or $aleph_1$ or
$frak(c)$ or $beth_1$. This is the _cardinality of the continuum_, also
called "aleph 1" or "beth 1".
However, in general uncountably infinite sets do not have the same
2025-01-06 15:32:17 -08:00
2025-01-06 18:28:44 -08:00
2025-01-06 15:32:17 -08:00
If a set is countably infinite, then it has a bijection with $ZZ$. This means
2025-01-07 01:00:54 -08:00
every set with cardinality $aleph_0$ has a bijection to $ZZ$. More generally,
any sets with the same cardinality have a bijection between them.
2025-01-06 15:32:17 -08:00
2025-01-07 17:58:53 -08:00
This gives us the following equivalent statement:
Two sets have the same cardinality if and only if there exists a bijective
function between them. In symbols,
$ N(A) = N(B) <==> exists F : A <-> B $
2025-01-08 15:43:40 -08:00
= Lecture #datetime(day: 8, month: 1, year: 2025).display()
== Probability
A *random experiment* is one in which the set of all possible outcomes is known in advance, but one can't predict which outcome will occur on a given trial of the experiment.
#example("Finite sample spaces")[
Toss a coin:
$ Omega = {H,T} $
Roll a pair of dice:
$ Omega = {1,2,3,4,5,6} times {1,2,3,4,5,6} $
#example("Countably infinite sample spaces")[
Shoot a basket until you make one:
$ Omega = {M, F M, F F M, F F F M, dots} $
#example("Uncountably infinite sample space")[
Waiting time for a bus:
$ Omega = {T : t >= 0} $
Elements of $Omega$ are called sample points.
Any properly defined subset of $Omega$ is called an *event*.
Rolling a fair die twice, let $A$ be the event that the combined score of both dice is 10.
$ A = {(4,6,), (5,5),(6,4)} $
2025-01-08 20:40:52 -08:00
Probabilistic concepts in the parlance of set theory:
2025-01-08 15:43:40 -08:00
- Superset ($Omega$) $<->$ sample space
- Element $<->$ outcome / sample point ($omega$)
- Disjoint sets $<->$ mutually exclusive events
== Classical approach
Classical approach:
$ P(a) = (hash A) / (hash Omega) $
Requires equally likely outcomes and finite sample spaces.
With an infinite sample space, the probability becomes 0, which is often wrong.
#example("Dice again")[
Rolling a fair die twice, let $A$ be the event that the combined score of both dice is 10.
A &= {(4,6,), (5,5),(6,4)} \
P(A) &= 3 / 36 = 1 / 12
== Relative frequency approach
2025-01-19 02:07:58 -08:00
An approach done commonly by applied statisticians who work in the disgusting
real world. This is where we are generally concerned with irrelevant concerns
2025-02-19 16:36:00 -08:00
like accurate sampling and $p$-values and such.
2025-01-08 15:43:40 -08:00
P(A) = (hash "of times" A "occurs in large number of trials") / (hash "of trials")
Flipping a coin to determine the probability of it landing heads.
== Subjective approach
2025-01-08 20:40:52 -08:00
Personal definition of probability. Not "real" probability, merely co-opting
its parlance to lend credibility to subjective judgements of confidence.
2025-01-08 15:43:40 -08:00
== Axiomatic approach
2025-01-19 02:07:58 -08:00
Consider a random experiment. Then:
The *sample space* $Omega$ is the set of all possible outcomes of the
Elements of $Omega$ are called *sample points*.
Subsets of $Omega$ are called *events*. The collection of events (in other
terms, the power set of $Omega$) in $Omega$ is denoted by $cal(F)$.
2025-01-08 15:43:40 -08:00
2025-01-19 02:07:58 -08:00
The *probability measure*, or probability distribution, or simply probability s a function $P$.
Let $P : cal(F) -> RR$ be a function satisfying the following axioms (properties).
2025-01-08 15:43:40 -08:00
+ $P(A) >= 0, forall A$
+ $P(Omega) = 1$
+ If $A_i sect A_j = emptyset, forall i != j$, then
$ P(union.big_(i=1)^infinity A_i) = sum_(i=1)^infinity P(A_i) $
2025-01-19 02:07:58 -08:00
The 3-tuple $(Omega, cal(F), P)$ is called a *probability space*.
In more advanced texts you will see $Omega$ introduced as a so-called
$sigma$-algebra. A $sigma$-algebra on a set $Omega$ is a nonempty collection
$Sigma$ of subsets of $Omega$ that is closed under set complement, countable
unions, and as a corollary, countable intersections.
2025-01-08 20:40:52 -08:00
Now let us show various results with $P$.
2025-01-08 15:43:40 -08:00
$ P(emptyset) = 0 $
By axiom 3,
A_1 = emptyset, A_2 = emptyset, A_3 = emptyset \
P(emptyset) = sum^infinity_(i=1) P(A_i) = sum^infinity_(i=1) P(emptyset)
Suppose $P(emptyset) != 0$. Then $P >= 0$ by axiom 1 but then $P -> infinity$ in the sum, which implies $Omega > 1$, which is disallowed by axiom 2. So $P(emptyset) = 0$.
If $A_1, A_2, ..., A_n$ are disjoint, then
$ P(union.big^n_(i=1) A_i) = sum^n_(i= 1) P(A_i) $
2025-01-08 20:40:52 -08:00
This is mostly a formal manipulation to derive the obviously true proposition from our axioms.
2025-01-08 15:43:40 -08:00
2025-01-08 20:40:52 -08:00
Write any finite set $(A_1, A_2, ..., A_n)$ as an infinite set $(A_1, A_2, ..., A_n, emptyset, emptyset, ...)$. Then
P(union.big_(i=1)^infinity A_i) = sum^n_(i=1) P(A_i) + sum^infinity_(i=n+1) P(emptyset) = sum^n_(i=1) P(A_i)
And because all of the elements after $A_n$ are $emptyset$, their union adds no additional elements to the resultant union set of all $A_i$, so
P(union.big_(i=1)^infinity A_i) = P(union.big_(i=1)^n A_i) = sum_(i=1)^n P(A_i)
2025-01-08 15:43:40 -08:00
$ P(A') = 1 - P(A) $
A' union A &= Omega \
A' sect A &= emptyset \
P(A' union A) &= P(A') + P(A) &"(by axiom 3)"\
2025-01-08 20:40:52 -08:00
= P(Omega) &= 1 &"(by axiom 2)" \
therefore P(A') &= 1 - P(A)
2025-01-08 15:43:40 -08:00
$ A subset.eq B => P(A) <= P(B) $
$ B = A union (A' sect B) $
but $A$ and ($A' sect B$) are disjoint, so
P(B) &= P(A union (A' sect B)) \
&= P(A) + P(A' sect B) \
&therefore P(B) >= P(A)
2025-02-03 15:15:35 -08:00
#proposition("Inclusion-exclusion principle")[
2025-01-08 15:43:40 -08:00
$ P(A union B) = P(A) + P(B) - P(A sect B) $
2025-02-03 15:15:35 -08:00
2025-01-08 15:43:40 -08:00
A = (A sect B) union (A sect B') \
=> P(A) = P(A sect B) + P(A sect B') \
=> P(B) = P(B sect A) + P(B sect A') \
P(A) + P(B) = P(A sect B) + P(A sect B) + P(A sect B') + P(A' sect B) \
=> P(A) + P(B) - P(A sect B) = P(A sect B) + P(A sect B') + P(A' sect B) \
2025-01-08 20:40:52 -08:00
This is a stronger result of axiom 3, which generalizes for all sets $A$ and $B$ regardless of whether they're disjoint.
2025-01-09 00:49:33 -08:00
These are mostly intuitively true statements (think about the probabilistic
concepts represented by the sets) in classical probability that we derive
rigorously from our axiomatic probability function $P$.
2025-01-08 15:43:40 -08:00
2025-01-09 00:49:33 -08:00
Now let us consider some trivial concepts in classical probability written in
the parlance of combinatorial probability.
2025-01-08 15:43:40 -08:00
Select one card from a deck of 52 cards.
2025-01-09 00:49:33 -08:00
Then the following is true:
2025-01-08 15:43:40 -08:00
Omega = {1,2,...,52} \
A = "card is a heart" = {H 2, H 3, H 4, ..., H"Ace"} \
B = "card is an Ace" = {H"Ace", C"Ace", D"Ace", S"Ace"} \
C = "card is black" = {C 2, C 3, ..., C"Ace", S 2, S 3, ..., S"Ace"} \
P(A) = 13 / 52,
P(B) = 4 / 52,
P(C) = 26 / 52 \
P(A sect B) = 1 / 52 \
P(A sect C) = 0 \
P(B sect C) = 2 / 52 \
P(A union B) = P(A) + P(B) - P(A sect B) = 16 / 52 \
P(B') = 1 - P(B) = 48 / 52 \
P(A sect B') = P(A) - P(A sect B) = 13 / 52 - 1 / 52 = 12 / 52 \
P((A sect B') union (A' sect B)) = P(A sect B') + P(A' sect B) = 15 / 52 \
P(A' sect B') = P(A union B)' = 1 - P(A union B) = 36 / 52
== Countable sample spaces
2025-01-09 00:49:33 -08:00
A sample space $Omega$ is said to be *countable* if it's finite or countably infinite.
2025-01-08 15:43:40 -08:00
In such a case, one can list the elements of $Omega$.
$ Omega = {omega_1, omega_2, omega_3, ...} $
with associated probabilities, $p_1, p_2, p_3,...$, where
p_i = P(omega_i) >= 0 \
1 = P(Omega) = sum P(omega_i)
#example[Fair die, again][
All outcomes are equally likely,
$ p_1 = p_2 = ... = p_6 = 1 / 6 $
Let $A$ be the event that the score is odd = ${1,3,5}$
$ P(A) = 3 / 6 $
#example[Loaded die][
Consider a die where the probabilities of rolling odd sides is double the probability of rolling an even side.
p_2 = p_4 = p_6, p_1 = p_3 = p_5 = 2p_2 \
6p_2 + 3p_2 = 9p_2 = 1 \
p_2 = 1 / 9, p_1 = 2 / 9
Toss a fair coin until you get the first head.
Omega = {H, T H, T T H, ...} "(countably infinite)" \
P(H) = 1 / 2 \
P(T T H) = (1 / 2)^3 \
P(Omega) = sum_(n=1)^infinity (1 / 2)^n = 1 / (1 - 1 / 2) - 1 = 1
2025-01-08 20:40:52 -08:00
2025-01-08 15:43:40 -08:00
What is the probability two people share the same birthday?
Omega = [1,365] times [1,365] \
P(A) = 365 / 365^2 = 1 / 365
== Continuous sample spaces
A *continuous sample space* contains an interval in $RR$ and is uncountably infinite.
A probability density function (#smallcaps[pdf]) gives the probability at the point
Properties of the #smallcaps[pdf]:
- $f(s) >= 0, forall p_i >= 0$
- $integral_S f(s) dif s = 1, forall p_i >= 0$
Waiting time for bus: $Omega = {s : s >= 0}$.
2025-01-19 02:07:58 -08:00
= Notes on counting
The cardinality of $A$ is given by $hash A$. Let us develop methods for finding
$hash A$ from a description of the set $A$ (in other words, methods for
== General multiplication principle
Let $A$ and $B$ be finite sets, $k in ZZ^+$. Then let $f : A -> B$ be a
function such that each element in $B$ is the image of exactly $k$ elements
in $A$ (such a function is called _$k$-to-one_). Then $hash A = k dot hash
Four fully loaded 10-seater vans transported people to the picnic. How many
people were transported?
By @ktoone, we have $A$ is the set of people, $B$ is the set of vans, $f : A -> B$ maps a person to the van they ride in. So $f$ is a 10-to-one function, $hash A = 40$, $hash B = 4$, and clearly the answer is $10 dot 4 = 40$.
An $n$-tuple is an ordered sequence of $n$ elements.
Many of our methods in probability rely on multiplying together multiple
outcomes to obtain their combined amount of outcomes. We make this explicit below in @tuplemultiplication.
Suppose a set of $n$-tuples $(a_1, ..., a_n)$ obeys these rules:
+ There are $r_1$ choices for the first entry $a_1$.
+ Once the first $k$ entries $a_1, ..., a_k$ have been chosen, the number of alternatives for the next entry $a_(k+1)$ is $r_(k+1)$, regardless of the previous choices.
Then the total number of $n$-tuples is the product $r_1 dot r_2 dot r_2 dot dots dot r_n$.
It is trivially true for $n = 1$ since you have $r_1$ choices of $a_1$ for a
1-tuple $(a_1)$.
Let $A$ be the set of all possible $n$-tuples and $B$ be the set of all
possible $(n+1)$-tuples. Now let us assume the statement is true for $A$.
Proceed by induction on $B$, noting that for each $n$-tuple in $A$, $(a_1,
..., a_n)$, we have $r_(n+1)$ tuples in $A$.
Let $f : B -> A$ be a function which takes each $(n+1)$-tuple and truncates the $a_(n+1)$ term, leaving us with just an $n$-tuple of the form $(a_1, a_2, ..., a_n)$.
$ f((a_1, ..., a_n, a_(n + 1))) = (a_1, ..., a_n) $
Now notice that $f$ is precisely a $r_(n+1)$-to-one function! Recall by
our assumption that @tuplemultiplication is true for $n$-tuples, so $A$ has $r_1 dot
r_2 dot ... dot r_n$ elements, or $hash A = r_1 dot ... dot r_n$. Then by
@ktoone, we have $hash B = hash A dot r_(n+1) = r_1 dot r_2 dot
... dot r_(n+1)$. Our induction is complete and we have proved @tuplemultiplication.
@tuplemultiplication is sometimes called the _general multiplication principle_.
We can use @tuplemultiplication to derive counting formulas for various
situations. Let $A_1, A_2, A_n$ be finite sets. Then as a corollary of
@tuplemultiplication, we can count the number of $n$-tuples in a finite
Cartesian product of $A_1, A_2, A_n$.
Let $A_1, A_2, A_n$ be finite sets. Then
hash (A_1 times A_2 times ... times, A_n) = (hash A_1) dot (hash A_2) dot ... dot (hash A_n) = Pi^n_(i=1) (hash A_i)
How many distinct subsets does a set of size $n$ have?
The answer is $2^n$. Each subset can be encoded as an $n$-tuple with entries 0
or 1, where the $i$th entry is 1 if the $i$th element of the set is in the
subset and 0 if it is not.
Thus the number of subsets is the same as the cardinality of
$ {0,1} times ... times {0,1} = {0,1}^n $
which is $2^n$.
This is why given a set $X$ with cardinality $aleph$, we write the
cardinality of the power set of $X$ as $2^aleph$.
== Permutations
Now we can use the multiplication principle to count permutations.
Consider all $k$-tuples $(a_1, ..., a_k)$ that can be constructed from a set $A$ of size $n, n>= k$ without repetition. The total number of these $k$-tuples is
$ (n)_k = n dot (n - 1) ... (n - k + 1) = n! / (n-k)! $
In particular, with $k=n$, each $n$-tuple is an ordering or _permutation_ of $A$. So the total number of permutations of a set of $n$ elements is $n!$.
2025-01-19 22:00:39 -08:00
2025-01-19 02:07:58 -08:00
We construct the $k$-tuples sequentially. For the first element, we choose
one element from $A$ with $n$ alternatives. The next element has $n - 1$
alternatives. In general, after $j$ elements are chosen, there are $n - j +
1$ alternatives.
Then clearly after choosing $k$ elements for our $k$-tuple we have by
@tuplemultiplication the number of $k$-tuples being $n dot (n - 1) dot ...
dot (n - k + 1) = (n)_k$.
2025-01-19 22:00:39 -08:00
Consider a round table with 8 seats.
+ In how many ways can we seat 8 guests around the table?
+ In how many ways can we do this if we do not differentiate between seating arrangements that are rotations of each other?
For (1), we easily see that we're simply asking for permutations of an
8-tuple, so $8!$ is the answer.
For (2), we number each person and each seat from 1-8, then always place person 1 in seat 1, and count the permutations of the other 7 people in the other 7 seats. Then the answer is $7!$.
Alternatively, notice that each arrangement has 8 equivalent arrangements under rotation. So the answer is $8!/8 = 7!$.
== Counting from sets
We turn our attention to sets, which unlike tuples are unordered collections.
Let $n,k in NN$ with $0 <= k <= n$. The numbers of distinct subsets of size $k$ that a set of size $n$ has is given by the *binomial coefficient*
$ vec(n,k) = n! / (k! (n-k)!) $
Let $A$ be a set of size $n$. By @permutation, $n!/(n-k)!$ unique ordered
$k$-tuples can be constructed from elements of $A$. Each subset of $A$ of
size $k$ has exactly $k!$ different orderings, and hence appears exactly $k!$
times among the ordered $k$-tuples. Thus the number of subsets of size $k$ is
$n! / (k! (n-k)!)$.
In a class there are 12 boys and 14 girls. How many different teams of 7 pupils
with 3 boys and 4 girls can be create?
First let us compute how many subsets of size 3 we can choose from the 12 boys and how many subsets of size 4 we can choose from the 14 girls.
"boys" &= vec(12,3) \
"girls" &= vec(14,4)
Then let us consider the entire team as a 2-tuple of (boys, girls). Then
there are $vec(12,3)$ alternatives for the choice of boys, and $vec(14,4)$ alternatives for
the choice of girls, so by the multiplication principle, we have the total being
$ vec(12,3) vec(14,4) $
Color the numbers 1, 2 red, the numbers 3, 4 green, and the numbers 5, 6
yellow. How many different two-element subsets of $A$ are there that have two
different colors?
First choose 2 colors, $vec(3,2) = 3$. Then from each color, choose one. Altogether it's
$ vec(3,2) vec(2,1) vec(2,1) = 3 dot 2 dot 2 = 12 $
One way to view $vec(n,k)$ is as the number of ways of painting $n$ elements
with two colors, red and yellow, with $k$ red and $n - k$ yellow elements. Let
us generalize to more than two colors.
Let $n$ and $r$ be positive integers and $k_1, ..., k_r$ nonnegative integers
such that $k_1 + dots.c + k_r = n$. The number of ways of assigning labels
$1,2, ..., r$ to $n$ items so that for each $i = 1, 2, ..., r$, exactly $k_i$
items receive label $i$, is the *multinomial coefficient*
$ vec(n, (k_1, k_2, ..., k_r)) = vec(n!, k_1 ! k_2 ! dots.c k_r !) $
Order the $n$ integers in some manner, and assign labels like this: for the
first $k_1$ integers, assign the label 1, then for the next $k_2$ integers,
assign the label 2, and so on. The $i$th label will be assigned to all the
integers between positions $k_1 + dots.c + k_(i-1) + 1$ and $k_1 + dots.c +
Then notice that all possible orderings (permutations) of the integers gives
every possible way to label the integers. However, we overcount by some
amount. How much? The order of the integers with a given label don't matter,
so we need to deduplicate those.
Each set of labels is duplicated once for each way we can order all of the
elements with the same label. For label $i$, there are $k_i$ elements with
that label, so $k_i !$ ways to order those. By @tuplemultiplication, we know
that we can express the combined amount of ways each group of $k_1, ..., k_i$
numbers are labeled as $k_1 ! k_2 ! k_3 ! dots.c k_r !$.
So by @ktoone, we can account for the duplicates and the answer is
$ n! / (k_1 ! k_2 ! k_3 ! dots.c k_r !) $
@multinomial-coefficient gives us a way to count how many ways there are to
fit $n$ distinguishable objects into $r$ distinguishable containers of
varying capacity.
To find the amount of ways to fit $n$ distinguishable objects into $k$
2025-01-19 23:14:43 -08:00
indistinguishable containers of _any_ capacity, use the "ball-and-urn"
2025-01-19 22:00:39 -08:00
2025-01-19 23:14:43 -08:00
How many different ways can six people be divided into three pairs?
First we use the multimonial coefficient to count the amount of ways to assign specific labels to pairs of elements:
$ vec(6, (2,2,2)) $
But notice that the actual labels themselves are irrelevant. Our multimonial
coefficient counts how many ways there are to assign 3 distinguishable
labels, say Pair 1, Pair 2, Pair 3, to our 6 elements.
To make this more explicit, say we had a 3-tuple where the position encoded
the label, where position 1 corresponds to Pair 1, and so on. Then the values
are the actual pairs of people (numbered 1-6). For instance
$ ((1,2), (3,4), (5,6)) $
corresponds to assigning the label Pair 1 to (1,2), Pair 2 to (3,4) and Pair
3 to (5,6). What our multimonial coefficient is doing is it's counting this,
as well as any other orderings of this tuple. For instance
$ ((3,4), (1,2), (5,6)) $
is also counted. However since in our case the actual labels are irrelevant,
the two examples shown above should really be counted only once.
How many extra times is each case counted? It turns out that we can think of
our multimonial coefficient as permuting the labels across our pairs. So in
this case it's permuting all the ways we can order 3 labels, which is $3! =
6$. That means by @ktoone our answer is
$ vec(6, (2,2,2)) / 3! = 15 $
How many poker hands are in the category _one pair_?
A one pair is a hand with two cards of the same rank and three cards with ranks
different from each other and the pair.
We can count in two ways: we count all the ordered hands, then divide by $5!$
to remove overcounting, or we can build the unordered hands directly.
When finding the ordered hands, the key is to figure out how we can encode
our information in a tuple of the form described in @tuplemultiplication, and
then use @tuplemultiplication to compute the solution.
In this case, the first element encodes the two slots in the hand of 5 our
pair occupies, the second element encodes the first card of the pair, the
third element encodes the second card of the pair, and the fourth, fifth, and
sixth elements represent the 3 cards that are not of the same rank.
Now it is clear that the number of alternatives in each position of the
6-tuple does not depend on any of the others, so @tuplemultiplication
applies. Then we can determine the amount of alternatives for each position
in the 6-tuple and multiply them to determine the total amount of ways the
6-tuple can be constructed, giving us the total amount of ways to construct
ordered poker hands with one pairs.
First we choose 2 slots out of 5 positions (in the hand) so there are
$vec(5,2)$ alternatives. Then we choose any of the 52 cards for our first
pair card, so there are 52 alternatives. Then we choose any card with the
same rank for the second card in the pair, where there are 3 possible
alternatives. Then we choose the third card which must not be the same rank
as the first two, where there are 48 alternatives. The fourth card must not
be the same rank as the others, so there are 44 alternatives. Likewise, the
final card has 40 alternatives.
So the final answer is, remembering to divide by $5!$ because we don't care
about order,
$ (vec(5,2) dot 52 dot 3 dot 48 dot 44 dot 40) / 5! $
Alternatively, we can find way to build an unordered hand with the
requirements. First we choose the rank of the pair, then we choose two suits
for that rank, then we choose the remaining 3 different ranks, and finally a
suit for each of the ranks. Then, noting that we will now omit constructing
the tuple and explicitly listing alternatives for brevity, we have
$ 13 dot vec(5,2) dot vec(12, 3) dot 4^3 $
Both approaches given the same answer.
2025-01-22 03:26:59 -08:00
2025-02-19 16:36:00 -08:00
= Baye's theorem and conditional probability
== Conditional probability, partitions, law of total probability
Sometimes we want to analyze the probability of events in a sample space given
that we already know another event has occurred. Ergo, we want the probability
of $A in Omega$ conditional on the event $B in Omega$.
For two events $A, B in Omega$, the probability of $A$ given $B$ is written
P(A | B)
To calculate the conditional probability, use the following formula:
P(A | B) = (P(A B)) / (P(B))
Oftentimes we don't know $P(B)$, but we do know $P(B)$ given some events in
$Omega$. That is, we know the probability of $B$ conditional on some events.
For example, if we have a 50% chance of choosing a rigged (6-sided) die and a
50% chance of choosing a fair die, we know the probability of getting side $n$
given that we have the rigged die, and the probability of side $n$ given that
we have the fair die. Also note that we know the probability of both events
we're conditioning on (50% each), and they're disjoint events.
In these situations, the following law is useful:
#theorem[Law of total probability][
Given a _partition_ of $Omega$ with pairwise disjoint subsets $A_1, A_2, A_3, ..., A_n in Omega$, such that
union.big_(A_i in Omega) A_i = Omega \
sect.big_(A_i in Omega) A_i = emptyset
The probability of an event $B in Omega$ is given by
P(B) = P(B | A_1) P(A_1) + P(B | A_2) P(A_2) + dots.c + P(B | A_n) P(A_n)
This is easy to show by writing the definition of the conditional probability
and simplifying.
== Baye's theorem
Finally let's discuss a rule for inverting conditional probabilities, that is,
getting $P(B | A)$ from $P(A | B)$.
#theorem[Baye's theorem][
Given two events $A,B in Omega$,
P(A | B) = (P(B | A)P(A)) / (P(B | A)P(A) + P(B | A^c)P(A^c))
Apply the definition of conditional probability, then apply @law-total-prob
noting that $A$ and $A^c$ are a partitioning of $Omega$.
2025-01-23 14:21:33 -08:00
= Lecture #datetime(day: 23, month: 1, year: 2025).display()
2025-01-22 03:26:59 -08:00
2025-01-23 14:21:33 -08:00
== Independence
Two events $A subset Omega$ and $B subset Omega$ are independent if and only if
$ P(B sect A) = P(B)P(A) $
"Joint probability is equal to product of their marginal probabilities."
2025-02-19 16:36:00 -08:00
This definition must be used to show the independence of two events.
2025-01-23 14:21:33 -08:00
If $A$ and $B$ are independent, then,
P(A | B) = underbrace((P(A sect B)) / P(B), "conditional probability") = (P(A) P(B)) / P(B) = P(A)
Flip a fair coin 3 times. Let the events:
- $A$ = we have exactly one tails among the first 2 flips
- $B$ = we have exactly one tails among the last 2 flips
- $D$ = we get exactly one tails among all 3 flip
Show that $A$ and $B$ are independent.
What about $B$ and $D$?
Compute all of the possible events, then we see that
P(A sect B) = (hash (A sect B)) / (hash Omega) = 2 / 8 = 4 / 8 dot 4 / 8 = P(A) P(B)
So they are independent.
Repeat the same reasoning for $B$ and $D$, we see that they are not independent.
Suppose we have 4 red and 7 green balls in an urn. We choose two balls with replacement. Let
- $A$ = the first ball is red
- $B$ = the second ball is greeen
Are $A$ and $B$ independent?
hash Omega = 11 times 11 = 121 \
hash A = 4 dot 11 = 44 \
hash B = 11 dot 7 = 77 \
hash (A sect B) = 4 dot 7 = 28
Events $A_1, ..., A_n$ are independent (mutually independent) if for every collection $A_i_1, ..., A_i_k$, where $2 <= k <= n$ and $1 <= i_1 < i_2 < dots.c < i_k <= n$,
P(A_i_1 sect A_i_2 sect dots.c sect A_i_k) = P(A_i_1) P(A_i_2) dots.c P(A_i_k)
We say that the events $A_1, ..., A_n$ are *pairwise independent* if any two
different events $A_i$ and $A_j$ are independent for any $i != j$.
2025-01-27 15:16:13 -08:00
2025-02-19 16:36:00 -08:00
= A bit of review on random variables
== Random variables, discrete random variables
2025-02-19 18:00:46 -08:00
First, some brief exposition on random variables. Quixotically, a random
variable is actually a function.
Standard notation: $Omega$ is a sample space, $omega in Omega$ is an event.
A *random variable* $X$ is a function $X : Omega -> RR$ that takes the set of
possible outcomes in a sample space, and maps it to a
#link("https://en.wikipedia.org/wiki/Measurable_space")[measurable space],
typically (as in our case) a subset of $RR$.
The *state space* or *support* of a random variable $X$ is all of the values $X$ can take.
Let $X$ be a random variable that takes on the values ${0,1,2,3}$. Then the
state space of $X$ is the set ${0,1,2,3}$.
2025-02-19 16:36:00 -08:00
$X$ gives its important probabilistic information. The probability distribution
is a description of the probabilities $P(X in B)$ for subsets $B in RR$. We
describe the probability density function and the cumulative distribution
A random variable $X$ is discrete if there is countable $A$ such that $P(X in
A) = 1$. $k$ is a possible value if $P(X = k) > 0$.
A discrete random variable has probability distribution entirely determined by
p.m.f $p(k) = P(X = k)$. The p.m.f. is a function from the set of possible
values of $X$ into $[0,1]$. Labeling the p.m.f. with the random variable is
done by $p_X (k)$.
By the axioms of probability,
sum_k p_X (k) = sum_k P(X=k) = 1
For a subset $B subset RR$,
P(X in B) = sum_(k in B) p_X (k)
== Continuous random variables
Now we introduce another major class of random variables.
Let $X$ be a random variable. If $f$ satisfies
P(X <= b) = integral^b_(-infinity) f(x) dif x
for all $b in RR$, then $f$ is the *probability density function* of $X$.
The probability that $X in (-infinity, b]$ is equal to the area under the graph
of $f$ from $-infinity$ to $b$.
A corollary is the following.
$ P(X in B) = integral_B f(x) dif x $
for any $B subset RR$ where integration makes sense.
The set can be bounded or unbounded, or any collection of intervals.
$ P(a <= X <= b) = integral_a^b f(x) dif x $
$ P(X > a) = integral_a^infinity f(x) dif x $
If a random variable $X$ has density function $f$ then individual point
values have probability zero:
$ P(X = c) = integral_c^c f(x) dif x = 0, forall c in RR $
It follows a random variable with a density function is not discrete. Also
the probabilities of intervals are not changed by including or excluding
How to determine which functions are p.d.f.s? Since $P(-infinity < X <
infinity) = 1$, a p.d.f. $f$ must satisfy
f(x) >= 0 forall x in RR \
integral^infinity_(-infinity) f(x) dif x = 1
Random variables with density functions are called _continuous_ random
variables. This does not imply that the random variable is a continuous
function on $Omega$ but it is standard terminology.
Named distributions of continuous random variables are introduced in the
following chapters.
2025-01-27 15:16:13 -08:00
= Lecture #datetime(day: 27, year: 2025, month: 1).display()
== Bernoulli trials
The setup: the experiment has exactly two outcomes:
- Success -- $S$ or 1
- Failure -- $F$ or 0
P(S) = p, (0 < p < 1) \
P(F) = 1 - p = q
Construct the probability mass function:
P(X = 1) = p \
P(X = 0) = 1 - p
Write it as:
$ p_x(k) = p^k (1-p)^(1-k) $
for $k = 1$ and $k = 0$.
== Binomial distribution
The setup: very similar to Bernoulli, trials have exactly 2 outcomes. A bunch
of Bernoulli trials in a row.
Importantly: $p$ and $q$ are defined exactly the same in all trials.
This ties the binomial distribution to the sampling with replacement model,
since each trial does not affect the next.
We conduct $n$ *independent* trials of this experiment. Example with coins: each
flip independently has a $1/2$ chance of heads or tails (holds same for die,
rigged coin, etc).
$n$ is fixed, i.e. known ahead of time.
== Binomial random variable
Let $X = hash$ of successes in $n$ independent trials. For any particular
sequence of $n$ trials, it takes the form $Omega = {omega} "where" omega = S
F F dots.c F$ and is of length $n$.
Then $X(omega) = 0,1,2,...,n$ can take $n + 1$ possible values. The
probability of any particular sequence is given by the product of the
individual trial probabilities.
$ omega = S F F S F dots.c S = (p q q p q dots.c p) $
So $P(x = 0) = P(F F F dots.c F) = q dot q dot dots.c dot q = q^n$.
P(X = 1) = P(S F F dots.c F) + P(F S F F dots.c F) + dots.c + P(F F F dots.c F S) \
= underbrace(n, "possible outcomes") dot p^1 dot p^(n-1) \
= vec(n, 1) dot p^1 dot p^(n-1) \
= n dot p^1 dot p^(n-1)
Now we can generalize
P(X = 2) = vec(n,2) p^2 q^(n-2)
How about all successes?
P(X = n) = P(S S dots.c S) = p^n
We see that for all failures we have $q^n$ and all successes we have $p^n$.
Otherwise we use our method above.
In general, here is the probability mass function for the binomial random variable
P(X = k) = vec(n, k) p^k q^(n-k), "for" k = 0,1,2,...,n
Binomial distribution is very powerful. Choosing between two things, what are the probabilities?
To summarize the characterization of the binomial random variable:
- $n$ independent trials
- each trial results in binary success or failure
- with probability of success $p$, identically across trials
with $X = hash$ successes in *fixed* $n$ trials.
$ X ~ "Bin"(n,p) $
with probability mass function
P(X = x) = vec(n,x) p^x (1 - p)^(n-x) = p(x) "for" x = 0,1,2,...,n
We see this is in fact the binomial theorem!
p(x) >= 0, sum^n_(x=0) p(x) = sum^n_(x=0) vec(n,x) p^x q^(n-x) = (p + q)^n
In fact,
(p + q)^n = (p + (1 - p))^n = 1
Family 5 children, what is the probability that number of males = 2 if we
assume births are independent and probability of a male is 0.5.
First we check binomial criteria: $n$ independent trials, well formed
$S$/$F$, probability the same across trials. Let's say male is $S$ and
otherwise $F$.
We have $n=5$ and $p = 0.5$. We just need $P(X = 2)$.
P(X = 2) = vec(5,2) (0.5)^2 (0.5)^3 \
= (5 dot 4) / (2 dot 1) (1 / 2)^5 = 10 / 32
What is the probability of getting exactly three aces (1's) out of 10 throws
of a fair die?
Seems a little trickier but we can still write this as well defined $S$/$F$.
Let $S$ be getting an ace and $F$ being anything else.
Then $p = 1/6$ and $n = 10$. We want $P(X=3)$. So
P(X=3) = vec(10,3) p^3 q^7 = vec(10,3) (1 / 6)^3 (5 / 6)^7 \
approx 0.15505
Suppose we have two types of candy, red and black. Select $n$ candies. Let $X$
be the number of red candies among $n$ selected.
2 cases.
- case 1: with replacement: Binomial Distribution, $n$, $p = a/(a + b)$.
$ P(X = 2) = vec(n,2) (a / (a+b))^2 (b / (a+b))^(n-2) $
- case 2: without replacement: then use counting
$ P(X = x) = (vec(a,x) vec(b,n-x)) / vec(a+b,n) = p(x) $
We've done case 2 before, but now we introduce a random variable to represent
$ P(X = x) = (vec(a,x) vec(b,n-x)) / vec(a+b,n) = p(x) $
is known as a *Hypergeometric distribution*.
== Hypergeometric distribution
There are different characterizations of the parameters, but
$ X ~ "Hypergeom"(hash "total", hash "successes", "sample size") $
For example,
$ X ~ "Hypergeom"(N, a, n) "where" N = a+b $
In the textbook, it's
$ X ~ "Hypergeom"(N, N_a, n) $
If $x$ is very small relative to $a + b$, then both cases give similar (approx.
the same) answers.
For instance, if we're sampling for blood types from UCSB, and we take a
student out without replacement, we don't really change the sample size
substantially. So both answers give a similar result.
Suppose we have two types of items, type $A$ and type $B$. Let $N_A$ be $hash$
type $A$, $N_B$ $hash$ type $B$. $N = N_A + N_B$ is the total number of
We sample $n$ items *without replacement* ($n <= N$) with order not mattering.
Denote by $X$ the number of type $A$ objects in our sample.
Let $0 <= N_A <= N$ and $1 <= n <= N$ be integers. A random variable $X$ has the *hypergeometric distribution* with parameters $(N, N_A, n)$ if $X$ takes values in the set ${0,1,...,n}$ and has p.m.f.
$ P(X = k) = (vec(N_A,k) vec(N-N_A,n-k)) / vec(N,n) = p(k) $
Let $N_A = 10$ defectives. Let $N_B = 90$ non-defectives. We select $n=5$ without replacement. What is the probability that 2 of the 5 selected are defective?
X ~ "Hypergeom" (N = 100, N_A = 10, n = 5)
We want $P(X=2)$.
P(X=2) = (vec(10,2) vec(90,3)) / vec(100,5) approx 0.0702
Make sure you can distinguish when a problem is binomial or when it is
hypergeometric. This is very important on exams.
Recall that both ask about number of successes, in a fixed number of trials.
But binomial is sample with replacement (each trial is independent) and
sampling without replacement is hypergeometric.
Cat gives birth to 6 kittens. 2 are male, 4 are females. Your neighbor comes and picks up 3 kittens randomly to take home with them.
How to define random variable? What is p.m.f.?
Let $X$ be the number of male cats in the neighbor's selection.
$ X ~ "Hypergeom"(N = 6, N_A = 2, n = 3) $
and $X$ takes values in ${0,1,2}$. Find the p.m.f. by finding probabilities for these values.
&P(X = 0) = (vec(2,0) vec(4,3)) / vec(6,3) = 4 / 20 \
&P(X = 1) = (vec(2,1) vec(4,2)) / vec(6,3) = 12 / 20 \
&P(X = 2) = (vec(2,2) vec(4,1)) / vec(6,3) = 4 / 20 \
&P(X = 3) = (vec(2,3) vec(4,0)) / vec(6,3) = 0
Note that for $P(X=3)$, we are asking for 3 successes (drawing males) where
there are only 2 males, so it must be 0.
== Geometric distribution
2025-02-11 18:47:12 -08:00
Consider an infinite sequence of independent trials. e.g. number of attempts until I make a basket.
2025-01-27 15:16:13 -08:00
2025-02-11 18:47:12 -08:00
Let $X_i$ denote the outcome of the $i^"th"$ trial, where success is 1 and failure is 0. Let $N$ be the number of trials needed to observe the first success in a sequence of independent trials with probability of success $p$. Then
2025-01-27 15:16:13 -08:00
We fail $k-1$ times and succeed on the $k^"th"$ try. Then:
P(N = k) = P(X_1 = 0, X_2 = 0, ..., X_(k-1) = 0, X_k = 1) = (1 - p)^(k-1) p
This is the probability of failures raised to the amount of failures, times
probability of success.
The key characteristic in these trials, we keep going until we succeed. There's
no $n$ choose $k$ in front like the binomial distribution because there's
exactly one sequence that gives us success.
Let $0 < p <= 1$. A random variable $X$ has the geometric distribution with
success parameter $p$ if the possible values of $X$ are ${1,2,3,...}$ and $X$
P(X=k) = (1-p)^(k-1) p
for positive integers $k$. Abbreviate this by $X ~ "Geom"(p)$.
What is the probability it takes more than seven rolls of a fair die to roll a
Let $X$ be the number of rolls of a fair die until the first six. Then $X ~
"Geom"(1/6)$. Now we just want $P(X > 7)$.
P(X > 7) = sum^infinity_(k=8) P(X=k) = sum^infinity_(k=8) (5 / 6)^(k-1) 1 / 6
sum^infinity_(k=8) (5 / 6)^(k-1) 1 / 6 = 1 / 6 (5 / 6)^7 sum^infinity_(j=0) (5 / 6)^j
Now we calculate by standard methods:
1 / 6 (5 / 6)^7 sum^infinity_(j=0) (5 / 6)^j = 1 / 6 (5 / 6)^7 dot 1 / (1-5 / 6) =
(5 / 6)^7
2025-02-03 15:15:35 -08:00
2025-02-19 16:36:00 -08:00
= Some more discrete distributions
2025-02-10 03:00:55 -08:00
== Negative binomial
Consider a sequence of Bernoulli trials with the following characteristics:
- Each trial success or failure
- Prob. of success $p$ is same on each trial
- Trials are independent (notice they are not fixed to specific number)
- Experiment continues until $r$ successes are observed, where $r$ is a given parameter
Then if $X$ is the number of trials necessary until $r$ successes are observed,
we say $X$ is a *negative binomial* random variable.
Let $k in ZZ^+$ and $0 < p <= 1$. A random variable $X$ has the negative
binomial distribution with parameters ${k,p}$ if the possible values of $X$
are the integers ${k,k+1, k+2, ...}$ and the p.m.f. is
P(X = n) = vec(n-1, k-1) p^k (1-p)^(n-k) "for" n >= k
Abbreviate this by $X ~ "Negbin"(k,p)$.
Steph Curry has a three point percentage of approx. $43%$. What is the
probability that Steph makes his third three-point basket on his $5^"th"$
Let $X$ be number of attempts required to observe the 3rd success. Then,
X ~ "Negbin"(k = 3, p = 0.43)
P(X = 5) &= vec(5-1,3-1)(0.43)^3 (1 - 0.43)^(5-3) \
&= vec(4,2) (0.43)^3 (0.57)^2 \
&approx 0.155
== Poisson distribution
This p.m.f. follows from the Taylor expansion
e^lambda = sum_(k=0)^infinity lambda^k / k!
which implies that
sum_(k=0)^infinity e^(-lambda) lambda^k / k! = e^(-lambda) e^lambda = 1
For an integer valued random variable $X$, we say $X ~ "Poisson"(lambda)$ if it has p.m.f.
2025-02-10 22:10:26 -08:00
$ P(X = k) = e^(-lambda) lambda^k / k! $
2025-02-10 03:00:55 -08:00
for $k in {0,1,2,...}$ for $lambda > 0$ and
sum_(k = 0)^infinity P(X=k) = 1
The Poisson arises from the Binomial. It applies in the binomial context when
$n$ is very large ($n >= 100$) and $p$ is very small $p <= 0.05$, such that $n
p$ is a moderate number ($n p < 10$).
Then $X$ follows a Poisson distribution with $lambda = n p$.
P("Bin"(n,p) = k) approx P("Poisson"(lambda = n p) = k)
for $k = 0,1,...,n$.
2025-02-10 22:06:34 -08:00
The Poisson distribution is useful for finding the probabilities of rare events
2025-02-10 22:06:52 -08:00
over a continuous interval of time. By knowing $lambda = n p$ for small $n$ and
$p$, we can calculate many probabilities.
2025-02-10 22:06:34 -08:00
2025-02-10 03:00:55 -08:00
The number of typing errors in the page of a textbook.
- $n$ be the number of letters of symbols per page (large)
- $p$ be the probability of error, small enough such that
- $lim_(n -> infinity) lim_(p -> 0) n p = lambda = 0.1$
What is the probability of exactly 1 error?
We can approximate the distribution of $X$ with a $"Poisson"(lambda = 0.1)$
P(X = 1) = (e^(-0.1) (0.1)^1) / 1! = 0.09048
The number of reported auto accidents in a big city on any given day
- $n$ be the number of autos on the road
- $p$ be the probability of an accident for any individual is small such that
$lim_(n->infinity) lim_(p->0) n p = lambda = 2$
What is the probability of no accidents today?
We can approximate $X$ by $"Poisson"(lambda = 2)$
P(X = 0) = (e^(-2) (2)^0) / 0! = 0.1353
A discrete example:
Suppose we have an election with candidates $B$ and $W$. A total of 10,000
ballots were cast such that
10,000 "votes" cases(5005 space B, 4995 space W)
But 15 ballots had irregularities and were disqualified. What is the
probability that the election results will change?
There are three combinations of disqualified ballots that would result in a
different election outcome: 13 $B$ and 2 $W$, 14 $B$ and 1 $W$, and 15 $B$
and 0 $W$. What is the probability of these?
2025-02-03 15:15:35 -08:00
= Lecture #datetime(day: 3, month: 2, year: 2025).display()
== CDFs, PMFs, PDFs
2025-02-11 16:58:54 -08:00
Let $X$ be a random variable. If we have a function $f$ such that
P(X <= b) = integral^b_(-infinity) f(x) dif x
for all $b in RR$, then $f$ is the *probability density function* of $X$.
The probability that the value of $X$ lies in $(-infinity, b]$ equals the area
under the curve of $f$ from $-infinity$ to $b$.
If $f$ satisfies this definition, then for any $B subset RR$ for which integration makes sense,
P(X in B) = integral_B f(x) dif x
2025-02-03 15:15:35 -08:00
Properties of a CDF:
Any CDF $F(x) = P(X <= x)$ satisfies
1. $F(-infinity) = 0$, $F(infinity) = 1$
2. $F(x)$ is non-decreasing in $x$ (monotonically increasing)
$ s < t => F(s) <= F(t) $
3. $P(a < X <= b) = P(X <= b) - P(X <= a) = F(b) - F(a)$
Let $X$ be a continuous random variable with density (pdf)
f(x) = cases(
c x^2 &"for" 0 < x < 2,
0 &"otherwise"
1. What is $c$?
$c$ is such that
1 = integral^infinity_(-infinity) f(x) dif x = integral_0^2 c x^2 dif x
2. Find the probability that $X$ is between 1 and 1.4.
Integrate the curve between 1 and 1.4.
integral_1^1.4 3 / 8 x^2 dif x = (x^3 / 8) |_1^1.4 \
= 0.218
This is the probability that $X$ lies between 1 and 1.4.
3. Find the probability that $X$ is between 1 and 3.
Idea: integrate between 1 and 3, be careful after 2.
$ integral^2_1 3 / 8 x^2 dif x + integral_2^3 0 dif x = $
4. What is the CDF for $P(X <= x)$? Integrate the curve to $x$.
F(x) = P(X <= x) = integral_(-infinity)^x f(t) dif t \
= integral_0^x 3 / 8 t^2 dif t \
= x^3 / 8
Important: include the range!
F(x) = cases(
0 &"for" x <= 0,
x^3/8 &"for" 0 < x < 2,
1 &"for" x >= 2
5. Find a point $a$ such that you integrate up to the point to find exactly $1/2$
the area.
We want to find $1/2 = P(X <= a)$.
$ 1 / 2 = P(X <= a) = F(a) = a^3 / 8 => a = root(3, 4) $
== The (continuous) uniform distribution
The most simple and the best of the named distributions!
Let $[a,b]$ be a bounded interval on the real line. A random variable $X$ has the uniform distribution on the interval $[a,b]$ if $X$ has the density function
f(x) = cases(
1/(b-a) &"for" x in [a,b],
0 &"for" x in.not [a,b]
Abbreviate this by $X ~ "Unif" [a,b]$.
The graph of $"Unif" [a,b]$ is a constant line at height $1/(b-a)$ defined
across $[a,b]$. The integral is just the area of a rectangle, and we can check
it is 1.
For $X ~ "Unif" [a,b]$, its cumulative distribution function (CDF) is given by:
F_x (x) = cases(
0 &"for" x < a,
(x-a)/(b-a) &"for" x in [a,b],
1 &"for" x > b
If $X ~ "Unif" [a,b]$, and $[c,d] subset [a,b]$, then
P(c <= X <= d) = integral_c^d 1 / (b-a) dif x = (d-c) / (b-a)
Let $Y$ be a uniform random variable on $[-2,5]$. Find the probability that its
absolute value is at least 1.
$Y$ takes values in the interval $[-2,5]$, so the absolute value is at least 1 iff. $Y in [-2,1] union [1,5]$.
The density function of $Y$ is $f(x) = 1/(5- (-2)) = 1/7$ on $[-2,5]$ and 0 everywhere else.
P(|Y| >= 1) &= P(Y in [-2,-1] union [1,5]) \
&= P(-2 <= Y <= -1) + P(1 <= Y <= 5) \
&= 5 / 7
== The exponential distribution
The geometric distribution can be viewed as modeling waiting times, in a discrete setting, i.e. we wait for $n - 1$ failures to arrive at the $n^"th"$ success.
The exponential distribution is the continuous analogue to the geometric
distribution, in that we often use it to model waiting times in the continuous
sense. For example, the first custom to enter the barber shop.
Let $0 < lambda < infinity$. A random variable $X$ has the exponential distribution with parameter $lambda$ if $X$ has PDF
f(x) = cases(
lambda e^(-lambda x) &"for" x >= 0,
0 &"for" x < 0
Abbreviate this by $X ~ "Exp"(lambda)$, the exponential distribution with rate $lambda$.
The CDF of the $"Exp"(lambda)$ distribution is given by:
F(t) + cases(
0 &"if" t <0,
1 - e^(-lambda t) &"if" t>= 0
Suppose the length of a phone call, in minutes, is well modeled by an exponential random variable with a rate $lambda = 1/10$.
1. What is the probability that a call takes more than 8 minutes?
2. What is the probability that a call takes between 8 and 22 minutes?
Let $X$ be the length of the phone call, so that $X ~ "Exp"(1/10)$. Then we can find the desired probability by:
P(X > 8) &= 1 - P(X <= 8) \
&= 1 - F_x (8) \
&= 1 - (1 - e^(-(1 / 10) dot 8)) \
&= e^(-8 / 10) approx 0.4493
Now to find $P(8 < X < 22)$, we can take the difference in CDFs:
&P(X > 8) - P(X >= 22) \
&= e^(-8 / 10) - e^(-22 / 10) \
&approx 0.3385
#fact("Memoryless property of the exponential distribution")[
Suppose that $X ~ "Exp"(lambda)$. Then for any $s,t > 0$, we have
P(X > t + s | X > t) = P(X > s)
This is like saying if I've been waiting 5 minutes and then 3 minutes for the
bus, what is the probability that I'm gonna wait more than 5 + 3 minutes, given
that I've already waited 5 minutes? And that's precisely equal to just the
probability I'm gonna wait more than 3 minutes.
P(X > t + s | X > t) = (P(X > t + s sect X > t)) / (P(X > t)) \
= P(X > t + s) / P(X > t)
= e^(-lambda (t+ s)) / (e^(-lambda t)) = e^(-lambda s) \
equiv P(X > s)
== Gamma distribution
Let $r, lambda > 0$. A random variable $X$ has the *gamma distribution* with parameters $(r, lambda)$ if $X$ is nonnegative and has probability density function
f(x) = cases(
(lambda^r x^(r-2))/(Gamma(r)) e^(-lambda x) &"for" x >= 0,
0 &"for" x < 0
Abbreviate this by $X ~ "Gamma"(r, lambda)$.
The gamma function $Gamma(r)$ generalizes the factorial function and is defined as
Gamma(r) = integral_0^infinity x^(r-1) e^(-x) dif x, "for" r > 0
Special case: $Gamma(n) = (n - 1)!$ if $n in ZZ^+$.
The $"Exp"(lambda)$ distribution is a special case of the gamma distribution,
with parameter $r = 1$.
== The normal (Gaussian) distribution
2025-02-11 18:47:12 -08:00
A random variable $Z$ has the *standard normal distribution* if $Z$ has
2025-02-03 15:15:35 -08:00
density function
phi(x) = 1 / sqrt(2 pi) e^(-x^2 / 2)
on the real line. Abbreviate this by $Z ~ N(0,1)$.
#fact("CDF of a standard normal random variable")[
Let $Z~N(0,1)$ be normally distributed. Then its CDF is given by
Phi(x) = integral_(-infinity)^x phi(s) dif s = integral_(-infinity)^x 1 / sqrt(2 pi) e^(-(-s^2) / 2) dif s
The normal distribution is so important, instead of the standard $f_Z(x)$ and
$F_z(x)$, we use the special $phi(x)$ and $Phi(x)$.
integral_(-infinity)^infinity e^(-s^2 / 2) dif s = sqrt(2 pi)
No closed form of the standard normal CDF $Phi$ exists, so we are left to either:
- approximate
- use technology (calculator)
- use the standard normal probability table in the textbook
2025-02-10 17:57:19 -08:00
2025-02-12 13:22:06 -08:00
To evaluate negative values, we can use the symmetry of the normal distribution
to apply the following identity:
Phi(-x) = 1 - Phi(x)
== General normal distributions
The general family of normal distributions is obtained by linear or affine
transformations of $Z$. Let $mu$ be real, and $sigma > 0$, then
X = sigma Z + mu
is also a normally distributed random variable with parameters $(mu, sigma^2)$.
The CDF of $X$ in terms of $Phi(dot)$ can be expressed as
F_X (x) &= P(X <= x) \
&= P(sigma Z + mu <= x) \
&= P(Z <= (x - mu) / sigma) \
&= Phi((x-mu)/sigma)
2025-02-25 13:46:09 -08:00
f(x) = F'(x) = dif / (dif x) [Phi((x-mu)/sigma)] = 1 / sigma phi((x-mu)/sigma) = 1 / sqrt(2 pi sigma^2) e^(-((x-mu)^2) / (2sigma^2))
2025-02-12 13:22:06 -08:00
Let $mu$ be real and $sigma > 0$. A random variable $X$ has the _normal distribution_ with mean $mu$ and variance $sigma^2$ if $X$ has density function
f(x) = 1 / sqrt(2 pi sigma^2) e^(-((x-mu)^2) / (2sigma^2))
on the real line. Abbreviate this by $X ~ N(mu, sigma^2)$.
Let $X ~ N(mu, sigma^2)$ and $Y = a X + b$. Then
Y ~ N(a mu + b, a^2 sigma^2)
That is, $Y$ is normally distributed with parameters $(a mu + b, a^2 sigma^2)$.
In particular,
Z = (X - mu) / sigma ~ N(0,1)
is a standard normal variable.
2025-02-10 17:57:19 -08:00
= Lecture #datetime(day: 11, year: 2025, month: 2).display()
== Expectation
The expectation or mean of a discrete random variable $X$ is the weighted
average, with weights assigned by the corresponding probabilities.
E(X) = sum_("all" x_i) x_i dot p(x_i)
Find the expected value of a single roll of a fair die.
- $X = "score" / "dots"$
- $x = 1,2,3,4,5,6$
- $p(x) = 1 / 6, 1 / 6,1 / 6,1 / 6,1 / 6,1 / 6$
E[x] = 1 dot 1 / 6 + 2 dot 1 / 6 ... + 6 dot 1 / 6
== Binomial expected value
E[x] = n p
== Bernoulli expected value
Bernoulli is just binomial with one trial.
Recall that $P(X=1) = p$ and $P(X=0) = 1 - p$.
E[X] = 1 dot P(X=1) + 0 dot P(X=0) = p
Let $A$ be an event on $Omega$. Its _indicator random variable_ $I_A$ is defined
for $omega in Omega$ by
I_A (omega) = cases(1", if " &omega in A, 0", if" &omega in.not A)
E[I_A] = 1 dot P(A) = P(A)
== Geometric expected value
Let $p in [0,1]$ and $X ~ "Geom"[ p ]$ be a geometric RV with probability of
success $p$. Recall that the p.m.f. is $p q^(k-1)$, where prob. of failure is defined by $q := 1-p$.
E[X] &= sum_(k=1)^infinity k p q^(k-1) \
&= p dot sum_(k=1)^infinity k dot q^(k-1)
Now recall from calculus that you can differentiate a power series term by term inside its radius of convergence. So for $|t| < 1$,
sum_(k=1)^infinity k t^(k-1) =
sum_(k=1)^infinity dif / (dif t) t^k = dif / (dif t) sum_(k=1)^infinity t^k = dif / (dif t) (1 / (1-t)) = 1 / (1-t)^2 \
therefore E[x] = sum^infinity_(k=1) k p q^(k-1) = p sum^infinity_(k=1) k q^(k-1) = p (1 / (1 - q)^2) = 1 / p
== Expected value of a continuous RV
The expectation or mean of a continuous random variable $X$ with density
function $f$ is
E[x] = integral_(-infinity)^infinity x dot f(x) dif x
An alternative symbol is $mu = E[x]$.
2025-02-25 13:46:09 -08:00
$mu$ is the "first moment" of $X$, analogous to physics, it's the "center of
2025-02-10 17:57:19 -08:00
gravity" of $X$.
In general when moving between discrete and continuous RV, replace sums with
integrals, p.m.f. with p.d.f., and vice versa.
Suppose $X$ is a continuous RV with p.d.f.
f_X (x) = cases(2x", " &0 < x < 1, 0"," &"elsewhere")
E[X] = integral_(-infinity)^infinity x dot f(x) dif x = integral^1_0 x dot 2x dif x = 2 / 3
#example("Uniform expectation")[
Let $X$ be a uniform random variable on the interval $[a,b]$ with $X ~
"Unif"[a,b]$. Find the expected value of $X$.
E[X] = integral^infinity_(-infinity) x dot f(x) dif x = integral_a^b x / (b-a) dif x \
= 1 / (b-a) integral_a^b x dif x = 1 / (b-a) dot (b^2 - a^2) / 2 = underbrace((b+a) / 2, "midpoint formula")
#example("Exponential expectation")[
Find the expected value of an exponential RV, with p.d.f.
f_X (x) = cases(lambda e^(-lambda x)", " &x > 0, 0"," &"elsewhere")
E[x] = integral_(-infinity)^infinity x dot f(x) dif x = integral_0^infinity x dot lambda e^(-lambda x) dif x \
= lambda dot integral_0^infinity x dot e^(-lambda x) dif x \
= lambda dot [lr(-x 1 / lambda e^(-lambda x) |)_(x=0)^(x=infinity) - integral_0^infinity -1 / lambda e^(-lambda x) dif x] \
= 1 / lambda
#example("Uniform dartboard")[
Our dartboard is a disk of radius $r_0$ and the dart lands uniformly at
random on the disk when thrown. Let $R$ be the distance of the dart from the
center of the disk. Find $E[R]$ given density function
f_R (t) = cases((2t)/(r_0 ^2)", " &0 <= t <= r_0, 0", " &t < 0 "or" t > r_0)
E[R] = integral_(-infinity)^infinity t f_R (t) dif t \
= integral^(r_0)_0 t dot (2t) / (r_0^2) dif t \
= 2 / 3 r_0
== Expectation of derived values
If we can find the expected value of $X$, can we find the expected value of
$X^2$? More precisely, can we find $E[X^2]$?
If the distribution is easy to see, then this is trivial. Otherwise we have the
following useful property:
E[X^2] = integral_("all" x) x^2 f_X (x) dif x
(for continuous RVs).
And in the discrete case,
E[X^2] = sum_("all" x) x^2 p_X (x)
In fact $E[X^2]$ is so important that we call it the *mean square*.
More generally, a real valued function $g(X)$ defined on the range of $X$ is
itself a random variable (with its own distribution).
We can find expected value of $g(X)$ by
E[g(x)] = integral_(-infinity)^infinity g(x) f(x) dif x
E[g(x)] = sum_("all" x) g(x) f(x)
You roll a fair die to determine the winnings (or losses) $W$ of a player as
W = cases(-1", if the roll is 1, 2, or 3", 1", if the roll is a 4", 3", if the roll is 5 or 6")
What is the expected winnings/losses for the player during 1 roll of the die?
Let $X$ denote the outcome of the roll of the die. Then we can define our
random variable as $W = g(X)$ where the function $g$ is defined by $g(1) =
g(2) = g(3) = -1$ and so on.
Note that $P(W = -1) = P(X = 1 union X = 2 union X = 3) = 1/2$. Likewise $P(W=1)
= P(X=4) = 1/6$, and $P(W=3) = P(X=5 union X=6) = 1/3$.
E[g(X)] = E[W] = (-1) dot P(W=-1) + (1) dot P(W=1) + (3) dot P(W=3) \
= -1 / 2 + 1 / 6 + 1 = 2 / 3
A stick of length $l$ is broken at a uniformly chosen random location. What is
the expected length of the longer piece?
Idea: if you break it before the halfway point, then the longer piece has length
given by $l - x$. If you break it after the halfway point, the longer piece
has length $x$.
Let the interval $[0,l]$ represent the stick and let $X ~ "Unif"[0,l]$ be the
location where the stick is broken. Then $X$ has density $f(x) = 1/l$ on
$[0,l]$ and 0 elsewhere.
Let $g(x)$ be th length of the longer piece when the stick is broken at $x$,
g(x) = cases(1-x", " &0 <= x < l/2, x", " &1/2 <= x <= l)
E[g(X)] = integral_(-infinity)^infinity g(x) f(x) dif x = integral_0^(l / 2) (l-x) / l dif x + integral_(l / 2)^l x / l dif x \
= 3 / 4 l
So we expect the longer piece to be $3/4$ of the total length, which is a bit
== Moments of a random variable
We continue discussing expectation but we introduce new terminology.
The $n^"th"$ moment (or $n^"th"$ raw moment) of a discrete random variable $X$
with p.m.f. $p_X (x)$ is the expectation
E[X^n] = sum_k k^n p_X (k) = mu_n
If $X$ is continuous, then we have analogously
E[X^n] = integral_(-infinity)^infinity x^n f_X (x) = mu_n
The *deviation* is given by $sigma$ and the *variance* is given by $sigma^2$ and
sigma^2 = mu_2 - (mu_1)^2
$mu_3$ is used to measure "skewness" / asymmetry of a distribution. For
example, the normal distribution is very symmetric.
$mu_4$ is used to measure kurtosis/peakedness of a distribution.
== Central moments
Previously we discussed "raw moments." Be careful not to confuse them with
_central moments_.
2025-02-19 16:07:50 -08:00
The $n^"th"$ central moment of a discrete random variable $X$ with p.m.f. $p_X
(x)$ is the expected value of the difference about the mean raised to the
2025-02-10 17:57:19 -08:00
$n^"th"$ power
E[(X-mu)^n] = sum_k (k - mu)^n p_X (k) = mu'_n
And of course in the continuous case,
E[(X-mu)^n] = integral_(-infinity)^infinity (x - mu)^n f_X (x) = mu'_n
In particular,
mu'_1 = E[(X-mu)^1] = integral_(-infinity)^infinity (x-mu)^1 f_X (x) dif x \
= integral_(infinity)^infinity x f_X (x) dif x = integral_(-infinity)^infinity mu f_X (x) dif x = mu - mu dot 1 = 0 \
mu'_2 = E[(X-mu)^2] = sigma^2_X = "Var"(X)
2025-02-19 16:07:50 -08:00
Effectively we're centering our distribution first.
2025-02-10 17:57:19 -08:00
Let $Y$ be a uniformly chosen integer from ${0,1,2,...,m}$. Find the first and
second moment of $Y$.
The p.m.f. of $Y$ is $p_Y (k) = 1/(m+1)$ for $k in [0,m]$. Thus,
E[Y] = sum_(k=0)^m k 1 / (m+1) = 1 / (m+1) sum_(k=0)^m k \
= m / 2
E[Y^2] = sum_(k=0)^m k^2 1 / (m+1) = 1 / (m+1) = (m(2m+1)) / 6
Let $c > 0$ and let $U$ be a uniform random variable on the interval $[0,c]$.
Find the $n^"th"$ moment for $U$ for all positive integers $n$.
The density function of $U$ is
f(x) = cases(1/c", if" &x in [0,c], 0", " &"otherwise")
Therefore the $n^"th"$ moment of $U$ is,
E[U^n] = integral_(-infinity)^infinity x^n f(x) dif x
Suppose the random variable $X ~ "Exp"(lambda)$. Find the second moment of $X$.
E[X^2] = integral_0^infinity x^2 lambda e^(-lambda x) dif x \
= 1 / (lambda^2) integral_0^infinity u^2 e^(-u) dif u \
= 1 / (lambda^2) Gamma(2 + 1) = 2! / lambda^2
2025-02-25 13:46:09 -08:00
In general, to find the $n^"th"$ moment of $X ~ "Exp"(lambda)$,
2025-02-10 17:57:19 -08:00
E[X^n] = integral^infinity_0 x^n lambda e^(-lambda x) dif x = n! / lambda^n
== Median and quartiles
When a random variable has rare (abnormal) values, its expectation may be a bad
indicator of where the center of the distribution lies.
The *median* of a random variable $X$ is any real value $m$ that satisfies
P(X >= m) >= 1 / 2 "and" P(X <= m) >= 1 / 2
With half the probability on both ${X <= m}$ and ${X >= m}$, the median is
representative of the midpoint of the distribution. We say that the median is
more _robust_ because it is less affected by outliers. It is not necessarily
Let $X$ be discretely uniformly distributed in the set ${-100, 1, 2, ,3, ..., 9}$ so $X$ has probability mass function
p_X (-100) = p_X (1) = dots.c = p_X (9)
Find the expected value and median of $X$.
E[X] = (-100) dot 1 / 10 + (1) dot 1 / 10 + dots.c + (9) dot 1 / 10 = -5.5
While the median is any number $m in [4,5]$.
The median reflects the fact that 90% of the values and probability is in the
range $1,2,...,9$ while the mean is heavily influenced by the $-100$ value.
2025-02-19 16:07:50 -08:00
= President's Day lecture
2025-03-01 20:23:15 -08:00
== Quantiles
For $p in (0,1)$, the *$p^"th"$ quantile* of a random variable $X$ is any $x in RR$ satisfying
P(X >= x) >= 1 - p "and" P(X <= x) >= p
We see that the median is the $0.5^"th"$ quantile. $p = 0.25$ is called the
"first quartile" (Q1). $p = 0.75$ is called the "third quartile" (Q3).
$Q 3 - Q 1$ is called the $I Q R$, the interquartile range.
== Variance
Variance is a measure of spread or _variation_ from the mean. Variance is the
*expected squared deviations* about the mean.
Let $X$ be a random variable with mean $mu$. The variance of $X$ is given by
"Var"(X) = E[(X-mu)^2] = sigma_X^2
If $X$ is discrete with PMF $p_X(x)$, then the variance is
"Var"(X) = sum_x (x-mu)^2 p_X (x)
If $X$ is continuous with PMF $f_X (x)$, then the variance is
"Var"(X) = integral^infinity_(-infinity) (x-mu)^2 f_X (x) dif x
Variance is the same as the second central moment.
$sigma_X = sqrt("Var"(X))$ is the "standard deviation" of $X$.
These tell us about how far spread out the points are.
#example[Fair die][
Find the variance for the value of a single roll of a fair die.
sigma_X^2 = "Var"(X) &= E[(X-3.5)^2] \
&= sum_("all" x) (x-3.5)^2 dot p_X (x) \
&=91 / 6
#example[Continuous $X$][
Let $X$ be a continuous RV with PDF $f_X (x) = cases(1 &"for" 0 < x < 1, 0 &"otherwise")$
Find $E[X]$:
integral_0^1 x dot f_X (x) dif x = 1 / 2
Find $"Var"(X)$:
E[(X- 1 / 2)^2] &= integral_0^1 (x- 1 / 2)^2 dot f_X (x) dif x \
&= 1 / 12
An easier formulation of variance is given by
"Var"(X) equiv E[(X-mu)^2] = E[X^2] - mu^2
2025-02-19 16:07:50 -08:00
= Lecture #datetime(day: 19, month: 2, year: 2025).display()
== Moment generating functions
Like the CDF, the moment generating function also completely characterizes the
distribution. That is, if you can find the MGF, it tells you all of the
information about the distribution. So it is an alternative way to characterize
a random variable.
They are "easy" to use for finding the distributions of:
- sums of independent random variables
- the distribution of the limit of a sequence of random variables
Let $X$ be a random variable with all finite moments
E[X^k] = mu_k, k = 1,2,...
Then the *moment generating function* of a random variable $X$ is defined by
2025-03-03 00:43:59 -08:00
$M_x (t) = E[e^(t x)]$, for the real variable $t$.
2025-02-19 16:07:50 -08:00
All of the moments must be defined for the MGF to exist. The MGF looks like
sum_("all" x) e^(t x) p(x)
in the discrete case, and
integral^infinity_(-infinity) e^(t x) f(x) dif x
in the continuous case.
It holds that the $n^"th"$ derivative of $M$ evaluated at 0 gives the
$n^"th"$ moment.
M_x^((n)) (0) = E[X^n]
M_X (t) &equiv E[e^(t x)] = E[1 + (t X) + (t X^2) / 2! + dots.c] \
&= E[1] + E[t X] + E[(t^2 X^2) / 2!] + dots.c \
&= E[1] + t E[X] + t^2 / 2! E[X^2] + dots.c \
&= 1 + t / 1! mu_1 + t^2 / 2! mu_2 + dots.c
The coefficient of $t^k/k!$ in the Taylor series expansion of $M_X (t)$ is the
$k^"th"$ moment. So an alternative way to get $mu_k$ is
mu_k = lr(((dif^k M(t))/(dif t^k)) |)_(t=0) = "coefficient of" t^k / k!
Let $X ~ "Bin"(n,p)$. Then the MGF of $X$ is given by
sum_(k=0)^n e^(t k) vec(n,k) p^k q^(n-k) = sum_(k=0)^n vec(n,k) underbrace(p (e^t)^k,a) underbrace(q^(n-k), b)
Applying the binomial theorem
(a + b)^n = sum_(k=0)^n vec(n,k) a^k b^(n-k)
So we have
(q + p e^t)^n
Let's find the first moment
mu_1 = lr((dif M(t))/(dif t) |)_(t=0) \
= n p
The second moment:
mu_2 = lr((dif^2 M(t))/(dif t^2) |)_(t=0) \
= n(n-1) p^2 + n p
For example, if $X$ has MGF $(1/3 + 2/3 e^t)^10$, then $X ~ "Bin"(10,2/3)$.
Let $X ~ "Pois"(lambda)$. Then the MGF of $X$ is given
M_X (t) = E[e^(t X)] \
sum^infinity_(x=0) e^(t x) e^(-lambda) lambda^x / x! \
e^(-lambda) sum^infinity_(x=0) e^(t x)lambda^x / x!
Note: $e^a = sum_(x=0) ^infinity a^x / x!$
= e^(-lambda) e^(lambda e^t) \
= e^(-lambda (1 - e^t))
Then, the first moment can be found by,
mu_1 = lr(e^(-lambda (1 - e^t)) (-lambda) (-e^t) |)_(t=0) = lambda
Let $X ~"Exp"(lambda)$ with PDF
f(x) = cases(lambda e^(-lambda x) &"for" x > 0, 0 &"otherwise")
Find the MGF of $X$
M_X (t) &= integral^infinity_(-infinity) e^(t x) dot lambda e^(-lambda x) dif x \
&= lambda integral_0^infinity e^((t-lambda) x) dif x \
&= lambda lim_(b->infinity) integral_0^b e^((t - lambda) x) dif x \
This integral depends on $t$, so we should consider three cases. If $t =
lambda$, then the integral diverges.
If $t != lambda$,
E[e^(t X)] = lambda lim_(b->infinity) integral_0^b e^((t - lambda) x) dif x = lambda lim_(b -> infinity) [(e^((t - lambda) x) - 1) / (t - lambda)]^(x=b)_(x=0) \
lambda lim_(b -> infinity) (e^((t - lambda) b) - 1) / (t - lambda) = cases(infinity &"if" t > lambda, lambda/(lambda - t) &"if" t < lambda)
Combining with the $lambda = t$ case,
lambda lim_(b -> infinity) (e^((t - lambda) b) - 1) / (t - lambda) = cases(infinity &"if" t >= lambda, lambda/(lambda - t) &"if" t < lambda)
#example[Alternative parameterization of the exponential][
Consider $X ~ "Exp"(beta)$ with PDF
f(x) = cases(1/beta e^(-x/beta) &"for" x > 0, 0 &"otherwise")
and proceed as usual
M_X (t) = integral_(-infinity)^infinity e^(t x) dot 1 / beta e^(-x / beta) dif x = 1 / beta lim_(b-infinity) [e^((t - 1 / beta) x) / (t - 1 / beta)]_(x=0)^(x=b) = 1 / (1 - beta t)
So it's a geometric series
1 + beta t + (beta t)^2 + dots.c \
Multiply each $n^"th"$ term by $n/n!$
= 1 + beta t + 2 beta^2 (t^2 / 2!) + 6 beta^3 (t^3 / 3!) + dots.c
Recall that the coefficient of each $r^k/k! = mu)k$. So
- $E[x] = beta$
- $E[X^2] = 2 beta^2$
- $E[X^3] = 6 beta^3$
"Var"(X) = E[X^2] - (E[X])^2 = beta^2
#example[Uniform on $[0,1]$][
Let $X ~ U(0,1)$, then
M_X (t) &= integral_0^1 e^(t x) dot 1 dif x \
&= lr(e^(t x)/t |)_(x=0)^(x=1) = (e^t - 1) / t \
&= (cancel(1) + t^2 / 2! + dots.c - cancel(1)) / t \
&= 1 + t^2 / 2! + t^2 / 3! + t^3 / 4! + dots.c \
&= 1 + 1 / 2 t + 1 / 3 (t^2 / 2!) + 1 / 4(t^3 / 3!) + dots.c
- $E[X] = 1 / 2$
- $E[X^2] = 1 / 3$
- $E[X^n] = 1 / (n + 1)$
== Properties of the MGF
Random variables $X$ and $Y$ are equal in distribution if $P(X in B) = P(Y in
B)$ for all all subsets $B$ of $RR$.
Abbreviate this by $X eq.delta Y$
#example[Normal distribution][
Let $Z ~ N(0,1)$. Then
E[e^(t Z)] = 1 / sqrt(2 pi) integral^infinity_(-infinity) e^(-1 / 2 z^2 + t z -1 / 2 t^2 + 1 / 2 t^2) dif z \
= e^(t^2 / 2) 1 / sqrt(2 pi) = integral_(-infinity)^infinity e^(-1 / 2 (z-t)^2) dif z = e^(t^2 / 2)
To get the MGF for a general normal RV, $X ~ N(mu, sigma^2)$, then
X = sigma Z + mu
we get
E[e^(t (sigma Z + mu))] = e^(t mu) E[e^(t sigma Z)] = e^(t mu) dot e^((t^2 sigma^2) / 2) = exp(mu t + (sigma^2 t^2) / 2)
== Joint distributions of RV
Looking at multiple random variables jointly. If $X$ and $Y$ are both random
variables defined on $Omega$< treat them as coordinates of a 2 dimensional
random vector. It's a vector valued function on $Omega$,
Omega -> RR^2
Valid both discretely and continuously
1. Poker hand: $X$ is num of face cards, $Y$ is num of red cards.
2. Demographic info: $X$ = height, $Y$ = weight
In general, with $n$ random variables jointly where
X_1, X_2, ..., X_n
defined on $Omega$ are coordinates of an $n$-dimensional random vector that
maps the results to $RR^n$.
The probability distribution of $(X_1, dots.c, X_n)$ is now $P((X_1, dots.c,
X_n) in B)$ where $B$ are subsets of $RR^n$ (power set of $RR^n)$.
The probability distribution of the random vector is called the _joint
Let $X$ and $Y$ both be discrete random variables defined on the same $Omega$
Then, the joint PMF is
P(X = x, Y = y) = P_(X,Y) (x,y)
where $p_(X,Y) (x,y) >= 0$ for all possible values $x,y$ of $X$ and $Y$
] And,
sum_(x in X) sum_(y in Y) p_(X,Y) (x,y) = 1
Let $X_1, X_2, ..., X_n$ are discrete random variables defined on $Omega$,
their joint PMF is given by
p(k_1, k_2, ..., k_n) = P(X_1 = k_1, X_1 = k_2, ..., X_n = k_n)
for all possible $k_1, ..., k_n$ of $X_1, ..., X_n$.
The joint probability in set notation:
P(X_1 = k_1, X_1 = k_2, ..., X_n = k_n) = P({X_1 = k_1}sect{X_n = k_n})
The joint PDF has the same properties as single variable PDF
p_(X_1,X_2,X_n) (k_1,k_2,...,k_n) >= 0
2025-03-01 20:23:15 -08:00
= Joint distributions
== Introduction
Looking at 2 or more random variables at the same time. Treat $n$ random
variables as the coordinates of an $n$ dimensional *random vector*. In fact, like how a random variable is a function from $Omega -> RR$, the joint random vector is a vector-valued function
vec(x,y) : Omega -> RR^2
The probability distribution of $(X_1,X_2,...,X_n)$ is now represented by
P((X_1,X_2,...,X_n) in B)
where $B$ are subsets of $RR^n$. The probability distribution of the random
vector is the *joint distribution*. The probability distribution of individual
coordinates $X_j$ are *marginal distributions*.
== Discrete joint distributions
Let $X$ and $Y$ both be discrete random variables defined on a common $Omega$. Then the joint PMF is given by
P(X=x, Y=y) equiv p_(X,Y) (x,y)
with the property that
sum_("all" x) sum_("all" y) p_(X,Y) (x,y) = 1
Let $X_1,X_2,...,X_n$ be discrete random variables defined on a common $Omega$, then their *joint probability mass function* is given by:
p(k_1,k_2,...,k_n) = P(X_1 = k_1, X_2 = k_2, ..., X_n = k_n)
for all possible values $k_1,k_2,...,k_n$ of $X_1,X_2,...,X_n$.
The joint probability in set notation looks like
P(X_1 = k_1, X_2 = k_2, ..., X_n = k_n) = P({X_1=k_1} sect {X_2 = k_2} sect dots.c sect {X_n=k_n})
The joint PDF has the same properties as the PDF for the single random variable, namely
p_(X_1,X_2,...,X_n) (k_1,k_2,...,k_n) >= 0 \
sum_(k_1,k_2,...,k_n) p_(X_1,X_2,...,X_n) (k_1,k_2,...,k_n) = 1
Let $g : RR^n -> RR$ be a real-valued function on an $n$-vector. If $X_1,X_2,...,X_n$ are discrete random variables with joint PMF $p$ then
E[g(X_1,X_2,...,X_n)] = sum_(k_1,k_2,...,k_n) g(k_1,k_2,...,k_n) p(k_1,k_2,...,k_n)
provided the sum is well defined.
Flip a fair coin three times. Let $X$ be the number of tails in the first flip and $Y$ a total number of tails observed from all flips. Then the support of each variable is $S_X = {0,1}$ and $S_Y = {0,1,2,3}$.
1. Find the joint PMF of $(X,Y)$, $p_(X,Y) (x,y)$.
Just record the probability of the respective events.
For example, the probability $X$ is 0 and $Y$ is 1 is $p_(X,Y) (0,1)$ is $2/8$.
2025-03-03 01:10:33 -08:00
2. Suppose we are playing a game where each tails earns 3 dollars and if the first flip is tails, each reward is doubled. What is the expected reward?
Note that we can record this reward function by $g(x,y) = 3(1+x)y$. Then the expected reward is just
EE[g(X,Y)] = sum_(k=0)^1 sum_(l=0)^3 g(k,l) p_(X,Y) (k,l) = sum_(k=0)^1 sum_(l=0)^3 3(1+k) l p_(X,Y) (k,l) \
= sum_(l=0)^3 p_(X,Y) (0,l) + sum_(l=0)^3 6l p_(X,Y) (1,l) = 7 + 1 / 2
== Discrete marginal distributions
Consider the joint PMF $p(x,y)$ for a random vector $vec(X,Y)$. The marginal PMF for $X$ is found by
P(X=x) &= p_X (x) = P(union.big_("all" y) {X=x,Y=y}) \
&= sum_("all" y) P(X = x, Y = y) \
&= sum_("all" y) p(x,y)
We sum for all possible values across all the other random variables.
In general, for an $n$-vector, we do the same thing summing over all possible
combinations of values of the other variables.
== Multinomial distribution
Let $n$ and $r$ be positive integers and let $p_1, p_2, ..., p_r$ be positive reals such that $p_1 + p_2 + dots.c + p_r = 1$. Then the random vector $(X_1, ..., X_r)$ has the *multinomial distribution* with parameters $n$, $r$, and $p_1,p_2,...,p_r$ if the possible values are integers vectors $(k_1,k_2,...,k_r)$ such that $k_j >= 0$ and $k_1 + k_2 + dots.c + k_r = n$ and the joint PMF is given by
P(X_1 = k_1, X_2 = k_2, ..., X_r = k_r) = vec(n,k_1,k_2,...,k_r) p_1^(k_1) p_2^(k_2) dots.c p_r^(k_r)
Abbreviate this by $(X_1,...,X_r) ~ "Mult"(n,r,p_1,...,p_r)$.
Suppose an urn contains 1 green, 2 red, and 3 yellow balls. We sample a ball
with replacement 10 times. Find the probability that green appeared 3 times,
red twice, and yellow 5 times.
Let $X_g, X_r, X_y$ be the number of green, red and yellow balls in the
sample. Then
(X_g,X_r,X_y) ~ "Mult"(n=10,r=3,1 / 6,2 / 6,3 / 6)
PP(X_g = 3, X_r = 2, X_y = 5) = 10! / (3!2!5!) (1 / 6)^3 (2 / 6)^2 (3 / 6)^5 \
approx 0.0405
== Jointly continuous
Let the pair of continuous random variables $(X,Y)$ defined on common $Omega$ have the joint PDF $f_(X,Y) (x,y)$ where $f$ is a function on $RR^2$ such that for any subset $B subset.eq RR^2$,
P((X,Y) in B) = integral.double f(x,y) dif x dif y
where $f_(X,Y) (x,y) >= 0$ for all possible $(x,y)$ and $integral _y integral _x f(x,y) = 1$.
In general, for random variables $X_1, X_2, ..., X_n$, they are jointly continuous if there exists a joint density function $f$ on $RR^n$ such that for subsets $B subset.eq RR^n$,
P((X_1,...,X_n) in B) = integral underbrace(dots.c,B) integral f(x_1,dots,x_n) dif x_1 dots.c dif x_n
where any $f(x_1,...,x_n) >= 0$ and we always integrate to unity.
2025-03-01 20:23:15 -08:00
2025-03-03 18:25:32 -08:00
= Lecture #datetime(day: 3, month: 3, year: 2025).display()
== Conditioning on an event
Let event $A = {X=k}$ for a discrete random variable $X$, then
Let $X$ be a discrete random variable and $B$ an event with $P(B) > 0$. Then
the *conditional probability mass function* of $X$, given $B$, is the
function $p_(X | B)$ defined as follows for all possible values $k$ of $X$:
p_(X|B) (k) = P(X = k | B) = P({X=k} sect B) / P(B)
Let $X$ be a discrete random variable and $B$ an event with $P(B) > 0$. Then
the *conditional expectation* of $X$, given the event $B$ is given by
$EE[X|B]$ and defined as:
EE[X | B] = sum_k k p_(X|B) (k) = sum_k k P(X=k | B)
where the sum ranges over all possible values $k$ of $X$.
== Law of total probability
Let $Omega$ be a sample space, $X$ a discrete random variable on $Omega$, and
$B_1,...,B_n$ a partition on $Omega$ such that each $P(B_i) > 0$. Then the
(unconditional) probability mass function of $X$ can be calculated by
averaging the conditional probability mass functions,
p_X (k) = sum_(i=1)^n p_(X|B_i) (k) P(B_i)
Let $X$ denote the number of customers that arrive in my store tomorrow. If the day is rainy, $X$ is $"Poisson"(lambda)$ and if the day is dry, $X$ is $"Poisson"(mu)$. Suppose the probability it rains tomorrow is 0.10. Find the probability max function and expectation of $X$.
Let $B$ be the event that it rains tomorrow. Then $P(B) = 0.10$. The conditional PMF and conditional expectation is given by
p_(X|B) (k) = e^(-lambda) lambda^k / k!, p_(X|B^c) (k) = e^(-mu) mu^k / k!
EE[X|B] = lambda, EE[X|B^c] = mu
The unconditional PMF is given by
p_X (k) &= P(B) p_(X|B) (k) + P(B^c) p_(X|B^c) (k) \
&= 1 / 10 e^(-lambda) lambda^k / k! + 9 / 10 e^(-mu) mu^k / k!
== Conditioning on a random variable
Let $X$ and $Y$ be discrete random variables. Then the *conditional probability mass function* of $X$ given $Y=y$ is the following:
p_(X|Y) (x|y) = P(X=x | Y = y) = P(X = x, Y = y) / P(Y=y) = (p_(X,Y) (x,y)) / (p_Y (y))
The conditional expectation of $X$ given $Y = y$ is
EE[X | Y=y] = sum_x x p_(X|Y) (x|y)
These definitions are valid for all $y$ such that $P(y) > 0$.
Suppose an insect lays some number of eggs, $X$. Of those eggs, some hatch and
some won't, with probability $p$, and each egg hatching independent of the
others. Let $Y$ represent the number of the $x$ eggs that hatch.
X ~ "Pois"(lambda) \
Y | X = x ~ "Bin"(x,p)
What is the marginal distribution of $Y$, $p_Y (y)$?
p_(X,Y) (x,y) &= p_X (x) dot p_(Y|X=x) (y) \
&= e^(-lambda) lambda^x / x! dot vec(x,y) p^y q^(x-y) \
&= e^(-lambda) lambda^x / cancel(x!) dot cancel(x!) / (y! (x-y)!) p^y q^(x-y) \
p_Y (y) &= sum_x p_(X,Y) (x,y) = sum_(x=0)^infinity p_(X,Y) (x,y) \
&= e^(-lambda) p^y / y! = sum_(x=y)^infinity (lambda^y dot lambda^(x-y) q^(x-y)) / (x-y)! = e^(-lambda (1-q)) (lambda p)^y / y! = e^(-lambda p) (lambda p)^y / y! \
&= "Pois"(lambda p)
== Continuous marginal/conditional distributions
The conditional probability density function of $X$ given $Y = y$ is given by
X | Y = y ~ f_(X|Y) (x) = f(x,y) / f_y(y)
and the corresponding probability density function of $Y$ given $X = x$,
Y | X = x ~ f_(Y|X=x) (y) = f(x,y) / f_X(x)
For example, where $X = "height"$, $Y = "weight"$, which is fixed at 150 lbs,
and we want the distribution of heights where the weight value is $y = 150$.
The conditional probability that $X in A$ given $Y = y$, is
PP(X in A | Y = y) = integral_A f_(X|Y) (x|y) dif x
The conditional expectation of $g$, given $Y = y$, is
EE[g(X) | Y = y] = integral_(-infinity)^infinity g(x) f_(X|Y) (x|y) dif x
Let $X$ and $Y$ be jointly continuous. Then
f_X (x) = integral_(-infinity)^infinity f_(X|Y) (x|y) f_Y (y) dif y
For any function $g$ where the expectation makes sense, is then
EE[g(X)] = integral^infinity_(-infinity) EE[g(X) | Y = y] f_Y (y) dif y
Let $X$ and $Y$ be discrete or jointly continuous random variables. The
*conditional expectation* of $X$ given $Y$, denoted $EE[X|Y]$, is by
definition the RV $v(Y)$ where the function $v$ is defined by $v(y)$ where
the function $v$ is defined by $v(y) = EE[X | Y = y]$.
Note the distinction between $EE[X | Y=y]$ and $EE[X|Y]$. The first is a number, the second is a random variable. The possible values (support) of $EE[X|Y]$ is precisely the numbers $EE[X | Y = y]$ as $y$ varies. The terminology:
- $EE[X | Y = y]$ is the expectation of $X$ given $Y = y$
- $EE[X|Y]$ is the expectation of $X$ given $Y$
Suppose $X$ and $Y$ are ${0,1}$-valued random variables with joint PMF
columns: 3,
rows: 3,
[$X \\ Y$], [0], [1],
[0], [$3 / 10$], [$2 / 10$],
[1], [$1 / 10$], [$4 / 10$],
Find the conditional PMF and conditional expectation of $X$ given $Y = y$.
The marginal PMFs come from summing respective rows and columns
p_Y (0) = 4 / 10 "and" p_Y (1) = 6 / 10
p_X (0) = 5 / 10 "and" p_X (1) = 5 / 10
The conditional PMF of $X$ given $Y = 0$ is
P_(X|Y) (0|0) = (p_(X,Y) (0,0)) / (p_Y (0)) = 3 / 4 \
P_(X|Y) (1|0) = (p_(X,Y) (1,0)) / (p_Y (0)) = 1 / 4
Similarly, the conditional PMF of $X$ given $Y = 0$ is
p_(X|Y) (0|1) = 1 / 3 \
p_(X|Y) (1|1) = 2 / 3
The conditional expectations of $X$ come by computnig expectations with the
conditional PMF
EE[X | Y = 0] = 0 dot p_(X|Y) (0|0) + 1 dot p_(X|Y) (1|0) = 0 dot 3 / 4 + 1 / 4 = 1 / 4 \
EE[X | Y = 1] = 0 dot p_(X|Y) (0|1) + 1 dot p_(X|Y) (1|1) = 0 dot 1 / 3 + 2 / 3 = 2 / 3
== Sums of independent random variables
We derive distributions for sums of independent random variables. We show how
symmetry can help simplify calculations. If we know the joint distribution of
any two $X$ and $Y$, then we know everything about them and can describe any
random variable of the form $g(X,Y)$.
In particular, we focus on $g(X,Y) = X + Y$ for both the discrete and
continuous case.
Suppose $X$ and $Y$ are discrete with joint PMF $p_(X,Y)$. Then $X + Y$ is also discrete and its PMF can be computed by breaking up the event ${X+Y = n}$.
{X+Y = n} = union.big_("all possible" k) {X = k, Y = n - k}
into the disjoint union of the events ${X=k,Y=n-k}$.
p_(X+Y) (n) = P(X + Y = n) = sum_k PP(X=k, Y = n - k) \
= sum_k p_(X,Y) (k,n-k)
If $X$ and $Y$ are independent, then we can rewrite
p_(X+Y) (n) = sum_k p_X (k) p_Y (n-k) \
= sum_l p_X (n-l) p_Y (l) \
= p_X convolve p_Y (n)
Where $convolve$ is the _convolution_ of $X$ and $Y$.