#import "@youwen/zen:0.1.0": *
#import "@preview/ctheorems:1.1.3": *

#show: zen.with(
  title: "PSTAT120A Course Notes",
  author: "Youwen Wu",
  date: "Winter 2025",
  subtitle: "Taught by Brian Wainwright",
)

#outline()

= Introduction

PSTAT 120A is an introductory course on probability and statistics. However, it
is a theoretical course rather an applied statistics course. You will not learn
how to read or conduct real-world statistical studies. Leave your $p$-values at
home, this ain't your momma's AP Stats.

= Lecture #datetime(day: 6, month: 1, year: 2025).display()

== Preliminaries

#definition[
  Statistics is the science dealing with the collection, summarization,
  analysis, and interpretation of data.
]

== Set theory for dummies

A terse introduction to elementary naive set theory and the basic operations
upon them.

#remark[
  Keep in mind that without $cal(Z F C)$ or another model of set theory that
  resolves fundamental issues, our set theory is subject to paradoxes like
  Russell's. Whoops, the universe doesn't exist.
]

#definition[
  A *set* is a collection of elements.
]

#example[Examples of sets][
  + Trivial set: ${1}$
  + Empty set: $emptyset$
  + $A = {a,b,c}$
]

We can construct sets using set-builder notation (also sometimes called set
comprehension).

$ {"expression with" x | "conditions on" x} $

#example("Set builder notation")[
  + The set of all even integers: ${2n | n in ZZ}$
  + The set of all perfect squares in $RR$: ${x^2 | x in NN}$
]

We also have notation for working with sets:

With arbitrary sets $A$, $B$:

+ $a in A$ ($a$ is a member of the set $A$)
+ $a in.not A$ ($a$ is not a member of the set $A$)
+ $A subset.eq B$ (Set theory: $A$ is a subset of $B$) (Stats: $A$ is a sample space in $B$)
+ $A subset B$ (Proper subset: $A != B$)
+ $A^c$ or $A'$ (read "complement of $A$," and introduced later)
+ $A union B$ (Union of $A$ and $B$. Gives a set with both the elements of $A$ and $B$)
+ $A sect B$ (Intersection of $A$ and $B$. Gives a set consisting of the elements in *both* $A$ and $B$)
+ $A \\ B$ (Set difference. The set of all elements of $A$ that are not also in $B$)
+ $A times B$ (Cartesian product. Ordered pairs of $(a,b)$ $forall a in A$, $forall b in B$)

We can also write a few of these operations precisely as set comprehensions.

+ $A subset B => A = {a | a in B, forall a in A}$
+ $A union B = {x | x in A or x in B}$ (here $or$ is the logical OR)
+ $A sect B = {x | x in A and x in B}$ (here $and$ is the logical AND)
+ $A \\ B = {a | a in A and a in.not B}$
+ $A times B = {(a,b) | forall a in A, forall b in B}$

Take a moment and convince yourself that these definitions are equivalent to
the previous ones.

#definition[
  The universal set $Omega$ is the set of all objects in a given set
  theoretical universe.
]

With the above definition, we can now introduce the set complement.

#definition[
  The set complement $A'$ is given by
  $
    A' = Omega \\ A
  $
  where $Omega$ is the _universal set_.
]

#example[The real plane][
  The real plane $RR^2$ can be defined as a Cartesian product of $RR$ with
  itself.

  $ RR^2 = RR times RR $
]

Check your intuition that this makes sense. Why do you think $RR^n$ was chosen
as the notation for $n$ dimensional spaces in $RR$?

#definition[Disjoint sets][
  If $A sect B$ = $emptyset$, then we say that $A$ and $B$ are *disjoint*.
]

#fact[
  For any sets $A$ and $B$, we have DeMorgan's Laws:
  + $(A union B)' = A' sect B'$
  + $(A sect B)' = A' union B'$
]

#fact[Generalized DeMorgan's][
  + $(union.big_i A_i)' = sect.big_i A_i '$
  + $(sect.big_i A_i)' = union.big_i A_i '$
]

== Sizes of infinity

#definition[
  Let $N(A)$ be the number of elements in $A$. $N(A)$ is called the _cardinality_ of $A$.
]

We say a set is finite if it has finite cardinality, or infinite if it has an
infinite cardinality.

Infinite sets can be either _countably infinite_ or _uncountably infinite_.

When a set is countably infinite, its cardinality is $aleph_0$ (here $aleph$ is
the Hebrew letter aleph and read "aleph null").

When a set is uncountably infinite, its cardinality is greater than $aleph_0$.

#example("Countable sets")[
  + The natural numbers $NN$.
  + The rationals $QQ$.
  + The natural numbers $ZZ$.
  + The set of all logical tautologies.
]

#example("Uncountable sets")[
  + The real numbers $RR$.
  + The real numbers in the interval $[0,1]$.
  + The _power set_ of $ZZ$, which is the set of all subsets of $ZZ$.
]

#remark[
  All the uncountable sets above have cardinality $2^(aleph_0)$ or $aleph_1$ or
  $frak(c)$ or $beth_1$. This is the _cardinality of the continuum_, also
  called "aleph 1" or "beth 1".

  However, in general uncountably infinite sets do not have the same
  cardinality.
]

#fact[
  If a set is countably infinite, then it has a bijection with $ZZ$. This means
  every set with cardinality $aleph_0$ has a bijection to $ZZ$. More generally,
  any sets with the same cardinality have a bijection between them.
]

This gives us the following equivalent statement:

#fact[
  Two sets have the same cardinality if and only if there exists a bijective
  function between them. In symbols,

  $ N(A) = N(B) <==> exists F : A <-> B $
]

= Lecture #datetime(day: 8, month: 1, year: 2025).display()

== Probability

#definition[
  A *random experiment* is one in which the set of all possible outcomes is known in advance, but one can't predict which outcome will occur on a given trial of the experiment.
]

#example("Finite sample spaces")[
  Toss a coin:
  $ Omega = {H,T} $

  Roll a pair of dice:
  $ Omega = {1,2,3,4,5,6} times {1,2,3,4,5,6} $
]

#example("Countably infinite sample spaces")[
  Shoot a basket until you make one:
  $ Omega = {M, F M, F F M, F F F M, dots} $
]

#example("Uncountably infinite sample space")[
  Waiting time for a bus:
  $ Omega = {T : t >= 0} $
]

#fact[
  Elements of $Omega$ are called sample points.
]

#definition[
  Any properly defined subset of $Omega$ is called an *event*.
]

#example[Dice][
  Rolling a fair die twice, let $A$ be the event that the combined score of both dice is 10.

  $ A = {(4,6,), (5,5),(6,4)} $
]

Probabilistic concepts in the parlance of set theory:

- Superset ($Omega$) $<->$ sample space
- Element $<->$ outcome / sample point ($omega$)
- Disjoint sets $<->$ mutually exclusive events

== Classical approach

Classical approach:

$ P(a) = (hash A) / (hash Omega) $

Requires equally likely outcomes and finite sample spaces.

#remark[
  With an infinite sample space, the probability becomes 0, which is often wrong.
]

#example("Dice again")[
  Rolling a fair die twice, let $A$ be the event that the combined score of both dice is 10.

  $
    A &= {(4,6,), (5,5),(6,4)} \
    P(A) &= 3 / 36 = 1 / 12
  $
]

== Relative frequency approach

An approach done commonly by applied statisticians who work in the disgusting
real world. This is where we are generally concerned with irrelevant concerns
like accurate sampling and $p$-values and such. I am told this is covered in
PSTAT 120B, so hopefully I can avoid ever taking that class (as a pure math
major).

$
  P(A) = (hash "of times" A "occurs in large number of trials") / (hash "of trials")
$

#example[
  Flipping a coin to determine the probability of it landing heads.
]

== Subjective approach

Personal definition of probability. Not "real" probability, merely co-opting
its parlance to lend credibility to subjective judgements of confidence.

== Axiomatic approach

Consider a random experiment. Then:

#definition[
  The *sample space* $Omega$ is the set of all possible outcomes of the
  experiment.
]

#definition[
  Elements of $Omega$ are called *sample points*.
]

#definition[
  Subsets of $Omega$ are called *events*. The collection of events (in other
  terms, the power set of $Omega$) in $Omega$ is denoted by $cal(F)$.
]

#definition[
  The *probability measure*, or probability distribution, or simply probability s a function $P$.

  Let $P : cal(F) -> RR$ be a function satisfying the following axioms (properties).

  + $P(A) >= 0, forall A$
  + $P(Omega) = 1$
  + If $A_i sect A_j = emptyset, forall i != j$, then
    $ P(union.big_(i=1)^infinity A_i) = sum_(i=1)^infinity P(A_i) $
]

The 3-tuple $(Omega, cal(F), P)$ is called a *probability space*.

#remark[
  In more advanced texts you will see $Omega$ introduced as a so-called
  $sigma$-algebra. A $sigma$-algebra on a set $Omega$ is a nonempty collection
  $Sigma$ of subsets of $Omega$ that is closed under set complement, countable
  unions, and as a corollary, countable intersections.
]

Now let us show various results with $P$.

#proposition[
  $ P(emptyset) = 0 $
]

#proof[
  By axiom 3,

  $
    A_1 = emptyset, A_2 = emptyset, A_3 = emptyset \
    P(emptyset) = sum^infinity_(i=1) P(A_i) = sum^infinity_(i=1) P(emptyset)
  $
  Suppose $P(emptyset) != 0$. Then $P >= 0$ by axiom 1 but then $P -> infinity$ in the sum, which implies $Omega > 1$, which is disallowed by axiom 2. So $P(emptyset) = 0$.
]

#proposition[
  If $A_1, A_2, ..., A_n$ are disjoint, then
  $ P(union.big^n_(i=1) A_i) = sum^n_(i= 1) P(A_i) $
]

This is mostly a formal manipulation to derive the obviously true proposition from our axioms.

#proof[
  Write any finite set $(A_1, A_2, ..., A_n)$ as an infinite set $(A_1, A_2, ..., A_n, emptyset, emptyset, ...)$. Then
  $
    P(union.big_(i=1)^infinity A_i) = sum^n_(i=1) P(A_i) + sum^infinity_(i=n+1) P(emptyset) = sum^n_(i=1) P(A_i)
  $
  And because all of the elements after $A_n$ are $emptyset$, their union adds no additional elements to the resultant union set of all $A_i$, so
  $
    P(union.big_(i=1)^infinity A_i) = P(union.big_(i=1)^n A_i) = sum_(i=1)^n P(A_i)
  $
]

#proposition[Complement][
  $ P(A') = 1 - P(A) $
]

#proof[
  $
    A' union A &= Omega \
    A' sect A &= emptyset \
    P(A' union A) &= P(A') + P(A) &"(by axiom 3)"\
    = P(Omega) &= 1 &"(by axiom 2)" \
    therefore P(A') &= 1 - P(A)
  $
]

#proposition[
  $ A subset.eq B => P(A) <= P(B) $
]

#proof[
  $ B = A union (A' sect B) $

  but $A$ and ($A' sect B$) are disjoint, so

  $
    P(B) &= P(A union (A' sect B)) \
    &= P(A) + P(A' sect B) \
    &therefore P(B) >= P(A)
  $
]

#proposition[
  $ P(A union B) = P(A) + P(B) - P(A sect B) $
]

#proof[
  $
    A = (A sect B) union (A sect B') \
    => P(A) = P(A sect B) + P(A sect B') \
    => P(B) = P(B sect A) + P(B sect A') \
    P(A) + P(B) = P(A sect B) + P(A sect B) + P(A sect B') + P(A' sect B) \
    => P(A) + P(B) - P(A sect B) = P(A sect B) + P(A sect B') + P(A' sect B) \
  $
]

#remark[
  This is a stronger result of axiom 3, which generalizes for all sets $A$ and $B$ regardless of whether they're disjoint.
]

#remark[
  These are mostly intuitively true statements (think about the probabilistic
  concepts represented by the sets) in classical probability that we derive
  rigorously from our axiomatic probability function $P$.
]

#example[
  Now let us consider some trivial concepts in classical probability written in
  the parlance of combinatorial probability.

  Select one card from a deck of 52 cards.
  Then the following is true:

  $
    Omega = {1,2,...,52} \
    A = "card is a heart" = {H 2, H 3, H 4, ..., H"Ace"} \
    B = "card is an Ace" = {H"Ace", C"Ace", D"Ace", S"Ace"} \
    C = "card is black" = {C 2, C 3, ..., C"Ace", S 2, S 3, ..., S"Ace"} \
    P(A) = 13 / 52,
    P(B) = 4 / 52,
    P(C) = 26 / 52 \
    P(A sect B) = 1 / 52 \
    P(A sect C) = 0 \
    P(B sect C) = 2 / 52 \
    P(A union B) = P(A) + P(B) - P(A sect B) = 16 / 52 \
    P(B') = 1 - P(B) = 48 / 52 \
    P(A sect B') = P(A) - P(A sect B) = 13 / 52 - 1 / 52 = 12 / 52 \
    P((A sect B') union (A' sect B)) = P(A sect B') + P(A' sect B) = 15 / 52 \
    P(A' sect B') = P(A union B)' = 1 - P(A union B) = 36 / 52
  $
]

== Countable sample spaces

#definition[
  A sample space $Omega$ is said to be *countable* if it's finite or countably infinite.
]

In such a case, one can list the elements of $Omega$.

$ Omega = {omega_1, omega_2, omega_3, ...} $
with associated probabilities, $p_1, p_2, p_3,...$, where
$
  p_i = P(omega_i) >= 0 \
  1 = P(Omega) = sum P(omega_i)
$

#example[Fair die, again][
  All outcomes are equally likely,
  $ p_1 = p_2 = ... = p_6 = 1 / 6 $
  Let $A$ be the event that the score is odd = ${1,3,5}$
  $ P(A) = 3 / 6 $
]

#example[Loaded die][
  Consider a die where the probabilities of rolling odd sides is double the probability of rolling an even side.
  $
    p_2 = p_4 = p_6, p_1 = p_3 = p_5 = 2p_2 \
    6p_2 + 3p_2 = 9p_2 = 1 \
    p_2 = 1 / 9, p_1 = 2 / 9
  $
]

#example[Coins][
  Toss a fair coin until you get the first head.
  $
    Omega = {H, T H, T T H, ...} "(countably infinite)" \
    P(H) = 1 / 2 \
    P(T T H) = (1 / 2)^3 \
    P(Omega) = sum_(n=1)^infinity (1 / 2)^n = 1 / (1 - 1 / 2) - 1 = 1
  $
]

#example[Birthdays][
  What is the probability two people share the same birthday?

  $
    Omega = [1,365] times [1,365] \
    P(A) = 365 / 365^2 = 1 / 365
  $
]

== Continuous sample spaces

#definition[
  A *continuous sample space* contains an interval in $RR$ and is uncountably infinite.
]

#definition[
  A probability density function (#smallcaps[pdf]) gives the probability at the point
  $s$.
]

Properties of the #smallcaps[pdf]:

- $f(s) >= 0, forall p_i >= 0$
- $integral_S f(s) dif s = 1, forall p_i >= 0$

#example[
  Waiting time for bus: $Omega = {s : s >= 0}$.
]

= Notes on counting

The cardinality of $A$ is given by $hash A$. Let us develop methods for finding
$hash A$ from a description of the set $A$ (in other words, methods for
counting).

== General multiplication principle

#fact[
  Let $A$ and $B$ be finite sets, $k in ZZ^+$. Then let $f : A -> B$ be a
  function such that each element in $B$ is the image of exactly $k$ elements
  in $A$ (such a function is called _$k$-to-one_). Then $hash A = k dot hash
  B$.
]<ktoone>

#example[
  Four fully loaded 10-seater vans transported people to the picnic. How many
  people were transported?

  By @ktoone, we have $A$ is the set of people, $B$ is the set of vans, $f : A -> B$ maps a person to the van they ride in. So $f$ is a 10-to-one function, $hash A = 40$, $hash B = 4$, and clearly the answer is $10 dot 4 = 40$.
]

#definition[
  An $n$-tuple is an ordered sequence of $n$ elements.
]

Many of our methods in probability rely on multiplying together multiple
outcomes to obtain their combined amount of outcomes. We make this explicit below in @tuplemultiplication.

#fact[
  Suppose a set of $n$-tuples $(a_1, ..., a_n)$ obeys these rules:

  + There are $r_1$ choices for the first entry $a_1$.
  + Once the first $k$ entries $a_1, ..., a_k$ have been chosen, the number of alternatives for the next entry $a_(k+1)$ is $r_(k+1)$, regardless of the previous choices.

  Then the total number of $n$-tuples is the product $r_1 dot r_2 dot r_2 dot dots dot r_n$.
]<tuplemultiplication>

#proof[
  It is trivially true for $n = 1$ since you have $r_1$ choices of $a_1$ for a
  1-tuple $(a_1)$.

  Let $A$ be the set of all possible $n$-tuples and $B$ be the set of all
  possible $(n+1)$-tuples. Now let us assume the statement is true for $A$.
  Proceed by induction on $B$, noting that for each $n$-tuple in $A$, $(a_1,
  ..., a_n)$, we have $r_(n+1)$ tuples in $A$.

  Let $f : B -> A$ be a function which takes each $(n+1)$-tuple and truncates the $a_(n+1)$ term, leaving us with just an $n$-tuple of the form $(a_1, a_2, ..., a_n)$.
  $ f((a_1, ..., a_n, a_(n + 1))) = (a_1, ..., a_n) $
  Now notice that $f$ is precisely a $r_(n+1)$-to-one function! Recall by
  our assumption that @tuplemultiplication is true for $n$-tuples, so $A$ has $r_1 dot
  r_2 dot ... dot r_n$ elements, or $hash A = r_1 dot ... dot r_n$. Then by
  @ktoone, we have $hash B = hash A dot r_(n+1) = r_1 dot r_2 dot
  ... dot r_(n+1)$. Our induction is complete and we have proved @tuplemultiplication.
]

@tuplemultiplication is sometimes called the _general multiplication principle_.

We can use @tuplemultiplication to derive counting formulas for various
situations. Let $A_1, A_2, A_n$ be finite sets. Then as a corollary of
@tuplemultiplication, we can count the number of $n$-tuples in a finite
Cartesian product of $A_1, A_2, A_n$.

#fact[
  Let $A_1, A_2, A_n$ be finite sets. Then

  $
    hash (A_1 times A_2 times ... times, A_n) = (hash A_1) dot (hash A_2) dot ... dot (hash A_n) = Pi^n_(i=1) (hash A_i)
  $
]

#example[
  How many distinct subsets does a set of size $n$ have?

  The answer is $2^n$. Each subset can be encoded as an $n$-tuple with entries 0
  or 1, where the $i$th entry is 1 if the $i$th element of the set is in the
  subset and 0 if it is not.

  Thus the number of subsets is the same as the cardinality of
  $ {0,1} times ... times {0,1} = {0,1}^n $
  which is $2^n$.

  This is why given a set $X$ with cardinality $aleph$, we write the
  cardinality of the power set of $X$ as $2^aleph$.
]

== Permutations

Now we can use the multiplication principle to count permutations.

#fact[
  Consider all $k$-tuples $(a_1, ..., a_k)$ that can be constructed from a set $A$ of size $n, n>= k$ without repetition. The total number of these $k$-tuples is
  $ (n)_k = n dot (n - 1) ... (n - k + 1) = n! / (n-k)! $

  In particular, with $k=n$, each $n$-tuple is an ordering or _permutation_ of $A$. So the total number of permutations of a set of $n$ elements is $n!$.
]<permutation>

#proof[
  We construct the $k$-tuples sequentially. For the first element, we choose
  one element from $A$ with $n$ alternatives. The next element has $n - 1$
  alternatives. In general, after $j$ elements are chosen, there are $n - j +
  1$ alternatives.

  Then clearly after choosing $k$ elements for our $k$-tuple we have by
  @tuplemultiplication the number of $k$-tuples being $n dot (n - 1) dot ...
  dot (n - k + 1) = (n)_k$.
]

#example[
  Consider a round table with 8 seats.

  + In how many ways can we seat 8 guests around the table?
  + In how many ways can we do this if we do not differentiate between seating arrangements that are rotations of each other?

  For (1), we easily see that we're simply asking for permutations of an
  8-tuple, so $8!$ is the answer.

  For (2), we number each person and each seat from 1-8, then always place person 1 in seat 1, and count the permutations of the other 7 people in the other 7 seats. Then the answer is $7!$.

  Alternatively, notice that each arrangement has 8 equivalent arrangements under rotation. So the answer is $8!/8 = 7!$.
]

== Counting from sets

We turn our attention to sets, which unlike tuples are unordered collections.

#fact[
  Let $n,k in NN$ with $0 <= k <= n$. The numbers of distinct subsets of size $k$ that a set of size $n$ has is given by the *binomial coefficient*
  $ vec(n,k) = n! / (k! (n-k)!) $
]

#proof[
  Let $A$ be a set of size $n$. By @permutation, $n!/(n-k)!$ unique ordered
  $k$-tuples can be constructed from elements of $A$. Each subset of $A$ of
  size $k$ has exactly $k!$ different orderings, and hence appears exactly $k!$
  times among the ordered $k$-tuples. Thus the number of subsets of size $k$ is
  $n! / (k! (n-k)!)$.
]

#example[
  In a class there are 12 boys and 14 girls. How many different teams of 7 pupils
  with 3 boys and 4 girls can be create?

  First let us compute how many subsets of size 3 we can choose from the 12 boys and how many subsets of size 4 we can choose from the 14 girls.

  $
    "boys" &= vec(12,3) \
    "girls" &= vec(14,4)
  $

  Then let us consider the entire team as a 2-tuple of (boys, girls). Then
  there are $vec(12,3)$ alternatives for the choice of boys, and $vec(14,4)$ alternatives for
  the choice of girls, so by the multiplication principle, we have the total being

  $ vec(12,3) vec(14,4) $
]

#example[
  Color the numbers 1, 2 red, the numbers 3, 4 green, and the numbers 5, 6
  yellow. How many different two-element subsets of $A$ are there that have two
  different colors?

  First choose 2 colors, $vec(3,2) = 3$. Then from each color, choose one. Altogether it's
  $ vec(3,2) vec(2,1) vec(2,1) = 3 dot 2 dot 2 = 12 $
]

One way to view $vec(n,k)$ is as the number of ways of painting $n$ elements
with two colors, red and yellow, with $k$ red and $n - k$ yellow elements. Let
us generalize to more than two colors.

#fact[
  Let $n$ and $r$ be positive integers and $k_1, ..., k_r$ nonnegative integers
  such that $k_1 + dots.c + k_r = n$. The number of ways of assigning labels
  $1,2, ..., r$ to $n$ items so that for each $i = 1, 2, ..., r$, exactly $k_i$
  items receive label $i$, is the *multinomial coefficient*

  $ vec(n, (k_1, k_2, ..., k_r)) = vec(n!, k_1 ! k_2 ! dots.c k_r !) $
]<multinomial-coefficient>

#proof[
  Order the $n$ integers in some manner, and assign labels like this: for the
  first $k_1$ integers, assign the label 1, then for the next $k_2$ integers,
  assign the label 2, and so on. The $i$th label will be assigned to all the
  integers between positions $k_1 + dots.c + k_(i-1) + 1$ and $k_1 + dots.c +
  k_i$.

  Then notice that all possible orderings (permutations) of the integers gives
  every possible way to label the integers. However, we overcount by some
  amount. How much? The order of the integers with a given label don't matter,
  so we need to deduplicate those.

  Each set of labels is duplicated once for each way we can order all of the
  elements with the same label. For label $i$, there are $k_i$ elements with
  that label, so $k_i !$ ways to order those. By @tuplemultiplication, we know
  that we can express the combined amount of ways each group of $k_1, ..., k_i$
  numbers are labeled as $k_1 ! k_2 ! k_3 ! dots.c k_r !$.

  So by @ktoone, we can account for the duplicates and the answer is
  $ n! / (k_1 ! k_2 ! k_3 ! dots.c k_r !) $
]

#remark[
  @multinomial-coefficient gives us a way to count how many ways there are to
  fit $n$ distinguishable objects into $r$ distinguishable containers of
  varying capacity.

  To find the amount of ways to fit $n$ distinguishable objects into $k$
  indistinguishable containers of equal capacity, use the "ball-and-urn"
  technique.
]