alexandria/documents/by-course/pstat-120a/course-notes/main.typ

#import "@youwen/zen:0.1.0": *
#import "@preview/ctheorems:1.1.3": *

#show: zen.with(
  title: "PSTAT120A Course Notes",
  author: "Youwen Wu",
  date: "Winter 2025",
  subtitle: "Taught by Brian Wainwright",
)

#outline()

= Introduction

PSTAT 120A is an introductory course on probability and statistics. However, it
is a theoretical course rather an applied statistics course. You will not learn
how to read or conduct real-world statistical studies. Leave your $p$-values at
home, this ain't your momma's AP Stats.

= Lecture #datetime(day: 6, month: 1, year: 2025).display()

== Preliminaries

#definition[
  Statistics is the science dealing with the collection, summarization,
  analysis, and interpretation of data.
]

== Set theory for dummies

A terse introduction to elementary naive set theory and the basic operations
upon them.

#remark[
  Keep in mind that without $cal(Z F C)$ or another model of set theory that
  resolves fundamental issues, our set theory is subject to paradoxes like
  Russell's. Whoops, the universe doesn't exist.
]

#definition[
  A *set* is a collection of elements.
]

#example[Examples of sets][
  + Trivial set: ${1}$
  + Empty set: $emptyset$
  + $A = {a,b,c}$
]

We can construct sets using set-builder notation (also sometimes called set
comprehension).

$ {"expression with" x | "conditions on" x} $

#example("Set builder notation")[
  + The set of all even integers: ${2n | n in ZZ}$
  + The set of all perfect squares in $RR$: ${x^2 | x in NN}$
]

We also have notation for working with sets:

With arbitrary sets $A$, $B$:

+ $a in A$ ($a$ is a member of the set $A$)
+ $a in.not A$ ($a$ is not a member of the set $A$)
+ $A subset.eq B$ (Set theory: $A$ is a subset of $B$) (Stats: $A$ is a sample space in $B$)
+ $A subset B$ (Proper subset: $A != B$)
+ $A^c$ or $A'$ (read "complement of $A$," and introduced later)
+ $A union B$ (Union of $A$ and $B$. Gives a set with both the elements of $A$ and $B$)
+ $A sect B$ (Intersection of $A$ and $B$. Gives a set consisting of the elements in *both* $A$ and $B$)
+ $A \\ B$ (Set difference. The set of all elements of $A$ that are not also in $B$)
+ $A times B$ (Cartesian product. Ordered pairs of $(a,b)$ $forall a in A$, $forall b in B$)

We can also write a few of these operations precisely as set comprehensions.

+ $A subset B => A = {a | a in B, forall a in A}$
+ $A union B = {x | x in A or x in B}$ (here $or$ is the logical OR)
+ $A sect B = {x | x in A and x in B}$ (here $and$ is the logical AND)
+ $A \\ B = {a | a in A and a in.not B}$
+ $A times B = {(a,b) | forall a in A, forall b in B}$

Take a moment and convince yourself that these definitions are equivalent to
the previous ones.

#definition[
  The universal set $Omega$ is the set of all objects in a given set
  theoretical universe.
]

With the above definition, we can now introduce the set complement.

#definition[
  The set complement $A'$ is given by
  $
    A' = Omega \\ A
  $
  where $Omega$ is the _universal set_.
]

#example[The real plane][
  The real plane $RR^2$ can be defined as a Cartesian product of $RR$ with
  itself.

  $ RR^2 = RR times RR $
]

Check your intuition that this makes sense. Why do you think $RR^n$ was chosen
as the notation for $n$ dimensional spaces in $RR$?

#definition[Disjoint sets][
  If $A sect B$ = $emptyset$, then we say that $A$ and $B$ are *disjoint*.
]

#fact[
  For any sets $A$ and $B$, we have DeMorgan's Laws:
  + $(A union B)' = A' sect B'$
  + $(A sect B)' = A' union B'$
]

#fact[Generalized DeMorgan's][
  + $(union.big_i A_i)' = sect.big_i A_i '$
  + $(sect.big_i A_i)' = union.big_i A_i '$
]

== Sizes of infinity

#definition[
  Let $N(A)$ be the number of elements in $A$. $N(A)$ is called the _cardinality_ of $A$.
]

We say a set is finite if it has finite cardinality, or infinite if it has an
infinite cardinality.

Infinite sets can be either _countably infinite_ or _uncountably infinite_.

When a set is countably infinite, its cardinality is $aleph_0$ (here $aleph$ is
the Hebrew letter aleph and read "aleph null").

When a set is uncountably infinite, its cardinality is greater than $aleph_0$.

#example("Countable sets")[
  + The natural numbers $NN$.
  + The rationals $QQ$.
  + The natural numbers $ZZ$.
  + The set of all logical tautologies.
]

#example("Uncountable sets")[
  + The real numbers $RR$.
  + The real numbers in the interval $[0,1]$.
  + The _power set_ of $ZZ$, which is the set of all subsets of $ZZ$.
]

#remark[
  All the uncountable sets above have cardinality $2^(aleph_0)$ or $aleph_1$ or
  $frak(c)$ or $beth_1$. This is the _cardinality of the continuum_, also
  called "aleph 1" or "beth 1".

  However, in general uncountably infinite sets do not have the same
  cardinality.
]

#fact[
  If a set is countably infinite, then it has a bijection with $ZZ$. This means
  every set with cardinality $aleph_0$ has a bijection to $ZZ$. More generally,
  any sets with the same cardinality have a bijection between them.
]

This gives us the following equivalent statement:

#fact[
  Two sets have the same cardinality if and only if there exists a bijective
  function between them. In symbols,

  $ N(A) = N(B) <==> exists F : A <-> B $
]

= Lecture #datetime(day: 8, month: 1, year: 2025).display()

== Probability

#definition[
  A *random experiment* is one in which the set of all possible outcomes is known in advance, but one can't predict which outcome will occur on a given trial of the experiment.
]

#example("Finite sample spaces")[
  Toss a coin:
  $ Omega = {H,T} $

  Roll a pair of dice:
  $ Omega = {1,2,3,4,5,6} times {1,2,3,4,5,6} $
]

#example("Countably infinite sample spaces")[
  Shoot a basket until you make one:
  $ Omega = {M, F M, F F M, F F F M, dots} $
]

#example("Uncountably infinite sample space")[
  Waiting time for a bus:
  $ Omega = {T : t >= 0} $
]

#fact[
  Elements of $Omega$ are called sample points.
]

#definition[
  Any properly defined subset of $Omega$ is called an *event*.
]

#example[Dice][
  Rolling a fair die twice, let $A$ be the event that the combined score of both dice is 10.

  $ A = {(4,6,), (5,5),(6,4)} $
]

Probabilistic concepts in the parlance of set theory:

- Superset ($Omega$) $<->$ sample space
- Element $<->$ outcome / sample point ($omega$)
- Disjoint sets $<->$ mutually exclusive events

== Classical approach

Classical approach:

$ P(a) = (hash A) / (hash Omega) $

Requires equally likely outcomes and finite sample spaces.

#remark[
  With an infinite sample space, the probability becomes 0, which is often wrong.
]

#example("Dice again")[
  Rolling a fair die twice, let $A$ be the event that the combined score of both dice is 10.

  $
    A &= {(4,6,), (5,5),(6,4)} \
    P(A) &= 3 / 36 = 1 / 12
  $
]

== Relative frequency approach

An approach done commonly by applied statisticians who work in the disgusting
real world. This is where we are generally concerned with irrelevant concerns
like accurate sampling and $p$-values and such. I am told this is covered in
PSTAT 120B, so hopefully I can avoid ever taking that class (as a pure math
major).

$
  P(A) = (hash "of times" A "occurs in large number of trials") / (hash "of trials")
$

#example[
  Flipping a coin to determine the probability of it landing heads.
]

== Subjective approach

Personal definition of probability. Not "real" probability, merely co-opting
its parlance to lend credibility to subjective judgements of confidence.

== Axiomatic approach

Consider a random experiment. Then:

#definition[
  The *sample space* $Omega$ is the set of all possible outcomes of the
  experiment.
]

#definition[
  Elements of $Omega$ are called *sample points*.
]

#definition[
  Subsets of $Omega$ are called *events*. The collection of events (in other
  terms, the power set of $Omega$) in $Omega$ is denoted by $cal(F)$.
]

#definition[
  The *probability measure*, or probability distribution, or simply probability s a function $P$.

  Let $P : cal(F) -> RR$ be a function satisfying the following axioms (properties).

  + $P(A) >= 0, forall A$
  + $P(Omega) = 1$
  + If $A_i sect A_j = emptyset, forall i != j$, then
    $ P(union.big_(i=1)^infinity A_i) = sum_(i=1)^infinity P(A_i) $
]

The 3-tuple $(Omega, cal(F), P)$ is called a *probability space*.

#remark[
  In more advanced texts you will see $Omega$ introduced as a so-called
  $sigma$-algebra. A $sigma$-algebra on a set $Omega$ is a nonempty collection
  $Sigma$ of subsets of $Omega$ that is closed under set complement, countable
  unions, and as a corollary, countable intersections.
]

Now let us show various results with $P$.

#proposition[
  $ P(emptyset) = 0 $
]

#proof[
  By axiom 3,

  $
    A_1 = emptyset, A_2 = emptyset, A_3 = emptyset \
    P(emptyset) = sum^infinity_(i=1) P(A_i) = sum^infinity_(i=1) P(emptyset)
  $
  Suppose $P(emptyset) != 0$. Then $P >= 0$ by axiom 1 but then $P -> infinity$ in the sum, which implies $Omega > 1$, which is disallowed by axiom 2. So $P(emptyset) = 0$.
]

#proposition[
  If $A_1, A_2, ..., A_n$ are disjoint, then
  $ P(union.big^n_(i=1) A_i) = sum^n_(i= 1) P(A_i) $
]

This is mostly a formal manipulation to derive the obviously true proposition from our axioms.

#proof[
  Write any finite set $(A_1, A_2, ..., A_n)$ as an infinite set $(A_1, A_2, ..., A_n, emptyset, emptyset, ...)$. Then
  $
    P(union.big_(i=1)^infinity A_i) = sum^n_(i=1) P(A_i) + sum^infinity_(i=n+1) P(emptyset) = sum^n_(i=1) P(A_i)
  $
  And because all of the elements after $A_n$ are $emptyset$, their union adds no additional elements to the resultant union set of all $A_i$, so
  $
    P(union.big_(i=1)^infinity A_i) = P(union.big_(i=1)^n A_i) = sum_(i=1)^n P(A_i)
  $
]

#proposition[Complement][
  $ P(A') = 1 - P(A) $
]

#proof[
  $
    A' union A &= Omega \
    A' sect A &= emptyset \
    P(A' union A) &= P(A') + P(A) &"(by axiom 3)"\
    = P(Omega) &= 1 &"(by axiom 2)" \
    therefore P(A') &= 1 - P(A)
  $
]

#proposition[
  $ A subset.eq B => P(A) <= P(B) $
]

#proof[
  $ B = A union (A' sect B) $

  but $A$ and ($A' sect B$) are disjoint, so

  $
    P(B) &= P(A union (A' sect B)) \
    &= P(A) + P(A' sect B) \
    &therefore P(B) >= P(A)
  $
]

#proposition[
  $ P(A union B) = P(A) + P(B) - P(A sect B) $
]

#proof[
  $
    A = (A sect B) union (A sect B') \
    => P(A) = P(A sect B) + P(A sect B') \
    => P(B) = P(B sect A) + P(B sect A') \
    P(A) + P(B) = P(A sect B) + P(A sect B) + P(A sect B') + P(A' sect B) \
    => P(A) + P(B) - P(A sect B) = P(A sect B) + P(A sect B') + P(A' sect B) \
  $
]

#remark[
  This is a stronger result of axiom 3, which generalizes for all sets $A$ and $B$ regardless of whether they're disjoint.
]

#remark[
  These are mostly intuitively true statements (think about the probabilistic
  concepts represented by the sets) in classical probability that we derive
  rigorously from our axiomatic probability function $P$.
]

#example[
  Now let us consider some trivial concepts in classical probability written in
  the parlance of combinatorial probability.

  Select one card from a deck of 52 cards.
  Then the following is true:

  $
    Omega = {1,2,...,52} \
    A = "card is a heart" = {H 2, H 3, H 4, ..., H"Ace"} \
    B = "card is an Ace" = {H"Ace", C"Ace", D"Ace", S"Ace"} \
    C = "card is black" = {C 2, C 3, ..., C"Ace", S 2, S 3, ..., S"Ace"} \
    P(A) = 13 / 52,
    P(B) = 4 / 52,
    P(C) = 26 / 52 \
    P(A sect B) = 1 / 52 \
    P(A sect C) = 0 \
    P(B sect C) = 2 / 52 \
    P(A union B) = P(A) + P(B) - P(A sect B) = 16 / 52 \
    P(B') = 1 - P(B) = 48 / 52 \
    P(A sect B') = P(A) - P(A sect B) = 13 / 52 - 1 / 52 = 12 / 52 \
    P((A sect B') union (A' sect B)) = P(A sect B') + P(A' sect B) = 15 / 52 \
    P(A' sect B') = P(A union B)' = 1 - P(A union B) = 36 / 52
  $
]

== Countable sample spaces

#definition[
  A sample space $Omega$ is said to be *countable* if it's finite or countably infinite.
]

In such a case, one can list the elements of $Omega$.

$ Omega = {omega_1, omega_2, omega_3, ...} $
with associated probabilities, $p_1, p_2, p_3,...$, where
$
  p_i = P(omega_i) >= 0 \
  1 = P(Omega) = sum P(omega_i)
$

#example[Fair die, again][
  All outcomes are equally likely,
  $ p_1 = p_2 = ... = p_6 = 1 / 6 $
  Let $A$ be the event that the score is odd = ${1,3,5}$
  $ P(A) = 3 / 6 $
]

#example[Loaded die][
  Consider a die where the probabilities of rolling odd sides is double the probability of rolling an even side.
  $
    p_2 = p_4 = p_6, p_1 = p_3 = p_5 = 2p_2 \
    6p_2 + 3p_2 = 9p_2 = 1 \
    p_2 = 1 / 9, p_1 = 2 / 9
  $
]

#example[Coins][
  Toss a fair coin until you get the first head.
  $
    Omega = {H, T H, T T H, ...} "(countably infinite)" \
    P(H) = 1 / 2 \
    P(T T H) = (1 / 2)^3 \
    P(Omega) = sum_(n=1)^infinity (1 / 2)^n = 1 / (1 - 1 / 2) - 1 = 1
  $
]

#example[Birthdays][
  What is the probability two people share the same birthday?

  $
    Omega = [1,365] times [1,365] \
    P(A) = 365 / 365^2 = 1 / 365
  $
]

== Continuous sample spaces

#definition[
  A *continuous sample space* contains an interval in $RR$ and is uncountably infinite.
]

#definition[
  A probability density function (#smallcaps[pdf]) gives the probability at the point
  $s$.
]

Properties of the #smallcaps[pdf]:

- $f(s) >= 0, forall p_i >= 0$
- $integral_S f(s) dif s = 1, forall p_i >= 0$

#example[
  Waiting time for bus: $Omega = {s : s >= 0}$.
]

= Notes on counting

The cardinality of $A$ is given by $hash A$. Let us develop methods for finding
$hash A$ from a description of the set $A$ (in other words, methods for
counting).

== General multiplication principle

#fact[
  Let $A$ and $B$ be finite sets, $k in ZZ^+$. Then let $f : A -> B$ be a
  function such that each element in $B$ is the image of exactly $k$ elements
  in $A$ (such a function is called _$k$-to-one_). Then $hash A = k dot hash
  B$.
]<ktoone>

#example[
  Four fully loaded 10-seater vans transported people to the picnic. How many
  people were transported?

  By @ktoone, we have $A$ is the set of people, $B$ is the set of vans, $f : A -> B$ maps a person to the van they ride in. So $f$ is a 10-to-one function, $hash A = 40$, $hash B = 4$, and clearly the answer is $10 dot 4 = 40$.
]

#definition[
  An $n$-tuple is an ordered sequence of $n$ elements.
]

Many of our methods in probability rely on multiplying together multiple
outcomes to obtain their combined amount of outcomes. We make this explicit below in @tuplemultiplication.

#fact[
  Suppose a set of $n$-tuples $(a_1, ..., a_n)$ obeys these rules:

  + There are $r_1$ choices for the first entry $a_1$.
  + Once the first $k$ entries $a_1, ..., a_k$ have been chosen, the number of alternatives for the next entry $a_(k+1)$ is $r_(k+1)$, regardless of the previous choices.

  Then the total number of $n$-tuples is the product $r_1 dot r_2 dot r_2 dot dots dot r_n$.
]<tuplemultiplication>

#proof[
  It is trivially true for $n = 1$ since you have $r_1$ choices of $a_1$ for a
  1-tuple $(a_1)$.

  Let $A$ be the set of all possible $n$-tuples and $B$ be the set of all
  possible $(n+1)$-tuples. Now let us assume the statement is true for $A$.
  Proceed by induction on $B$, noting that for each $n$-tuple in $A$, $(a_1,
  ..., a_n)$, we have $r_(n+1)$ tuples in $A$.

  Let $f : B -> A$ be a function which takes each $(n+1)$-tuple and truncates the $a_(n+1)$ term, leaving us with just an $n$-tuple of the form $(a_1, a_2, ..., a_n)$.
  $ f((a_1, ..., a_n, a_(n + 1))) = (a_1, ..., a_n) $
  Now notice that $f$ is precisely a $r_(n+1)$-to-one function! Recall by
  our assumption that @tuplemultiplication is true for $n$-tuples, so $A$ has $r_1 dot
  r_2 dot ... dot r_n$ elements, or $hash A = r_1 dot ... dot r_n$. Then by
  @ktoone, we have $hash B = hash A dot r_(n+1) = r_1 dot r_2 dot
  ... dot r_(n+1)$. Our induction is complete and we have proved @tuplemultiplication.
]

@tuplemultiplication is sometimes called the _general multiplication principle_.

We can use @tuplemultiplication to derive counting formulas for various
situations. Let $A_1, A_2, A_n$ be finite sets. Then as a corollary of
@tuplemultiplication, we can count the number of $n$-tuples in a finite
Cartesian product of $A_1, A_2, A_n$.

#fact[
  Let $A_1, A_2, A_n$ be finite sets. Then

  $
    hash (A_1 times A_2 times ... times, A_n) = (hash A_1) dot (hash A_2) dot ... dot (hash A_n) = Pi^n_(i=1) (hash A_i)
  $
]

#example[
  How many distinct subsets does a set of size $n$ have?

  The answer is $2^n$. Each subset can be encoded as an $n$-tuple with entries 0
  or 1, where the $i$th entry is 1 if the $i$th element of the set is in the
  subset and 0 if it is not.

  Thus the number of subsets is the same as the cardinality of
  $ {0,1} times ... times {0,1} = {0,1}^n $
  which is $2^n$.

  This is why given a set $X$ with cardinality $aleph$, we write the
  cardinality of the power set of $X$ as $2^aleph$.
]

== Permutations

Now we can use the multiplication principle to count permutations.

#fact[
  Consider all $k$-tuples $(a_1, ..., a_k)$ that can be constructed from a set $A$ of size $n, n>= k$ without repetition. The total number of these $k$-tuples is
  $ (n)_k = n dot (n - 1) ... (n - k + 1) = n! / (n-k)! $

  In particular, with $k=n$, each $n$-tuple is an ordering or _permutation_ of $A$. So the total number of permutations of a set of $n$ elements is $n!$.
]<permutation>

#proof[
  We construct the $k$-tuples sequentially. For the first element, we choose
  one element from $A$ with $n$ alternatives. The next element has $n - 1$
  alternatives. In general, after $j$ elements are chosen, there are $n - j +
  1$ alternatives.

  Then clearly after choosing $k$ elements for our $k$-tuple we have by
  @tuplemultiplication the number of $k$-tuples being $n dot (n - 1) dot ...
  dot (n - k + 1) = (n)_k$.
]

#example[
  Consider a round table with 8 seats.

  + In how many ways can we seat 8 guests around the table?
  + In how many ways can we do this if we do not differentiate between seating arrangements that are rotations of each other?

  For (1), we easily see that we're simply asking for permutations of an
  8-tuple, so $8!$ is the answer.

  For (2), we number each person and each seat from 1-8, then always place person 1 in seat 1, and count the permutations of the other 7 people in the other 7 seats. Then the answer is $7!$.

  Alternatively, notice that each arrangement has 8 equivalent arrangements under rotation. So the answer is $8!/8 = 7!$.
]

== Counting from sets

We turn our attention to sets, which unlike tuples are unordered collections.

#fact[
  Let $n,k in NN$ with $0 <= k <= n$. The numbers of distinct subsets of size $k$ that a set of size $n$ has is given by the *binomial coefficient*
  $ vec(n,k) = n! / (k! (n-k)!) $
]

#proof[
  Let $A$ be a set of size $n$. By @permutation, $n!/(n-k)!$ unique ordered
  $k$-tuples can be constructed from elements of $A$. Each subset of $A$ of
  size $k$ has exactly $k!$ different orderings, and hence appears exactly $k!$
  times among the ordered $k$-tuples. Thus the number of subsets of size $k$ is
  $n! / (k! (n-k)!)$.
]

#example[
  In a class there are 12 boys and 14 girls. How many different teams of 7 pupils
  with 3 boys and 4 girls can be create?

  First let us compute how many subsets of size 3 we can choose from the 12 boys and how many subsets of size 4 we can choose from the 14 girls.

  $
    "boys" &= vec(12,3) \
    "girls" &= vec(14,4)
  $

  Then let us consider the entire team as a 2-tuple of (boys, girls). Then
  there are $vec(12,3)$ alternatives for the choice of boys, and $vec(14,4)$ alternatives for
  the choice of girls, so by the multiplication principle, we have the total being

  $ vec(12,3) vec(14,4) $
]

#example[
  Color the numbers 1, 2 red, the numbers 3, 4 green, and the numbers 5, 6
  yellow. How many different two-element subsets of $A$ are there that have two
  different colors?

  First choose 2 colors, $vec(3,2) = 3$. Then from each color, choose one. Altogether it's
  $ vec(3,2) vec(2,1) vec(2,1) = 3 dot 2 dot 2 = 12 $
]

One way to view $vec(n,k)$ is as the number of ways of painting $n$ elements
with two colors, red and yellow, with $k$ red and $n - k$ yellow elements. Let
us generalize to more than two colors.

#fact[
  Let $n$ and $r$ be positive integers and $k_1, ..., k_r$ nonnegative integers
  such that $k_1 + dots.c + k_r = n$. The number of ways of assigning labels
  $1,2, ..., r$ to $n$ items so that for each $i = 1, 2, ..., r$, exactly $k_i$
  items receive label $i$, is the *multinomial coefficient*

  $ vec(n, (k_1, k_2, ..., k_r)) = vec(n!, k_1 ! k_2 ! dots.c k_r !) $
]<multinomial-coefficient>

#proof[
  Order the $n$ integers in some manner, and assign labels like this: for the
  first $k_1$ integers, assign the label 1, then for the next $k_2$ integers,
  assign the label 2, and so on. The $i$th label will be assigned to all the
  integers between positions $k_1 + dots.c + k_(i-1) + 1$ and $k_1 + dots.c +
  k_i$.

  Then notice that all possible orderings (permutations) of the integers gives
  every possible way to label the integers. However, we overcount by some
  amount. How much? The order of the integers with a given label don't matter,
  so we need to deduplicate those.

  Each set of labels is duplicated once for each way we can order all of the
  elements with the same label. For label $i$, there are $k_i$ elements with
  that label, so $k_i !$ ways to order those. By @tuplemultiplication, we know
  that we can express the combined amount of ways each group of $k_1, ..., k_i$
  numbers are labeled as $k_1 ! k_2 ! k_3 ! dots.c k_r !$.

  So by @ktoone, we can account for the duplicates and the answer is
  $ n! / (k_1 ! k_2 ! k_3 ! dots.c k_r !) $
]

#remark[
  @multinomial-coefficient gives us a way to count how many ways there are to
  fit $n$ distinguishable objects into $r$ distinguishable containers of
  varying capacity.

  To find the amount of ways to fit $n$ distinguishable objects into $k$
  indistinguishable containers of _any_ capacity, use the "ball-and-urn"
  technique.
]

#example[
  How many different ways can six people be divided into three pairs?

  First we use the multimonial coefficient to count the amount of ways to assign specific labels to pairs of elements:
  $ vec(6, (2,2,2)) $
  But notice that the actual labels themselves are irrelevant. Our multimonial
  coefficient counts how many ways there are to assign 3 distinguishable
  labels, say Pair 1, Pair 2, Pair 3, to our 6 elements.

  To make this more explicit, say we had a 3-tuple where the position encoded
  the label, where position 1 corresponds to Pair 1, and so on. Then the values
  are the actual pairs of people (numbered 1-6). For instance
  $ ((1,2), (3,4), (5,6)) $
  corresponds to assigning the label Pair 1 to (1,2), Pair 2 to (3,4) and Pair
  3 to (5,6). What our multimonial coefficient is doing is it's counting this,
  as well as any other orderings of this tuple. For instance
  $ ((3,4), (1,2), (5,6)) $
  is also counted. However since in our case the actual labels are irrelevant,
  the two examples shown above should really be counted only once.

  How many extra times is each case counted? It turns out that we can think of
  our multimonial coefficient as permuting the labels across our pairs. So in
  this case it's permuting all the ways we can order 3 labels, which is $3! =
  6$. That means by @ktoone our answer is

  $ vec(6, (2,2,2)) / 3! = 15 $
]

#example("Poker")[
  How many poker hands are in the category _one pair_?

  A one pair is a hand with two cards of the same rank and three cards with ranks
  different from each other and the pair.

  We can count in two ways: we count all the ordered hands, then divide by $5!$
  to remove overcounting, or we can build the unordered hands directly.

  When finding the ordered hands, the key is to figure out how we can encode
  our information in a tuple of the form described in @tuplemultiplication, and
  then use @tuplemultiplication to compute the solution.

  In this case, the first element encodes the two slots in the hand of 5 our
  pair occupies, the second element encodes the first card of the pair, the
  third element encodes the second card of the pair, and the fourth, fifth, and
  sixth elements represent the 3 cards that are not of the same rank.

  Now it is clear that the number of alternatives in each position of the
  6-tuple does not depend on any of the others, so @tuplemultiplication
  applies. Then we can determine the amount of alternatives for each position
  in the 6-tuple and multiply them to determine the total amount of ways the
  6-tuple can be constructed, giving us the total amount of ways to construct
  ordered poker hands with one pairs.

  First we choose 2 slots out of 5 positions (in the hand) so there are
  $vec(5,2)$ alternatives. Then we choose any of the 52 cards for our first
  pair card, so there are 52 alternatives. Then we choose any card with the
  same rank for the second card in the pair, where there are 3 possible
  alternatives. Then we choose the third card which must not be the same rank
  as the first two, where there are 48 alternatives. The fourth card must not
  be the same rank as the others, so there are 44 alternatives. Likewise, the
  final card has 40 alternatives.

  So the final answer is, remembering to divide by $5!$ because we don't care
  about order,
  $ (vec(5,2) dot 52 dot 3 dot 48 dot 44 dot 40) / 5! $

  Alternatively, we can find way to build an unordered hand with the
  requirements. First we choose the rank of the pair, then we choose two suits
  for that rank, then we choose the remaining 3 different ranks, and finally a
  suit for each of the ranks. Then, noting that we will now omit constructing
  the tuple and explicitly listing alternatives for brevity, we have
  $ 13 dot vec(5,2) dot vec(12, 3) dot 4^3 $

  Both approaches given the same answer.
]

= Discussion section #datetime(day: 22, month: 1, year: 2025).display()

= Lecture #datetime(day: 23, month: 1, year: 2025).display()

== Independence
#definition("Independence")[
  Two events $A subset Omega$ and $B subset Omega$ are independent if and only if
  $ P(B sect A) = P(B)P(A) $
  "Joint probability is equal to product of their marginal probabilities."
]

#fact[This definition must be used to show the independence of two events.]

#fact[
  If $A$ and $B$ are independent, then,
  $
    P(A | B) = underbrace((P(A sect B)) / P(B), "conditional probability") = (P(A) P(B)) / P(B) = P(A)
  $
]

#example[
  Flip a fair coin 3 times. Let the events:

  - $A$ = we have exactly one tails among the first 2 flips
  - $B$ = we have exactly one tails among the last 2 flips
  - $D$ = we get exactly one tails among all 3 flip

  Show that $A$ and $B$ are independent.
  What about $B$ and $D$?

  Compute all of the possible events, then we see that

  $
    P(A sect B) = (hash (A sect B)) / (hash Omega) = 2 / 8 = 4 / 8 dot 4 / 8 = P(A) P(B)
  $

  So they are independent.

  Repeat the same reasoning for $B$ and $D$, we see that they are not independent.
]

#example[
  Suppose we have 4 red and 7 green balls in an urn. We choose two balls with replacement. Let

  - $A$ = the first ball is red
  - $B$ = the second ball is greeen

  Are $A$ and $B$ independent?

  $
    hash Omega = 11 times 11 = 121 \
    hash A = 4 dot 11 = 44 \
    hash B = 11 dot 7 = 77 \
    hash (A sect B) = 4 dot 7 = 28
  $
]

#definition[
  Events $A_1, ..., A_n$ are independent (mutually independent) if for every collection $A_i_1, ..., A_i_k$, where $2 <= k <= n$ and $1 <= i_1 < i_2 < dots.c < i_k <= n$,

  $
    P(A_i_1 sect A_i_2 sect dots.c sect A_i_k) = P(A_i_1) P(A_i_2) dots.c P(A_i_k)
  $
]

#definition[
  We say that the events $A_1, ..., A_n$ are *pairwise independent* if any two
  different events $A_i$ and $A_j$ are independent for any $i != j$.
]

= Lecture #datetime(day: 27, year: 2025, month: 1).display()

== Bernoulli trials

The setup: the experiment has exactly two outcomes:
- Success -- $S$ or 1
- Failure -- $F$ or 0

Additionally:
$
  P(S) = p, (0 < p < 1) \
  P(F) = 1 - p = q
$

Construct the probability mass function:

$
  P(X = 1) = p \
  P(X = 0) = 1 - p
$

Write it as:

$ p_x(k) = p^k (1-p)^(1-k) $

for $k = 1$ and $k = 0$.

== Binomial distribution

The setup: very similar to Bernoulli, trials have exactly 2 outcomes. A bunch
of Bernoulli trials in a row.

Importantly: $p$ and $q$ are defined exactly the same in all trials.

This ties the binomial distribution to the sampling with replacement model,
since each trial does not affect the next.

We conduct $n$ *independent* trials of this experiment. Example with coins: each
flip independently has a $1/2$ chance of heads or tails (holds same for die,
rigged coin, etc).

$n$ is fixed, i.e. known ahead of time.

== Binomial random variable

Let $X = hash$ of successes in $n$ independent trials. For any particular
sequence of $n$ trials, it takes the form $Omega = {omega} "where" omega = S
  F F dots.c F$ and is of length $n$.

Then $X(omega) = 0,1,2,...,n$ can take $n + 1$ possible values. The
probability of any particular sequence is given by the product of the
individual trial probabilities.

#example[
  $ omega = S F F S F dots.c S = (p q q p q dots.c p) $
]

So $P(x = 0) = P(F F F dots.c F) = q dot q dot dots.c dot q = q^n$.

And
$
  P(X = 1) = P(S F F dots.c F) + P(F S F F dots.c F) + dots.c + P(F F F dots.c F S) \
  = underbrace(n, "possible outcomes") dot p^1 dot p^(n-1) \
  = vec(n, 1) dot p^1 dot p^(n-1) \
  = n dot p^1 dot p^(n-1)
$

Now we can generalize

$
  P(X = 2) = vec(n,2) p^2 q^(n-2)
$

How about all successes?

$
  P(X = n) = P(S S dots.c S) = p^n
$

We see that for all failures we have $q^n$ and all successes we have $p^n$.
Otherwise we use our method above.

In general, here is the probability mass function for the binomial random variable

$
  P(X = k) = vec(n, k) p^k q^(n-k), "for" k = 0,1,2,...,n
$


Binomial distribution is very powerful. Choosing between two things, what are the probabilities?

To summarize the characterization of the binomial random variable:

- $n$ independent trials
- each trial results in binary success or failure
- with probability of success $p$, identically across trials

with $X = hash$ successes in *fixed* $n$ trials.

$ X ~ "Bin"(n,p) $

with probability mass function

$
  P(X = x) = vec(n,x) p^x (1 - p)^(n-x) = p(x) "for" x = 0,1,2,...,n
$

We see this is in fact the binomial theorem!

$
  p(x) >= 0, sum^n_(x=0) p(x) = sum^n_(x=0) vec(n,x) p^x q^(n-x) = (p + q)^n
$

In fact,
$
  (p + q)^n = (p + (1 - p))^n = 1
$

#example[
  Family 5 children, what is the probability that number of males = 2 if we
  assume births are independent and probability of a male is 0.5.

  First we check binomial criteria: $n$ independent trials, well formed
  $S$/$F$, probability the same across trials. Let's say male is $S$ and
  otherwise $F$.

  We have $n=5$ and $p = 0.5$. We just need $P(X = 2)$.

  $
    P(X = 2) = vec(5,2) (0.5)^2 (0.5)^3 \
    = (5 dot 4) / (2 dot 1) (1 / 2)^5 = 10 / 32
  $
]

#example[
  What is the probability of getting exactly three aces (1's) out of 10 throws
  of a fair die?

  Seems a little trickier but we can still write this as well defined $S$/$F$.
  Let $S$ be getting an ace and $F$ being anything else.

  Then $p = 1/6$ and $n = 10$. We want $P(X=3)$. So

  $
    P(X=3) = vec(10,3) p^3 q^7 = vec(10,3) (1 / 6)^3 (5 / 6)^7 \
    approx 0.15505
  $
]

#example[
  Suppose we have two types of candy, red and black. Select $n$ candies. Let $X$
  be the number of red candies among $n$ selected.

  2 cases.

  - case 1: with replacement: Binomial Distribution, $n$, $p = a/(a + b)$.
  $ P(X = 2) = vec(n,2) (a / (a+b))^2 (b / (a+b))^(n-2) $
  - case 2: without replacement: then use counting
  $ P(X = x) = (vec(a,x) vec(b,n-x)) / vec(a+b,n) = p(x) $
]

We've done case 2 before, but now we introduce a random variable to represent
it.

$ P(X = x) = (vec(a,x) vec(b,n-x)) / vec(a+b,n) = p(x) $

is known as a *Hypergeometric distribution*.

== Hypergeometric distribution

There are different characterizations of the parameters, but

$ X ~ "Hypergeom"(hash "total", hash "successes", "sample size") $

For example,
$ X ~ "Hypergeom"(N, a, n) "where" N = a+b $

In the textbook, it's
$ X ~ "Hypergeom"(N, N_a, n) $

#remark[
  If $x$ is very small relative to $a + b$, then both cases give similar (approx.
  the same) answers.
]

For instance, if we're sampling for blood types from UCSB, and we take a
student out without replacement, we don't really change the sample size
substantially. So both answers give a similar result.

Suppose we have two types of items, type $A$ and type $B$. Let $N_A$ be $hash$
type $A$, $N_B$ $hash$ type $B$. $N = N_A + N_B$ is the total number of
objects.

We sample $n$ items *without replacement* ($n <= N$) with order not mattering.
Denote by $X$ the number of type $A$ objects in our sample.

#definition[
  Let $0 <= N_A <= N$ and $1 <= n <= N$ be integers. A random variable $X$ has the *hypergeometric distribution* with parameters $(N, N_A, n)$ if $X$ takes values in the set ${0,1,...,n}$ and has p.m.f.

  $ P(X = k) = (vec(N_A,k) vec(N-N_A,n-k)) / vec(N,n) = p(k) $
]

#example[
  Let $N_A = 10$ defectives. Let $N_B = 90$ non-defectives. We select $n=5$ without replacement. What is the probability that 2 of the 5 selected are defective?

  $
    X ~ "Hypergeom" (N = 100, N_A = 10, n = 5)
  $

  We want $P(X=2)$.

  $
    P(X=2) = (vec(10,2) vec(90,3)) / vec(100,5) approx 0.0702
  $
]

#remark[
  Make sure you can distinguish when a problem is binomial or when it is
  hypergeometric. This is very important on exams.

  Recall that both ask about number of successes, in a fixed number of trials.
  But binomial is sample with replacement (each trial is independent) and
  sampling without replacement is hypergeometric.
]

#example[
  Cat gives birth to 6 kittens. 2 are male, 4 are females. Your neighbor comes and picks up 3 kittens randomly to take home with them.

  How to define random variable? What is p.m.f.?

  Let $X$ be the number of male cats in the neighbor's selection.

  $ X ~ "Hypergeom"(N = 6, N_A = 2, n = 3) $
  and $X$ takes values in ${0,1,2}$. Find the p.m.f. by finding probabilities for these values.

  $
    &P(X = 0) = (vec(2,0) vec(4,3)) / vec(6,3) = 4 / 20 \
    &P(X = 1) = (vec(2,1) vec(4,2)) / vec(6,3) = 12 / 20 \
    &P(X = 2) = (vec(2,2) vec(4,1)) / vec(6,3) = 4 / 20 \
    &P(X = 3) = (vec(2,3) vec(4,0)) / vec(6,3) = 0
  $

  Note that for $P(X=3)$, we are asking for 3 successes (drawing males) where
  there are only 2 males, so it must be 0.
]

== Geometric distribution

Consider an infinite sequence of independent trials. e.g. number of attmepts until I make a basket.

Let $X_i$ denote the outcome of the $i^"th"$ trial, where success is 1 and failure is 0. Let $N$ be the number of trials needed to observe the first success in a sequence of independent trials with probabilty of success $p$. Then

We fail $k-1$ times and succeed on the $k^"th"$ try. Then:

$
  P(N = k) = P(X_1 = 0, X_2 = 0, ..., X_(k-1) = 0, X_k = 1) = (1 - p)^(k-1) p
$

This is the probability of failures raised to the amount of failures, times
probability of success.

The key characteristic in these trials, we keep going until we succeed. There's
no $n$ choose $k$ in front like the binomial distribution because there's
exactly one sequence that gives us success.

#definition[
  Let $0 < p <= 1$. A random variable $X$ has the geometric distribution with
  success parameter $p$ if the possible values of $X$ are ${1,2,3,...}$ and $X$
  satisfies

  $
    P(X=k) = (1-p)^(k-1) p
  $

  for positive integers $k$. Abbreviate this by $X ~ "Geom"(p)$.
]

#example[
  What is the probability it takes more than seven rolls of a fair die to roll a
  six?

  Let $X$ be the number of rolls of a fair die until the first six. Then $X ~
  "Geom"(1/6)$. Now we just want $P(X > 7)$.

  $
    P(X > 7) = sum^infinity_(k=8) P(X=k) = sum^infinity_(k=8) (5 / 6)^(k-1) 1 / 6
  $

  Re-indexing,

  $
    sum^infinity_(k=8) (5 / 6)^(k-1) 1 / 6 = 1 / 6 (5 / 6)^7 sum^infinity_(j=0) (5 / 6)^j
  $

  Now we calculate by standard methods:

  $
    1 / 6 (5 / 6)^7 sum^infinity_(j=0) (5 / 6)^j = 1 / 6 (5 / 6)^7 dot 1 / (1-5 / 6) =
    (5 / 6)^7
  $
]