alexandria/documents/by-course/pstat-120a/course-notes/main.typ

#import "@youwen/zen:0.1.0": *

#show: zen.with(
  title: "PSTAT120A Course Notes",
  author: "Youwen Wu",
  date: "Winter 2025",
  subtitle: "Taught by Brian Wainwright",
)

#outline()

= Introduction

There are lecture notes from when PSTAT120A (Probability and Statistics) was
taught in Winter 2025 by Dr. Wainwright. Any errors contained within are the
scribe's, not the instructor's.

= Lecture #datetime(day: 6, month: 1, year: 2025).display()

== Preliminaries

#definition[
  Statistics is the science dealing with the collection, summarization,
  analysis, and interpretation of data.
]

== Set theory for dummies

A terse introduction to elementary naive set theory and the basic operations
upon them.

#remark[
  Keep in mind that without $cal(Z F C)$ or another model of set theory that
  resolves fundamental issues, our set theory is subject to paradoxes like
  Russell's. Whoops, the universe doesn't exist.
]

#definition[
  A *set* is a collection of elements.
]

#example[Examples of sets][
  + Trivial set: ${1}$
  + Empty set: $emptyset$
  + $A = {a,b,c}$
]

We can construct sets using set-builder notation (also sometimes called set
comprehension).

$ {"expression with" x | "conditions on" x} $

#example("Set builder notation")[
  + The set of all even integers: ${2n | n in ZZ}$
  + The set of all perfect squares in $RR$: ${x^2 | x in NN}$
]

We also have notation for working with sets:

With arbitrary sets $A$, $B$:

+ $a in A$ ($a$ is a member of the set $A$)
+ $a in.not A$ ($a$ is not a member of the set $A$)
+ $A subset.eq B$ (Set theory: $A$ is a subset of $B$) (Stats: $A$ is a sample space in $B$)
+ $A subset B$ (Proper subset: $A != B$)
+ $A^c$ or $A'$ (read "complement of $A$," and introduced later)
+ $A union B$ (Union of $A$ and $B$. Gives a set with both the elements of $A$ and $B$)
+ $A sect B$ (Intersection of $A$ and $B$. Gives a set consisting of the elements in *both* $A$ and $B$)
+ $A \\ B$ (Set difference. The set of all elements of $A$ that are not also in $B$)
+ $A times B$ (Cartesian product. Ordered pairs of $(a,b)$ $forall a in A$, $forall b in B$)

We can also write a few of these operations precisely as set comprehensions.

+ $A subset B => A = {a | a in B, forall a in A}$
+ $A union B = {x | x in A or x in B}$ (here $or$ is the logical OR)
+ $A sect B = {x | x in A and x in B}$ (here $and$ is the logical AND)
+ $A \\ B = {a | a in A and a in.not B}$
+ $A times B = {(a,b) | forall a in A, forall b in B}$

Take a moment and convince yourself that these definitions are equivalent to
the previous ones.

#definition[
  The universal set $Omega$ is the set of all objects in a given set
  theoretical universe.
]

With the above definition, we can now introduce the set complement.

#definition[
  The set complement $A'$ is given by
  $
    A' = Omega \\ A
  $
  where $Omega$ is the _universal set_.
]

#example[The real plane][
  The real plane $RR^2$ can be defined as a Cartesian product of $RR$ with
  itself.

  $ RR^2 = RR times RR $
]

Check your intuition that this makes sense. Why do you think $RR^n$ was chosen
as the notation for $n$ dimensional spaces in $RR$?

#definition[Disjoint sets][
  If $A sect B$ = $emptyset$, then we say that $A$ and $B$ are *disjoint*.
]

#fact[
  For any sets $A$ and $B$, we have DeMorgan's Laws:
  + $(A union B)' = A' sect B'$
  + $(A sect B)' = A' union B'$
]

#fact[Generalized DeMorgan's][
  + $(union.big_i A_i)' = sect.big_i A_i '$
  + $(sect.big_i A_i)' = union.big_i A_i '$
]

== Sizes of infinity

#definition[
  Let $N(A)$ be the number of elements in $A$. $N(A)$ is called the _cardinality_ of $A$.
]

We say a set is finite if it has finite cardinality, or infinite if it has an
infinite cardinality.

Infinite sets can be either _countably infinite_ or _uncountably infinite_.

When a set is countably infinite, its cardinality is $aleph_0$ (here $aleph$ is
the Hebrew letter aleph and read "aleph null").

When a set is uncountably infinite, its cardinality is greater than $aleph_0$.

#example("Countable sets")[
  + The natural numbers $NN$.
  + The rationals $QQ$.
  + The natural numbers $ZZ$.
  + The set of all logical tautologies.
]

#example("Uncountable sets")[
  + The real numbers $RR$.
  + The real numbers in the interval $[0,1]$.
  + The _power set_ of $ZZ$, which is the set of all subsets of $ZZ$.
]

#remark[
  All the uncountable sets above have cardinality $2^(aleph_0)$ or $aleph_1$ or
  $frak(c)$ or $beth_1$. This is the _cardinality of the continuum_, also
  called "aleph 1" or "beth 1".

  However, in general uncountably infinite sets do not have the same
  cardinality.
]

#fact[
  If a set is countably infinite, then it has a bijection with $ZZ$. This means
  every set with cardinality $aleph_0$ has a bijection to $ZZ$. More generally,
  any sets with the same cardinality have a bijection between them.
]

This gives us the following equivalent statement:

#fact[
  Two sets have the same cardinality if and only if there exists a bijective
  function between them. In symbols,

  $ N(A) = N(B) <==> exists F : A <-> B $
]

= Lecture #datetime(day: 8, month: 1, year: 2025).display()

== Probability

#definition[
  A *random experiment* is one in which the set of all possible outcomes is known in advance, but one can't predict which outcome will occur on a given trial of the experiment.
]

#example("Finite sample spaces")[
  Toss a coin:
  $ Omega = {H,T} $

  Roll a pair of dice:
  $ Omega = {1,2,3,4,5,6} times {1,2,3,4,5,6} $
]

#example("Countably infinite sample spaces")[
  Shoot a basket until you make one:
  $ Omega = {M, F M, F F M, F F F M, dots} $
]

#example("Uncountably infinite sample space")[
  Waiting time for a bus:
  $ Omega = {T : t >= 0} $
]

#fact[
  Elements of $Omega$ are called sample points.
]

#definition[
  Any properly defined subset of $Omega$ is called an *event*.
]

#example[Dice][
  Rolling a fair die twice, let $A$ be the event that the combined score of both dice is 10.

  $ A = {(4,6,), (5,5),(6,4)} $
]

Probabilistic concepts in the parlance of set theory:

- Superset ($Omega$) $<->$ sample space
- Element $<->$ outcome / sample point ($omega$)
- Disjoint sets $<->$ mutually exclusive events

== Classical approach

Classical approach:

$ P(a) = (hash A) / (hash Omega) $

Requires equally likely outcomes and finite sample spaces.

#remark[
  With an infinite sample space, the probability becomes 0, which is often wrong.
]

#example("Dice again")[
  Rolling a fair die twice, let $A$ be the event that the combined score of both dice is 10.

  $
    A &= {(4,6,), (5,5),(6,4)} \
    P(A) &= 3 / 36 = 1 / 12
  $
]

== Relative frequency approach

An approach done commonly by applied statisticians who work in the disgusting
real world. This is where we are generally concerned with irrelevant concerns
like accurate sampling and $p$-values and such.
$
  P(A) = (hash "of times" A "occurs in large number of trials") / (hash "of trials")
$

#example[
  Flipping a coin to determine the probability of it landing heads.
]

== Subjective approach

Personal definition of probability. Not "real" probability, merely co-opting
its parlance to lend credibility to subjective judgements of confidence.

== Axiomatic approach

Consider a random experiment. Then:

#definition[
  The *sample space* $Omega$ is the set of all possible outcomes of the
  experiment.
]

#definition[
  Elements of $Omega$ are called *sample points*.
]

#definition[
  Subsets of $Omega$ are called *events*. The collection of events (in other
  terms, the power set of $Omega$) in $Omega$ is denoted by $cal(F)$.
]

#definition[
  The *probability measure*, or probability distribution, or simply probability s a function $P$.

  Let $P : cal(F) -> RR$ be a function satisfying the following axioms (properties).

  + $P(A) >= 0, forall A$
  + $P(Omega) = 1$
  + If $A_i sect A_j = emptyset, forall i != j$, then
    $ P(union.big_(i=1)^infinity A_i) = sum_(i=1)^infinity P(A_i) $
]

The 3-tuple $(Omega, cal(F), P)$ is called a *probability space*.

#remark[
  In more advanced texts you will see $Omega$ introduced as a so-called
  $sigma$-algebra. A $sigma$-algebra on a set $Omega$ is a nonempty collection
  $Sigma$ of subsets of $Omega$ that is closed under set complement, countable
  unions, and as a corollary, countable intersections.
]

Now let us show various results with $P$.

#proposition[
  $ P(emptyset) = 0 $
]

#proof[
  By axiom 3,

  $
    A_1 = emptyset, A_2 = emptyset, A_3 = emptyset \
    P(emptyset) = sum^infinity_(i=1) P(A_i) = sum^infinity_(i=1) P(emptyset)
  $
  Suppose $P(emptyset) != 0$. Then $P >= 0$ by axiom 1 but then $P -> infinity$ in the sum, which implies $Omega > 1$, which is disallowed by axiom 2. So $P(emptyset) = 0$.
]

#proposition[
  If $A_1, A_2, ..., A_n$ are disjoint, then
  $ P(union.big^n_(i=1) A_i) = sum^n_(i= 1) P(A_i) $
]

This is mostly a formal manipulation to derive the obviously true proposition from our axioms.

#proof[
  Write any finite set $(A_1, A_2, ..., A_n)$ as an infinite set $(A_1, A_2, ..., A_n, emptyset, emptyset, ...)$. Then
  $
    P(union.big_(i=1)^infinity A_i) = sum^n_(i=1) P(A_i) + sum^infinity_(i=n+1) P(emptyset) = sum^n_(i=1) P(A_i)
  $
  And because all of the elements after $A_n$ are $emptyset$, their union adds no additional elements to the resultant union set of all $A_i$, so
  $
    P(union.big_(i=1)^infinity A_i) = P(union.big_(i=1)^n A_i) = sum_(i=1)^n P(A_i)
  $
]

#proposition[Complement][
  $ P(A') = 1 - P(A) $
]

#proof[
  $
    A' union A &= Omega \
    A' sect A &= emptyset \
    P(A' union A) &= P(A') + P(A) &"(by axiom 3)"\
    = P(Omega) &= 1 &"(by axiom 2)" \
    therefore P(A') &= 1 - P(A)
  $
]

#proposition[
  $ A subset.eq B => P(A) <= P(B) $
]

#proof[
  $ B = A union (A' sect B) $

  but $A$ and ($A' sect B$) are disjoint, so

  $
    P(B) &= P(A union (A' sect B)) \
    &= P(A) + P(A' sect B) \
    &therefore P(B) >= P(A)
  $
]

#proposition("Inclusion-exclusion principle")[
  $ P(A union B) = P(A) + P(B) - P(A sect B) $
]<inclusion-exclusion>

#proof[
  $
    A = (A sect B) union (A sect B') \
    => P(A) = P(A sect B) + P(A sect B') \
    => P(B) = P(B sect A) + P(B sect A') \
    P(A) + P(B) = P(A sect B) + P(A sect B) + P(A sect B') + P(A' sect B) \
    => P(A) + P(B) - P(A sect B) = P(A sect B) + P(A sect B') + P(A' sect B) \
  $
]

#remark[
  This is a stronger result of axiom 3, which generalizes for all sets $A$ and $B$ regardless of whether they're disjoint.
]

#remark[
  These are mostly intuitively true statements (think about the probabilistic
  concepts represented by the sets) in classical probability that we derive
  rigorously from our axiomatic probability function $P$.
]

#example[
  Now let us consider some trivial concepts in classical probability written in
  the parlance of combinatorial probability.

  Select one card from a deck of 52 cards.
  Then the following is true:

  $
    Omega = {1,2,...,52} \
    A = "card is a heart" = {H 2, H 3, H 4, ..., H"Ace"} \
    B = "card is an Ace" = {H"Ace", C"Ace", D"Ace", S"Ace"} \
    C = "card is black" = {C 2, C 3, ..., C"Ace", S 2, S 3, ..., S"Ace"} \
    P(A) = 13 / 52,
    P(B) = 4 / 52,
    P(C) = 26 / 52 \
    P(A sect B) = 1 / 52 \
    P(A sect C) = 0 \
    P(B sect C) = 2 / 52 \
    P(A union B) = P(A) + P(B) - P(A sect B) = 16 / 52 \
    P(B') = 1 - P(B) = 48 / 52 \
    P(A sect B') = P(A) - P(A sect B) = 13 / 52 - 1 / 52 = 12 / 52 \
    P((A sect B') union (A' sect B)) = P(A sect B') + P(A' sect B) = 15 / 52 \
    P(A' sect B') = P(A union B)' = 1 - P(A union B) = 36 / 52
  $
]

== Countable sample spaces

#definition[
  A sample space $Omega$ is said to be *countable* if it's finite or countably infinite.
]

In such a case, one can list the elements of $Omega$.

$ Omega = {omega_1, omega_2, omega_3, ...} $
with associated probabilities, $p_1, p_2, p_3,...$, where
$
  p_i = P(omega_i) >= 0 \
  1 = P(Omega) = sum P(omega_i)
$

#example[Fair die, again][
  All outcomes are equally likely,
  $ p_1 = p_2 = ... = p_6 = 1 / 6 $
  Let $A$ be the event that the score is odd = ${1,3,5}$
  $ P(A) = 3 / 6 $
]

#example[Loaded die][
  Consider a die where the probabilities of rolling odd sides is double the probability of rolling an even side.
  $
    p_2 = p_4 = p_6, p_1 = p_3 = p_5 = 2p_2 \
    6p_2 + 3p_2 = 9p_2 = 1 \
    p_2 = 1 / 9, p_1 = 2 / 9
  $
]

#example[Coins][
  Toss a fair coin until you get the first head.
  $
    Omega = {H, T H, T T H, ...} "(countably infinite)" \
    P(H) = 1 / 2 \
    P(T T H) = (1 / 2)^3 \
    P(Omega) = sum_(n=1)^infinity (1 / 2)^n = 1 / (1 - 1 / 2) - 1 = 1
  $
]

#example[Birthdays][
  What is the probability two people share the same birthday?

  $
    Omega = [1,365] times [1,365] \
    P(A) = 365 / 365^2 = 1 / 365
  $
]

== Continuous sample spaces

#definition[
  A *continuous sample space* contains an interval in $RR$ and is uncountably infinite.
]

#definition[
  A probability density function (#smallcaps[pdf]) gives the probability at the point
  $s$.
]

Properties of the #smallcaps[pdf]:

- $f(s) >= 0, forall p_i >= 0$
- $integral_S f(s) dif s = 1, forall p_i >= 0$

#example[
  Waiting time for bus: $Omega = {s : s >= 0}$.
]

= Notes on counting

The cardinality of $A$ is given by $hash A$. Let us develop methods for finding
$hash A$ from a description of the set $A$ (in other words, methods for
counting).

== General multiplication principle

#fact[
  Let $A$ and $B$ be finite sets, $k in ZZ^+$. Then let $f : A -> B$ be a
  function such that each element in $B$ is the image of exactly $k$ elements
  in $A$ (such a function is called _$k$-to-one_). Then $hash A = k dot hash
  B$.
]<ktoone>

#example[
  Four fully loaded 10-seater vans transported people to the picnic. How many
  people were transported?

  By @ktoone, we have $A$ is the set of people, $B$ is the set of vans, $f : A -> B$ maps a person to the van they ride in. So $f$ is a 10-to-one function, $hash A = 40$, $hash B = 4$, and clearly the answer is $10 dot 4 = 40$.
]

#definition[
  An $n$-tuple is an ordered sequence of $n$ elements.
]

Many of our methods in probability rely on multiplying together multiple
outcomes to obtain their combined amount of outcomes. We make this explicit below in @tuplemultiplication.

#fact[
  Suppose a set of $n$-tuples $(a_1, ..., a_n)$ obeys these rules:

  + There are $r_1$ choices for the first entry $a_1$.
  + Once the first $k$ entries $a_1, ..., a_k$ have been chosen, the number of alternatives for the next entry $a_(k+1)$ is $r_(k+1)$, regardless of the previous choices.

  Then the total number of $n$-tuples is the product $r_1 dot r_2 dot r_2 dot dots dot r_n$.
]<tuplemultiplication>

#proof[
  It is trivially true for $n = 1$ since you have $r_1$ choices of $a_1$ for a
  1-tuple $(a_1)$.

  Let $A$ be the set of all possible $n$-tuples and $B$ be the set of all
  possible $(n+1)$-tuples. Now let us assume the statement is true for $A$.
  Proceed by induction on $B$, noting that for each $n$-tuple in $A$, $(a_1,
  ..., a_n)$, we have $r_(n+1)$ tuples in $A$.

  Let $f : B -> A$ be a function which takes each $(n+1)$-tuple and truncates the $a_(n+1)$ term, leaving us with just an $n$-tuple of the form $(a_1, a_2, ..., a_n)$.
  $ f((a_1, ..., a_n, a_(n + 1))) = (a_1, ..., a_n) $
  Now notice that $f$ is precisely a $r_(n+1)$-to-one function! Recall by
  our assumption that @tuplemultiplication is true for $n$-tuples, so $A$ has $r_1 dot
  r_2 dot ... dot r_n$ elements, or $hash A = r_1 dot ... dot r_n$. Then by
  @ktoone, we have $hash B = hash A dot r_(n+1) = r_1 dot r_2 dot
  ... dot r_(n+1)$. Our induction is complete and we have proved @tuplemultiplication.
]

@tuplemultiplication is sometimes called the _general multiplication principle_.

We can use @tuplemultiplication to derive counting formulas for various
situations. Let $A_1, A_2, A_n$ be finite sets. Then as a corollary of
@tuplemultiplication, we can count the number of $n$-tuples in a finite
Cartesian product of $A_1, A_2, A_n$.

#fact[
  Let $A_1, A_2, A_n$ be finite sets. Then

  $
    hash (A_1 times A_2 times ... times, A_n) = (hash A_1) dot (hash A_2) dot ... dot (hash A_n) = Pi^n_(i=1) (hash A_i)
  $
]

#example[
  How many distinct subsets does a set of size $n$ have?

  The answer is $2^n$. Each subset can be encoded as an $n$-tuple with entries 0
  or 1, where the $i$th entry is 1 if the $i$th element of the set is in the
  subset and 0 if it is not.

  Thus the number of subsets is the same as the cardinality of
  $ {0,1} times ... times {0,1} = {0,1}^n $
  which is $2^n$.

  This is why given a set $X$ with cardinality $aleph$, we write the
  cardinality of the power set of $X$ as $2^aleph$.
]

== Permutations

Now we can use the multiplication principle to count permutations.

#fact[
  Consider all $k$-tuples $(a_1, ..., a_k)$ that can be constructed from a set $A$ of size $n, n>= k$ without repetition. The total number of these $k$-tuples is
  $ (n)_k = n dot (n - 1) ... (n - k + 1) = n! / (n-k)! $

  In particular, with $k=n$, each $n$-tuple is an ordering or _permutation_ of $A$. So the total number of permutations of a set of $n$ elements is $n!$.
]<permutation>

#proof[
  We construct the $k$-tuples sequentially. For the first element, we choose
  one element from $A$ with $n$ alternatives. The next element has $n - 1$
  alternatives. In general, after $j$ elements are chosen, there are $n - j +
  1$ alternatives.

  Then clearly after choosing $k$ elements for our $k$-tuple we have by
  @tuplemultiplication the number of $k$-tuples being $n dot (n - 1) dot ...
  dot (n - k + 1) = (n)_k$.
]

#example[
  Consider a round table with 8 seats.

  + In how many ways can we seat 8 guests around the table?
  + In how many ways can we do this if we do not differentiate between seating arrangements that are rotations of each other?

  For (1), we easily see that we're simply asking for permutations of an
  8-tuple, so $8!$ is the answer.

  For (2), we number each person and each seat from 1-8, then always place person 1 in seat 1, and count the permutations of the other 7 people in the other 7 seats. Then the answer is $7!$.

  Alternatively, notice that each arrangement has 8 equivalent arrangements under rotation. So the answer is $8!/8 = 7!$.
]

== Counting from sets

We turn our attention to sets, which unlike tuples are unordered collections.

#fact[
  Let $n,k in NN$ with $0 <= k <= n$. The numbers of distinct subsets of size $k$ that a set of size $n$ has is given by the *binomial coefficient*
  $ vec(n,k) = n! / (k! (n-k)!) $
]

#proof[
  Let $A$ be a set of size $n$. By @permutation, $n!/(n-k)!$ unique ordered
  $k$-tuples can be constructed from elements of $A$. Each subset of $A$ of
  size $k$ has exactly $k!$ different orderings, and hence appears exactly $k!$
  times among the ordered $k$-tuples. Thus the number of subsets of size $k$ is
  $n! / (k! (n-k)!)$.
]

#example[
  In a class there are 12 boys and 14 girls. How many different teams of 7 pupils
  with 3 boys and 4 girls can be create?

  First let us compute how many subsets of size 3 we can choose from the 12 boys and how many subsets of size 4 we can choose from the 14 girls.

  $
    "boys" &= vec(12,3) \
    "girls" &= vec(14,4)
  $

  Then let us consider the entire team as a 2-tuple of (boys, girls). Then
  there are $vec(12,3)$ alternatives for the choice of boys, and $vec(14,4)$ alternatives for
  the choice of girls, so by the multiplication principle, we have the total being

  $ vec(12,3) vec(14,4) $
]

#example[
  Color the numbers 1, 2 red, the numbers 3, 4 green, and the numbers 5, 6
  yellow. How many different two-element subsets of $A$ are there that have two
  different colors?

  First choose 2 colors, $vec(3,2) = 3$. Then from each color, choose one. Altogether it's
  $ vec(3,2) vec(2,1) vec(2,1) = 3 dot 2 dot 2 = 12 $
]

One way to view $vec(n,k)$ is as the number of ways of painting $n$ elements
with two colors, red and yellow, with $k$ red and $n - k$ yellow elements. Let
us generalize to more than two colors.

#fact[
  Let $n$ and $r$ be positive integers and $k_1, ..., k_r$ nonnegative integers
  such that $k_1 + dots.c + k_r = n$. The number of ways of assigning labels
  $1,2, ..., r$ to $n$ items so that for each $i = 1, 2, ..., r$, exactly $k_i$
  items receive label $i$, is the *multinomial coefficient*

  $ vec(n, (k_1, k_2, ..., k_r)) = vec(n!, k_1 ! k_2 ! dots.c k_r !) $
]<multinomial-coefficient>

#proof[
  Order the $n$ integers in some manner, and assign labels like this: for the
  first $k_1$ integers, assign the label 1, then for the next $k_2$ integers,
  assign the label 2, and so on. The $i$th label will be assigned to all the
  integers between positions $k_1 + dots.c + k_(i-1) + 1$ and $k_1 + dots.c +
  k_i$.

  Then notice that all possible orderings (permutations) of the integers gives
  every possible way to label the integers. However, we overcount by some
  amount. How much? The order of the integers with a given label don't matter,
  so we need to deduplicate those.

  Each set of labels is duplicated once for each way we can order all of the
  elements with the same label. For label $i$, there are $k_i$ elements with
  that label, so $k_i !$ ways to order those. By @tuplemultiplication, we know
  that we can express the combined amount of ways each group of $k_1, ..., k_i$
  numbers are labeled as $k_1 ! k_2 ! k_3 ! dots.c k_r !$.

  So by @ktoone, we can account for the duplicates and the answer is
  $ n! / (k_1 ! k_2 ! k_3 ! dots.c k_r !) $
]

#remark[
  @multinomial-coefficient gives us a way to count how many ways there are to
  fit $n$ distinguishable objects into $r$ distinguishable containers of
  varying capacity.

  To find the amount of ways to fit $n$ distinguishable objects into $k$
  indistinguishable containers of _any_ capacity, use the "ball-and-urn"
  technique.
]

#example[
  How many different ways can six people be divided into three pairs?

  First we use the multimonial coefficient to count the amount of ways to assign specific labels to pairs of elements:
  $ vec(6, (2,2,2)) $
  But notice that the actual labels themselves are irrelevant. Our multimonial
  coefficient counts how many ways there are to assign 3 distinguishable
  labels, say Pair 1, Pair 2, Pair 3, to our 6 elements.

  To make this more explicit, say we had a 3-tuple where the position encoded
  the label, where position 1 corresponds to Pair 1, and so on. Then the values
  are the actual pairs of people (numbered 1-6). For instance
  $ ((1,2), (3,4), (5,6)) $
  corresponds to assigning the label Pair 1 to (1,2), Pair 2 to (3,4) and Pair
  3 to (5,6). What our multimonial coefficient is doing is it's counting this,
  as well as any other orderings of this tuple. For instance
  $ ((3,4), (1,2), (5,6)) $
  is also counted. However since in our case the actual labels are irrelevant,
  the two examples shown above should really be counted only once.

  How many extra times is each case counted? It turns out that we can think of
  our multimonial coefficient as permuting the labels across our pairs. So in
  this case it's permuting all the ways we can order 3 labels, which is $3! =
  6$. That means by @ktoone our answer is

  $ vec(6, (2,2,2)) / 3! = 15 $
]

#example("Poker")[
  How many poker hands are in the category _one pair_?

  A one pair is a hand with two cards of the same rank and three cards with ranks
  different from each other and the pair.

  We can count in two ways: we count all the ordered hands, then divide by $5!$
  to remove overcounting, or we can build the unordered hands directly.

  When finding the ordered hands, the key is to figure out how we can encode
  our information in a tuple of the form described in @tuplemultiplication, and
  then use @tuplemultiplication to compute the solution.

  In this case, the first element encodes the two slots in the hand of 5 our
  pair occupies, the second element encodes the first card of the pair, the
  third element encodes the second card of the pair, and the fourth, fifth, and
  sixth elements represent the 3 cards that are not of the same rank.

  Now it is clear that the number of alternatives in each position of the
  6-tuple does not depend on any of the others, so @tuplemultiplication
  applies. Then we can determine the amount of alternatives for each position
  in the 6-tuple and multiply them to determine the total amount of ways the
  6-tuple can be constructed, giving us the total amount of ways to construct
  ordered poker hands with one pairs.

  First we choose 2 slots out of 5 positions (in the hand) so there are
  $vec(5,2)$ alternatives. Then we choose any of the 52 cards for our first
  pair card, so there are 52 alternatives. Then we choose any card with the
  same rank for the second card in the pair, where there are 3 possible
  alternatives. Then we choose the third card which must not be the same rank
  as the first two, where there are 48 alternatives. The fourth card must not
  be the same rank as the others, so there are 44 alternatives. Likewise, the
  final card has 40 alternatives.

  So the final answer is, remembering to divide by $5!$ because we don't care
  about order,
  $ (vec(5,2) dot 52 dot 3 dot 48 dot 44 dot 40) / 5! $

  Alternatively, we can find way to build an unordered hand with the
  requirements. First we choose the rank of the pair, then we choose two suits
  for that rank, then we choose the remaining 3 different ranks, and finally a
  suit for each of the ranks. Then, noting that we will now omit constructing
  the tuple and explicitly listing alternatives for brevity, we have
  $ 13 dot vec(5,2) dot vec(12, 3) dot 4^3 $

  Both approaches given the same answer.
]

= Baye's theorem and conditional probability

== Conditional probability, partitions, law of total probability

Sometimes we want to analyze the probability of events in a sample space given
that we already know another event has occurred. Ergo, we want the probability
of $A in Omega$ conditional on the event $B in Omega$.

#definition[
  For two events $A, B in Omega$, the probability of $A$ given $B$ is written
  $
    P(A | B)
  $
]

#fact[
  To calculate the conditional probability, use the following formula:
  $
    P(A | B) = (P(A B)) / (P(B))
  $
]

Oftentimes we don't know $P(B)$, but we do know $P(B)$ given some events in
$Omega$. That is, we know the probability of $B$ conditional on some events.
For example, if we have a 50% chance of choosing a rigged (6-sided) die and a
50% chance of choosing a fair die, we know the probability of getting side $n$
given that we have the rigged die, and the probability of side $n$ given that
we have the fair die. Also note that we know the probability of both events
we're conditioning on (50% each), and they're disjoint events.

In these situations, the following law is useful:

#theorem[Law of total probability][
  Given a _partition_ of $Omega$ with pairwise disjoint subsets $A_1, A_2, A_3, ..., A_n in Omega$, such that
  $
    union.big_(A_i in Omega) A_i = Omega \
    sect.big_(A_i in Omega) A_i = emptyset
  $
  The probability of an event $B in Omega$ is given by
  $
    P(B) = P(B | A_1) P(A_1) + P(B | A_2) P(A_2) + dots.c + P(B | A_n) P(A_n)
  $
]<law-total-prob>

#proof[
  This is easy to show by writing the definition of the conditional probability
  and simplifying.
]

== Baye's theorem

Finally let's discuss a rule for inverting conditional probabilities, that is,
getting $P(B | A)$ from $P(A | B)$.

#theorem[Baye's theorem][
  Given two events $A,B in Omega$,
  $
    P(A | B) = (P(B | A)P(A)) / (P(B | A)P(A) + P(B | A^c)P(A^c))
  $
]

#proof[
  Apply the definition of conditional probability, then apply @law-total-prob
  noting that $A$ and $A^c$ are a partitioning of $Omega$.
]

= Lecture #datetime(day: 23, month: 1, year: 2025).display()

== Independence
#definition("Independence")[
  Two events $A subset Omega$ and $B subset Omega$ are independent if and only if
  $ P(B sect A) = P(B)P(A) $
  "Joint probability is equal to product of their marginal probabilities."
]

#fact[
  This definition must be used to show the independence of two events.
]

#fact[
  If $A$ and $B$ are independent, then,
  $
    P(A | B) = underbrace((P(A sect B)) / P(B), "conditional probability") = (P(A) P(B)) / P(B) = P(A)
  $
]

#example[
  Flip a fair coin 3 times. Let the events:

  - $A$ = we have exactly one tails among the first 2 flips
  - $B$ = we have exactly one tails among the last 2 flips
  - $D$ = we get exactly one tails among all 3 flip

  Show that $A$ and $B$ are independent.
  What about $B$ and $D$?

  Compute all of the possible events, then we see that

  $
    P(A sect B) = (hash (A sect B)) / (hash Omega) = 2 / 8 = 4 / 8 dot 4 / 8 = P(A) P(B)
  $

  So they are independent.

  Repeat the same reasoning for $B$ and $D$, we see that they are not independent.
]

#example[
  Suppose we have 4 red and 7 green balls in an urn. We choose two balls with replacement. Let

  - $A$ = the first ball is red
  - $B$ = the second ball is greeen

  Are $A$ and $B$ independent?

  $
    hash Omega = 11 times 11 = 121 \
    hash A = 4 dot 11 = 44 \
    hash B = 11 dot 7 = 77 \
    hash (A sect B) = 4 dot 7 = 28
  $
]

#definition[
  Events $A_1, ..., A_n$ are independent (mutually independent) if for every collection $A_i_1, ..., A_i_k$, where $2 <= k <= n$ and $1 <= i_1 < i_2 < dots.c < i_k <= n$,

  $
    P(A_i_1 sect A_i_2 sect dots.c sect A_i_k) = P(A_i_1) P(A_i_2) dots.c P(A_i_k)
  $
]

#definition[
  We say that the events $A_1, ..., A_n$ are *pairwise independent* if any two
  different events $A_i$ and $A_j$ are independent for any $i != j$.
]

= A bit of review on random variables

== Random variables, discrete random variables

First, some brief exposition on random variables. Quixotically, a random
variable is actually a function.

Standard notation: $Omega$ is a sample space, $omega in Omega$ is an event.

#definition[
  A *random variable* $X$ is a function $X : Omega -> RR$ that takes the set of
  possible outcomes in a sample space, and maps it to a
  #link("https://en.wikipedia.org/wiki/Measurable_space")[measurable space],
  typically (as in our case) a subset of $RR$.
]

#definition[
  The *state space* or *support* of a random variable $X$ is all of the values $X$ can take.
]

#example[
  Let $X$ be a random variable that takes on the values ${0,1,2,3}$. Then the
  state space of $X$ is the set ${0,1,2,3}$.
]

$X$ gives its important probabilistic information. The probability distribution
is a description of the probabilities $P(X in B)$ for subsets $B in RR$. We
describe the probability density function and the cumulative distribution
function.

A random variable $X$ is discrete if there is countable $A$ such that $P(X in
A) = 1$. $k$ is a possible value if $P(X = k) > 0$.

A discrete random variable has probability distribution entirely determined by
p.m.f $p(k) = P(X = k)$. The p.m.f. is a function from the set of possible
values of $X$ into $[0,1]$. Labeling the p.m.f. with the random variable is
done by $p_X (k)$.

By the axioms of probability,

$
  sum_k p_X (k) = sum_k P(X=k) = 1
$

For a subset $B subset RR$,

$
  P(X in B) = sum_(k in B) p_X (k)
$

== Continuous random variables

Now we introduce another major class of random variables.

#definition[
  Let $X$ be a random variable. If $f$ satisfies

  $
    P(X <= b) = integral^b_(-infinity) f(x) dif x
  $

  for all $b in RR$, then $f$ is the *probability density function* of $X$.
]

The probability that $X in (-infinity, b]$ is equal to the area under the graph
of $f$ from $-infinity$ to $b$.

A corollary is the following.

#fact[
  $ P(X in B) = integral_B f(x) dif x $
]

for any $B subset RR$ where integration makes sense.

The set can be bounded or unbounded, or any collection of intervals.

#fact[
  $ P(a <= X <= b) = integral_a^b f(x) dif x $
  $ P(X > a) = integral_a^infinity f(x) dif x $
]

#fact[
  If a random variable $X$ has density function $f$ then individual point
  values have probability zero:

  $ P(X = c) = integral_c^c f(x) dif x = 0, forall c in RR $
]

#remark[
  It follows a random variable with a density function is not discrete. Also
  the probabilities of intervals are not changed by including or excluding
  endpoints.
]

How to determine which functions are p.d.f.s? Since $P(-infinity < X <
infinity) = 1$, a p.d.f. $f$ must satisfy

$
  f(x) >= 0 forall x in RR \
  integral^infinity_(-infinity) f(x) dif x = 1
$

#fact[
  Random variables with density functions are called _continuous_ random
  variables. This does not imply that the random variable is a continuous
  function on $Omega$ but it is standard terminology.
]

Named distributions of continuous random variables are introduced in the
following chapters.

= Lecture #datetime(day: 27, year: 2025, month: 1).display()

== Bernoulli trials

The setup: the experiment has exactly two outcomes:
- Success -- $S$ or 1
- Failure -- $F$ or 0

Additionally:
$
  P(S) = p, (0 < p < 1) \
  P(F) = 1 - p = q
$

Construct the probability mass function:

$
  P(X = 1) = p \
  P(X = 0) = 1 - p
$

Write it as:

$ p_x(k) = p^k (1-p)^(1-k) $

for $k = 1$ and $k = 0$.

== Binomial distribution

The setup: very similar to Bernoulli, trials have exactly 2 outcomes. A bunch
of Bernoulli trials in a row.

Importantly: $p$ and $q$ are defined exactly the same in all trials.

This ties the binomial distribution to the sampling with replacement model,
since each trial does not affect the next.

We conduct $n$ *independent* trials of this experiment. Example with coins: each
flip independently has a $1/2$ chance of heads or tails (holds same for die,
rigged coin, etc).

$n$ is fixed, i.e. known ahead of time.

== Binomial random variable

Let $X = hash$ of successes in $n$ independent trials. For any particular
sequence of $n$ trials, it takes the form $Omega = {omega} "where" omega = S
  F F dots.c F$ and is of length $n$.

Then $X(omega) = 0,1,2,...,n$ can take $n + 1$ possible values. The
probability of any particular sequence is given by the product of the
individual trial probabilities.

#example[
  $ omega = S F F S F dots.c S = (p q q p q dots.c p) $
]

So $P(x = 0) = P(F F F dots.c F) = q dot q dot dots.c dot q = q^n$.

And
$
  P(X = 1) = P(S F F dots.c F) + P(F S F F dots.c F) + dots.c + P(F F F dots.c F S) \
  = underbrace(n, "possible outcomes") dot p^1 dot p^(n-1) \
  = vec(n, 1) dot p^1 dot p^(n-1) \
  = n dot p^1 dot p^(n-1)
$

Now we can generalize

$
  P(X = 2) = vec(n,2) p^2 q^(n-2)
$

How about all successes?

$
  P(X = n) = P(S S dots.c S) = p^n
$

We see that for all failures we have $q^n$ and all successes we have $p^n$.
Otherwise we use our method above.

In general, here is the probability mass function for the binomial random variable

$
  P(X = k) = vec(n, k) p^k q^(n-k), "for" k = 0,1,2,...,n
$


Binomial distribution is very powerful. Choosing between two things, what are the probabilities?

To summarize the characterization of the binomial random variable:

- $n$ independent trials
- each trial results in binary success or failure
- with probability of success $p$, identically across trials

with $X = hash$ successes in *fixed* $n$ trials.

$ X ~ "Bin"(n,p) $

with probability mass function

$
  P(X = x) = vec(n,x) p^x (1 - p)^(n-x) = p(x) "for" x = 0,1,2,...,n
$

We see this is in fact the binomial theorem!

$
  p(x) >= 0, sum^n_(x=0) p(x) = sum^n_(x=0) vec(n,x) p^x q^(n-x) = (p + q)^n
$

In fact,
$
  (p + q)^n = (p + (1 - p))^n = 1
$

#example[
  Family 5 children, what is the probability that number of males = 2 if we
  assume births are independent and probability of a male is 0.5.

  First we check binomial criteria: $n$ independent trials, well formed
  $S$/$F$, probability the same across trials. Let's say male is $S$ and
  otherwise $F$.

  We have $n=5$ and $p = 0.5$. We just need $P(X = 2)$.

  $
    P(X = 2) = vec(5,2) (0.5)^2 (0.5)^3 \
    = (5 dot 4) / (2 dot 1) (1 / 2)^5 = 10 / 32
  $
]

#example[
  What is the probability of getting exactly three aces (1's) out of 10 throws
  of a fair die?

  Seems a little trickier but we can still write this as well defined $S$/$F$.
  Let $S$ be getting an ace and $F$ being anything else.

  Then $p = 1/6$ and $n = 10$. We want $P(X=3)$. So

  $
    P(X=3) = vec(10,3) p^3 q^7 = vec(10,3) (1 / 6)^3 (5 / 6)^7 \
    approx 0.15505
  $
]

#example[
  Suppose we have two types of candy, red and black. Select $n$ candies. Let $X$
  be the number of red candies among $n$ selected.

  2 cases.

  - case 1: with replacement: Binomial Distribution, $n$, $p = a/(a + b)$.
  $ P(X = 2) = vec(n,2) (a / (a+b))^2 (b / (a+b))^(n-2) $
  - case 2: without replacement: then use counting
  $ P(X = x) = (vec(a,x) vec(b,n-x)) / vec(a+b,n) = p(x) $
]

We've done case 2 before, but now we introduce a random variable to represent
it.

$ P(X = x) = (vec(a,x) vec(b,n-x)) / vec(a+b,n) = p(x) $

is known as a *Hypergeometric distribution*.

== Hypergeometric distribution

There are different characterizations of the parameters, but

$ X ~ "Hypergeom"(hash "total", hash "successes", "sample size") $

For example,
$ X ~ "Hypergeom"(N, a, n) "where" N = a+b $

In the textbook, it's
$ X ~ "Hypergeom"(N, N_a, n) $

#remark[
  If $x$ is very small relative to $a + b$, then both cases give similar (approx.
  the same) answers.
]

For instance, if we're sampling for blood types from UCSB, and we take a
student out without replacement, we don't really change the sample size
substantially. So both answers give a similar result.

Suppose we have two types of items, type $A$ and type $B$. Let $N_A$ be $hash$
type $A$, $N_B$ $hash$ type $B$. $N = N_A + N_B$ is the total number of
objects.

We sample $n$ items *without replacement* ($n <= N$) with order not mattering.
Denote by $X$ the number of type $A$ objects in our sample.

#definition[
  Let $0 <= N_A <= N$ and $1 <= n <= N$ be integers. A random variable $X$ has the *hypergeometric distribution* with parameters $(N, N_A, n)$ if $X$ takes values in the set ${0,1,...,n}$ and has p.m.f.

  $ P(X = k) = (vec(N_A,k) vec(N-N_A,n-k)) / vec(N,n) = p(k) $
]

#example[
  Let $N_A = 10$ defectives. Let $N_B = 90$ non-defectives. We select $n=5$ without replacement. What is the probability that 2 of the 5 selected are defective?

  $
    X ~ "Hypergeom" (N = 100, N_A = 10, n = 5)
  $

  We want $P(X=2)$.

  $
    P(X=2) = (vec(10,2) vec(90,3)) / vec(100,5) approx 0.0702
  $
]

#remark[
  Make sure you can distinguish when a problem is binomial or when it is
  hypergeometric. This is very important on exams.

  Recall that both ask about number of successes, in a fixed number of trials.
  But binomial is sample with replacement (each trial is independent) and
  sampling without replacement is hypergeometric.
]

#example[
  Cat gives birth to 6 kittens. 2 are male, 4 are females. Your neighbor comes and picks up 3 kittens randomly to take home with them.

  How to define random variable? What is p.m.f.?

  Let $X$ be the number of male cats in the neighbor's selection.

  $ X ~ "Hypergeom"(N = 6, N_A = 2, n = 3) $
  and $X$ takes values in ${0,1,2}$. Find the p.m.f. by finding probabilities for these values.

  $
    &P(X = 0) = (vec(2,0) vec(4,3)) / vec(6,3) = 4 / 20 \
    &P(X = 1) = (vec(2,1) vec(4,2)) / vec(6,3) = 12 / 20 \
    &P(X = 2) = (vec(2,2) vec(4,1)) / vec(6,3) = 4 / 20 \
    &P(X = 3) = (vec(2,3) vec(4,0)) / vec(6,3) = 0
  $

  Note that for $P(X=3)$, we are asking for 3 successes (drawing males) where
  there are only 2 males, so it must be 0.
]

== Geometric distribution

Consider an infinite sequence of independent trials. e.g. number of attempts until I make a basket.

Let $X_i$ denote the outcome of the $i^"th"$ trial, where success is 1 and failure is 0. Let $N$ be the number of trials needed to observe the first success in a sequence of independent trials with probability of success $p$. Then

We fail $k-1$ times and succeed on the $k^"th"$ try. Then:

$
  P(N = k) = P(X_1 = 0, X_2 = 0, ..., X_(k-1) = 0, X_k = 1) = (1 - p)^(k-1) p
$

This is the probability of failures raised to the amount of failures, times
probability of success.

The key characteristic in these trials, we keep going until we succeed. There's
no $n$ choose $k$ in front like the binomial distribution because there's
exactly one sequence that gives us success.

#definition[
  Let $0 < p <= 1$. A random variable $X$ has the geometric distribution with
  success parameter $p$ if the possible values of $X$ are ${1,2,3,...}$ and $X$
  satisfies

  $
    P(X=k) = (1-p)^(k-1) p
  $

  for positive integers $k$. Abbreviate this by $X ~ "Geom"(p)$.
]

#example[
  What is the probability it takes more than seven rolls of a fair die to roll a
  six?

  Let $X$ be the number of rolls of a fair die until the first six. Then $X ~
  "Geom"(1/6)$. Now we just want $P(X > 7)$.

  $
    P(X > 7) = sum^infinity_(k=8) P(X=k) = sum^infinity_(k=8) (5 / 6)^(k-1) 1 / 6
  $

  Re-indexing,

  $
    sum^infinity_(k=8) (5 / 6)^(k-1) 1 / 6 = 1 / 6 (5 / 6)^7 sum^infinity_(j=0) (5 / 6)^j
  $

  Now we calculate by standard methods:

  $
    1 / 6 (5 / 6)^7 sum^infinity_(j=0) (5 / 6)^j = 1 / 6 (5 / 6)^7 dot 1 / (1-5 / 6) =
    (5 / 6)^7
  $
]

= Some more discrete distributions

== Negative binomial

Consider a sequence of Bernoulli trials with the following characteristics:

- Each trial success or failure
- Prob. of success $p$ is same on each trial
- Trials are independent (notice they are not fixed to specific number)
- Experiment continues until $r$ successes are observed, where $r$ is a given parameter

Then if $X$ is the number of trials necessary until $r$ successes are observed,
we say $X$ is a *negative binomial* random variable.

#definition[
  Let $k in ZZ^+$ and $0 < p <= 1$. A random variable $X$ has the negative
  binomial distribution with parameters ${k,p}$ if the possible values of $X$
  are the integers ${k,k+1, k+2, ...}$ and the p.m.f. is

  $
    P(X = n) = vec(n-1, k-1) p^k (1-p)^(n-k) "for" n >= k
  $

  Abbreviate this by $X ~ "Negbin"(k,p)$.
]

#example[
  Steph Curry has a three point percentage of approx. $43%$. What is the
  probability that Steph makes his third three-point basket on his $5^"th"$
  attempt?

  Let $X$ be number of attempts required to observe the 3rd success. Then,

  $
    X ~ "Negbin"(k = 3, p = 0.43)
  $

  So,
  $
    P(X = 5) &= vec(5-1,3-1)(0.43)^3 (1 - 0.43)^(5-3) \
    &= vec(4,2) (0.43)^3 (0.57)^2 \
    &approx 0.155
  $
]

== Poisson distribution

This p.m.f. follows from the Taylor expansion

$
  e^lambda = sum_(k=0)^infinity lambda^k / k!
$

which implies that

$
  sum_(k=0)^infinity e^(-lambda) lambda^k / k! = e^(-lambda) e^lambda = 1
$

#definition[
  For an integer valued random variable $X$, we say $X ~ "Poisson"(lambda)$ if it has p.m.f.

  $ P(X = k) = e^(-lambda) lambda^k / k! $

  for $k in {0,1,2,...}$ for $lambda > 0$ and

  $
    sum_(k = 0)^infinity P(X=k) = 1
  $
]

The Poisson arises from the Binomial. It applies in the binomial context when
$n$ is very large ($n >= 100$) and $p$ is very small $p <= 0.05$, such that $n
p$ is a moderate number ($n p < 10$).

Then $X$ follows a Poisson distribution with $lambda = n p$.

$
  P("Bin"(n,p) = k) approx P("Poisson"(lambda = n p) = k)
$

for $k = 0,1,...,n$.

The Poisson distribution is useful for finding the probabilities of rare events
over a continuous interval of time. By knowing $lambda = n p$ for small $n$ and
$p$, we can calculate many probabilities.

#example[
  The number of typing errors in the page of a textbook.

  Let

  - $n$ be the number of letters of symbols per page (large)
  - $p$ be the probability of error, small enough such that
  - $lim_(n -> infinity) lim_(p -> 0) n p = lambda = 0.1$

  What is the probability of exactly 1 error?

  We can approximate the distribution of $X$ with a $"Poisson"(lambda = 0.1)$
  distribution

  $
    P(X = 1) = (e^(-0.1) (0.1)^1) / 1! = 0.09048
  $
]

#example[
  The number of reported auto accidents in a big city on any given day

  Let

  - $n$ be the number of autos on the road
  - $p$ be the probability of an accident for any individual is small such that
    $lim_(n->infinity) lim_(p->0) n p = lambda = 2$

  What is the probability of no accidents today?

  We can approximate $X$ by $"Poisson"(lambda = 2)$

  $
    P(X = 0) = (e^(-2) (2)^0) / 0! = 0.1353
  $
]

A discrete example:

#example[
  Suppose we have an election with candidates $B$ and $W$. A total of 10,000
  ballots were cast such that

  $
    10,000 "votes" cases(5005 space B, 4995 space W)
  $

  But 15 ballots had irregularities and were disqualified. What is the
  probability that the election results will change?

  There are three combinations of disqualified ballots that would result in a
  different election outcome: 13 $B$ and 2 $W$, 14 $B$ and 1 $W$, and 15 $B$
  and 0 $W$. What is the probability of these?
]

= Lecture #datetime(day: 3, month: 2, year: 2025).display()

== CDFs, PMFs, PDFs

#definition[
  Let $X$ be a random variable. If we have a function $f$ such that

  $
    P(X <= b) = integral^b_(-infinity) f(x) dif x
  $
  for all $b in RR$, then $f$ is the *probability density function* of $X$.
]

The probability that the value of $X$ lies in $(-infinity, b]$ equals the area
under the curve of $f$ from $-infinity$ to $b$.

If $f$ satisfies this definition, then for any $B subset RR$ for which integration makes sense,

$
  P(X in B) = integral_B f(x) dif x
$

Properties of a CDF:

Any CDF $F(x) = P(X <= x)$ satisfies

1. $F(-infinity) = 0$, $F(infinity) = 1$
2. $F(x)$ is non-decreasing in $x$ (monotonically increasing)
$ s < t => F(s) <= F(t) $
3. $P(a < X <= b) = P(X <= b) - P(X <= a) = F(b) - F(a)$

#example[
  Let $X$ be a continuous random variable with density (pdf)

  $
    f(x) = cases(
c x^2 &"for" 0 < x < 2,
0 &"otherwise"
)
  $

  1. What is $c$?

  $c$ is such that
  $
    1 = integral^infinity_(-infinity) f(x) dif x = integral_0^2 c x^2 dif x
  $

  2. Find the probability that $X$ is between 1 and 1.4.

  Integrate the curve between 1 and 1.4.

  $
    integral_1^1.4 3 / 8 x^2 dif x = (x^3 / 8) |_1^1.4 \
    = 0.218
  $

  This is the probability that $X$ lies between 1 and 1.4.

  3. Find the probability that $X$ is between 1 and 3.

  Idea: integrate between 1 and 3, be careful after 2.

  $ integral^2_1 3 / 8 x^2 dif x + integral_2^3 0 dif x = $

  4. What is the CDF for $P(X <= x)$? Integrate the curve to $x$.

  $
    F(x) = P(X <= x) = integral_(-infinity)^x f(t) dif t \
    = integral_0^x 3 / 8 t^2 dif t \
    = x^3 / 8
  $

  Important: include the range!

  $
    F(x) = cases(
  0 &"for" x <= 0,
  x^3/8 &"for" 0 < x < 2,
  1 &"for" x >= 2
  )
  $

  5. Find a point $a$ such that you integrate up to the point to find exactly $1/2$
  the area.

  We want to find $1/2 = P(X <= a)$.

  $ 1 / 2 = P(X <= a) = F(a) = a^3 / 8 => a = root(3, 4) $
]

== The (continuous) uniform distribution

The most simple and the best of the named distributions!

#definition[
  Let $[a,b]$ be a bounded interval on the real line. A random variable $X$ has the uniform distribution on the interval $[a,b]$ if $X$ has the density function

  $
    f(x) = cases(
1/(b-a) &"for" x in [a,b],
0 &"for" x in.not [a,b]
)
  $

  Abbreviate this by $X ~ "Unif" [a,b]$.
]<continuous-uniform>

The graph of $"Unif" [a,b]$ is a constant line at height $1/(b-a)$ defined
across $[a,b]$. The integral is just the area of a rectangle, and we can check
it is 1.

#fact[
  For $X ~ "Unif" [a,b]$, its cumulative distribution function (CDF) is given by:

  $
    F_x (x) = cases(
0 &"for" x < a,
(x-a)/(b-a) &"for" x in [a,b],
1 &"for" x > b
)
  $
]

#fact[
  If $X ~ "Unif" [a,b]$, and $[c,d] subset [a,b]$, then
  $
    P(c <= X <= d) = integral_c^d 1 / (b-a) dif x = (d-c) / (b-a)
  $
]

#example[
  Let $Y$ be a uniform random variable on $[-2,5]$. Find the probability that its
  absolute value is at least 1.

  $Y$ takes values in the interval $[-2,5]$, so the absolute value is at least 1 iff. $Y in [-2,1] union [1,5]$.

  The density function of $Y$ is $f(x) = 1/(5- (-2)) = 1/7$ on $[-2,5]$ and 0 everywhere else.

  So,

  $
    P(|Y| >= 1) &= P(Y in [-2,-1] union [1,5]) \
    &= P(-2 <= Y <= -1) + P(1 <= Y <= 5) \
    &= 5 / 7
  $
]

== The exponential distribution

The geometric distribution can be viewed as modeling waiting times, in a discrete setting, i.e. we wait for $n - 1$ failures to arrive at the $n^"th"$ success.

The exponential distribution is the continuous analogue to the geometric
distribution, in that we often use it to model waiting times in the continuous
sense. For example, the first custom to enter the barber shop.

#definition[
  Let $0 < lambda < infinity$. A random variable $X$ has the exponential distribution with parameter $lambda$ if $X$ has PDF

  $
    f(x) = cases(
  lambda e^(-lambda x) &"for" x >= 0,
  0 &"for" x < 0
  )
  $

  Abbreviate this by $X ~ "Exp"(lambda)$, the exponential distribution with rate $lambda$.

  The CDF of the $"Exp"(lambda)$ distribution is given by:

  $
    F(t) + cases(
  0 &"if" t <0,
  1 - e^(-lambda t) &"if" t>= 0
  )
  $
]

#example[
  Suppose the length of a phone call, in minutes, is well modeled by an exponential random variable with a rate $lambda = 1/10$.

  1. What is the probability that a call takes more than 8 minutes?
  2. What is the probability that a call takes between 8 and 22 minutes?

  Let $X$ be the length of the phone call, so that $X ~ "Exp"(1/10)$. Then we can find the desired probability by:

  $
    P(X > 8) &= 1 - P(X <= 8) \
    &= 1 - F_x (8) \
    &= 1 - (1 - e^(-(1 / 10) dot 8)) \
    &= e^(-8 / 10) approx 0.4493
  $

  Now to find $P(8 < X < 22)$, we can take the difference in CDFs:

  $
    &P(X > 8) - P(X >= 22) \
    &= e^(-8 / 10) - e^(-22 / 10) \
    &approx 0.3385
  $
]

#fact("Memoryless property of the exponential distribution")[
  Suppose that $X ~ "Exp"(lambda)$. Then for any $s,t > 0$, we have
  $
    P(X > t + s | X > t) = P(X > s)
  $
]<memoryless>

This is like saying if I've been waiting 5 minutes and then 3 minutes for the
bus, what is the probability that I'm gonna wait more than 5 + 3 minutes, given
that I've already waited 5 minutes? And that's precisely equal to just the
probability I'm gonna wait more than 3 minutes.

#proof[
  $
    P(X > t + s | X > t) = (P(X > t + s sect X > t)) / (P(X > t)) \
    = P(X > t + s) / P(X > t)
    = e^(-lambda (t+ s)) / (e^(-lambda t)) = e^(-lambda s) \
    equiv P(X > s)
  $
]

== Gamma distribution

#definition[
  Let $r, lambda > 0$. A random variable $X$ has the *gamma distribution* with parameters $(r, lambda)$ if $X$ is nonnegative and has probability density function

  $
    f(x) = cases(
(lambda^r x^(r-2))/(Gamma(r)) e^(-lambda x) &"for" x >= 0,
0 &"for" x < 0
)
  $

  Abbreviate this by $X ~ "Gamma"(r, lambda)$.
]

The gamma function $Gamma(r)$ generalizes the factorial function and is defined as

$
  Gamma(r) = integral_0^infinity x^(r-1) e^(-x) dif x, "for" r > 0
$

Special case: $Gamma(n) = (n - 1)!$ if $n in ZZ^+$.

#remark[
  The $"Exp"(lambda)$ distribution is a special case of the gamma distribution,
  with parameter $r = 1$.
]

== The normal (Gaussian) distribution

#definition[
  A random variable $Z$ has the *standard normal distribution* if $Z$ has
  density function

  $
    phi(x) = 1 / sqrt(2 pi) e^(-x^2 / 2)
  $
  on the real line. Abbreviate this by $Z ~ N(0,1)$.
]<normal-dist>

#fact("CDF of a standard normal random variable")[
  Let $Z~N(0,1)$ be normally distributed. Then its CDF is given by
  $
    Phi(x) = integral_(-infinity)^x phi(s) dif s = integral_(-infinity)^x 1 / sqrt(2 pi) e^(-(-s^2) / 2) dif s
  $
]

The normal distribution is so important, instead of the standard $f_Z(x)$ and
$F_z(x)$, we use the special $phi(x)$ and $Phi(x)$.

#fact[
  $
    integral_(-infinity)^infinity e^(-s^2 / 2) dif s = sqrt(2 pi)
  $

  No closed form of the standard normal CDF $Phi$ exists, so we are left to either:
  - approximate
  - use technology (calculator)
  - use the standard normal probability table in the textbook
]

To evaluate negative values, we can use the symmetry of the normal distribution
to apply the following identity:

$
  Phi(-x) = 1 - Phi(x)
$

== General normal distributions

The general family of normal distributions is obtained by linear or affine
transformations of $Z$. Let $mu$ be real, and $sigma > 0$, then

$
  X = sigma Z + mu
$
is also a normally distributed random variable with parameters $(mu, sigma^2)$.
The CDF of $X$ in terms of $Phi(dot)$ can be expressed as

$
  F_X (x) &= P(X <= x) \
  &= P(sigma Z + mu <= x) \
  &= P(Z <= (x - mu) / sigma) \
  &= Phi((x-mu)/sigma)
$

Also,

$
  f(x) = F'(x) = dif / (dif x) [Phi((x-u)/sigma)] = 1 / sigma phi((x-u)/sigma) = 1 / sqrt(2 pi sigma^2) e^(-((x-mu)^2) / (2sigma^2))
$

#definition[
  Let $mu$ be real and $sigma > 0$. A random variable $X$ has the _normal distribution_ with mean $mu$ and variance $sigma^2$ if $X$ has density function

  $
    f(x) = 1 / sqrt(2 pi sigma^2) e^(-((x-mu)^2) / (2sigma^2))
  $

  on the real line. Abbreviate this by $X ~ N(mu, sigma^2)$.
]

#fact[
  Let $X ~ N(mu, sigma^2)$ and $Y = a X + b$. Then
  $
    Y ~ N(a mu + b, a^2 sigma^2)
  $

  That is, $Y$ is normally distributed with parameters $(a mu + b, a^2 sigma^2)$.
  In particular,
  $
    Z = (X - mu) / sigma ~ N(0,1)
  $
  is a standard normal variable.
]

= Lecture #datetime(day: 11, year: 2025, month: 2).display()

== Expectation

#definition[
  The expectation or mean of a discrete random variable $X$ is the weighted
  average, with weights assigned by the corresponding probabilities.

  $
    E(X) = sum_("all" x_i) x_i dot p(x_i)
  $
]

#example[
  Find the expected value of a single roll of a fair die.

  - $X = "score" / "dots"$
  - $x = 1,2,3,4,5,6$
  - $p(x) = 1 / 6, 1 / 6,1 / 6,1 / 6,1 / 6,1 / 6$

  $
    E[x] = 1 dot 1 / 6 + 2 dot 1 / 6 ... + 6 dot 1 / 6
  $
]

== Binomial expected value

$
  E[x] = n p
$

== Bernoulli expected value

Bernoulli is just binomial with one trial.

Recall that $P(X=1) = p$ and $P(X=0) = 1 - p$.

$
  E[X] = 1 dot P(X=1) + 0 dot P(X=0) = p
$

Let $A$ be an event on $Omega$. Its _indicator random variable_ $I_A$ is defined
for $omega in Omega$ by

$
  I_A (omega) = cases(1", if " &omega in A, 0", if" &omega in.not A)
$

$
  E[I_A] = 1 dot P(A) = P(A)
$

== Geometric expected value

Let $p in [0,1]$ and $X ~ "Geom"[ p ]$ be a geometric RV with probability of
success $p$. Recall that the p.m.f. is $p q^(k-1)$, where prob. of failure is defined by $q := 1-p$.

Then

$
  E[X] &= sum_(k=1)^infinity k p q^(k-1) \
  &= p dot sum_(k=1)^infinity k dot q^(k-1)
$

Now recall from calculus that you can differentiate a power series term by term inside its radius of convergence. So for $|t| < 1$,

$
  sum_(k=1)^infinity k t^(k-1) =
  sum_(k=1)^infinity dif / (dif t) t^k = dif / (dif t) sum_(k=1)^infinity t^k = dif / (dif t) (1 / (1-t)) = 1 / (1-t)^2 \
  therefore E[x] = sum^infinity_(k=1) k p q^(k-1) = p sum^infinity_(k=1) k q^(k-1) = p (1 / (1 - q)^2) = 1 / p
$

== Expected value of a continuous RV

#definition[
  The expectation or mean of a continuous random variable $X$ with density
  function $f$ is

  $
    E[x] = integral_(-infinity)^infinity x dot f(x) dif x
  $

  An alternative symbol is $mu = E[x]$.
]

$mu$ is the "first moment" of $X$, analogous to physics, its the "center of
gravity" of $X$.

#remark[
  In general when moving between discrete and continuous RV, replace sums with
  integrals, p.m.f. with p.d.f., and vice versa.
]

#example[
  Suppose $X$ is a continuous RV with p.d.f.

  $
    f_X (x) = cases(2x", " &0 < x < 1, 0"," &"elsewhere")
  $

  $
    E[X] = integral_(-infinity)^infinity x dot f(x) dif x = integral^1_0 x dot 2x dif x = 2 / 3
  $
]

#example("Uniform expectation")[
  Let $X$ be a uniform random variable on the interval $[a,b]$ with $X ~
  "Unif"[a,b]$. Find the expected value of $X$.

  $
    E[X] = integral^infinity_(-infinity) x dot f(x) dif x = integral_a^b x / (b-a) dif x \
    = 1 / (b-a) integral_a^b x dif x = 1 / (b-a) dot (b^2 - a^2) / 2 = underbrace((b+a) / 2, "midpoint formula")
  $
]

#example("Exponential expectation")[
  Find the expected value of an exponential RV, with p.d.f.

  $
    f_X (x) = cases(lambda e^(-lambda x)", " &x > 0, 0"," &"elsewhere")
  $

  $
    E[x] = integral_(-infinity)^infinity x dot f(x) dif x = integral_0^infinity x dot lambda e^(-lambda x) dif x \
    = lambda dot integral_0^infinity x dot e^(-lambda x) dif x \
    = lambda dot [lr(-x 1 / lambda e^(-lambda x) |)_(x=0)^(x=infinity) - integral_0^infinity -1 / lambda e^(-lambda x) dif x] \
    = 1 / lambda
  $
]

#example("Uniform dartboard")[
  Our dartboard is a disk of radius $r_0$ and the dart lands uniformly at
  random on the disk when thrown. Let $R$ be the distance of the dart from the
  center of the disk. Find $E[R]$ given density function

  $
    f_R (t) = cases((2t)/(r_0 ^2)", " &0 <= t <= r_0, 0", " &t < 0 "or" t > r_0)
  $

  $
    E[R] = integral_(-infinity)^infinity t f_R (t) dif t \
    = integral^(r_0)_0 t dot (2t) / (r_0^2) dif t \
    = 2 / 3 r_0
  $
]

== Expectation of derived values

If we can find the expected value of $X$, can we find the expected value of
$X^2$? More precisely, can we find $E[X^2]$?

If the distribution is easy to see, then this is trivial. Otherwise we have the
following useful property:

$
  E[X^2] = integral_("all" x) x^2 f_X (x) dif x
$

(for continuous RVs).

And in the discrete case,

$
  E[X^2] = sum_("all" x) x^2 p_X (x)
$

In fact $E[X^2]$ is so important that we call it the *mean square*.

#fact[
  More generally, a real valued function $g(X)$ defined on the range of $X$ is
  itself a random variable (with its own distribution).
]

We can find expected value of $g(X)$ by

$
  E[g(x)] = integral_(-infinity)^infinity g(x) f(x) dif x
$

or

$
  E[g(x)] = sum_("all" x) g(x) f(x)
$

#example[
  You roll a fair die to determine the winnings (or losses) $W$ of a player as
  follows:

  $
    W = cases(-1", if the roll is 1, 2, or 3", 1", if the roll is a 4", 3", if the roll is 5 or 6")
  $

  What is the expected winnings/losses for the player during 1 roll of the die?

  Let $X$ denote the outcome of the roll of the die. Then we can define our
  random variable as $W = g(X)$ where the function $g$ is defined by $g(1) =
  g(2) = g(3) = -1$ and so on.

  Note that $P(W = -1) = P(X = 1 union X = 2 union X = 3) = 1/2$. Likewise $P(W=1)
  = P(X=4) = 1/6$, and $P(W=3) = P(X=5 union X=6) = 1/3$.

  Then
  $
    E[g(X)] = E[W] = (-1) dot P(W=-1) + (1) dot P(W=1) + (3) dot P(W=3) \
    = -1 / 2 + 1 / 6 + 1 = 2 / 3
  $
]

#example[
  A stick of length $l$ is broken at a uniformly chosen random location. What is
  the expected length of the longer piece?

  Idea: if you break it before the halfway point, then the longer piece has length
  given by $l - x$. If you break it after the halfway point, the longer piece
  has length $x$.

  Let the interval $[0,l]$ represent the stick and let $X ~ "Unif"[0,l]$ be the
  location where the stick is broken. Then $X$ has density $f(x) = 1/l$ on
  $[0,l]$ and 0 elsewhere.

  Let $g(x)$ be th length of the longer piece when the stick is broken at $x$,

  $
    g(x) = cases(1-x", " &0 <= x < l/2, x", " &1/2 <= x <= l)
  $

  Then
  $
    E[g(X)] = integral_(-infinity)^infinity g(x) f(x) dif x = integral_0^(l / 2) (l-x) / l dif x + integral_(l / 2)^l x / l dif x \
    = 3 / 4 l
  $

  So we expect the longer piece to be $3/4$ of the total length, which is a bit
  pathological.
]

== Moments of a random variable

We continue discussing expectation but we introduce new terminology.

#fact[
  The $n^"th"$ moment (or $n^"th"$ raw moment) of a discrete random variable $X$
  with p.m.f. $p_X (x)$ is the expectation

  $
    E[X^n] = sum_k k^n p_X (k) = mu_n
  $

  If $X$ is continuous, then we have analogously

  $
    E[X^n] = integral_(-infinity)^infinity x^n f_X (x) = mu_n
  $
]

The *deviation* is given by $sigma$ and the *variance* is given by $sigma^2$ and

$
  sigma^2 = mu_2 - (mu_1)^2
$

$mu_3$ is used to measure "skewness" / asymmetry of a distribution. For
example, the normal distribution is very symmetric.

$mu_4$ is used to measure kurtosis/peakedness of a distribution.

== Central moments

Previously we discussed "raw moments." Be careful not to confuse them with
_central moments_.

#fact[
  The $n^"th"$ central moment of a discrete random variable $X$ with p.m.f. $p_X
  (x)$ is the expected value of the difference about the mean raised to the
  $n^"th"$ power

  $
    E[(X-mu)^n] = sum_k (k - mu)^n p_X (k) = mu'_n
  $

  And of course in the continuous case,

  $
    E[(X-mu)^n] = integral_(-infinity)^infinity (x - mu)^n f_X (x) = mu'_n
  $
]

In particular,

$
  mu'_1 = E[(X-mu)^1] = integral_(-infinity)^infinity (x-mu)^1 f_X (x) dif x \
  = integral_(infinity)^infinity x f_X (x) dif x = integral_(-infinity)^infinity mu f_X (x) dif x = mu - mu dot 1 = 0 \
  mu'_2 = E[(X-mu)^2] = sigma^2_X = "Var"(X)
$

Effectively we're centering our distribution first.

#example[
  Let $Y$ be a uniformly chosen integer from ${0,1,2,...,m}$. Find the first and
  second moment of $Y$.

  The p.m.f. of $Y$ is $p_Y (k) = 1/(m+1)$ for $k in [0,m]$. Thus,

  $
    E[Y] = sum_(k=0)^m k 1 / (m+1) = 1 / (m+1) sum_(k=0)^m k \
    = m / 2
  $

  Then,

  $
    E[Y^2] = sum_(k=0)^m k^2 1 / (m+1) = 1 / (m+1) = (m(2m+1)) / 6
  $
]

#example[
  Let $c > 0$ and let $U$ be a uniform random variable on the interval $[0,c]$.
  Find the $n^"th"$ moment for $U$ for all positive integers $n$.

  The density function of $U$ is

  $
    f(x) = cases(1/c", if" &x in [0,c], 0", " &"otherwise")
  $

  Therefore the $n^"th"$ moment of $U$ is,

  $
    E[U^n] = integral_(-infinity)^infinity x^n f(x) dif x
  $
]

#example[
  Suppose the random variable $X ~ "Exp"(lambda)$. Find the second moment of $X$.

  $
    E[X^2] = integral_0^infinity x^2 lambda e^(-lambda x) dif x \
    = 1 / (lambda^2) integral_0^infinity u^2 e^(-u) dif u \
    = 1 / (lambda^2) Gamma(2 + 1) = 2! / lambda^2
  $
]

#fact[
  In general, to find teh $n^"th"$ moment of $X ~ "Exp"(lambda)$,
  $
    E[X^n] = integral^infinity_0 x^n lambda e^(-lambda x) dif x = n! / lambda^n
  $
]

== Median and quartiles

When a random variable has rare (abnormal) values, its expectation may be a bad
indicator of where the center of the distribution lies.

#definition[
  The *median* of a random variable $X$ is any real value $m$ that satisfies

  $
    P(X >= m) >= 1 / 2 "and" P(X <= m) >= 1 / 2
  $

  With half the probability on both ${X <= m}$ and ${X >= m}$, the median is
  representative of the midpoint of the distribution. We say that the median is
  more _robust_ because it is less affected by outliers. It is not necessarily
  unique.
]

#example[
  Let $X$ be discretely uniformly distributed in the set ${-100, 1, 2, ,3, ..., 9}$ so $X$ has probability mass function
  $
    p_X (-100) = p_X (1) = dots.c = p_X (9)
  $

  Find the expected value and median of $X$.

  $
    E[X] = (-100) dot 1 / 10 + (1) dot 1 / 10 + dots.c + (9) dot 1 / 10 = -5.5
  $

  While the median is any number $m in [4,5]$.

  The median reflects the fact that 90% of the values and probability is in the
  range $1,2,...,9$ while the mean is heavily influenced by the $-100$ value.
]

= President's Day lecture

...

= Lecture #datetime(day: 19, month: 2, year: 2025).display()

== Moment generating functions

Like the CDF, the moment generating function also completely characterizes the
distribution. That is, if you can find the MGF, it tells you all of the
information about the distribution. So it is an alternative way to characterize
a random variable.

They are "easy" to use for finding the distributions of:

- sums of independent random variables
- the distribution of the limit of a sequence of random variables

#definition[
  Let $X$ be a random variable with all finite moments
  $
    E[X^k] = mu_k, k = 1,2,...
  $
  Then the *moment generating function* of a random variable $X$ is defined by
  $M_x(t) = E[e^(t x)]$, for the real variable $t$.
]

All of the moments must be defined for the MGF to exist. The MGF looks like

$
  sum_("all" x) e^(t x) p(x)
$
in the discrete case, and
$
  integral^infinity_(-infinity) e^(t x) f(x) dif x
$
in the continuous case.

#proposition[
  It holds that the $n^"th"$ derivative of $M$ evaluated at 0 gives the
  $n^"th"$ moment.
  $
    M_x^((n)) (0) = E[X^n]
  $
]

#proof[
  $
    M_X (t) &equiv E[e^(t x)] = E[1 + (t X) + (t X^2) / 2! + dots.c] \
    &= E[1] + E[t X] + E[(t^2 X^2) / 2!] + dots.c \
    &= E[1] + t E[X] + t^2 / 2! E[X^2] + dots.c \
    &= 1 + t / 1! mu_1 + t^2 / 2! mu_2 + dots.c
  $

  The coefficient of $t^k/k!$ in the Taylor series expansion of $M_X (t)$ is the
  $k^"th"$ moment. So an alternative way to get $mu_k$ is

  $
    mu_k = lr(((dif^k M(t))/(dif t^k)) |)_(t=0) = "coefficient of" t^k / k!
  $
]

#example[Binomial][

  Let $X ~ "Bin"(n,p)$. Then the MGF of $X$ is given by

  $
    sum_(k=0)^n e^(t k) vec(n,k) p^k q^(n-k) = sum_(k=0)^n vec(n,k) underbrace(p (e^t)^k,a) underbrace(q^(n-k), b)
  $

  Applying the binomial theorem

  $
    (a + b)^n = sum_(k=0)^n vec(n,k) a^k b^(n-k)
  $

  So we have

  $
    (q + p e^t)^n
  $

  Let's find the first moment

  $
    mu_1 = lr((dif M(t))/(dif t) |)_(t=0) \
    = n p
  $

  The second moment:

  $
    mu_2 = lr((dif^2 M(t))/(dif t^2) |)_(t=0) \
    = n(n-1) p^2 + n p
  $

  For example, if $X$ has MGF $(1/3 + 2/3 e^t)^10$, then $X ~ "Bin"(10,2/3)$.
]

#example[Poisson][
  Let $X ~ "Pois"(lambda)$. Then the MGF of $X$ is given
  $
    M_X (t) = E[e^(t X)] \
    sum^infinity_(x=0) e^(t x) e^(-lambda) lambda^x / x! \
    e^(-lambda) sum^infinity_(x=0) e^(t x)lambda^x / x!
  $
  Note: $e^a = sum_(x=0) ^infinity a^x / x!$
  $
    = e^(-lambda) e^(lambda e^t) \
    = e^(-lambda (1 - e^t))
  $
  Then, the first moment can be found by,
  $
    mu_1 = lr(e^(-lambda (1 - e^t)) (-lambda) (-e^t) |)_(t=0) = lambda
  $
]

#example[Exponential][
  Let $X ~"Exp"(lambda)$ with PDF
  $
    f(x) = cases(lambda e^(-lambda x) &"for" x > 0, 0 &"otherwise")
  $
  Find the MGF of $X$

  $
    M_X (t) &= integral^infinity_(-infinity) e^(t x) dot lambda e^(-lambda x) dif x \
    &= lambda integral_0^infinity e^((t-lambda) x) dif x \
    &= lambda lim_(b->infinity) integral_0^b e^((t - lambda) x) dif x \
  $
  This integral depends on $t$, so we should consider three cases. If $t =
  lambda$, then the integral diverges.

  If $t != lambda$,
  $
    E[e^(t X)] = lambda lim_(b->infinity) integral_0^b e^((t - lambda) x) dif x = lambda lim_(b -> infinity) [(e^((t - lambda) x) - 1) / (t - lambda)]^(x=b)_(x=0) \
    lambda lim_(b -> infinity) (e^((t - lambda) b) - 1) / (t - lambda) = cases(infinity &"if" t > lambda, lambda/(lambda - t) &"if" t < lambda)
  $
  Combining with the $lambda = t$ case,

  $
    lambda lim_(b -> infinity) (e^((t - lambda) b) - 1) / (t - lambda) = cases(infinity &"if" t >= lambda, lambda/(lambda - t) &"if" t < lambda)
  $
]

#example[Alternative parameterization of the exponential][
  Consider $X ~ "Exp"(beta)$ with PDF
  $
    f(x) = cases(1/beta e^(-x/beta) &"for" x > 0, 0 &"otherwise")
  $
  and proceed as usual
  $
    M_X (t) = integral_(-infinity)^infinity e^(t x) dot 1 / beta e^(-x / beta) dif x = 1 / beta lim_(b-infinity) [e^((t - 1 / beta) x) / (t - 1 / beta)]_(x=0)^(x=b) = 1 / (1 - beta t)
  $
  So it's a geometric series
  $
    1 + beta t + (beta t)^2 + dots.c \
  $
  Multiply each $n^"th"$ term by $n/n!$
  $
    = 1 + beta t + 2 beta^2 (t^2 / 2!) + 6 beta^3 (t^3 / 3!) + dots.c
  $
  Recall that the coefficient of each $r^k/k! = mu)k$. So
  - $E[x] = beta$
  - $E[X^2] = 2 beta^2$
  - $E[X^3] = 6 beta^3$
  $
    "Var"(X) = E[X^2] - (E[X])^2 = beta^2
  $
]

#example[Uniform on $[0,1]$][
  Let $X ~ U(0,1)$, then
  $
    M_X (t) &= integral_0^1 e^(t x) dot 1 dif x \
    &= lr(e^(t x)/t |)_(x=0)^(x=1) = (e^t - 1) / t \
    &= (cancel(1) + t^2 / 2! + dots.c - cancel(1)) / t \
    &= 1 + t^2 / 2! + t^2 / 3! + t^3 / 4! + dots.c \
    &= 1 + 1 / 2 t + 1 / 3 (t^2 / 2!) + 1 / 4(t^3 / 3!) + dots.c
  $
  So
  - $E[X] = 1 / 2$
  - $E[X^2] = 1 / 3$
  - $E[X^n] = 1 / (n + 1)$
]

== Properties of the MGF

#definition[
  Random variables $X$ and $Y$ are equal in distribution if $P(X in B) = P(Y in
B)$ for all all subsets $B$ of $RR$.
]

#abuse[
  Abbreviate this by $X eq.delta Y$
]

#example[Normal distribution][
  Let $Z ~ N(0,1)$. Then
  $
    E[e^(t Z)] = 1 / sqrt(2 pi) integral^infinity_(-infinity) e^(-1 / 2 z^2 + t z -1 / 2 t^2 + 1 / 2 t^2) dif z \
    = e^(t^2 / 2) 1 / sqrt(2 pi) = integral_(-infinity)^infinity e^(-1 / 2 (z-t)^2) dif z = e^(t^2 / 2)
  $

  To get the MGF for a general normal RV, $X ~ N(mu, sigma^2)$, then
  $
    X = sigma Z + mu
  $
  we get
  $
    E[e^(t (sigma Z + mu))] = e^(t mu) E[e^(t sigma Z)] = e^(t mu) dot e^((t^2 sigma^2) / 2) = exp(mu t + (sigma^2 t^2) / 2)
  $
]

== Joint distributions of RV

Looking at multiple random variables jointly. If $X$ and $Y$ are both random
variables defined on $Omega$< treat them as coordinates of a 2 dimensional
random vector. It's a vector valued function on $Omega$,

$
  Omega -> RR^2
$

Valid both discretely and continuously

#example[
  $
    (X,Y)
  $
  1. Poker hand: $X$ is num of face cards, $Y$ is num of red cards.
  2. Demographic info: $X$ = height, $Y$ = weight
]

In general, with $n$ random variables jointly where
$
  X_1, X_2, ..., X_n
$
defined on $Omega$ are coordinates of an $n$-dimensional random vector that
maps the results to $RR^n$.

The probability distribution of $(X_1, dots.c, X_n)$ is now $P((X_1, dots.c,
X_n) in B)$ where $B$ are subsets of $RR^n$ (power set of $RR^n)$.

The probability distribution of the random vector is called the _joint
distribution_.

#fact[
  Let $X$ and $Y$ both be discrete random variables defined on the same $Omega$
  Then, the joint PMF is
  $
    P(X = x, Y = y) = P_(X,Y) (x,y)
  $
  where $p_(X,Y) (x,y) >= 0$ for all possible values $x,y$ of $X$ and $Y$
  respectively.
] And,
$
  sum_(x in X) sum_(y in Y) p_(X,Y) (x,y) = 1
$

#definition[
  Let $X_1, X_2, ..., X_n$ are discrete random variables defined on $Omega$,
  their joint PMF is given by
  $
    p(k_1, k_2, ..., k_n) = P(X_1 = k_1, X_1 = k_2, ..., X_n = k_n)
  $
  for all possible $k_1, ..., k_n$ of $X_1, ..., X_n$.
]

#fact[
  The joint probability in set notation:
  $
    P(X_1 = k_1, X_1 = k_2, ..., X_n = k_n) = P({X_1 = k_1}sect{X_n = k_n})
  $
  The joint PDF has the same properties as single variable PDF
  $
    p_(X_1,X_2,X_n) (k_1,k_2,...,k_n) >= 0
  $
]