The Involution https://blog.youwen.dev/atom.xml Youwen Wu youwenw@gmail.com 2025-02-16T00:00:00Z Random variables, distributions, and probability theory https://blog.youwen.dev/random-variables-distributions-and-probability-theory.html 2025-02-16T00:00:00Z 2025-02-16T00:00:00Z

Random variables, distributions, and probability theory

An overview of discrete and continuous random variables and their distributions and moment generating functions

2025-02-16

These are some notes I’ve been collecting on random variables, their distributions, expected values, and moment generating functions. I thought I’d write them down somewhere useful.

These are almost extracted verbatim from my in-class notes, which I take in real time using Typst. I simply wrote a tiny compatibility shim to allow Pandoc to render them to the web.


Random variables

First, some brief exposition on random variables. Quixotically, a random variable is actually a function.

Standard notation: Ω\Omega is a sample space, ωΩ\omega \in \Omega is an event.

Definition.

A random variable XX is a function X:ΩX:\Omega \rightarrow {\mathbb{R}} that takes the set of possible outcomes in a sample space, and maps it to a measurable space, typically (as in our case) a subset of \mathbb{R}.

Definition.

The state space of a random variable XX is all of the values XX can take.

Example.

Let XX be a random variable that takes on the values {0,1,2,3}\left\{ 0,1,2,3 \right\}. Then the state space of XX is the set {0,1,2,3}\left\{ 0,1,2,3 \right\}.

Discrete random variables

A random variable XX is discrete if there is countable AA such that P(XA)=1P(X \in A) = 1. kk is a possible value if P(X=k)>0P(X = k) > 0. We discuss continuous random variables later.

The probability distribution of XX gives its important probabilistic information. The probability distribution is a description of the probabilities P(XB)P(X \in B) for subsets BB \in {\mathbb{R}}. We describe the probability density function and the cumulative distribution function.

A discrete random variable has probability distribution entirely determined by its probability mass function (hereafter abbreviated p.m.f or PMF) p(k)=P(X=k)p(k) = P(X = k). The p.m.f. is a function from the set of possible values of XX into [0,1]\lbrack 0,1\rbrack. Labeling the p.m.f. with the random variable is done by pX(k)p_{X}(k).

pX: State space of X[0,1]p_{X}:\text{ State space of }X \rightarrow \lbrack 0,1\rbrack

By the axioms of probability,

kpX(k)=kP(X=k)=1\sum_{k}p_{X}(k) = \sum_{k}P(X = k) = 1

For a subset BB \subset {\mathbb{R}},

P(XB)=kBpX(k)P(X \in B) = \sum_{k \in B}p_{X}(k)

Continuous random variables

Now as promised we introduce another major class of random variables.

Definition.

Let XX be a random variable. If ff satisfies

P(Xb)=bf(x)dxP(X \leq b) = \int_{- \infty}^{b}f(x)dx

for all bb \in {\mathbb{R}}, then ff is the probability density function (hereafter abbreviated p.d.f. or PDF) of XX.

We immediately see that the p.d.f. is analogous to the p.m.f. of the discrete case.

The probability that X(,b]X \in ( - \infty,b\rbrack is equal to the area under the graph of ff from - \infty to bb.

A corollary is the following.

Fact.

P(XB)=Bf(x)dxP(X \in B) = \int_{B}f(x)dx

for any BB \subset {\mathbb{R}} where integration makes sense.

The set can be bounded or unbounded, or any collection of intervals.

Fact.

P(aXb)=abf(x)dxP(a \leq X \leq b) = \int_{a}^{b}f(x)dx P(X>a)=af(x)dxP(X > a) = \int_{a}^{\infty}f(x)dx

Fact.

If a random variable XX has density function ff then individual point values have probability zero:

P(X=c)=ccf(x)dx=0,cP(X = c) = \int_{c}^{c}f(x)dx = 0,\forall c \in {\mathbb{R}}

Remark.

It follows a random variable with a density function is not discrete. An immediate corollary of this is that the probabilities of intervals are not changed by including or excluding endpoints. So P(Xk)P(X \leq k) and P(X<k)P(X < k) are equivalent.

How to determine which functions are p.d.f.s? Since P(<X<)=1P( - \infty < X < \infty) = 1, a p.d.f. ff must satisfy

f(x)0xf(x)dx=1\begin{array}{r} f(x) \geq 0\forall x \in {\mathbb{R}} \\ \int_{- \infty}^{\infty}f(x)dx = 1 \end{array}

Fact.

Random variables with density functions are called continuous random variables. This does not imply that the random variable is a continuous function on Ω\Omega but it is standard terminology.

Discrete distributions

Recall that the probability distribution of XX gives its important probabilistic information. Let us discuss some of these distributions.

In general we first consider the experiment’s properties and theorize about the distribution that its random variable takes. We can then apply the distribution to find out various pieces of probabilistic information.

Bernoulli trials

A Bernoulli trial is the original “experiment.” It’s simply a single trial with a binary “success” or “failure” outcome. Encode this T/F, 0 or 1, or however you’d like. It becomes immediately useful in defining more complex distributions, so let’s analyze its properties.

The setup: the experiment has exactly two outcomes:

  • Success – SS or 1

  • Failure – FF or 0

Additionally: P(S)=p,(0<p<1)P(F)=1p=q\begin{array}{r} P(S) = p,(0 < p < 1) \\ P(F) = 1 - p = q \end{array}

Construct the probability mass function:

P(X=1)=pP(X=0)=1p\begin{array}{r} P(X = 1) = p \\ P(X = 0) = 1 - p \end{array}

Write it as:

px(k)=pk(1p)1kp_{x(k)} = p^{k}(1 - p)^{1 - k}

for k=1k = 1 and k=0k = 0.

Binomial distribution

The setup: very similar to Bernoulli, trials have exactly 2 outcomes. A bunch of Bernoulli trials in a row.

Importantly: pp and qq are defined exactly the same in all trials.

This ties the binomial distribution to the sampling with replacement model, since each trial does not affect the next.

We conduct nn independent trials of this experiment. Example with coins: each flip independently has a 12\frac{1}{2} chance of heads or tails (holds same for die, rigged coin, etc).

nn is fixed, i.e. known ahead of time.

Binomial random variable

Let’s consider the random variable characterized by the binomial distribution now.

Let X=#X = \# of successes in nn independent trials. For any particular sequence of nn trials, it takes the form Ω={ω} where ω=SFFF\Omega = \left\{ \omega \right\}\text{ where }\omega = SFF\cdots F and is of length nn.

Then X(ω)=0,1,2,,nX(\omega) = 0,1,2,\ldots,n can take n+1n + 1 possible values. The probability of any particular sequence is given by the product of the individual trial probabilities.

Example.

ω=SFFSFS=(pqqpqp)\omega = SFFSF\cdots S = (pqqpq\cdots p)

So P(x=0)=P(FFFF)=qqq=qnP(x = 0) = P(FFF\cdots F) = q \cdot q \cdot \cdots \cdot q = q^{n}.

And P(X=1)=P(SFFF)+P(FSFFF)++P(FFFFS)=n possible outcomesp1pn1=(n1)p1pn1=np1pn1\begin{array}{r} P(X = 1) = P(SFF\cdots F) + P(FSFF\cdots F) + \cdots + P(FFF\cdots FS) \\ = \underset{\text{ possible outcomes}}{\underbrace{n}} \cdot p^{1} \cdot p^{n - 1} \\ = \begin{pmatrix} n \\ 1 \end{pmatrix} \cdot p^{1} \cdot p^{n - 1} \\ = n \cdot p^{1} \cdot p^{n - 1} \end{array}

Now we can generalize

P(X=2)=(n2)p2qn2P(X = 2) = \begin{pmatrix} n \\ 2 \end{pmatrix}p^{2}q^{n - 2}

How about all successes?

P(X=n)=P(SSS)=pnP(X = n) = P(SS\cdots S) = p^{n}

We see that for all failures we have qnq^{n} and all successes we have pnp^{n}. Otherwise we use our method above.

In general, here is the probability mass function for the binomial random variable

P(X=k)=(nk)pkqnk, for k=0,1,2,,nP(X = k) = \begin{pmatrix} n \\ k \end{pmatrix}p^{k}q^{n - k},\text{ for }k = 0,1,2,\ldots,n

Binomial distribution is very powerful. Choosing between two things, what are the probabilities?

To summarize the characterization of the binomial random variable:

  • nn independent trials

  • each trial results in binary success or failure

  • with probability of success pp, identically across trials

with X=#X = \# successes in fixed nn trials.

X Bin(n,p)X\sim\text{ Bin}(n,p)

with probability mass function

P(X=x)=(nx)px(1p)nx=p(x) for x=0,1,2,,nP(X = x) = \begin{pmatrix} n \\ x \end{pmatrix}p^{x}(1 - p)^{n - x} = p(x)\text{ for }x = 0,1,2,\ldots,n

We see this is in fact the binomial theorem!

p(x)0,x=0np(x)=x=0n(nx)pxqnx=(p+q)np(x) \geq 0,\sum_{x = 0}^{n}p(x) = \sum_{x = 0}^{n}\begin{pmatrix} n \\ x \end{pmatrix}p^{x}q^{n - x} = (p + q)^{n}

In fact, (p+q)n=(p+(1p))n=1(p + q)^{n} = \left( p + (1 - p) \right)^{n} = 1

Example.

What is the probability of getting exactly three aces (1’s) out of 10 throws of a fair die?

Seems a little trickier but we can still write this as well defined SS/FF. Let SS be getting an ace and FF being anything else.

Then p=16p = \frac{1}{6} and n=10n = 10. We want P(X=3)P(X = 3). So

P(X=3)=(103)p3q7=(103)(16)3(56)70.15505\begin{array}{r} P(X = 3) = \begin{pmatrix} 10 \\ 3 \end{pmatrix}p^{3}q^{7} = \begin{pmatrix} 10 \\ 3 \end{pmatrix}\left( \frac{1}{6} \right)^{3}\left( \frac{5}{6} \right)^{7} \\ \approx 0.15505 \end{array}

With or without replacement?

I place particular emphasis on the fact that the binomial distribution generally applies to cases where you’re sampling with replacement. Consider the following: Example.

Suppose we have two types of candy, red and black. Select nn candies. Let XX be the number of red candies among nn selected.

2 cases.

  • case 1: with replacement: Binomial Distribution, nn, p=aa+bp = \frac{a}{a + b}.

P(X=2)=(n2)(aa+b)2(ba+b)n2P(X = 2) = \begin{pmatrix} n \\ 2 \end{pmatrix}\left( \frac{a}{a + b} \right)^{2}\left( \frac{b}{a + b} \right)^{n - 2}

  • case 2: without replacement: then use counting

P(X=x)=(ax)(bnx)(a+bn)=p(x)P(X = x) = \frac{\begin{pmatrix} a \\ x \end{pmatrix}\begin{pmatrix} b \\ n - x \end{pmatrix}}{\begin{pmatrix} a + b \\ n \end{pmatrix}} = p(x)

In case 2, we used the elementary counting techniques we are already familiar with. Immediately we see a distinct case similar to the binomial but when sampling without replacement. Let’s formalize this as a random variable!

Hypergeometric distribution

Let’s introduce a random variable to represent a situation like case 2 above.

Definition.

P(X=x)=(ax)(bnx)(a+bn)=p(x)P(X = x) = \frac{\begin{pmatrix} a \\ x \end{pmatrix}\begin{pmatrix} b \\ n - x \end{pmatrix}}{\begin{pmatrix} a + b \\ n \end{pmatrix}} = p(x)

is known as a Hypergeometric distribution.

Abbreviate this by:

X Hypergeom(# total,# successes, sample size)X\sim\text{ Hypergeom}\left( \#\text{ total},\#\text{ successes},\text{ sample size} \right)

For example,

X Hypergeom(N,Na,n)X\sim\text{ Hypergeom}\left( N,N_{a},n \right)

Remark.

If xx is very small relative to a+ba + b, then both cases give similar (approx. the same) answers.

For instance, if we’re sampling for blood types from UCSB, and we take a student out without replacement, we don’t really change the sample size substantially. So both answers give a similar result.

Suppose we have two types of items, type AA and type BB. Let NAN_{A} be #\# type AA, NBN_{B} #\# type BB. N=NA+NBN = N_{A} + N_{B} is the total number of objects.

We sample nn items without replacement (nNn \leq N) with order not mattering. Denote by XX the number of type AA objects in our sample.

Definition.

Let 0NAN0 \leq N_{A} \leq N and 1nN1 \leq n \leq N be integers. A random variable XX has the hypergeometric distribution with parameters (N,NA,n)\left( N,N_{A},n \right) if XX takes values in the set {0,1,,n}\left\{ 0,1,\ldots,n \right\} and has p.m.f.

P(X=k)=(NAk)(NNAnk)(Nn)=p(k)P(X = k) = \frac{\begin{pmatrix} N_{A} \\ k \end{pmatrix}\begin{pmatrix} N - N_{A} \\ n - k \end{pmatrix}}{\begin{pmatrix} N \\ n \end{pmatrix}} = p(k)

Example.

Let NA=10N_{A} = 10 defectives. Let NB=90N_{B} = 90 non-defectives. We select n=5n = 5 without replacement. What is the probability that 2 of the 5 selected are defective?

X Hypergeom (N=100,NA=10,n=5)X\sim\text{ Hypergeom }\left( N = 100,N_{A} = 10,n = 5 \right)

We want P(X=2)P(X = 2).

P(X=2)=(102)(903)(1005)0.0702P(X = 2) = \frac{\begin{pmatrix} 10 \\ 2 \end{pmatrix}\begin{pmatrix} 90 \\ 3 \end{pmatrix}}{\begin{pmatrix} 100 \\ 5 \end{pmatrix}} \approx 0.0702

Remark.

Make sure you can distinguish when a problem is binomial or when it is hypergeometric. This is very important on exams.

Recall that both ask about number of successes, in a fixed number of trials. But binomial is sample with replacement (each trial is independent) and sampling without replacement is hypergeometric.

Geometric distribution

Consider an infinite sequence of independent trials. e.g. number of attempts until I make a basket.

In fact we can think of this as a variation on the binomial distribution. But in this case we don’t sample nn times and ask how many successes we have, we sample as many times as we need for one success. Later on we’ll see this is really a specific case of another distribution, the negative binomial.

Let XiX_{i} denote the outcome of the ithi^{\text{th}} trial, where success is 1 and failure is 0. Let NN be the number of trials needed to observe the first success in a sequence of independent trials with probability of success pp. Then

We fail k1k - 1 times and succeed on the kthk^{\text{th}} try. Then:

P(N=k)=P(X1=0,X2=0,,Xk1=0,Xk=1)=(1p)k1pP(N = k) = P\left( X_{1} = 0,X_{2} = 0,\ldots,X_{k - 1} = 0,X_{k} = 1 \right) = (1 - p)^{k - 1}p

This is the probability of failures raised to the amount of failures, times probability of success.

The key characteristic in these trials, we keep going until we succeed. There’s no nn choose kk in front like the binomial distribution because there’s exactly one sequence that gives us success.

Definition.

Let 0<p10 < p \leq 1. A random variable XX has the geometric distribution with success parameter pp if the possible values of XX are {1,2,3,}\left\{ 1,2,3,\ldots \right\} and XX satisfies

P(X=k)=(1p)k1pP(X = k) = (1 - p)^{k - 1}p

for positive integers kk. Abbreviate this by X Geom(p)X\sim\text{ Geom}(p).

Example.

What is the probability it takes more than seven rolls of a fair die to roll a six?

Let XX be the number of rolls of a fair die until the first six. Then X Geom(16)X\sim\text{ Geom}\left( \frac{1}{6} \right). Now we just want P(X>7)P(X > 7).

P(X>7)=k=8P(X=k)=k=8(56)k116P(X > 7) = \sum_{k = 8}^{\infty}P(X = k) = \sum_{k = 8}^{\infty}\left( \frac{5}{6} \right)^{k - 1}\frac{1}{6}

Re-indexing,

k=8(56)k116=16(56)7j=0(56)j\sum_{k = 8}^{\infty}\left( \frac{5}{6} \right)^{k - 1}\frac{1}{6} = \frac{1}{6}\left( \frac{5}{6} \right)^{7}\sum_{j = 0}^{\infty}\left( \frac{5}{6} \right)^{j}

Now we calculate by standard methods:

16(56)7j=0(56)j=16(56)71156=(56)7\frac{1}{6}\left( \frac{5}{6} \right)^{7}\sum_{j = 0}^{\infty}\left( \frac{5}{6} \right)^{j} = \frac{1}{6}\left( \frac{5}{6} \right)^{7} \cdot \frac{1}{1 - \frac{5}{6}} = \left( \frac{5}{6} \right)^{7}

Negative binomial

As promised, here’s the negative binomial.

Consider a sequence of Bernoulli trials with the following characteristics:

  • Each trial success or failure

  • Prob. of success pp is same on each trial

  • Trials are independent (notice they are not fixed to specific number)

  • Experiment continues until rr successes are observed, where rr is a given parameter

Then if XX is the number of trials necessary until rr successes are observed, we say XX is a negative binomial random variable.

Immediately we see that the geometric distribution is just the negative binomial with r=1r = 1.

Definition.

Let k+k \in {\mathbb{Z}}^{+} and 0<p10 < p \leq 1. A random variable XX has the negative binomial distribution with parameters {k,p}\left\{ k,p \right\} if the possible values of XX are the integers {k,k+1,k+2,}\left\{ k,k + 1,k + 2,\ldots \right\} and the p.m.f. is

P(X=n)=(n1k1)pk(1p)nk for nkP(X = n) = \begin{pmatrix} n - 1 \\ k - 1 \end{pmatrix}p^{k}(1 - p)^{n - k}\text{ for }n \geq k

Abbreviate this by X Negbin(k,p)X\sim\text{ Negbin}(k,p).

Example.

Steph Curry has a three point percentage of approx. 43%43\%. What is the probability that Steph makes his third three-point basket on his 5th5^{\text{th}} attempt?

Let XX be number of attempts required to observe the 3rd success. Then,

X Negbin(k=3,p=0.43)X\sim\text{ Negbin}(k = 3,p = 0.43)

So, P(X=5)=(5131)(0.43)3(10.43)53=(42)(0.43)3(0.57)20.155\begin{aligned} P(X = 5) & = {\begin{pmatrix} 5 - 1 \\ 3 - 1 \end{pmatrix}(0.43)}^{3}(1 - 0.43)^{5 - 3} \\ & = \begin{pmatrix} 4 \\ 2 \end{pmatrix}(0.43)^{3}(0.57)^{2} \\ & \approx 0.155 \end{aligned}

Poisson distribution

This p.m.f. follows from the Taylor expansion

eλ=k=0λkk!e^{\lambda} = \sum_{k = 0}^{\infty}\frac{\lambda^{k}}{k!}

which implies that

k=0eλλkk!=eλeλ=1\sum_{k = 0}^{\infty}e^{- \lambda}\frac{\lambda^{k}}{k!} = e^{- \lambda}e^{\lambda} = 1

Definition.

For an integer valued random variable XX, we say X Poisson(λ)X\sim\text{ Poisson}(\lambda) if it has p.m.f.

P(X=k)=eλλkk!P(X = k) = e^{- \lambda}\frac{\lambda^{k}}{k!}

for k{0,1,2,}k \in \left\{ 0,1,2,\ldots \right\} for λ>0\lambda > 0 and

k=0P(X=k)=1\sum_{k = 0}^{\infty}P(X = k) = 1

The Poisson arises from the Binomial. It applies in the binomial context when nn is very large (n100n \geq 100) and pp is very small p0.05p \leq 0.05, such that npnp is a moderate number (np<10np < 10).

Then XX follows a Poisson distribution with λ=np\lambda = np.

P(Bin(n,p)=k)P(Poisson(λ=np)=k)P\left( \text{Bin}(n,p) = k \right) \approx P\left( \text{Poisson}(\lambda = np) = k \right)

for k=0,1,,nk = 0,1,\ldots,n.

The Poisson distribution is useful for finding the probabilities of rare events over a continuous interval of time. By knowing λ=np\lambda = np for small nn and pp, we can calculate many probabilities.

Example.

The number of typing errors in the page of a textbook.

Let

  • nn be the number of letters of symbols per page (large)

  • pp be the probability of error, small enough such that

  • limnlimp0np=λ=0.1\lim\limits_{n \rightarrow \infty}\lim\limits_{p \rightarrow 0}np = \lambda = 0.1

What is the probability of exactly 1 error?

We can approximate the distribution of XX with a Poisson(λ=0.1)\text{Poisson}(\lambda = 0.1) distribution

P(X=1)=e0.1(0.1)11!=0.09048P(X = 1) = \frac{e^{- 0.1}(0.1)^{1}}{1!} = 0.09048

Continuous distributions

All of the distributions we’ve been analyzing have been discrete, that is, they apply to random variables with a countable state space. Even when the state space is infinite, as in the negative binomial, it is countable. We can think of it as indexing each trial with a natural number 0,1,2,3,0,1,2,3,\ldots.

Now we turn our attention to continuous random variables that operate on uncountably infinite state spaces. For example, if we sample uniformly inside of the interval [0,1]\lbrack 0,1\rbrack, there are an uncountably infinite number of possible values we could obtain. We cannot index these values by the natural numbers, by some theorems of set theory we in fact know that the interval [0,1]\lbrack 0,1\rbrack has a bijection to \mathbb{R} and has cardinality א1א_{1}.

Additionally we notice that asking for the probability that we pick a certain point in the interval [0,1]\lbrack 0,1\rbrack makes no sense, there are an infinite amount of sample points! Intuitively we should think that the probability of choosing any particular point is 0. However, we should be able to make statements about whether we can choose a point that lies within a subset, like [0,0.5]\lbrack 0,0.5\rbrack.

Let’s formalize these ideas.

Definition.

Let XX be a random variable. If we have a function ff such that

P(Xb)=bf(x)dxP(X \leq b) = \int_{- \infty}^{b}f(x)dx for all bb \in {\mathbb{R}}, then ff is the probability density function of XX.

The probability that the value of XX lies in (,b]( - \infty,b\rbrack equals the area under the curve of ff from - \infty to bb.

If ff satisfies this definition, then for any BB \subset {\mathbb{R}} for which integration makes sense,

P(XB)=Bf(x)dxP(X \in B) = \int_{B}f(x)dx

Remark.

Recall from our previous discussion of random variables that the PDF is the analogue of the PMF for discrete random variables.

Properties of a CDF:

Any CDF F(x)=P(Xx)F(x) = P(X \leq x) satisfies

  1. Integrates to unity: F()=0F( - \infty) = 0, F()=1F(\infty) = 1

  2. F(x)F(x) is non-decreasing in xx (monotonically increasing)

s<tF(s)F(t)s < t \Rightarrow F(s) \leq F(t)

  1. P(a<Xb)=P(Xb)P(Xa)=F(b)F(a)P(a < X \leq b) = P(X \leq b) - P(X \leq a) = F(b) - F(a)

Like we mentioned before, we can only ask about things like P(Xk)P(X \leq k), but not P(X=k)P(X = k). In fact P(X=k)=0P(X = k) = 0 for all kk. An immediate corollary of this is that we can freely interchange \leq and << and likewise for \geq and >>, since P(Xk)=P(X<k)P(X \leq k) = P(X < k) if P(X=k)=0P(X = k) = 0.

Example.

Let XX be a continuous random variable with density (pdf)

f(x)={cx2for 0<x<20otherwise f(x) = \begin{cases} cx^{2} & \text{for }0 < x < 2 \\ 0 & \text{otherwise } \end{cases}

  1. What is cc?

cc is such that 1=f(x)dx=02cx2dx1 = \int_{- \infty}^{\infty}f(x)dx = \int_{0}^{2}cx^{2}dx

  1. Find the probability that XX is between 1 and 1.4.

Integrate the curve between 1 and 1.4.

11.438x2dx=(x38)|11.4=0.218\begin{array}{r} \int_{1}^{1.4}\frac{3}{8}x^{2}dx = \left( \frac{x^{3}}{8} \right)|_{1}^{1.4} \\ = 0.218 \end{array}

This is the probability that XX lies between 1 and 1.4.

  1. Find the probability that XX is between 1 and 3.

Idea: integrate between 1 and 3, be careful after 2.

1238x2dx+230dx=\int_{1}^{2}\frac{3}{8}x^{2}dx + \int_{2}^{3}0dx =

  1. What is the CDF for P(Xx)P(X \leq x)? Integrate the curve to xx.

F(x)=P(Xx)=xf(t)dt=0x38t2dt=x38\begin{array}{r} F(x) = P(X \leq x) = \int_{- \infty}^{x}f(t)dt \\ = \int_{0}^{x}\frac{3}{8}t^{2}dt \\ = \frac{x^{3}}{8} \end{array}

Important: include the range!

F(x)={0for x0x38for 0<x<21for x2F(x) = \begin{cases} 0 & \text{for }x \leq 0 \\ \frac{x^{3}}{8} & \text{for }0 < x < 2 \\ 1 & \text{for }x \geq 2 \end{cases}

  1. Find a point aa such that you integrate up to the point to find exactly 12\frac{1}{2}

the area.

We want to find 12=P(Xa)\frac{1}{2} = P(X \leq a).

12=P(Xa)=F(a)=a38a=43\frac{1}{2} = P(X \leq a) = F(a) = \frac{a^{3}}{8} \Rightarrow a = \sqrt[3]{4}

Now let us discuss some named continuous distributions.

The (continuous) uniform distribution

The most simple and the best of the named distributions!

Definition.

Let [a,b]\lbrack a,b\rbrack be a bounded interval on the real line. A random variable XX has the uniform distribution on the interval [a,b]\lbrack a,b\rbrack if XX has the density function

f(x)={1bafor x[a,b]0for x[a,b]f(x) = \begin{cases} \frac{1}{b - a} & \text{for }x \in \lbrack a,b\rbrack \\ 0 & \text{for }x \notin \lbrack a,b\rbrack \end{cases}

Abbreviate this by X Unif [a,b]X\sim\text{ Unif }\lbrack a,b\rbrack.

The graph of Unif [a,b]\text{Unif }\lbrack a,b\rbrack is a constant line at height 1ba\frac{1}{b - a} defined across [a,b]\lbrack a,b\rbrack. The integral is just the area of a rectangle, and we can check it is 1.

Fact.

For X Unif [a,b]X\sim\text{ Unif }\lbrack a,b\rbrack, its cumulative distribution function (CDF) is given by:

Fx(x)={0for x<axabafor x[a,b]1for x>bF_{x}(x) = \begin{cases} 0 & \text{for }x < a \\ \frac{x - a}{b - a} & \text{for }x \in \lbrack a,b\rbrack \\ 1 & \text{for }x > b \end{cases}

Fact.

If X Unif [a,b]X\sim\text{ Unif }\lbrack a,b\rbrack, and [c,d][a,b]\lbrack c,d\rbrack \subset \lbrack a,b\rbrack, then P(cXd)=cd1badx=dcbaP(c \leq X \leq d) = \int_{c}^{d}\frac{1}{b - a}dx = \frac{d - c}{b - a}

Example.

Let YY be a uniform random variable on [2,5]\lbrack - 2,5\rbrack. Find the probability that its absolute value is at least 1.

YY takes values in the interval [2,5]\lbrack - 2,5\rbrack, so the absolute value is at least 1 iff. Y[2,1][1,5]Y \in \lbrack - 2,1\rbrack \cup \lbrack 1,5\rbrack.

The density function of YY is f(x)=15(2)=17f(x) = \frac{1}{5 - ( - 2)} = \frac{1}{7} on [2,5]\lbrack - 2,5\rbrack and 0 everywhere else.

So,

P(|Y|1)=P(Y[2,1][1,5])=P(2Y1)+P(1Y5)=57\begin{aligned} P\left( |Y| \geq 1 \right) & = P\left( Y \in \lbrack - 2, - 1\rbrack \cup \lbrack 1,5\rbrack \right) \\ & = P( - 2 \leq Y \leq - 1) + P(1 \leq Y \leq 5) \\ & = \frac{5}{7} \end{aligned}

The exponential distribution

The geometric distribution can be viewed as modeling waiting times, in a discrete setting, i.e. we wait for n1n - 1 failures to arrive at the nthn^{\text{th}} success.

The exponential distribution is the continuous analogue to the geometric distribution, in that we often use it to model waiting times in the continuous sense. For example, the first custom to enter the barber shop.

Definition.

Let 0<λ<0 < \lambda < \infty. A random variable XX has the exponential distribution with parameter λ\lambda if XX has PDF

f(x)={λeλxfor x00for x<0f(x) = \begin{cases} \lambda e^{- \lambda x} & \text{for }x \geq 0 \\ 0 & \text{for }x < 0 \end{cases}

Abbreviate this by X Exp(λ)X\sim\text{ Exp}(\lambda), the exponential distribution with rate λ\lambda.

The CDF of the Exp(λ)\text{Exp}(\lambda) distribution is given by:

F(t)+{0if t<01eλtif t0F(t) + \begin{cases} 0 & \text{if }t < 0 \\ 1 - e^{- \lambda t} & \text{if }t \geq 0 \end{cases}

Example.

Suppose the length of a phone call, in minutes, is well modeled by an exponential random variable with a rate λ=110\lambda = \frac{1}{10}.

  1. What is the probability that a call takes more than 8 minutes?

  2. What is the probability that a call takes between 8 and 22 minutes?

Let XX be the length of the phone call, so that X Exp(110)X\sim\text{ Exp}\left( \frac{1}{10} \right). Then we can find the desired probability by:

P(X>8)=1P(X8)=1Fx(8)=1(1e(110)8)=e8100.4493\begin{aligned} P(X > 8) & = 1 - P(X \leq 8) \\ & = 1 - F_{x}(8) \\ & = 1 - \left( 1 - e^{- \left( \frac{1}{10} \right) \cdot 8} \right) \\ & = e^{- \frac{8}{10}} \approx 0.4493 \end{aligned}

Now to find P(8<X<22)P(8 < X < 22), we can take the difference in CDFs:

P(X>8)P(X22)=e810e22100.3385\begin{aligned} & P(X > 8) - P(X \geq 22) \\ & = e^{- \frac{8}{10}} - e^{- \frac{22}{10}} \\ & \approx 0.3385 \end{aligned}

Fact (Memoryless property of the exponential distribution).

Suppose that X Exp(λ)X\sim\text{ Exp}(\lambda). Then for any s,t>0s,t > 0, we have P(X>t+s|X>t)=P(X>s)P\left( X > t + s~|~X > t \right) = P(X > s)

This is like saying if I’ve been waiting 5 minutes and then 3 minutes for the bus, what is the probability that I’m gonna wait more than 5 + 3 minutes, given that I’ve already waited 5 minutes? And that’s precisely equal to just the probability I’m gonna wait more than 3 minutes.

Proof.

P(X>t+s|X>t)=P(X>t+sX>t)P(X>t)=P(X>t+s)P(X>t)=eλ(t+s)eλt=eλsP(X>s)\begin{array}{r} P\left( X > t + s~|~X > t \right) = \frac{P(X > t + s \cap X > t)}{P(X > t)} \\ = \frac{P(X > t + s)}{P(X > t)} = \frac{e^{- \lambda(t + s)}}{e^{- \lambda t}} = e^{- \lambda s} \\ \equiv P(X > s) \end{array}

Gamma distribution

Definition.

Let r,λ>0r,\lambda > 0. A random variable XX has the gamma distribution with parameters (r,λ)(r,\lambda) if XX is nonnegative and has probability density function

f(x)={λrxr2Γ(r)eλxfor x00for x<0f(x) = \begin{cases} \frac{\lambda^{r}x^{r - 2}}{\Gamma(r)}e^{- \lambda x} & \text{for }x \geq 0 \\ 0 & \text{for }x < 0 \end{cases}

Abbreviate this by X Gamma(r,λ)X\sim\text{ Gamma}(r,\lambda).

The gamma function Γ(r)\Gamma(r) generalizes the factorial function and is defined as

Γ(r)=0xr1exdx, for r>0\Gamma(r) = \int_{0}^{\infty}x^{r - 1}e^{- x}dx,\text{ for }r > 0

Special case: Γ(n)=(n1)!\Gamma(n) = (n - 1)! if n+n \in {\mathbb{Z}}^{+}.

Remark.

The Exp(λ)\text{Exp}(\lambda) distribution is a special case of the gamma distribution, with parameter r=1r = 1.

The normal distribution

Also known as the Gaussian distribution, this is so important it gets its own section.

Definition.

A random variable ZZ has the standard normal distribution if ZZ has density function

φ(x)=12πex22\varphi(x) = \frac{1}{\sqrt{2\pi}}e^{- \frac{x^{2}}{2}} on the real line. Abbreviate this by ZN(0,1)Z\sim N(0,1).

Fact (CDF of a standard normal random variable).

Let ZN(0,1)Z\sim N(0,1) be normally distributed. Then its CDF is given by Φ(x)=xφ(s)ds=x12πe(s2)2ds\Phi(x) = \int_{- \infty}^{x}\varphi(s)ds = \int_{- \infty}^{x}\frac{1}{\sqrt{2\pi}}e^{\frac{- \left( - s^{2} \right)}{2}}ds

The normal distribution is so important, instead of the standard fZ(x)f_{Z(x)} and Fz(x)F_{z(x)}, we use the special φ(x)\varphi(x) and Φ(x)\Phi(x).

Fact.

es22ds=2π\int_{- \infty}^{\infty}e^{- \frac{s^{2}}{2}}ds = \sqrt{2\pi}

No closed form of the standard normal CDF Φ\Phi exists, so we are left to either:

  • approximate

  • use technology (calculator)

  • use the standard normal probability table in the textbook

To evaluate negative values, we can use the symmetry of the normal distribution to apply the following identity:

Φ(x)=1Φ(x)\Phi( - x) = 1 - \Phi(x)

General normal distributions

We can compute any other parameters of the normal distribution using the standard normal.

The general family of normal distributions is obtained by linear or affine transformations of ZZ. Let μ\mu be real, and σ>0\sigma > 0, then

X=σZ+μX = \sigma Z + \mu is also a normally distributed random variable with parameters (μ,σ2)\left( \mu,\sigma^{2} \right). The CDF of XX in terms of Φ()\Phi( \cdot ) can be expressed as

FX(x)=P(Xx)=P(σZ+μx)=P(Zxμσ)=Φ(xμσ)\begin{aligned} F_{X}(x) & = P(X \leq x) \\ & = P(\sigma Z + \mu \leq x) \\ & = P\left( Z \leq \frac{x - \mu}{\sigma} \right) \\ & = \Phi(\frac{x - \mu}{\sigma}) \end{aligned}

Also,

f(x)=F(x)=ddx[Φ(xuσ)]=1σφ(xuσ)=12πσ2e((xμ)2)2σ2f(x) = F\prime(x) = \frac{d}{dx}\left\lbrack \Phi(\frac{x - u}{\sigma}) \right\rbrack = \frac{1}{\sigma}\varphi(\frac{x - u}{\sigma}) = \frac{1}{\sqrt{2\pi\sigma^{2}}}e^{\frac{- \left( (x - \mu)^{2} \right)}{2\sigma^{2}}}

Definition.

Let μ\mu be real and σ>0\sigma > 0. A random variable XX has the normal distribution with mean μ\mu and variance σ2\sigma^{2} if XX has density function

f(x)=12πσ2e((xμ)2)2σ2f(x) = \frac{1}{\sqrt{2\pi\sigma^{2}}}e^{\frac{- \left( (x - \mu)^{2} \right)}{2\sigma^{2}}}

on the real line. Abbreviate this by XN(μ,σ2)X\sim N\left( \mu,\sigma^{2} \right).

Fact.

Let XN(μ,σ2)X\sim N\left( \mu,\sigma^{2} \right) and Y=aX+bY = aX + b. Then YN(aμ+b,a2σ2)Y\sim N\left( a\mu + b,a^{2}\sigma^{2} \right)

That is, YY is normally distributed with parameters (aμ+b,a2σ2)\left( a\mu + b,a^{2}\sigma^{2} \right). In particular, Z=XμσN(0,1)Z = \frac{X - \mu}{\sigma}\sim N(0,1) is a standard normal variable.

Expectation

Let’s discuss the expectation of a random variable, which is a similar idea to the basic concept of mean.

Definition.

The expectation or mean of a discrete random variable XX is the weighted average, with weights assigned by the corresponding probabilities.

E(X)=all xixip(xi)E(X) = \sum_{\text{all }x_{i}}x_{i} \cdot p\left( x_{i} \right)

Example.

Find the expected value of a single roll of a fair die.

  • X= score dotsX = \frac{\text{ score }}{\text{ dots}}

  • x=1,2,3,4,5,6x = 1,2,3,4,5,6

  • p(x)=16,16,16,16,16,16p(x) = \frac{1}{6},\frac{1}{6},\frac{1}{6},\frac{1}{6},\frac{1}{6},\frac{1}{6}

E[x]=116+216+616E\lbrack x\rbrack = 1 \cdot \frac{1}{6} + 2 \cdot \frac{1}{6}\ldots + 6 \cdot \frac{1}{6}

Binomial expected value

E[x]=npE\lbrack x\rbrack = np

Bernoulli expected value

Bernoulli is just binomial with one trial.

Recall that P(X=1)=pP(X = 1) = p and P(X=0)=1pP(X = 0) = 1 - p.

E[X]=1P(X=1)+0P(X=0)=pE\lbrack X\rbrack = 1 \cdot P(X = 1) + 0 \cdot P(X = 0) = p

Let AA be an event on Ω\Omega. Its indicator random variable IAI_{A} is defined for ωΩ\omega \in \Omega by

IA(ω)={1, if ωA0, if ωAI_{A}(\omega) = \begin{cases} 1\text{, if } & \omega \in A \\ 0\text{, if } & \omega \notin A \end{cases}

E[IA]=1P(A)=P(A)E\left\lbrack I_{A} \right\rbrack = 1 \cdot P(A) = P(A)

Geometric expected value

Let p[0,1]p \in \lbrack 0,1\rbrack and X Geom[p]X\sim\text{ Geom}\lbrack p\rbrack be a geometric RV with probability of success pp. Recall that the p.m.f. is pqk1pq^{k - 1}, where prob. of failure is defined by q1pq ≔ 1 - p.

Then

E[X]=k=1kpqk1=pk=1kqk1\begin{aligned} E\lbrack X\rbrack & = \sum_{k = 1}^{\infty}kpq^{k - 1} \\ & = p \cdot \sum_{k = 1}^{\infty}k \cdot q^{k - 1} \end{aligned}

Now recall from calculus that you can differentiate a power series term by term inside its radius of convergence. So for |t|<1|t| < 1,

k=1ktk1=k=1ddttk=ddtk=1tk=ddt(11t)=1(1t)2E[x]=k=1kpqk1=pk=1kqk1=p(1(1q)2)=1p\begin{array}{r} \sum_{k = 1}^{\infty}kt^{k - 1} = \sum_{k = 1}^{\infty}\frac{d}{dt}t^{k} = \frac{d}{dt}\sum_{k = 1}^{\infty}t^{k} = \frac{d}{dt}\left( \frac{1}{1 - t} \right) = \frac{1}{(1 - t)^{2}} \\ \therefore E\lbrack x\rbrack = \sum_{k = 1}^{\infty}kpq^{k - 1} = p\sum_{k = 1}^{\infty}kq^{k - 1} = p\left( \frac{1}{(1 - q)^{2}} \right) = \frac{1}{p} \end{array}

Expected value of a continuous RV

Definition.

The expectation or mean of a continuous random variable XX with density function ff is

E[x]=xf(x)dxE\lbrack x\rbrack = \int_{- \infty}^{\infty}x \cdot f(x)dx

An alternative symbol is μ=E[x]\mu = E\lbrack x\rbrack.

μ\mu is the “first moment” of XX, analogous to physics, it’s the “center of gravity” of XX.

Remark.

In general when moving between discrete and continuous RV, replace sums with integrals, p.m.f. with p.d.f., and vice versa.

Example.

Suppose XX is a continuous RV with p.d.f.

fX(x)={2x, 0<x<10, elsewheref_{X}(x) = \begin{cases} 2x\text{, } & 0 < x < 1 \\ 0\text{, } & \text{elsewhere} \end{cases}

E[X]=xf(x)dx=01x2xdx=23E\lbrack X\rbrack = \int_{- \infty}^{\infty}x \cdot f(x)dx = \int_{0}^{1}x \cdot 2xdx = \frac{2}{3}

Example (Uniform expectation).

Let XX be a uniform random variable on the interval [a,b]\lbrack a,b\rbrack with X Unif[a,b]X\sim\text{ Unif}\lbrack a,b\rbrack. Find the expected value of XX.

E[X]=xf(x)dx=abxbadx=1baabxdx=1bab2a22=b+a2 midpoint formula\begin{array}{r} E\lbrack X\rbrack = \int_{- \infty}^{\infty}x \cdot f(x)dx = \int_{a}^{b}\frac{x}{b - a}dx \\ = \frac{1}{b - a}\int_{a}^{b}xdx = \frac{1}{b - a} \cdot \frac{b^{2} - a^{2}}{2} = \underset{\text{ midpoint formula}}{\underbrace{\frac{b + a}{2}}} \end{array}

Example (Exponential expectation).

Find the expected value of an exponential RV, with p.d.f.

fX(x)={λeλx, x>00, elsewheref_{X}(x) = \begin{cases} \lambda e^{- \lambda x}\text{, } & x > 0 \\ 0\text{, } & \text{elsewhere} \end{cases}

E[x]=xf(x)dx=0xλeλxdx=λ0xeλxdx=λ[x1λeλx|x=0x=01λeλxdx]=1λ\begin{array}{r} E\lbrack x\rbrack = \int_{- \infty}^{\infty}x \cdot f(x)dx = \int_{0}^{\infty}x \cdot \lambda e^{- \lambda x}dx \\ = \lambda \cdot \int_{0}^{\infty}x \cdot e^{- \lambda x}dx \\ = \lambda \cdot \left\lbrack \left. -x\frac{1}{\lambda}e^{- \lambda x} \right|_{x = 0}^{x = \infty} - \int_{0}^{\infty} - \frac{1}{\lambda}e^{- \lambda x}dx \right\rbrack \\ = \frac{1}{\lambda} \end{array}

Example (Uniform dartboard).

Our dartboard is a disk of radius r0r_{0} and the dart lands uniformly at random on the disk when thrown. Let RR be the distance of the dart from the center of the disk. Find E[R]E\lbrack R\rbrack given density function

fR(t)={2tr02, 0tr00, t<0 or t>r0f_{R}(t) = \begin{cases} \frac{2t}{r_{0}^{2}}\text{, } & 0 \leq t \leq r_{0} \\ 0\text{, } & t < 0\text{ or }t > r_{0} \end{cases}

E[R]=tfR(t)dt=0r0t2tr02dt=23r0\begin{array}{r} E\lbrack R\rbrack = \int_{- \infty}^{\infty}tf_{R}(t)dt \\ = \int_{0}^{r_{0}}t \cdot \frac{2t}{r_{0}^{2}}dt \\ = \frac{2}{3}r_{0} \end{array}

Expectation of derived values

If we can find the expected value of XX, can we find the expected value of X2X^{2}? More precisely, can we find E[X2]E\left\lbrack X^{2} \right\rbrack?

If the distribution is easy to see, then this is trivial. Otherwise we have the following useful property:

E[X2]=all xx2fX(x)dxE\left\lbrack X^{2} \right\rbrack = \int_{\text{all }x}x^{2}f_{X}(x)dx

(for continuous RVs).

And in the discrete case,

E[X2]=all xx2pX(x)E\left\lbrack X^{2} \right\rbrack = \sum_{\text{all }x}x^{2}p_{X}(x)

In fact E[X2]E\left\lbrack X^{2} \right\rbrack is so important that we call it the mean square.

Fact.

More generally, a real valued function g(X)g(X) defined on the range of XX is itself a random variable (with its own distribution).

We can find expected value of g(X)g(X) by

E[g(x)]=g(x)f(x)dxE\left\lbrack g(x) \right\rbrack = \int_{- \infty}^{\infty}g(x)f(x)dx

or

E[g(x)]=all xg(x)f(x)E\left\lbrack g(x) \right\rbrack = \sum_{\text{all }x}g(x)f(x)

Example.

You roll a fair die to determine the winnings (or losses) WW of a player as follows:

W={1,iftherollis1,2,or31,iftherollisa43,iftherollis5or6W = \begin{cases} - 1,\ if\ the\ roll\ is\ 1,\ 2,\ or\ 3 \\ 1,\ if\ the\ roll\ is\ a\ 4 \\ 3,\ if\ the\ roll\ is\ 5\ or\ 6 \end{cases}

What is the expected winnings/losses for the player during 1 roll of the die?

Let XX denote the outcome of the roll of the die. Then we can define our random variable as W=g(X)W = g(X) where the function gg is defined by g(1)=g(2)=g(3)=1g(1) = g(2) = g(3) = - 1 and so on.

Note that P(W=1)=P(X=1X=2X=3)=12P(W = - 1) = P(X = 1 \cup X = 2 \cup X = 3) = \frac{1}{2}. Likewise P(W=1)=P(X=4)=16P(W = 1) = P(X = 4) = \frac{1}{6}, and P(W=3)=P(X=5X=6)=13P(W = 3) = P(X = 5 \cup X = 6) = \frac{1}{3}.

Then E[g(X)]=E[W]=(1)P(W=1)+(1)P(W=1)+(3)P(W=3)=12+16+1=23\begin{array}{r} E\left\lbrack g(X) \right\rbrack = E\lbrack W\rbrack = ( - 1) \cdot P(W = - 1) + (1) \cdot P(W = 1) + (3) \cdot P(W = 3) \\ = - \frac{1}{2} + \frac{1}{6} + 1 = \frac{2}{3} \end{array}

Example.

A stick of length ll is broken at a uniformly chosen random location. What is the expected length of the longer piece?

Idea: if you break it before the halfway point, then the longer piece has length given by lxl - x. If you break it after the halfway point, the longer piece has length xx.

Let the interval [0,l]\lbrack 0,l\rbrack represent the stick and let X Unif[0,l]X\sim\text{ Unif}\lbrack 0,l\rbrack be the location where the stick is broken. Then XX has density f(x)=1lf(x) = \frac{1}{l} on [0,l]\lbrack 0,l\rbrack and 0 elsewhere.

Let g(x)g(x) be the length of the longer piece when the stick is broken at xx,

g(x)={1x, 0x<l2x, 12xlg(x) = \begin{cases} 1 - x\text{, } & 0 \leq x < \frac{l}{2} \\ x\text{, } & \frac{1}{2} \leq x \leq l \end{cases}

Then E[g(X)]=g(x)f(x)dx=0l2lxldx+l2lxldx=34l\begin{array}{r} E\left\lbrack g(X) \right\rbrack = \int_{- \infty}^{\infty}g(x)f(x)dx = \int_{0}^{\frac{l}{2}}\frac{l - x}{l}dx + \int_{\frac{l}{2}}^{l}\frac{x}{l}dx \\ = \frac{3}{4}l \end{array}

So we expect the longer piece to be 34\frac{3}{4} of the total length, which is a bit pathological.

Moments of a random variable

We continue discussing expectation but we introduce new terminology.

Fact.

The nthn^{\text{th}} moment (or nthn^{\text{th}} raw moment) of a discrete random variable XX with p.m.f. pX(x)p_{X}(x) is the expectation

E[Xn]=kknpX(k)=μnE\left\lbrack X^{n} \right\rbrack = \sum_{k}k^{n}p_{X}(k) = \mu_{n}

If XX is continuous, then we have analogously

E[Xn]=xnfX(x)=μnE\left\lbrack X^{n} \right\rbrack = \int_{- \infty}^{\infty}x^{n}f_{X}(x) = \mu_{n}

The deviation is given by σ\sigma and the variance is given by σ2\sigma^{2} and

σ2=μ2(μ1)2\sigma^{2} = \mu_{2} - \left( \mu_{1} \right)^{2}

μ3\mu_{3} is used to measure “skewness” / asymmetry of a distribution. For example, the normal distribution is very symmetric.

μ4\mu_{4} is used to measure kurtosis/peakedness of a distribution.

Central moments

Previously we discussed “raw moments.” Be careful not to confuse them with central moments.

Fact.

The nthn^{\text{th}} central moment of a discrete random variable XX with p.m.f. pX(x)p_{X}(x) is the expected value of the difference about the mean raised to the nthn^{\text{th}} power

E[(Xμ)n]=k(kμ)npX(k)=μnE\left\lbrack (X - \mu)^{n} \right\rbrack = \sum_{k}(k - \mu)^{n}p_{X}(k) = \mu\prime_{n}

And of course in the continuous case,

E[(Xμ)n]=(xμ)nfX(x)=μnE\left\lbrack (X - \mu)^{n} \right\rbrack = \int_{- \infty}^{\infty}(x - \mu)^{n}f_{X}(x) = \mu\prime_{n}

In particular,

μ1=E[(Xμ)1]=(xμ)1fX(x)dx=xfX(x)dx=μfX(x)dx=μμ1=0μ2=E[(Xμ)2]=σX2= Var(X)\begin{array}{r} \mu\prime_{1} = E\left\lbrack (X - \mu)^{1} \right\rbrack = \int_{- \infty}^{\infty}(x - \mu)^{1}f_{X}(x)dx \\ = \int_{\infty}^{\infty}xf_{X}(x)dx = \int_{- \infty}^{\infty}\mu f_{X}(x)dx = \mu - \mu \cdot 1 = 0 \\ \mu\prime_{2} = E\left\lbrack (X - \mu)^{2} \right\rbrack = \sigma_{X}^{2} = \text{ Var}(X) \end{array}

Example.

Let YY be a uniformly chosen integer from {0,1,2,,m}\left\{ 0,1,2,\ldots,m \right\}. Find the first and second moment of YY.

The p.m.f. of YY is pY(k)=1m+1p_{Y}(k) = \frac{1}{m + 1} for k[0,m]k \in \lbrack 0,m\rbrack. Thus,

E[Y]=k=0mk1m+1=1m+1k=0mk=m2\begin{array}{r} E\lbrack Y\rbrack = \sum_{k = 0}^{m}k\frac{1}{m + 1} = \frac{1}{m + 1}\sum_{k = 0}^{m}k \\ = \frac{m}{2} \end{array}

Then,

E[Y2]=k=0mk21m+1=1m+1=m(2m+1)6E\left\lbrack Y^{2} \right\rbrack = \sum_{k = 0}^{m}k^{2}\frac{1}{m + 1} = \frac{1}{m + 1} = \frac{m(2m + 1)}{6}

Example.

Let c>0c > 0 and let UU be a uniform random variable on the interval [0,c]\lbrack 0,c\rbrack. Find the nthn^{\text{th}} moment for UU for all positive integers nn.

The density function of UU is

f(x)={1c, if x[0,c]0, otherwisef(x) = \begin{cases} \frac{1}{c}\text{, if } & x \in \lbrack 0,c\rbrack \\ 0\text{, } & \text{otherwise} \end{cases}

Therefore the nthn^{\text{th}} moment of UU is,

E[Un]=xnf(x)dxE\left\lbrack U^{n} \right\rbrack = \int_{- \infty}^{\infty}x^{n}f(x)dx

Example.

Suppose the random variable X Exp(λ)X\sim\text{ Exp}(\lambda). Find the second moment of XX.

E[X2]=0x2λeλxdx=1λ20u2eudu=1λ2Γ(2+1)=2!λ2\begin{array}{r} E\left\lbrack X^{2} \right\rbrack = \int_{0}^{\infty}x^{2}\lambda e^{- \lambda x}dx \\ = \frac{1}{\lambda^{2}}\int_{0}^{\infty}u^{2}e^{- u}du \\ = \frac{1}{\lambda^{2}}\Gamma(2 + 1) = \frac{2!}{\lambda^{2}} \end{array}

Fact.

In general, to find teh nthn^{\text{th}} moment of X Exp(λ)X\sim\text{ Exp}(\lambda), E[Xn]=0xnλeλxdx=n!λnE\left\lbrack X^{n} \right\rbrack = \int_{0}^{\infty}x^{n}\lambda e^{- \lambda x}dx = \frac{n!}{\lambda^{n}}

Median and quartiles

When a random variable has rare (abnormal) values, its expectation may be a bad indicator of where the center of the distribution lies.

Definition.

The median of a random variable XX is any real value mm that satisfies

P(Xm)12 and P(Xm)12P(X \geq m) \geq \frac{1}{2}\text{ and }P(X \leq m) \geq \frac{1}{2}

With half the probability on both {Xm}\left\{ X \leq m \right\} and {Xm}\left\{ X \geq m \right\}, the median is representative of the midpoint of the distribution. We say that the median is more robust because it is less affected by outliers. It is not necessarily unique.

Example.

Let XX be discretely uniformly distributed in the set {100,1,2,,3,,9}\left\{ - 100,1,2,,3,\ldots,9 \right\} so XX has probability mass function pX(100)=pX(1)==pX(9)p_{X}( - 100) = p_{X}(1) = \cdots = p_{X}(9)

Find the expected value and median of XX.

E[X]=(100)110+(1)110++(9)110=5.5E\lbrack X\rbrack = ( - 100) \cdot \frac{1}{10} + (1) \cdot \frac{1}{10} + \cdots + (9) \cdot \frac{1}{10} = - 5.5

While the median is any number m[4,5]m \in \lbrack 4,5\rbrack.

The median reflects the fact that 90% of the values and probability is in the range 1,2,,91,2,\ldots,9 while the mean is heavily influenced by the 100- 100 value.

]]>
An assortment of preliminaries on linear algebra https://blog.youwen.dev/an-assortment-of-preliminaries-on-linear-algebra.html 2025-02-15T00:00:00Z 2025-02-15T00:00:00Z

An assortment of preliminaries on linear algebra

and also a test for pandoc

2025-02-15

This entire document was written entirely in Typst and directly translated to this file by Pandoc. It serves as a proof of concept of a way to do static site generation from Typst files instead of Markdown.


I figured I should write this stuff down before I forgot it.

Basic Notions

Vector spaces

Before we can understand vectors, we need to first discuss vector spaces. Thus far, you have likely encountered vectors primarily in physics classes, generally in the two-dimensional plane. You may conceptualize them as arrows in space. For vectors of size >3> 3, a hand waving argument is made that they are essentially just arrows in higher dimensional spaces.

It is helpful to take a step back from this primitive geometric understanding of the vector. Let us build up a rigorous idea of vectors from first principles.

Vector axioms

The so-called axioms of a vector space (which we’ll call the vector space VV) are as follows:

  1. Commutativity: u+v=v+u, u,vVu + v = v + u,\text{ }\forall u,v \in V

  2. Associativity: (u+v)+w=u+(v+w), u,v,wV(u + v) + w = u + (v + w),\text{ }\forall u,v,w \in V

  3. Zero vector: \exists a special vector, denoted 00, such that v+0=v, vVv + 0 = v,\text{ }\forall v \in V

  4. Additive inverse: vV, wV such that v+w=0\forall v \in V,\text{ }\exists w \in V\text{ such that }v + w = 0. Such an additive inverse is generally denoted v- v

  5. Multiplicative identity: 1v=v, vV1v = v,\text{ }\forall v \in V

  6. Multiplicative associativity: (αβ)v=α(βv) vV, scalars α,β(\alpha\beta)v = \alpha(\beta v)\text{ }\forall v \in V,\text{ scalars }\alpha,\beta

  7. Distributive property for vectors: α(u+v)=αu+αv u,vV, scalars α\alpha(u + v) = \alpha u + \alpha v\text{ }\forall u,v \in V,\text{ scalars }\alpha

  8. Distributive property for scalars: (α+β)v=αv+βv vV, scalars α,β(\alpha + \beta)v = \alpha v + \beta v\text{ }\forall v \in V,\text{ scalars }\alpha,\beta

It is easy to show that the zero vector 00 and the additive inverse v- v are unique. We leave the proof of this fact as an exercise.

These may seem difficult to memorize, but they are essentially the same familiar algebraic properties of numbers you know from high school. The important thing to remember is which operations are valid for what objects. For example, you cannot add a vector and scalar, as it does not make sense.

Remark. For those of you versed in computer science, you may recognize this as essentially saying that you must ensure your operations are type-safe. Adding a vector and scalar is not just “wrong” in the same sense that 1+1=31 + 1 = 3 is wrong, it is an invalid question entirely because vectors and scalars and different types of mathematical objects. See [@chen2024digression] for more.

Vectors big and small

In order to begin your descent into what mathematicians colloquially recognize as abstract vapid nonsense, let’s discuss which fields constitute a vector space. We have the familiar field of \mathbb{R} where all scalars are real numbers, with corresponding vector spaces n{\mathbb{R}}^{n}, where nn is the length of the vector. We generally discuss 2D or 3D vectors, corresponding to vectors of length 2 or 3; in our case, 2{\mathbb{R}}^{2} and 3{\mathbb{R}}^{3}.

However, vectors in n{\mathbb{R}}^{n} can really be of any length. Vectors can be viewed as arbitrary length lists of numbers (for the computer science folk: think C++ std::vector).

Example. (123456789)9\begin{pmatrix} 1 \\ 2 \\ 3 \\ 4 \\ 5 \\ 6 \\ 7 \\ 8 \\ 9 \end{pmatrix} \in {\mathbb{R}}^{9}

Keep in mind that vectors need not be in n{\mathbb{R}}^{n} at all. Recall that a vector space need only satisfy the aforementioned axioms of a vector space.

Example. The vector space n{\mathbb{C}}^{n} is similar to n{\mathbb{R}}^{n}, except it includes complex numbers. All complex vector spaces are real vector spaces (as you can simply restrict them to only use the real numbers), but not the other way around.

From now on, let us refer to vector spaces n{\mathbb{R}}^{n} and n{\mathbb{C}}^{n} as 𝔽n{\mathbb{F}}^{n}.

In general, we can have a vector space where the scalars are in an arbitrary field, as long as the axioms are satisfied.

Example. The vector space of all polynomials of at most degree 3, or 3{\mathbb{P}}^{3}. It is not yet clear what this vector may look like. We shall return to this example once we discuss basis.

Vector addition. Multiplication

Vector addition, represented by ++ can be done entrywise.

Example.

(123)+(456)=(1+42+53+6)=(579)\begin{pmatrix} 1 \\ 2 \\ 3 \end{pmatrix} + \begin{pmatrix} 4 \\ 5 \\ 6 \end{pmatrix} = \begin{pmatrix} 1 + 4 \\ 2 + 5 \\ 3 + 6 \end{pmatrix} = \begin{pmatrix} 5 \\ 7 \\ 9 \end{pmatrix} (123)(456)=(142536)=(41018)\begin{pmatrix} 1 \\ 2 \\ 3 \end{pmatrix} \cdot \begin{pmatrix} 4 \\ 5 \\ 6 \end{pmatrix} = \begin{pmatrix} 1 \cdot 4 \\ 2 \cdot 5 \\ 3 \cdot 6 \end{pmatrix} = \begin{pmatrix} 4 \\ 10 \\ 18 \end{pmatrix}

This is simple enough to understand. Again, the difficulty is simply ensuring that you always perform operations with the correct types. For example, once we introduce matrices, it doesn’t make sense to multiply or add vectors and matrices in this fashion.

Vector-scalar multiplication

Multiplying a vector by a scalar simply results in each entry of the vector being multiplied by the scalar.

Example.

β(abc)=(βaβbβc)\beta\begin{pmatrix} a \\ b \\ c \end{pmatrix} = \begin{pmatrix} \beta \cdot a \\ \beta \cdot b \\ \beta \cdot c \end{pmatrix}

Linear combinations

Given vector spaces VV and WW and vectors vVv \in V and wWw \in W, v+wv + w is the linear combination of vv and ww.

Spanning systems

We say that a set of vectors v1,v2,,vnVv_{1},v_{2},\ldots,v_{n} \in V span VV if the linear combination of the vectors can represent any arbitrary vector vVv \in V.

Precisely, given scalars α1,α2,,αn\alpha_{1},\alpha_{2},\ldots,\alpha_{n},

α1v1+α2v2++αnvn=v,vV\alpha_{1}v_{1} + \alpha_{2}v_{2} + \ldots + \alpha_{n}v_{n} = v,\forall v \in V

Note that any scalar αk\alpha_{k} could be 0. Therefore, it is possible for a subset of a spanning system to also be a spanning system. The proof of this fact is left as an exercise.

Intuition for linear independence and dependence

We say that vv and ww are linearly independent if vv cannot be represented by the scaling of ww, and ww cannot be represented by the scaling of vv. Otherwise, they are linearly dependent.

You may intuitively visualize linear dependence in the 2D plane as two vectors both pointing in the same direction. Clearly, scaling one vector will allow us to reach the other vector. Linear independence is therefore two vectors pointing in different directions.

Of course, this definition applies to vectors in any 𝔽n{\mathbb{F}}^{n}.

Formal definition of linear dependence and independence

Let us formally define linear independence for arbitrary vectors in 𝔽n{\mathbb{F}}^{n}. Given a set of vectors

v1,v2,,vnVv_{1},v_{2},\ldots,v_{n} \in V

we say they are linearly independent iff. the equation

α1v1+α2v2++αnvn=0\alpha_{1}v_{1} + \alpha_{2}v_{2} + \ldots + \alpha_{n}v_{n} = 0

has only a unique set of solutions α1,α2,,αn\alpha_{1},\alpha_{2},\ldots,\alpha_{n} such that all αn\alpha_{n} are zero.

Equivalently,

|α1|+|α2|++|αn|=0\left| \alpha_{1} \right| + \left| \alpha_{2} \right| + \ldots + \left| \alpha_{n} \right| = 0

More precisely,

i=1k|αi|=0\sum_{i = 1}^{k}\left| \alpha_{i} \right| = 0

Therefore, a set of vectors v1,v2,,vmv_{1},v_{2},\ldots,v_{m} is linearly dependent if the opposite is true, that is there exists solution α1,α2,,αm\alpha_{1},\alpha_{2},\ldots,\alpha_{m} to the equation

α1v1+α2v2++αmvm=0\alpha_{1}v_{1} + \alpha_{2}v_{2} + \ldots + \alpha_{m}v_{m} = 0

such that

i=1k|αi|0\sum_{i = 1}^{k}\left| \alpha_{i} \right| \neq 0

Basis

We say a system of vectors v1,v2,,vnVv_{1},v_{2},\ldots,v_{n} \in V is a basis in VV if the system is both linearly independent and spanning. That is, the system must be able to represent any vector in VV as well as satisfy our requirements for linear independence.

Equivalently, we may say that a system of vectors in VV is a basis in VV if any vector vVv \in V admits a unique representation as a linear combination of vectors in the system. This is equivalent to our previous statement, that the system must be spanning and linearly independent.

Standard basis

We may define a standard basis for a vector space. By convention, the standard basis in 2{\mathbb{R}}^{2} is

(10)(01)\begin{pmatrix} 1 \\ 0 \end{pmatrix}\begin{pmatrix} 0 \\ 1 \end{pmatrix}

Verify that the above is in fact a basis (that is, linearly independent and generating).

Recalling the definition of the basis, we can represent any vector in 2{\mathbb{R}}^{2} as the linear combination of the standard basis.

Therefore, for any arbitrary vector v2v \in {\mathbb{R}}^{2}, we can represent it as

v=α1(10)+α2(01)v = \alpha_{1}\begin{pmatrix} 1 \\ 0 \end{pmatrix} + \alpha_{2}\begin{pmatrix} 0 \\ 1 \end{pmatrix}

Let us call α1\alpha_{1} and α2\alpha_{2} the coordinates of the vector. Then, we can write vv as

v=(α1α2)v = \begin{pmatrix} \alpha_{1} \\ \alpha_{2} \end{pmatrix}

For example, the vector

(12)\begin{pmatrix} 1 \\ 2 \end{pmatrix}

represents

1(10)+2(01)1 \cdot \begin{pmatrix} 1 \\ 0 \end{pmatrix} + 2 \cdot \begin{pmatrix} 0 \\ 1 \end{pmatrix}

Verify that this aligns with your previous intuition of vectors.

You may recognize the standard basis in 2{\mathbb{R}}^{2} as the familiar unit vectors

î,ĵ\hat{i},\hat{j}

This aligns with the fact that

(αβ)=αî+βĵ\begin{pmatrix} \alpha \\ \beta \end{pmatrix} = \alpha\hat{i} + \beta\hat{j}

However, we may define a standard basis in any arbitrary vector space. So, let

e1,e2,,ene_{1},e_{2},\ldots,e_{n}

be a standard basis in 𝔽n{\mathbb{F}}^{n}. Then, the coordinates α1,α2,,αn\alpha_{1},\alpha_{2},\ldots,\alpha_{n} of a vector v𝔽nv \in {\mathbb{F}}^{n} represent the following

(α1α2αn)=α1e1+α2+e2+αnen\begin{pmatrix} \alpha_{1} \\ \alpha_{2} \\ \vdots \\ \alpha_{n} \end{pmatrix} = \alpha_{1}e_{1} + \alpha_{2} + e_{2} + \alpha_{n}e_{n}

Using our new notation, the standard basis in 2{\mathbb{R}}^{2} is

e1=(10),e2=(01)e_{1} = \begin{pmatrix} 1 \\ 0 \end{pmatrix},e_{2} = \begin{pmatrix} 0 \\ 1 \end{pmatrix}

Matrices

Before discussing any properties of matrices, let’s simply reiterate what we learned in class about their notation. We say a matrix with rows of length mm, and columns of size nn (in less precise terms, a matrix with length mm and height nn) is a m×nm \times n matrix.

Given a matrix

A=(123456789)A = \begin{pmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{pmatrix}

we refer to the entry in row jj and column kk as Aj,kA_{j,k} .

Matrix transpose

A formalism that is useful later on is called the transpose, and we obtain it from a matrix AA by switching all the rows and columns. More precisely, each row becomes a column instead. We use the notation ATA^{T} to represent the transpose of AA.

(123456)T=(142536)\begin{pmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{pmatrix}^{T} = \begin{pmatrix} 1 & 4 \\ 2 & 5 \\ 3 & 6 \end{pmatrix}

Formally, we can say (AT)j,k=Ak,j\left( A^{T} \right)_{j,k} = A_{k,j}

Linear transformations

A linear transformation T:VWT:V \rightarrow W is a mapping between two vector spaces VWV \rightarrow W, such that the following axioms are satisfied:

  1. T(v+w)=T(v)+T(w),vV,wWT(v + w) = T(v) + T(w),\forall v \in V,\forall w \in W

  2. T(αv)+T(βw)=αT(v)+βT(w),vV,wWT(\alpha v) + T(\beta w) = \alpha T(v) + \beta T(w),\forall v \in V,\forall w \in W, for all scalars α,β\alpha,\beta

Definition. TT is a linear transformation iff.

T(αv+βw)=αT(v)+βT(w)T(\alpha v + \beta w) = \alpha T(v) + \beta T(w)

Abuse of notation. From now on, we may elide the parentheses and say that T(v)=Tv,vVT(v) = Tv,\forall v \in V

Remark. A phrase that you may commonly hear is that linear transformations preserve linearity. Essentially, straight lines remain straight, parallel lines remain parallel, and the origin remains fixed at 0. Take a moment to think about why this is true (at least, in lower dimensional spaces you can visualize).

Examples.

  1. Rotation for V=W=2V = W = {\mathbb{R}}^{2} (i.e. rotation in 2 dimensions). Given v,w2v,w \in {\mathbb{R}}^{2}, and their linear combination v+wv + w, a rotation of γ\gamma radians of v+wv + w is equivalent to first rotating vv and ww individually by γ\gamma and then taking their linear combination.

  2. Differentiation of polynomials. In this case V=nV = {\mathbb{P}}^{n} and W=n1W = {\mathbb{P}}^{n - 1}, where n{\mathbb{P}}^{n} is the field of all polynomials of degree at most nn.

    ddx(αv+βw)=αddxv+βddxw,vV,wW, scalars α,β\frac{d}{dx}(\alpha v + \beta w) = \alpha\frac{d}{dx}v + \beta\frac{d}{dx}w,\forall v \in V,w \in W,\forall\text{ scalars }\alpha,\beta

Matrices represent linear transformations

Suppose we wanted to represent a linear transformation T:𝔽n𝔽mT:{\mathbb{F}}^{n} \rightarrow {\mathbb{F}}^{m}. I propose that we need encode how TT acts on the standard basis of 𝔽n{\mathbb{F}}^{n}.

Using our intuition from lower dimensional vector spaces, we know that the standard basis in 2{\mathbb{R}}^{2} is the unit vectors î\hat{i} and ĵ\hat{j}. Because linear transformations preserve linearity (i.e. all straight lines remain straight and parallel lines remain parallel), we can encode any transformation as simply changing î\hat{i} and ĵ\hat{j}. And indeed, if any vector v2v \in {\mathbb{R}}^{2} can be represented as the linear combination of î\hat{i} and ĵ\hat{j} (this is the definition of a basis), it makes sense both symbolically and geometrically that we can represent all linear transformations as the transformations of the basis vectors.

Example. To reflect all vectors v2v \in {\mathbb{R}}^{2} across the yy-axis, we can simply change the standard basis to

(10)(01)\begin{pmatrix} - 1 \\ 0 \end{pmatrix}\begin{pmatrix} 0 \\ 1 \end{pmatrix}

Then, any vector in 2{\mathbb{R}}^{2} using this new basis will be reflected across the yy-axis. Take a moment to justify this geometrically.

Writing a linear transformation as matrix

For any linear transformation T:𝔽m𝔽nT:{\mathbb{F}}^{m} \rightarrow {\mathbb{F}}^{n}, we can write it as an n×mn \times m matrix AA. That is, there is a matrix AA with nn rows and mm columns that can represent any linear transformation from 𝔽m𝔽n{\mathbb{F}}^{m} \rightarrow {\mathbb{F}}^{n}.

How should we write this matrix? Naturally, from our previous discussion, we should write a matrix with each column being one of our new transformed basis vectors.

Example. Our yy-axis reflection transformation from earlier. We write the bases in a matrix

(1001)\begin{pmatrix} - 1 & 0 \\ 0 & 1 \end{pmatrix}

Matrix-vector multiplication

Perhaps you now see why the so-called matrix-vector multiplication is defined the way it is. Recalling our definition of a basis, given a basis in VV, any vector vVv \in V can be written as the linear combination of the vectors in the basis. Then, given a linear transformation represented by the matrix containing the new basis, we simply write the linear combination with the new basis instead.

Example. Let us first write a vector in the standard basis in 2{\mathbb{R}}^{2} and then show how our matrix-vector multiplication naturally corresponds to the definition of the linear transformation.

(12)2\begin{pmatrix} 1 \\ 2 \end{pmatrix} \in {\mathbb{R}}^{2}

is the same as

1(10)+2(01)1 \cdot \begin{pmatrix} 1 \\ 0 \end{pmatrix} + 2 \cdot \begin{pmatrix} 0 \\ 1 \end{pmatrix}

Then, to perform our reflection, we need only replace the basis vector (10)\begin{pmatrix} 1 \\ 0 \end{pmatrix} with (10)\begin{pmatrix} - 1 \\ 0 \end{pmatrix}.

Then, the reflected vector is given by

1(10)+2(01)=(12)1 \cdot \begin{pmatrix} - 1 \\ 0 \end{pmatrix} + 2 \cdot \begin{pmatrix} 0 \\ 1 \end{pmatrix} = \begin{pmatrix} - 1 \\ 2 \end{pmatrix}

We can clearly see that this is exactly how the matrix multiplication

(1001)(12)\begin{pmatrix} - 1 & 0 \\ 0 & 1 \end{pmatrix} \cdot \begin{pmatrix} 1 \\ 2 \end{pmatrix} is defined! The column-by-coordinate rule for matrix-vector multiplication says that we multiply the nthn^{\text{th}} entry of the vector by the corresponding nthn^{\text{th}} column of the matrix and sum them all up (take their linear combination). This algorithm intuitively follows from our definition of matrices.

Matrix-matrix multiplication

As you may have noticed, a very similar natural definition arises for the matrix-matrix multiplication. Multiplying two matrices ABA \cdot B is essentially just taking each column of BB, and applying the linear transformation defined by the matrix AA!

]]>
Nix automatic hash updates made easy https://blog.youwen.dev/nix-automatic-hash-updates-made-easy.html 2024-12-28T00:00:00Z 2024-12-28T00:00:00Z

Nix automatic hash updates made easy

keep your flakes up to date

2024-12-28

Nix users often create flakes to package software out of tree, like this Zen Browser flake I’ve been maintaining. Keeping them up to date is a hassle though, since you have to update the Subresource Integrity (SRI) hashes that Nix uses to ensure reproducibility.

Here’s a neat method I’ve been using to cleanly handle automatic hash updates. I use Nushell to easily work with data, prefetch some hashes, and put it all in a JSON file that can be read by Nix at build time.

First, let’s create a file called update.nu. At the top, place this shebang:

#!/usr/bin/env -S nix shell nixpkgs#nushell --command nu

This will execute the script in a Nushell environment, which is fetched by Nix.

Get the up to date URLs

We need to obtain the latest version of whatever software we want to update. In this case, I’ll use GitHub releases as my source of truth.

You can use the GitHub API to fetch metadata about all the releases of a repository.

https://api.github.com/repos/($repo)/releases

Roughly speaking, the raw JSON returned by the GitHub releases API looks something like:

[
   {tag_name: "foo", prerelease: false, ...},
   {tag_name: "bar", prerelease: true, ...},
   {tag_name: "foobar", prerelease: false, ...},
]

Note that the ordering of the objects in the array is chronological.

Even if you aren’t using GitHub releases, as long as there is a reliable way to programmatically fetch the latest download URLs of whatever software you’re packaging, you can adapt this approach for your specific case.

We use Nushell’s http get to make a network request. Nushell will automatically detect and parse the JSON reponse into a Nushell table.

In my case, Zen Browser frequently publishes prerelease “twilight” builds which we don’t want to update to. So, we ignore any releases tagged “twilight” or marked “prerelease” by filtering them out with the where selector.

Finally, we retrieve the tag name of the item at the first index, which would be the latest release (since the JSON array was chronologically sorted).

#!/usr/bin/env -S nix shell nixpkgs#nushell --command nu

# get the latest tag of the latest release that isn't a prerelease
def get_latest_release [repo: string] {
  try {
	http get $"https://api.github.com/repos/($repo)/releases"
	  | where prerelease == false
	  | where tag_name != "twilight"
	  | get tag_name
	  | get 0
  } catch { |err| $"Failed to fetch latest release, aborting: ($err.msg)" }
}

Prefetching SRI hashes

Now that we have the latest tags, we can easily obtain the latest download URLs, which are of the form:

https://github.com/zen-browser/desktop/releases/download/$tag/zen.linux-x86_64.tar.bz2
https://github.com/zen-browser/desktop/releases/download/$tag/zen.aarch64-x86_64.tar.bz2

However, we still need the corresponding SRI hashes to pass to Nix.

src = fetchurl {
   url = "https://github.com/zen-browser/desktop/releases/download/1.0.2-b.5/zen.linux-x86_64.tar.bz2";
   hash = "sha256-00000000000000000000000000000000000000000000";
};

The easiest way to obtain these new hashes is to update the URL and then set the hash property to an empty string (""). Nix will spit out an hash mismatch error with the correct hash. However, this is inconvenient for automated command line scripting.

The Nix documentation mentions nix-prefetch-url as a way to obtain these hashes, but as usual, it doesn’t work quite right and has also been replaced by a more powerful but underdocumented experimental feature instead.

The nix store prefetch-file command does what nix-prefetch-url is supposed to do, but handles the caveats that lead to the wrong hash being produced automatically.

Let’s write a Nushell function that outputs the SRI hash of the given URL. We tell prefetch-file to output structured JSON that we can parse.

Since Nushell is a shell, we can directly invoke shell commands like usual, and then process their output with pipes.

def get_nix_hash [url: string] {
  nix store prefetch-file --hash-type sha256 --json $url | from json | get hash
}

Cool! Now get_nix_hash can give us SRI hashes that look like this:

sha256-K3zTCLdvg/VYQNsfeohw65Ghk8FAjhOl8hXU6REO4/s=

Putting it all together

Now that we’re able to fetch the latest release, obtain the download URLs, and compute their SRI hashes, we have all the information we need to make an automated update. However, these URLs are typically hardcoded in our Nix expressions. The question remains as to how to update these values.

A common way I’ve seen updates performed is using something like sed to modify the Nix expressions in place. However, there’s actually a more maintainable and easy to understand approach.

Let’s have our Nushell script generate the URLs and hashes and place them in a JSON file! Then, we’ll be able to read the JSON file from Nix and obtain the URL and hash.

def generate_sources [] {
  let tag = get_latest_release "zen-browser/desktop"
  let prev_sources = open ./sources.json

  if $tag == $prev_sources.version {
	# everything up to date
	return $tag
  }

  # generate the download URLs with the new tag
  let x86_64_url = $"https://github.com/zen-browser/desktop/releases/download/($tag)/zen.linux-x86_64.tar.bz2"
  let aarch64_url = $"https://github.com/zen-browser/desktop/releases/download/($tag)/zen.linux-aarch64.tar.bz2"

  # create a Nushell record that maps cleanly to JSON
  let sources = {
    # add a version field as well for convenience
	version: $tag

	x86_64-linux: {
	  url:  $x86_64_url
	  hash: (get_nix_hash $x86_64_url)
	}
	aarch64-linux: {
	  url: $aarch64_url
	  hash: (get_nix_hash $aarch64_url)
	}
  }

  echo $sources | save --force "sources.json"

  return $tag
}

Running this script with

chmod +x ./update.nu
./update.nu

gives us the file sources.json:

{
  "version": "1.0.2-b.5",
  "x86_64-linux": {
    "url": "https://github.com/zen-browser/desktop/releases/download/1.0.2-b.5/zen.linux-x86_64.tar.bz2",
    "hash": "sha256-K3zTCLdvg/VYQNsfeohw65Ghk8FAjhOl8hXU6REO4/s="
  },
  "aarch64-linux": {
    "url": "https://github.com/zen-browser/desktop/releases/download/1.0.2-b.5/zen.linux-aarch64.tar.bz2",
    "hash": "sha256-NwIYylGal2QoWhWKtMhMkAAJQ6iNHfQOBZaxTXgvxAk="
  }
}

Now, let’s read this from Nix. My file organization looks like the following:

./
| flake.nix
| zen-browser-unwrapped.nix
| ...other files...

zen-browser-unwrapped.nix contains the derivation for Zen Browser. Let’s add version, url, and hash to its inputs:

{
  stdenv,
  fetchurl,
  # add these below
  version,
  url,
  hash,
  ...
}:
stdenv.mkDerivation {
   # inherit version from inputs
  inherit version;
  pname = "zen-browser-unwrapped";

  src = fetchurl {
    # inherit the URL and hash we obtain from the inputs
    inherit url hash;
  };
}

Then in flake.nix, let’s provide the derivation with the data from sources.json:

let
   supportedSystems = [
     "x86_64-linux"
     "aarch64-linux"
   ];
   forAllSystems = nixpkgs.lib.genAttrs supportedSystems;
in
{
   # rest of file omitted for simplicity
   packages = forAllSystems (
     system:
     let
       pkgs = import nixpkgs { inherit system; };
       # parse sources.json into a Nix attrset
       sources = builtins.fromJSON (builtins.readFile ./sources.json);
     in
     rec {
       zen-browser-unwrapped = pkgs.callPackage ./zen-browser-unwrapped.nix {
         inherit (sources.${system}) hash url;
         inherit (sources) version;

         # if the above is difficult to understand, it is equivalent to the following:
         hash = sources.${system}.hash;
         url = sources.${system}.url;
         version = sources.version;
       };
}

Now, running nix build .#zen-browser-unwrapped will be able to use the hashes and URLs from sources.json to build the package!

Automating it in CI

We now have a script that can automatically fetch releases and generate hashes and URLs, as well as a way for Nix to use the outputted JSON to build derivations. All that’s left is to fully automate it using CI!

We are going to use GitHub actions for this, as it’s free and easy and you’re probably already hosting on GitHub.

Ensure you’ve set up actions for your repo and given it sufficient permissions.

We’re gonna run it on a cron timer that checks for updates at 8 PM PST every day.

We use DeterminateSystems’ actions to help set up Nix. Then, we simply run our update script. Since we made the script return the tag it fetched, we can store it in a variable and then use it in our commit message.

name: Update to latest version, and update flake inputs

on:
  schedule:
    - cron: "0 4 * * *"
  workflow_dispatch:

jobs:
  update:
    name: Update flake inputs and browser
    runs-on: ubuntu-latest
    steps:
      - name: Checkout Repository
        uses: actions/checkout@v4

      - name: Check flake inputs
        uses: DeterminateSystems/flake-checker-action@v4

      - name: Install Nix
        uses: DeterminateSystems/nix-installer-action@main

      - name: Set up magic Nix cache
        uses: DeterminateSystems/magic-nix-cache-action@main

      - name: Check for update and perform update
        run: |
          git config --global user.name "github-actions[bot]"
          git config --global user.email "github-actions[bot]@users.noreply.github.com"

          chmod +x ./update.nu
          export ZEN_LATEST_VER="$(./update.nu)"

          git add -A
          git commit -m "github-actions: update to $ZEN_LATEST_VER" || echo "Latest version is $ZEN_LATEST_VER, no updates found"

          nix flake update --commit-lock-file

          git push

Now, our repository will automatically check for and perform updates every day!

]]>
a haskellian blog https://blog.youwen.dev/a-haskellian-blog.html 2024-05-25T00:00:00Z 2024-05-25T12:00:00Z

a haskellian blog

a purely functional...blog?

2024-05-25
(last updated: 2024-05-25T12:00:00Z)

Welcome! This is the first post on The Involution and also one that tests all of the features.

A monad is just a monoid in the category of endofunctors, what’s the problem?

haskell?

This entire blog is generated with hakyll. It’s a library for generating static sites for Haskell, a purely functional programming language. It’s a library because it doesn’t come with as many batteries included as tools like Hugo or Astro. You set up most of the site yourself by calling the library from Haskell.

Here’s a brief excerpt:

main :: IO ()
main = hakyllWith config $ do
    forM_
        [ "CNAME"
        , "favicon.ico"
        , "robots.txt"
        , "_config.yml"
        , "images/*"
        , "out/*"
        , "fonts/*"
        ]
        $ \f -> match f $ do
            route idRoute
            compile copyFileCompiler

The code highlighting is also generated by hakyll.


why?

Haskell is a purely functional language with no mutable state. Its syntax actually makes it pretty elegant for declaring routes and “rendering” pipelines.

  1. Haskell is cool.
  2. It comes with enough features that I don’t feel like I have to build everything from scratch.
  3. It comes with Pandoc, a Haskell library for converting between markdown formats. It’s probably more powerful than anything you could do in nodejs. It renders all of the markdown to HTML as well as the math.
    1. It supports KaTeX as well as MathML. I’m a little disappointed with the KaTeX though. It doesn’t directly render it, but simply injects the KaTeX files and renders it client-side.

speaking of math

We can have math inline, like so: ex2dx=π\int_{-\infty}^\infty \, e^{-x^2}\,dx = \sqrt{\pi}. This site ships semantic MathML math with its HTML, and the MathJax script to the client.

It’d be nice if MathML could just be used and supported across all browsers, but unfortunately we still aren’t quite there yet. Firefox is the only one where everything looks 80% of the way to LaTeX. On Safari and Chrome, even simple equations like π\sqrt{\pi} render improperly.

Pros of MathML:

  • A little more accessible
  • Can be rendered without additional stylesheets. I just installed the Latin Modern font, but this isn’t even really necessary
  • Built-in to most browsers (#UseThePlatform)

Cons:

  • Isn’t fully standardized. Might look different on different browsers
  • Rendering quality isn’t as good as KaTeX

This site has MathJax render all of the math so it looks nice and standardized across browsers, but the math still displays regardless (like say if MathJax couldn’t load due to slow network) because of MathML. Best of both worlds.

Let’s try it now. Here’s a simple theorem:

an+bncn{a,b,c}n3 a^n + b^n \ne c^n \, \forall\,\left\{ a,\,b,\,c \right\} \in \mathbb{Z} \land n \ge 3

The proof is trivial and will be left as an exercise to the reader.

seems a little overengineered

Probably is. Not as much as the old one, though.

]]>