Random variables, distributions, and probability theory
An overview of discrete and continuous random variables and their distributions and moment generating functions
These are some notes I’ve been collecting on random variables, their distributions, expected values, and moment generating functions. I thought I’d write them down somewhere useful.
These are almost extracted verbatim from my in-class notes, which I take in real time using Typst. I simply wrote a tiny compatibility shim to allow Pandoc to render them to the web.
Random variables
First, some brief exposition on random variables. Quixotically, a random variable is actually a function.
Standard notation: is a sample space, is an event.
Definition.
A random variable is a function that takes the set of possible outcomes in a sample space, and maps it to a measurable space, typically (as in our case) a subset of .
Definition.
The state space of a random variable is all of the values can take.
Example.
Let be a random variable that takes on the values . Then the state space of is the set .
Discrete random variables
A random variable is discrete if there is countable such that . is a possible value if . We discuss continuous random variables later.
The probability distribution of gives its important probabilistic information. The probability distribution is a description of the probabilities for subsets . We describe the probability density function and the cumulative distribution function.
A discrete random variable has probability distribution entirely determined by its probability mass function (hereafter abbreviated p.m.f or PMF) . The p.m.f. is a function from the set of possible values of into . Labeling the p.m.f. with the random variable is done by .
By the axioms of probability,
For a subset ,
Continuous random variables
Now as promised we introduce another major class of random variables.
Definition.
Let be a random variable. If satisfies
for all , then is the probability density function (hereafter abbreviated p.d.f. or PDF) of .
We immediately see that the p.d.f. is analogous to the p.m.f. of the discrete case.
The probability that is equal to the area under the graph of from to .
A corollary is the following.
Fact.
for any where integration makes sense.
The set can be bounded or unbounded, or any collection of intervals.
Fact.
Fact.
If a random variable has density function then individual point values have probability zero:
Remark.
It follows a random variable with a density function is not discrete. An immediate corollary of this is that the probabilities of intervals are not changed by including or excluding endpoints. So and are equivalent.
How to determine which functions are p.d.f.s? Since , a p.d.f. must satisfy
Fact.
Random variables with density functions are called continuous random variables. This does not imply that the random variable is a continuous function on but it is standard terminology.
Discrete distributions
Recall that the probability distribution of gives its important probabilistic information. Let us discuss some of these distributions.
In general we first consider the experiment’s properties and theorize about the distribution that its random variable takes. We can then apply the distribution to find out various pieces of probabilistic information.
Bernoulli trials
A Bernoulli trial is the original “experiment.” It’s simply a single trial with a binary “success” or “failure” outcome. Encode this T/F, 0 or 1, or however you’d like. It becomes immediately useful in defining more complex distributions, so let’s analyze its properties.
The setup: the experiment has exactly two outcomes:
Success – or 1
Failure – or 0
Additionally:
Construct the probability mass function:
Write it as:
for and .
Binomial distribution
The setup: very similar to Bernoulli, trials have exactly 2 outcomes. A bunch of Bernoulli trials in a row.
Importantly: and are defined exactly the same in all trials.
This ties the binomial distribution to the sampling with replacement model, since each trial does not affect the next.
We conduct independent trials of this experiment. Example with coins: each flip independently has a chance of heads or tails (holds same for die, rigged coin, etc).
is fixed, i.e. known ahead of time.
Binomial random variable
Let’s consider the random variable characterized by the binomial distribution now.
Let of successes in independent trials. For any particular sequence of trials, it takes the form and is of length .
Then can take possible values. The probability of any particular sequence is given by the product of the individual trial probabilities.
Example.
So .
And
Now we can generalize
How about all successes?
We see that for all failures we have and all successes we have . Otherwise we use our method above.
In general, here is the probability mass function for the binomial random variable
Binomial distribution is very powerful. Choosing between two things, what are the probabilities?
To summarize the characterization of the binomial random variable:
independent trials
each trial results in binary success or failure
with probability of success , identically across trials
with successes in fixed trials.
with probability mass function
We see this is in fact the binomial theorem!
In fact,
Example.
What is the probability of getting exactly three aces (1’s) out of 10 throws of a fair die?
Seems a little trickier but we can still write this as well defined /. Let be getting an ace and being anything else.
Then and . We want . So
With or without replacement?
I place particular emphasis on the fact that the binomial distribution generally applies to cases where you’re sampling with replacement. Consider the following: Example.
Suppose we have two types of candy, red and black. Select candies. Let be the number of red candies among selected.
2 cases.
- case 1: with replacement: Binomial Distribution, , .
- case 2: without replacement: then use counting
In case 2, we used the elementary counting techniques we are already familiar with. Immediately we see a distinct case similar to the binomial but when sampling without replacement. Let’s formalize this as a random variable!
Hypergeometric distribution
Let’s introduce a random variable to represent a situation like case 2 above.
Definition.
is known as a Hypergeometric distribution.
Abbreviate this by:
For example,
Remark.
If is very small relative to , then both cases give similar (approx. the same) answers.
For instance, if we’re sampling for blood types from UCSB, and we take a student out without replacement, we don’t really change the sample size substantially. So both answers give a similar result.
Suppose we have two types of items, type and type . Let be type , type . is the total number of objects.
We sample items without replacement () with order not mattering. Denote by the number of type objects in our sample.
Definition.
Let and be integers. A random variable has the hypergeometric distribution with parameters if takes values in the set and has p.m.f.
Example.
Let defectives. Let non-defectives. We select without replacement. What is the probability that 2 of the 5 selected are defective?
We want .
Remark.
Make sure you can distinguish when a problem is binomial or when it is hypergeometric. This is very important on exams.
Recall that both ask about number of successes, in a fixed number of trials. But binomial is sample with replacement (each trial is independent) and sampling without replacement is hypergeometric.
Geometric distribution
Consider an infinite sequence of independent trials. e.g. number of attempts until I make a basket.
In fact we can think of this as a variation on the binomial distribution. But in this case we don’t sample times and ask how many successes we have, we sample as many times as we need for one success. Later on we’ll see this is really a specific case of another distribution, the negative binomial.
Let denote the outcome of the trial, where success is 1 and failure is 0. Let be the number of trials needed to observe the first success in a sequence of independent trials with probability of success . Then
We fail times and succeed on the try. Then:
This is the probability of failures raised to the amount of failures, times probability of success.
The key characteristic in these trials, we keep going until we succeed. There’s no choose in front like the binomial distribution because there’s exactly one sequence that gives us success.
Definition.
Let . A random variable has the geometric distribution with success parameter if the possible values of are and satisfies
for positive integers . Abbreviate this by .
Example.
What is the probability it takes more than seven rolls of a fair die to roll a six?
Let be the number of rolls of a fair die until the first six. Then . Now we just want .
Re-indexing,
Now we calculate by standard methods:
Negative binomial
As promised, here’s the negative binomial.
Consider a sequence of Bernoulli trials with the following characteristics:
Each trial success or failure
Prob. of success is same on each trial
Trials are independent (notice they are not fixed to specific number)
Experiment continues until successes are observed, where is a given parameter
Then if is the number of trials necessary until successes are observed, we say is a negative binomial random variable.
Immediately we see that the geometric distribution is just the negative binomial with .
Definition.
Let and . A random variable has the negative binomial distribution with parameters if the possible values of are the integers and the p.m.f. is
Abbreviate this by .
Example.
Steph Curry has a three point percentage of approx. . What is the probability that Steph makes his third three-point basket on his attempt?
Let be number of attempts required to observe the 3rd success. Then,
So,
Poisson distribution
This p.m.f. follows from the Taylor expansion
which implies that
Definition.
For an integer valued random variable , we say if it has p.m.f.
for for and
The Poisson arises from the Binomial. It applies in the binomial context when is very large () and is very small , such that is a moderate number ().
Then follows a Poisson distribution with .
for .
The Poisson distribution is useful for finding the probabilities of rare events over a continuous interval of time. By knowing for small and , we can calculate many probabilities.
Example.
The number of typing errors in the page of a textbook.
Let
be the number of letters of symbols per page (large)
be the probability of error, small enough such that
What is the probability of exactly 1 error?
We can approximate the distribution of with a distribution
Continuous distributions
All of the distributions we’ve been analyzing have been discrete, that is, they apply to random variables with a countable state space. Even when the state space is infinite, as in the negative binomial, it is countable. We can think of it as indexing each trial with a natural number .
Now we turn our attention to continuous random variables that operate on uncountably infinite state spaces. For example, if we sample uniformly inside of the interval , there are an uncountably infinite number of possible values we could obtain. We cannot index these values by the natural numbers, by some theorems of set theory we in fact know that the interval has a bijection to and has cardinality .
Additionally we notice that asking for the probability that we pick a certain point in the interval makes no sense, there are an infinite amount of sample points! Intuitively we should think that the probability of choosing any particular point is 0. However, we should be able to make statements about whether we can choose a point that lies within a subset, like .
Let’s formalize these ideas.
Definition.
Let be a random variable. If we have a function such that
for all , then is the probability density function of .
The probability that the value of lies in equals the area under the curve of from to .
If satisfies this definition, then for any for which integration makes sense,
Remark.
Recall from our previous discussion of random variables that the PDF is the analogue of the PMF for discrete random variables.
Properties of a CDF:
Any CDF satisfies
Integrates to unity: ,
is non-decreasing in (monotonically increasing)
Like we mentioned before, we can only ask about things like , but not . In fact for all . An immediate corollary of this is that we can freely interchange and and likewise for and , since if .
Example.
Let be a continuous random variable with density (pdf)
- What is ?
is such that
- Find the probability that is between 1 and 1.4.
Integrate the curve between 1 and 1.4.
This is the probability that lies between 1 and 1.4.
- Find the probability that is between 1 and 3.
Idea: integrate between 1 and 3, be careful after 2.
- What is the CDF for ? Integrate the curve to .
Important: include the range!
- Find a point such that you integrate up to the point to find exactly
the area.
We want to find .
Now let us discuss some named continuous distributions.
The (continuous) uniform distribution
The most simple and the best of the named distributions!
Definition.
Let be a bounded interval on the real line. A random variable has the uniform distribution on the interval if has the density function
Abbreviate this by .
The graph of is a constant line at height defined across . The integral is just the area of a rectangle, and we can check it is 1.
Fact.
For , its cumulative distribution function (CDF) is given by:
Fact.
If , and , then
Example.
Let be a uniform random variable on . Find the probability that its absolute value is at least 1.
takes values in the interval , so the absolute value is at least 1 iff. .
The density function of is on and 0 everywhere else.
So,
The exponential distribution
The geometric distribution can be viewed as modeling waiting times, in a discrete setting, i.e. we wait for failures to arrive at the success.
The exponential distribution is the continuous analogue to the geometric distribution, in that we often use it to model waiting times in the continuous sense. For example, the first custom to enter the barber shop.
Definition.
Let . A random variable has the exponential distribution with parameter if has PDF
Abbreviate this by , the exponential distribution with rate .
The CDF of the distribution is given by:
Example.
Suppose the length of a phone call, in minutes, is well modeled by an exponential random variable with a rate .
What is the probability that a call takes more than 8 minutes?
What is the probability that a call takes between 8 and 22 minutes?
Let be the length of the phone call, so that . Then we can find the desired probability by:
Now to find , we can take the difference in CDFs:
Fact (Memoryless property of the exponential distribution).
Suppose that . Then for any , we have
This is like saying if I’ve been waiting 5 minutes and then 3 minutes for the bus, what is the probability that I’m gonna wait more than 5 + 3 minutes, given that I’ve already waited 5 minutes? And that’s precisely equal to just the probability I’m gonna wait more than 3 minutes.
Proof.
Gamma distribution
Definition.
Let . A random variable has the gamma distribution with parameters if is nonnegative and has probability density function
Abbreviate this by .
The gamma function generalizes the factorial function and is defined as
Special case: if .
Remark.
The distribution is a special case of the gamma distribution, with parameter .
The normal distribution
Also known as the Gaussian distribution, this is so important it gets its own section.
Definition.
A random variable has the standard normal distribution if has density function
on the real line. Abbreviate this by .
Fact (CDF of a standard normal random variable).
Let be normally distributed. Then its CDF is given by
The normal distribution is so important, instead of the standard and , we use the special and .
Fact.
No closed form of the standard normal CDF exists, so we are left to either:
approximate
use technology (calculator)
use the standard normal probability table in the textbook
To evaluate negative values, we can use the symmetry of the normal distribution to apply the following identity:
General normal distributions
We can compute any other parameters of the normal distribution using the standard normal.
The general family of normal distributions is obtained by linear or affine transformations of . Let be real, and , then
is also a normally distributed random variable with parameters . The CDF of in terms of can be expressed as
Also,
Definition.
Let be real and . A random variable has the normal distribution with mean and variance if has density function
on the real line. Abbreviate this by .
Fact.
Let and . Then
That is, is normally distributed with parameters . In particular, is a standard normal variable.
Expectation
Let’s discuss the expectation of a random variable, which is a similar idea to the basic concept of mean.
Definition.
The expectation or mean of a discrete random variable is the weighted average, with weights assigned by the corresponding probabilities.
Example.
Find the expected value of a single roll of a fair die.
Binomial expected value
Bernoulli expected value
Bernoulli is just binomial with one trial.
Recall that and .
Let be an event on . Its indicator random variable is defined for by
Geometric expected value
Let and be a geometric RV with probability of success . Recall that the p.m.f. is , where prob. of failure is defined by .
Then
Now recall from calculus that you can differentiate a power series term by term inside its radius of convergence. So for ,
Expected value of a continuous RV
Definition.
The expectation or mean of a continuous random variable with density function is
An alternative symbol is .
is the “first moment” of , analogous to physics, it’s the “center of gravity” of .
Remark.
In general when moving between discrete and continuous RV, replace sums with integrals, p.m.f. with p.d.f., and vice versa.
Example.
Suppose is a continuous RV with p.d.f.
Example (Uniform expectation).
Let be a uniform random variable on the interval with . Find the expected value of .
Example (Exponential expectation).
Find the expected value of an exponential RV, with p.d.f.
Example (Uniform dartboard).
Our dartboard is a disk of radius and the dart lands uniformly at random on the disk when thrown. Let be the distance of the dart from the center of the disk. Find given density function
Expectation of derived values
If we can find the expected value of , can we find the expected value of ? More precisely, can we find ?
If the distribution is easy to see, then this is trivial. Otherwise we have the following useful property:
(for continuous RVs).
And in the discrete case,
In fact is so important that we call it the mean square.
Fact.
More generally, a real valued function defined on the range of is itself a random variable (with its own distribution).
We can find expected value of by
or
Example.
You roll a fair die to determine the winnings (or losses) of a player as follows:
What is the expected winnings/losses for the player during 1 roll of the die?
Let denote the outcome of the roll of the die. Then we can define our random variable as where the function is defined by and so on.
Note that . Likewise , and .
Then
Example.
A stick of length is broken at a uniformly chosen random location. What is the expected length of the longer piece?
Idea: if you break it before the halfway point, then the longer piece has length given by . If you break it after the halfway point, the longer piece has length .
Let the interval represent the stick and let be the location where the stick is broken. Then has density on and 0 elsewhere.
Let be the length of the longer piece when the stick is broken at ,
Then
So we expect the longer piece to be of the total length, which is a bit pathological.
Moments of a random variable
We continue discussing expectation but we introduce new terminology.
Fact.
The moment (or raw moment) of a discrete random variable with p.m.f. is the expectation
If is continuous, then we have analogously
The deviation is given by and the variance is given by and
is used to measure “skewness” / asymmetry of a distribution. For example, the normal distribution is very symmetric.
is used to measure kurtosis/peakedness of a distribution.
Central moments
Previously we discussed “raw moments.” Be careful not to confuse them with central moments.
Fact.
The central moment of a discrete random variable with p.m.f. is the expected value of the difference about the mean raised to the power
And of course in the continuous case,
In particular,
Example.
Let be a uniformly chosen integer from . Find the first and second moment of .
The p.m.f. of is for . Thus,
Then,
Example.
Let and let be a uniform random variable on the interval . Find the moment for for all positive integers .
The density function of is
Therefore the moment of is,
Example.
Suppose the random variable . Find the second moment of .
Fact.
In general, to find teh moment of ,
Median and quartiles
When a random variable has rare (abnormal) values, its expectation may be a bad indicator of where the center of the distribution lies.
Definition.
The median of a random variable is any real value that satisfies
With half the probability on both and , the median is representative of the midpoint of the distribution. We say that the median is more robust because it is less affected by outliers. It is not necessarily unique.
Example.
Let be discretely uniformly distributed in the set so has probability mass function
Find the expected value and median of .
While the median is any number .
The median reflects the fact that 90% of the values and probability is in the range while the mean is heavily influenced by the value.