These are some notes I’ve been collecting on random variables, their
+distributions, expected values, and moment generating functions. I
+thought I’d write them down somewhere useful.
+These are almost extracted verbatim from my in-class notes, which I take
+in real time using Typst. I simply wrote a tiny compatibility shim to
+allow Pandoc to render them to the web.
+
+Random variables
+First, some brief exposition on random variables. Quixotically, a random
+variable is actually a function.
+Standard notation: is a sample space, is an
+event.
+Definition.
+A random variable is a function
+ that takes the set of possible
+outcomes in a sample space, and maps it to a measurable
+space, typically (as in
+our case) a subset of .
+Definition.
+The state space of a random variable is all of the values
+can take.
+Example.
+Let be a random variable that takes on the values
+. Then the state space of is the set
+.
+Discrete random variables
+A random variable is discrete if there is countable such that
+. is a possible value if . We discuss
+continuous random variables later.
+The probability distribution of gives its important probabilistic
+information. The probability distribution is a description of the
+probabilities for subsets . We describe
+the probability density function and the cumulative distribution
+function.
+A discrete random variable has probability distribution entirely
+determined by its probability mass function (hereafter abbreviated p.m.f
+or PMF) . The p.m.f. is a function from the set of
+possible values of into . Labeling the p.m.f.
+with the random variable is done by .
+
+By the axioms of probability,
+
+For a subset ,
+
+Continuous random variables
+Now as promised we introduce another major class of random variables.
+Definition.
+Let be a random variable. If satisfies
+
+for all , then is the probability density
+function (hereafter abbreviated p.d.f. or PDF) of .
+We immediately see that the p.d.f. is analogous to the p.m.f. of the
+discrete case.
+The probability that is equal to the area
+under the graph of from to .
+A corollary is the following.
+Fact.
+
+for any where integration makes sense.
+The set can be bounded or unbounded, or any collection of intervals.
+Fact.
+
+
+Fact.
+If a random variable has density function then individual point
+values have probability zero:
+
+Remark.
+It follows a random variable with a density function is not discrete. An
+immediate corollary of this is that the probabilities of intervals are
+not changed by including or excluding endpoints. So and
+ are equivalent.
+How to determine which functions are p.d.f.s? Since
+, a p.d.f. must satisfy
+
+Fact.
+Random variables with density functions are called continuous random
+variables. This does not imply that the random variable is a continuous
+function on but it is standard terminology.
+Discrete distributions
+Recall that the probability distribution of gives its important
+probabilistic information. Let us discuss some of these distributions.
+In general we first consider the experiment’s properties and theorize
+about the distribution that its random variable takes. We can then apply
+the distribution to find out various pieces of probabilistic
+information.
+Bernoulli trials
+A Bernoulli trial is the original “experiment.” It’s simply a single
+trial with a binary “success” or “failure” outcome. Encode this T/F, 0
+or 1, or however you’d like. It becomes immediately useful in defining
+more complex distributions, so let’s analyze its properties.
+The setup: the experiment has exactly two outcomes:
+
+Success – or 1
+Failure – or 0
+
+Additionally:
+Construct the probability mass function:
+
+Write it as:
+
+for and .
+Binomial distribution
+The setup: very similar to Bernoulli, trials have exactly 2 outcomes. A
+bunch of Bernoulli trials in a row.
+Importantly: and are defined exactly the same in all trials.
+This ties the binomial distribution to the sampling with replacement
+model, since each trial does not affect the next.
+We conduct independent trials of this experiment. Example with
+coins: each flip independently has a chance of heads or
+tails (holds same for die, rigged coin, etc).
+ is fixed, i.e. known ahead of time.
+Binomial random variable
+Let’s consider the random variable characterized by the binomial
+distribution now.
+Let of successes in independent trials. For any particular
+sequence of trials, it takes the form
+ and
+is of length .
+Then can take possible values. The
+probability of any particular sequence is given by the product of the
+individual trial probabilities.
+Example.
+
+So .
+And
+Now we can generalize
+
+How about all successes?
+
+We see that for all failures we have and all successes we have
+. Otherwise we use our method above.
+In general, here is the probability mass function for the binomial
+random variable
+
+Binomial distribution is very powerful. Choosing between two things,
+what are the probabilities?
+To summarize the characterization of the binomial random variable:
+
+ independent trials
+each trial results in binary success or failure
+with probability of success , identically across trials
+
+with successes in fixed trials.
+
+with probability mass function
+
+We see this is in fact the binomial theorem!
+
+In fact,
+Example.
+What is the probability of getting exactly three aces (1’s) out of 10
+throws of a fair die?
+Seems a little trickier but we can still write this as well defined
+/. Let be getting an ace and being anything else.
+Then and . We want . So
+
+With or without replacement?
+I place particular emphasis on the fact that the binomial distribution
+generally applies to cases where you’re sampling with replacement.
+Consider the following: Example.
+Suppose we have two types of candy, red and black. Select candies.
+Let be the number of red candies among selected.
+2 cases.
+
+- case 1: with replacement: Binomial Distribution, ,
+.
+
+
+
+- case 2: without replacement: then use counting
+
+
+In case 2, we used the elementary counting techniques we are already
+familiar with. Immediately we see a distinct case similar to the
+binomial but when sampling without replacement. Let’s formalize this as
+a random variable!
+Hypergeometric distribution
+Let’s introduce a random variable to represent a situation like case 2
+above.
+Definition.
+
+is known as a Hypergeometric distribution.
+Abbreviate this by:
+
+For example,
+
+Remark.
+If is very small relative to , then both cases give similar
+(approx. the same) answers.
+For instance, if we’re sampling for blood types from UCSB, and we take a
+student out without replacement, we don’t really change the sample size
+substantially. So both answers give a similar result.
+Suppose we have two types of items, type and type . Let
+be type , type . is the
+total number of objects.
+We sample items without replacement () with order not
+mattering. Denote by the number of type objects in our sample.
+Definition.
+Let and be integers. A random
+variable has the hypergeometric distribution with parameters
+ if takes values in the set
+ and has p.m.f.
+
+Example.
+Let defectives. Let non-defectives. We select
+ without replacement. What is the probability that 2 of the 5
+selected are defective?
+
+We want .
+
+Remark.
+Make sure you can distinguish when a problem is binomial or when it is
+hypergeometric. This is very important on exams.
+Recall that both ask about number of successes, in a fixed number of
+trials. But binomial is sample with replacement (each trial is
+independent) and sampling without replacement is hypergeometric.
+Geometric distribution
+Consider an infinite sequence of independent trials. e.g. number of
+attempts until I make a basket.
+In fact we can think of this as a variation on the binomial
+distribution. But in this case we don’t sample times and ask how
+many successes we have, we sample as many times as we need for one
+success. Later on we’ll see this is really a specific case of another
+distribution, the negative binomial.
+Let denote the outcome of the trial, where
+success is 1 and failure is 0. Let be the number of trials needed to
+observe the first success in a sequence of independent trials with
+probability of success . Then
+We fail times and succeed on the try. Then:
+
+This is the probability of failures raised to the amount of failures,
+times probability of success.
+The key characteristic in these trials, we keep going until we succeed.
+There’s no choose in front like the binomial distribution
+because there’s exactly one sequence that gives us success.
+Definition.
+Let . A random variable has the geometric distribution
+with success parameter if the possible values of are
+ and satisfies
+
+for positive integers . Abbreviate this by .
+Example.
+What is the probability it takes more than seven rolls of a fair die to
+roll a six?
+Let be the number of rolls of a fair die until the first six. Then
+. Now we just want
+.
+
+Re-indexing,
+
+Now we calculate by standard methods:
+
+Negative binomial
+As promised, here’s the negative binomial.
+Consider a sequence of Bernoulli trials with the following
+characteristics:
+
+Each trial success or failure
+Prob. of success is same on each trial
+Trials are independent (notice they are not fixed to specific
+number)
+Experiment continues until successes are observed, where is
+a given parameter
+
+Then if is the number of trials necessary until successes are
+observed, we say is a negative binomial random variable.
+Immediately we see that the geometric distribution is just the negative
+binomial with .
+Definition.
+Let and . A random variable
+has the negative binomial distribution with parameters
+ if the possible values of are the integers
+ and the p.m.f. is
+
+Abbreviate this by .
+Example.
+Steph Curry has a three point percentage of approx. . What is the
+probability that Steph makes his third three-point basket on his
+ attempt?
+Let be number of attempts required to observe the 3rd success. Then,
+
+So,
+Poisson distribution
+This p.m.f. follows from the Taylor expansion
+
+which implies that
+
+Definition.
+For an integer valued random variable , we say
+ if it has p.m.f.
+
+for for and
+
+The Poisson arises from the Binomial. It applies in the binomial context
+when is very large () and is very small
+, such that is a moderate number ().
+Then follows a Poisson distribution with .
+
+for .
+The Poisson distribution is useful for finding the probabilities of rare
+events over a continuous interval of time. By knowing for
+small and , we can calculate many probabilities.
+Example.
+The number of typing errors in the page of a textbook.
+Let
+
+ be the number of letters of symbols per page (large)
+ be the probability of error, small enough such that
+
+
+What is the probability of exactly 1 error?
+We can approximate the distribution of with a
+ distribution
+
+Continuous distributions
+All of the distributions we’ve been analyzing have been discrete, that
+is, they apply to random variables with a
+countable state space.
+Even when the state space is infinite, as in the negative binomial, it
+is countable. We can think of it as indexing each trial with a natural
+number .
+Now we turn our attention to continuous random variables that operate on
+uncountably infinite state spaces. For example, if we sample uniformly
+inside of the interval , there are an uncountably
+infinite number of possible values we could obtain. We cannot index
+these values by the natural numbers, by some theorems of set theory we
+in fact know that the interval has a bijection to
+ and has cardinality .
+Additionally we notice that asking for the probability that we pick a
+certain point in the interval makes no sense, there
+are an infinite amount of sample points! Intuitively we should think
+that the probability of choosing any particular point is 0. However, we
+should be able to make statements about whether we can choose a point
+that lies within a subset, like .
+Let’s formalize these ideas.
+Definition.
+Let be a random variable. If we have a function such that
+ for all
+, then is the probability density function
+of .
+The probability that the value of lies in
+equals the area under the curve of from to .
+If satisfies this definition, then for any
+for which integration makes sense,
+
+Remark.
+Recall from our previous discussion of random variables that the PDF is
+the analogue of the PMF for discrete random variables.
+Properties of a CDF:
+Any CDF satisfies
+
+Integrates to unity: ,
+ is non-decreasing in (monotonically increasing)
+
+
+
+
+
+Like we mentioned before, we can only ask about things like
+, but not . In fact for all .
+An immediate corollary of this is that we can freely interchange
+and and likewise for and , since
+if .
+Example.
+Let be a continuous random variable with density (pdf)
+
+
+- What is ?
+
+ is such that
+
+
+- Find the probability that is between 1 and 1.4.
+
+Integrate the curve between 1 and 1.4.
+
+This is the probability that lies between 1 and 1.4.
+
+- Find the probability that is between 1 and 3.
+
+Idea: integrate between 1 and 3, be careful after 2.
+
+
+- What is the CDF for ? Integrate the curve to .
+
+
+Important: include the range!
+
+
+- Find a point such that you integrate up to the point to find
+exactly
+
+the area.
+We want to find .
+
+Now let us discuss some named continuous distributions.
+
+The most simple and the best of the named distributions!
+Definition.
+Let be a bounded interval on the real line. A
+random variable has the uniform distribution on the interval
+ if has the density function
+
+Abbreviate this by .
+The graph of is a constant line at
+height defined across . The
+integral is just the area of a rectangle, and we can check it is 1.
+Fact.
+For , its cumulative distribution
+function (CDF) is given by:
+
+Fact.
+If , and
+, then
+
+Example.
+Let be a uniform random variable on . Find the
+probability that its absolute value is at least 1.
+ takes values in the interval , so the absolute
+value is at least 1 iff.
+.
+The density function of is
+ on
+and 0 everywhere else.
+So,
+
+The exponential distribution
+The geometric distribution can be viewed as modeling waiting times, in a
+discrete setting, i.e. we wait for failures to arrive at the
+ success.
+The exponential distribution is the continuous analogue to the geometric
+distribution, in that we often use it to model waiting times in the
+continuous sense. For example, the first custom to enter the barber
+shop.
+Definition.
+Let . A random variable has the exponential
+distribution with parameter if has PDF
+
+Abbreviate this by , the exponential
+distribution with rate .
+The CDF of the distribution is given by:
+
+Example.
+Suppose the length of a phone call, in minutes, is well modeled by an
+exponential random variable with a rate .
+
+What is the probability that a call takes more than 8 minutes?
+What is the probability that a call takes between 8 and 22 minutes?
+
+Let be the length of the phone call, so that
+. Then we can find the
+desired probability by:
+
+Now to find , we can take the difference in CDFs:
+
+Fact (Memoryless property of the exponential distribution).
+Suppose that . Then for any , we
+have
+This is like saying if I’ve been waiting 5 minutes and then 3 minutes
+for the bus, what is the probability that I’m gonna wait more than 5 + 3
+minutes, given that I’ve already waited 5 minutes? And that’s precisely
+equal to just the probability I’m gonna wait more than 3 minutes.
+Proof.
+
+Gamma distribution
+Definition.
+Let . A random variable has the gamma
+distribution with parameters if is nonnegative and
+has probability density function
+
+Abbreviate this by .
+The gamma function generalizes the factorial function and is
+defined as
+
+Special case: if .
+Remark.
+The distribution is a special case of the gamma
+distribution, with parameter .
+The normal distribution
+Also known as the Gaussian distribution, this is so important it gets
+its own section.
+Definition.
+A random variable has the standard normal distribution if
+has density function
+ on the real
+line. Abbreviate this by .
+Fact (CDF of a standard normal random variable).
+Let be normally distributed. Then its CDF is given by
+
+The normal distribution is so important, instead of the standard
+ and , we use the special and
+.
+Fact.
+
+No closed form of the standard normal CDF exists, so we are left
+to either:
+
+To evaluate negative values, we can use the symmetry of the normal
+distribution to apply the following identity:
+
+General normal distributions
+We can compute any other parameters of the normal distribution using the
+standard normal.
+The general family of normal distributions is obtained by linear or
+affine transformations of . Let be real, and , then
+ is also a normally distributed random variable
+with parameters . The CDF of in terms
+of can be expressed as
+
+Also,
+
+Definition.
+Let be real and . A random variable has the
+normal distribution with mean and variance if
+has density function
+
+on the real line. Abbreviate this by
+.
+Fact.
+Let and . Then
+
+That is, is normally distributed with parameters
+. In particular,
+ is a standard normal variable.
+Expectation
+Let’s discuss the expectation of a random variable, which is a similar
+idea to the basic concept of mean.
+Definition.
+The expectation or mean of a discrete random variable is the
+weighted average, with weights assigned by the corresponding
+probabilities.
+
+Example.
+Find the expected value of a single roll of a fair die.
+
+
+
+
+
+
+Binomial expected value
+
+Bernoulli expected value
+Bernoulli is just binomial with one trial.
+Recall that and .
+
+Let be an event on . Its indicator random variable
+is defined for by
+
+
+Geometric expected value
+Let and
+be a geometric RV with probability of success . Recall that the
+p.m.f. is , where prob. of failure is defined by
+.
+Then
+
+Now recall from calculus that you can differentiate a power series term
+by term inside its radius of convergence. So for ,
+
+Expected value of a continuous RV
+Definition.
+The expectation or mean of a continuous random variable with density
+function is
+
+An alternative symbol is .
+ is the “first moment” of , analogous to physics, it’s the
+“center of gravity” of .
+Remark.
+In general when moving between discrete and continuous RV, replace sums
+with integrals, p.m.f. with p.d.f., and vice versa.
+Example.
+Suppose is a continuous RV with p.d.f.
+
+
+Example (Uniform expectation).
+Let be a uniform random variable on the interval
+ with . Find
+the expected value of .
+
+Example (Exponential expectation).
+Find the expected value of an exponential RV, with p.d.f.
+
+
+Example (Uniform dartboard).
+Our dartboard is a disk of radius and the dart lands uniformly
+at random on the disk when thrown. Let be the distance of the dart
+from the center of the disk. Find given density
+function
+
+
+Expectation of derived values
+If we can find the expected value of , can we find the expected value
+of ? More precisely, can we find
+?
+If the distribution is easy to see, then this is trivial. Otherwise we
+have the following useful property:
+
+(for continuous RVs).
+And in the discrete case,
+
+In fact is so important that we call
+it the mean square.
+Fact.
+More generally, a real valued function defined on the range of
+ is itself a random variable (with its own distribution).
+We can find expected value of by
+
+or
+
+Example.
+You roll a fair die to determine the winnings (or losses) of a
+player as follows:
+
+What is the expected winnings/losses for the player during 1 roll of the
+die?
+Let denote the outcome of the roll of the die. Then we can define
+our random variable as where the function is defined by
+ and so on.
+Note that .
+Likewise , and
+.
+Then
+Example.
+A stick of length is broken at a uniformly chosen random location.
+What is the expected length of the longer piece?
+Idea: if you break it before the halfway point, then the longer piece
+has length given by . If you break it after the halfway point,
+the longer piece has length .
+Let the interval represent the stick and let
+ be the location where the stick is
+broken. Then has density on
+ and 0 elsewhere.
+Let be the length of the longer piece when the stick is broken at
+,
+
+Then
+So we expect the longer piece to be of the total length,
+which is a bit pathological.
+Moments of a random variable
+We continue discussing expectation but we introduce new terminology.
+Fact.
+The moment (or raw moment) of a discrete
+random variable with p.m.f. is the expectation
+
+If is continuous, then we have analogously
+
+The deviation is given by and the variance is given by
+ and
+
+ is used to measure “skewness” / asymmetry of a distribution.
+For example, the normal distribution is very symmetric.
+ is used to measure kurtosis/peakedness of a distribution.
+Central moments
+Previously we discussed “raw moments.” Be careful not to confuse them
+with central moments.
+Fact.
+The central moment of a discrete random variable
+with p.m.f. is the expected value of the difference about the
+mean raised to the power
+
+And of course in the continuous case,
+
+In particular,
+
+Example.
+Let be a uniformly chosen integer from
+. Find the first and second moment of
+.
+The p.m.f. of is for
+. Thus,
+
+Then,
+
+Example.
+Let and let be a uniform random variable on the interval
+. Find the moment for for all
+positive integers .
+The density function of is
+
+Therefore the moment of is,
+
+Example.
+Suppose the random variable . Find the second
+moment of .
+
+Fact.
+In general, to find teh moment of
+,
+
+
+When a random variable has rare (abnormal) values, its expectation may
+be a bad indicator of where the center of the distribution lies.
+Definition.
+The median of a random variable is any real value that
+satisfies
+
+With half the probability on both and
+, the median is representative of the
+midpoint of the distribution. We say that the median is more robust
+because it is less affected by outliers. It is not necessarily unique.
+Example.
+Let be discretely uniformly distributed in the set
+ so has probability mass
+function
+Find the expected value and median of .
+
+While the median is any number .
+The median reflects the fact that 90% of the values and probability is
+in the range while the mean is heavily influenced by the
+ value.
+
+]]>