Statistics.
From probability to hypothesis testing — theory, flashcards, spaced repetition, and worked exercises, built to be read.
How to use this volume
- Begin with Theory. Every concept is introduced from first principles, with worked examples.
- Move to Flashcards for active recall. Click any card to flip, then mark HARD or GOT IT — your judgments persist.
- Attempt each Exercise on paper. Reveal the hint if stuck, then the full solution to check your reasoning.
- Navigate with ← →. Focus search with /. Mark a problem solved to track your progress.
Descriptive Statistics
From raw data to a clear summary: variables, tables, graphs, centre, spread, shape, and the relationship between two variables.
1 · What statistics is, and the basic vocabulary
"Statistics is the art of learning from data." The work splits in two: descriptive statistics — summarising the data you have with tables, graphs and numbers — and inferential statistics — drawing conclusions about a larger group from a sample. This chapter is the descriptive half, and it's the foundation for everything later.
Five words you'll use constantly:
- Population: the whole collection you care about (e.g. all LUISS students).
- Sample: the subset you actually observe (e.g. the students in one Statistics class).
- Units (or subjects): the members — people, countries, days, objects.
- Variable: the feature you record on each unit (age, height, grade).
- Modalities: the possible values a variable can take (for age: 18, 19, 20, …).
Think of the data as a table: one row per unit, one column per variable. Everything in this chapter is a way of squeezing such a table into something you can actually read.
2 · Types of variable — this decides what you can do
Before computing anything, classify the variable — it dictates which summaries make sense.
- Categorical (qualitative) — values are categories, not real magnitudes.
- Nominal: categories with no order — hair colour, favourite team, laptop OS.
- Ordinal: categories with a natural order — education level, hotel stars (1–5), film ratings (G, PG, …).
- Quantitative (numerical) — values are genuine numbers.
- Discrete: countable values — number of goals, exam grade.
- Continuous: any value in a range — height, time, the length of an episode.
Careful: a number used as a label is still categorical. Hotel stars run 1–5 but they're ordered categories, not a measured quantity. You wouldn't average shirt numbers on a football team.
Why it matters: for a categorical variable the useful summary is "what fraction falls in each category"; for a quantitative one it's "where is the centre and how spread out is it".
3 · Frequency tables
The first summary: count how often each value appears.
- Absolute frequency \(n_i\): how many units have value \(v_i\) (just count).
- Relative frequency \(f_i=n_i/n\): the proportion — these sum to 1.
- Cumulative relative frequency \(F_i=f_1+\dots+f_i\): the proportion of units up to value \(v_i\) (only for ordered data).
Course example — favourite Thanksgiving pie (2238 US adults, 2015): Pumpkin 729 (\(f=0.33\)), Apple 514 (0.23), Pecan 342 (0.15), … summing to 1. A wall of 2238 raw answers tells you nothing; the table tells you Pumpkin wins.
Cumulative example — 1000 customers' satisfaction (ordinal): Very unhappy 0.315, Quite unhappy 0.123 → \(F=0.438\) are unhappy; the rest are neutral-or-better. Cumulative frequencies answer "what proportion is at most this level?"
For a continuous variable every value is basically unique, so you group into classes (intervals) like \([0,10),[10,20),\dots\) and count per class (classes needn't be equal width).
4 · Graphs: pie, bar, histogram
Pick the graph to match the variable type:
- Pie chart (categorical): each category is a slice; slice angle \(=360^\circ\cdot f_i\).
- Bar plot (categorical or discrete): each value is a bar whose height = its frequency.
- Histogram (continuous, grouped): each class is a bar whose area = its frequency. This is the key difference from a bar plot — with unequal class widths, area (not height) carries the count, so a wide class isn't made to look bigger than it is.
A histogram reveals the data's shape at a glance: where the values pile up, whether it's symmetric, gaps, and outliers. When comparing two groups of different sizes, switch to relative frequencies first so the comparison is fair.
5 · Where is the centre? Mode, median, mean
Three ways to say "typical value":
- Mode: the most frequent value (can be more than one). For grouped data, the modal class = tallest histogram bar. Works for any variable type.
- Median: the middle of the ordered data. Order the values; if \(n\) is odd it's the middle one, if even the average of the two middle ones. Equivalently, the smallest value with \(F_i\ge0.5\). It ignores how extreme the values are — only their order.
- Mean \(\bar x=\dfrac1n\sum_i x_i\): the arithmetic average. From a frequency table it's the weighted form \(\bar x=\sum_j v_j f_j\); for grouped data, use class midpoints \(c_j\): \(\bar x\approx\sum_j c_j f_j\).
Mean vs median — the robustness lesson. The mean uses every value, so a few extreme ones drag it; the median doesn't budge. Pets per family in a building: values mostly 0–2 but a couple of 16–17 → mean \(\approx3.05\), median \(=1\). The median better reflects "a typical family". Mean and median coincide only when the data is symmetric.
6 · Percentiles & quartiles
The 100p-th percentile is the value with a fraction \(p\) of the data at or below it. Rule to locate it in the ordered sample of size \(n\):
- if \(np\) is not an integer → take the value at position \(\lceil np\rceil\) (round up);
- if \(np\) is an integer → average the values at positions \(np\) and \(np+1\).
Quartiles are the 25th, 50th (=median) and 75th percentiles: \(Q_1,Q_2,Q_3\). They cut the data into four equal quarters.
Worked (18 bowling scores, ordered). \(Q_1\): \(0.25\cdot18=4.5\to\) 5th value \(=145\). Median: \(0.5\cdot18=9\to\) average of 9th & 10th \(=159.5\). \(Q_3\): \(0.75\cdot18=13.5\to\) 14th value \(=177\). (Exercise 2 below walks this through.)
7 · How spread out? Range, variance, standard deviation
Centre isn't enough — two datasets can share a mean yet look completely different. Measures of spread:
- Range \(=x_{(n)}-x_{(1)}\) (max − min). Interquartile range \(\mathrm{IQR}=Q_3-Q_1\) — the width holding the middle 50%, unaffected by outliers.
- Variance. Natural idea: average the distances from the mean. But \(\sum_i(x_i-\bar x)=0\) always (positives cancel negatives), so we square the deviations first:
The two forms are equal; the right one is faster by hand. The divisor is \(n-1\), not \(n\) — "for technical reasons that become clear in the estimation chapter" (it makes \(s^2\) unbiased; you'll prove it later).
- Standard deviation \(s=\sqrt{s^2}\): same units as the data, so it's the interpretable one.
Worked (first 10 of the course's exam grades): 25,16,30,24,24,21,30,27,24,30. \(\sum x_i=251\Rightarrow\bar x=25.1\); \(\sum x_i^2=6479\); \(s^2=\dfrac{6479-10(25.1)^2}{9}=\dfrac{178.9}{9}\approx19.9\), so \(s\approx4.46\). (Exercise 1.)
Chebyshev's inequality ties spread to concentration for any shape: at most \(1/k^2\) of the data lies more than \(k\) standard deviations from the mean (so at most \(1/4\) lies outside \(\bar x\pm2s\)).
8 · Shape: symmetric, skewed, normal
The histogram's shape matters as much as its centre.
- Approximately normal: bell-shaped — tallest in the middle, symmetric, tapering on both sides. Lots of real data looks like this.
- Skewed: one long tail. Right-skewed = long tail to the right (income, pets-per-family); left-skewed = long tail to the left. A quick tell: if mean and median differ noticeably, the data is skewed (the mean is pulled toward the long tail).
A boxplot draws the five-number summary (min, \(Q_1\), median, \(Q_3\), max) — the box spans \(Q_1\) to \(Q_3\) with a line at the median — and is the easiest way to compare groups or spot skew (median off-centre in the box).
For approximately normal data, the Empirical Rule: about 68% of values lie within \(\bar x\pm s\), 95% within \(\bar x\pm2s\), 99.7% within \(\bar x\pm3s\). (This is the bridge to the Normal distribution chapter.)
9 · Two variables together: covariance & correlation
Often you want to know whether two variables move together. Plot each unit as a point \((x_i,y_i)\) — a scatterplot — and look.
Covariance measures the direction of a linear relationship:
\[ \mathrm{cov}_{x,y}=\frac{1}{n-1}\sum_i (x_i-\bar x)(y_i-\bar y). \]When big-\(x\) goes with big-\(y\), the products are positive → positive covariance; opposite movement → negative. But its size depends on the units, so you can't tell "strong" from "weak". Fix that by rescaling into the correlation coefficient:
\[ r_{x,y}=\frac{\mathrm{cov}_{x,y}}{s_x\,s_y}\in[-1,1]. \]\(r\) is unit-free. \(|r|\) near 1 = strong linear relationship (\(r=+1\) or \(-1\) means the points lie exactly on a line); \(r\) near 0 = weak linear relationship. Course example: LSD dose vs maths score gives \(r\approx-0.93\) — strong negative.
Two warnings. (1) \(r\) only sees linear structure — a perfect parabola can give \(r\approx0\), so "\(r\approx0\)" doesn't mean "unrelated". (2) Correlation is not causation: deaths from falling out of bed correlate \(0.96\) with the number of lawyers in Puerto Rico — a coincidence, or a lurking third variable, not cause and effect.
10 · Where the data comes from: sampling design
Summaries are only as good as the sample. A simple random sample picks units so that every possible group of that size is equally likely — the best general safeguard for representativeness. (Sampling the first 100 people entering a library to estimate a city's average age fails: libraries over-represent students and retirees.)
A stratified random sample first splits the population into groups (strata) and samples each in proportion. If customers are 70% teenagers and 30% adults, take 70%/30% — guaranteeing both groups appear. Stratification helps most when the strata genuinely differ on what you're measuring.
Finally, the distinction that drives all of inference: a parameter is a (usually unknown) number describing the whole population; a sample statistic (like \(\bar x\) or \(s\)) is what you compute from the sample to estimate it. Everything from chapter 8 onward is about going from statistic back to parameter.
11 · What the exam asks from this chapter
Descriptive questions on past papers cluster into three recognisable types:
| Wording | Type | Tool |
|---|---|---|
| group proportions/sizes + group means; overall mean given or asked; "back out a subgroup/size"; "add a new observation" | Weighted / grouped mean | \(\bar x=\sum w_g\bar x_g\); recover \(\sum x=n\bar x\), \(\sum(x-\bar x)^2=(n-1)s^2\) |
| several histograms, same mean, "match the standard deviations" | Dispersion reasoning | mass near centre = small SD; mass at extremes = large SD (don't compute) |
| few \((x,y)\) pairs, "correlation coefficient", "plot the points" | Correlation | points on a line → \(r=\pm1\); else the \(r\) formula |
Exercises 1–2 build the raw skills (mean, variance, quartiles); exercises 3–8 are the real exam problems of each type.
Compute the mean, variance and SD from scratch
Ten exam grades: \(25,16,30,24,24,21,30,27,24,30\). Compute the sample mean, sample variance, and standard deviation.
\(\bar x=\frac1n\sum x_i\). Then use the fast form \(s^2=\frac{1}{n-1}(\sum x_i^2-n\bar x^2)\), and \(s=\sqrt{s^2}\).
\(\sum x_i=251\), so \(\bar x=251/10=25.1\).
\(\sum x_i^2=625+256+900+576+576+441+900+729+576+900=6479\).
\(s^2=\dfrac{6479-10(25.1)^2}{10-1}=\dfrac{6479-6300.1}{9}=\dfrac{178.9}{9}\approx19.9\).
\(s=\sqrt{19.9}\approx4.46\) (same units as the grades).
Median and quartiles
Eighteen scores, already ordered: \(122,126,133,140,145,145,149,150,157,162,166,175,177,177,183,188,199,212\). Find \(Q_1\), the median, and \(Q_3\).
For each quartile compute \(np\) (\(p=0.25,0.5,0.75\), \(n=18\)); not integer → round the position up; integer → average two positions.
\(0.25\cdot18=4.5\) (not integer) → 5th value \(=145\).
\(0.5\cdot18=9\) (integer) → average of 9th and 10th \(=(157+162)/2=159.5\).
\(0.75\cdot18=13.5\) (not integer) → 14th value \(=177\).
Average salary across staff types (weighted mean)
Staff is 10% type A, 70% type B, 20% type C, with average monthly salaries 1000, 2000, 3000 respectively. What is the average salary across all staff?
Overall mean = weighted average of group means, weights = proportions.
\( \bar x=0.1(1000)+0.7(2000)+0.2(3000)=100+1400+600 \).
Augment a 12-point sample
Twelve numbers have sample mean 10 and sample variance 12. A new observation \(x_{13}=10\) is added. Find the mean and variance of the 13-point sample.
Recover the totals \(\sum x_i=n\bar x\) and \(\sum(x_i-\bar x)^2=(n-1)s^2\), add the point, recompute. The new point equals the old mean.
\( \sum_{i=1}^{12}x_i=12\cdot10=120 \); \( \sum_{i=1}^{12}(x_i-10)^2=11\cdot12=132 \).
\( \bar x_{13}=\dfrac{120+10}{13}=10 \) (unchanged).
New point adds \((10-10)^2=0\) to the sum of squares, so \( s^2_{13}=\dfrac{132+0}{13-1}=\dfrac{132}{12}=11 \).
Female vs male average grade — a 2×2 system
A class of 100 has overall average grade 25.0. It is 60% female, 40% male, and the female–male difference in averages is 1.0. Find the female and male average grades.
Let \(\bar x_F=1+\bar x_M\) and \(0.6\bar x_F+0.4\bar x_M=25\); substitute.
\( \bar x_F=1+\bar x_M \); weighted mean \(0.6\bar x_F+0.4\bar x_M=25\).
\( 0.6(1+\bar x_M)+0.4\bar x_M=25\Rightarrow 0.6+\bar x_M=25\Rightarrow \bar x_M=24.4 \).
\( \bar x_F=1+24.4=25.4 \).
Back-solve class sizes from a pooled mean
Two classes take the same test. Class A averages 7.2, class B averages 6.7, and the 50 students together average 6.9. How many students are in each class?
Let \(n_A\) be class A's size, \(50-n_A\) class B's. Pooled mean = total grades / 50 = 6.9.
\( \dfrac{7.2\,n_A+6.7(50-n_A)}{50}=6.9 \).
\( 7.2n_A+335-6.7n_A=345\Rightarrow 0.5\,n_A=10\Rightarrow n_A=20 \).
Match standard deviations to histograms
Three samples have the same mean \(\bar x=3.5\) over the range 0–7, with standard deviations (in some order) \(1.38,\,1.71,\,1.98\). Sample A is a flat/uniform histogram; Sample B is mound-shaped (mass near the centre); Sample C is U-shaped (mass at the extremes). Match each SD to its sample — without computing.
SD measures spread about the mean. Rank by how far the typical observation sits from 3.5.
Sample B (mound) keeps observations closest to \(\bar x\) → smallest SD \(=1.38\).
Sample C (U-shape) pushes mass to the extremes → largest \((x-\bar x)^2\) → largest SD \(=1.98\).
Sample A (uniform) is in between → \(1.71\).
Correlation of five collinear points
Five pairs: \((1,3),(3,7),(5,11),(4,9),(2,5)\). What is the sample correlation coefficient? (Hint: plot the points.)
Check whether they lie on a line \(y=a+bx\). If so, \(r=\operatorname{sign}(b)\) with no arithmetic.
Each pair satisfies \(y=2x+1\): \(1\!\to\!3,\,2\!\to\!5,\,3\!\to\!7,\,4\!\to\!9,\,5\!\to\!11\). All five are exactly collinear.
Positive slope \(b=2>0\) and a perfect line → \(r=+1\).
Probability Foundations
Starting from zero: what probability is, how to combine events, and how to reason with Bayes — built up with concrete examples before any exam problem.
1 · What probability actually is
Start with the picture, not the formula. An experiment is any situation whose result you can't predict for sure, but where you do know the list of possible results. Tossing a coin is an experiment; so is "which operating system will my next laptop run?" or "what will Apple's share price be on Monday?"
Each single result is an outcome. The full list of possible outcomes is the sample space, written \(S\).
- Coin toss: \(S=\{\text{Heads},\text{Tails}\}\).
- Roll a die: \(S=\{1,2,3,4,5,6\}\).
- Next laptop OS: \(S=\{\text{Windows},\text{MacOS},\text{Linux},\dots\}\).
An event is just a statement about the result — "the die shows an even number", "I get Heads". Formally it's a subset of \(S\): the even-number event is the set \(\{2,4,6\}\). We say the event occurs when the actual outcome is one of the outcomes inside it.
So what is a probability? The most useful picture is long-run frequency: if you repeated the experiment over and over, the probability of an event is the proportion of times it would happen. Flip a fair coin thousands of times and the fraction landing Tails settles near \(0.5\) — that limiting fraction is the probability. (Real example: across the world about 105 boys are born for every 100 girls, year after year — so \(P(\text{newborn is male})\approx0.51\).) A probability is always a number between 0 and 1: 0 = never, 1 = certain.
2 · Combining events: AND, OR, NOT
Events are sets, so we combine them like sets. Picture a rectangle for \(S\) and circles inside it for events (a Venn diagram).
- OR — union \(A\cup B\): the outcomes in \(A\), or in \(B\), or in both. It occurs if at least one of them happens.
- AND — intersection \(A\cap B\): the outcomes in both. It occurs only if they happen together.
- NOT — complement \(A^c\): everything in \(S\) that is not in \(A\). It occurs exactly when \(A\) doesn't.
Mutually exclusive (disjoint) events can't happen at the same time — their intersection is empty, \(A\cap B=\varnothing\). "The die shows 2" and "the die shows 5" are mutually exclusive. The empty set \(\varnothing\) is the impossible event.
One identity worth seeing now, because it drives a lot of exam problems: "in \(A\) but not in \(B\)" is \(A\cap B^c\), and it equals what's in \(A\) minus the overlap. We'll turn that into numbers next.
3 · The rules of probability
Three basic rules (they just codify the frequency picture):
- \(0\le P(A)\le1\) — proportions live between 0 and 1.
- \(P(S)=1\) — something in the list always happens.
- If \(A,B\) are mutually exclusive, \(P(A\cup B)=P(A)+P(B)\) — non-overlapping chances just add.
From these you derive the two you'll actually use:
Complement rule. Since \(A\) and \(A^c\) split \(S\): \(P(A^c)=1-P(A)\). (Heads has probability 0.4 ⇒ Tails has 0.6.) This is the engine behind "at least one" problems — it's usually easier to compute the opposite and subtract.
Addition rule (when events can overlap): \[ P(A\cup B)=P(A)+P(B)-P(A\cap B). \] Why subtract? If you just add \(P(A)+P(B)\), the overlap \(A\cap B\) gets counted twice, so you remove it once.
Concrete example (Ross 4.3). A shop takes Amex or VISA. 22% of customers carry Amex, 58% carry VISA, 14% carry both. Probability a customer has at least one card: \(0.22+0.58-0.14=0.66\). And "VISA but not Amex" \(=0.58-0.14=0.44\) — exactly the \(A\cap B^c\) idea from section 2.
4 · Equally likely outcomes — just count
When every outcome in \(S\) is equally likely (a fair die, a well-shuffled deck, "a person chosen at random"), probability becomes pure counting:
\[ P(A)=\frac{\#\text{outcomes in }A}{\#\text{outcomes in }S}. \]The phrase "chosen at random" is your signal that outcomes are equally likely.
- Fair die: \(P(\text{even})=3/6=1/2\).
- European roulette, bet on odd: numbers \(\{0,1,\dots,36\}\), 18 are odd ⇒ \(18/37\).
- Retirement centre (Ross 4.4): 420 members, 144 smokers ⇒ \(P(\text{smoker})=144/420=12/35\).
This is why counting techniques (section 9) matter: to get a probability you often just need to count the favourable outcomes and divide by the total.
5 · Conditional probability — updating on information
Often you learn something partway through. Conditional probability \(P(B\mid A)\) is "the probability of \(B\) given that \(A\) has happened."
The intuition (no formula yet). Roll two dice; you're told the first die is a 4. That knowledge shrinks the world: only six outcomes are still possible — \((4,1),\dots,(4,6)\). Among those, only \((4,6)\) makes the sum 10, so the chance is \(1/6\). You re-computed the probability inside a reduced sample space: \(A\) became your new \(S\).
Turning that into a formula — measure the overlap relative to the thing you now know:
\[ P(B\mid A)=\frac{P(A\cap B)}{P(A)}. \]The famous trap (Ross 4.10 / the two-children problem). A couple has two children; you learn at least one is a girl. Probability both are girls? Equally likely families are \(\{(g,g),(g,b),(b,g),(b,b)\}\). "At least one girl" rules out only \((b,b)\), leaving three equally likely cases; just one is \((g,g)\). So the answer is \(1/3\), not \(1/2\) — the extra information reshapes the sample space. (This exact reasoning powers exam exercise 5 below.)
Rearranging the formula gives the multiplication rule: \(P(A\cap B)=P(A)\,P(B\mid A)\) — useful for "draw two without replacement" problems, where the second draw's odds depend on the first.
6 · Independence — when information doesn't help
Sometimes knowing \(A\) tells you nothing about \(B\). Then \(P(B\mid A)=P(B)\), and the multiplication rule simplifies to the test you'll use:
\[ A,B\text{ independent}\iff P(A\cap B)=P(A)\,P(B). \]Signals of independence: "with replacement", "i.i.d.", "each toss is fair" — separate trials that don't influence each other.
Why it's not automatic (Ross 4.13). Two fair dice. Let \(A=\)"first die is 3". Compare two events: \(B=\)"sum is 8" and \(C=\)"sum is 7". Knowing the first die is 3 changes the chance of an 8 (now you just need a 5 next) — so \(A,B\) are dependent. But the chance of a 7 stays \(1/6\) whatever the first die shows (there's always exactly one matching second die) — so \(A,C\) are independent. Same first event, opposite verdicts: independence is something you check, not assume.
For "at least one" across independent trials, lean on the complement: three children, \(P(\text{at least one girl})=1-P(\text{all boys})=1-(1/2)^3=7/8\).
7 · Total probability — averaging over cases
Often the thing you want depends on a hidden "case" or "cause". Split the problem by the case, compute each piece, and combine. If \(B\) either happens or not, then for any \(A\):
\[ P(A)=P(A\mid B)\,P(B)+P(A\mid B^c)\,P(B^c). \]Read it as a weighted average: the chance of \(A\) in each case, weighted by how likely that case is. (It generalises to any set of mutually-exclusive cases \(B_1,\dots,B_k\): \(P(A)=\sum_i P(A\mid B_i)P(B_i)\).)
Concrete example (insurance). 30% of drivers are high-risk, 70% low-risk. A high-risk driver has an accident this year with probability 0.4, a low-risk one with probability 0.2. Overall chance a random driver has an accident: \[ P(\text{accident})=0.4(0.30)+0.2(0.70)=0.12+0.14=0.26. \] You couldn't answer without splitting by risk type — that's the whole move.
8 · Bayes' theorem — reasoning backwards
Total probability runs cause → effect. Bayes runs it backwards: you observed the effect, and you want the probability of the cause. "It rained — how likely is it that the morning had been sunny?" "The test is positive — how likely is the disease?"
Start from the definition \(P(\text{cause}\mid\text{effect})=\dfrac{P(\text{cause}\cap\text{effect})}{P(\text{effect})}\), write the top as \(P(\text{effect}\mid\text{cause})P(\text{cause})\), and expand the bottom with total probability:
\[ P(H\mid E)=\frac{P(E\mid H)\,P(H)}{P(E\mid H)\,P(H)+P(E\mid H^c)\,P(H^c)}. \]The procedure: (1) name the causes \(H\) and the observed effect \(E\); (2) write the priors \(P(H)\) and the likelihoods \(P(E\mid H)\) straight from the text; (3) total probability gives the denominator; (4) divide.
The showcase example — why a positive test can still be reassuring (Ross 4.17). A blood test is 99% accurate when the disease is present (\(P(E\mid H)=0.99\)) and gives a false positive 2% of the time (\(P(E\mid H^c)=0.02\)). Only 0.5% of people have the disease (\(P(H)=0.005\)). You test positive — what's the chance you're actually sick? \[ P(H\mid E)=\frac{0.99(0.005)}{0.99(0.005)+0.02(0.995)}\approx0.199. \] About 20% — surprisingly low, because the huge healthy population produces many false positives that swamp the few true cases. This is exactly the engine behind exam exercises 3 and 4.
9 · Counting — tools for equally-likely problems
Section 4 said probability is often just counting favourable ÷ total. Here are the tools, introduced only because we need them.
Basic principle. If step 1 has \(n\) options and step 2 has \(m\), together there are \(n\cdot m\). (One man from 8, one woman from 12 → \(96\) pairs.)
Permutations / factorial. The number of orderings of \(n\) distinct objects is \(n!=n(n-1)\cdots2\cdot1\) (with \(0!=1\)). Order matters here.
Combinations. When order does not matter — choosing a group of \(k\) from \(n\):
\[ \binom{n}{k}=\frac{n!}{k!\,(n-k)!}. \]Two reflexes for the exam: "all different" → count the complement (the few repeats) and subtract; "these specific items are included" → fix them, then count the choices for the remaining slots. Watch for repeated elements — e.g. the two identical P's in APPLE: \(\binom{5}{2}=10\) selections, only one is the pair PP.
10 · Which tool? — recognition guide
Once the concepts are in place, exam problems are about spotting which one applies.
| If the wording says… | It's a… | Do |
|---|---|---|
| "chosen at random" among options, then observe a result; "given [result], prob it was [cause]" | Bayes | priors × likelihoods, divide |
| "a cause is random, then count X", asked an unconditional \(P(X{=}k)\) | Total probability (mixture) | \(\sum P(X{=}k\mid \text{case})P(\text{case})\) |
| two overlapping groups + counts; "at least one", "one but not the other" | Addition / inclusion–exclusion | \(P(A\cup B)=P(A)+P(B)-P(A\cap B)\) |
| "choose \(k\) without replacement", "all different", "both included" | Counting | \(\binom{n}{k}\) + complement / fix-items |
| "knowing exactly \(m\) of \(n\) are…", asked about a specific draw | Conditional in a fair setup | reduced sample space; count favourable ÷ remaining |
Coin tossed a random number of times — P(X=0)
Let \(N\) be a number chosen from \(\{1,2,3\}\) with equal probability. Throw a fair coin \(N\) times, counting the number \(X\) of Heads obtained. Calculate \(P(X=0)\).
The hidden 'case' is how many times you tossed (\(N\)). No reversal is asked, so it's the law of total probability (section 7). Given \(N=n\), getting zero Heads means \(n\) Tails in a row: \((\tfrac12)^n\).
Cases: \(P(N=1)=P(N=2)=P(N=3)=\tfrac13\). The question asks a plain \(P(X=0)\) — no 'given' — so average over the cases (total probability), not Bayes.
\(n\) independent fair tosses, all Tails: \(P(X=0\mid N=n)=(\tfrac12)^n\). For \(n=1,2,3\) that's \(\tfrac12,\tfrac14,\tfrac18\).
\[ P(X=0)=\sum_{n=1}^{3}P(X=0\mid N=n)P(N=n)=\tfrac13\left(\tfrac12+\tfrac14+\tfrac18\right)=\tfrac13\cdot\tfrac78. \]
Same coin experiment — reverse it with Bayes
Same setup (\(N\) chosen from \(\{1,2,3\}\), fair coin tossed \(N\) times, \(X\) = Heads). Given that \(X=0\), calculate \(P(N=1\mid X=0)\).
Now you observe the effect \(X=0\) and want the hidden cause \(N=1\) → Bayes (section 8). The denominator is the \(P(X=0)=\tfrac{7}{24}\) you just built.
From exercise 1, \(P(X=0)=\tfrac{7}{24}\).
\(P(X=0\mid N=1)\,P(N=1)=\tfrac12\cdot\tfrac13=\tfrac16\).
\[ P(N=1\mid X=0)=\frac{1/6}{7/24}=\frac16\cdot\frac{24}{7}. \]
Which coin was it? — Bayes with two coins
An urn has two coins: coin A with \(P(\text{Heads})=\tfrac14\), coin B with \(P(\text{Heads})=\tfrac12\). One coin is picked at random and thrown, giving "Heads". Given this result, what is the probability it was coin A?
Cause = which coin (priors \(\tfrac12,\tfrac12\)); effect = Heads (likelihoods \(\tfrac14,\tfrac12\)). You observed the effect, want the cause → Bayes. Same shape as the medical-test example.
\(P(A)=P(B)=\tfrac12\); \(P(H\mid A)=\tfrac14\), \(P(H\mid B)=\tfrac12\).
\(P(H)=\tfrac14\cdot\tfrac12+\tfrac12\cdot\tfrac12=\tfrac18+\tfrac14=\tfrac38\).
\[ P(A\mid H)=\frac{P(H\mid A)P(A)}{P(H)}=\frac{1/8}{3/8}. \]
Was the morning sunny? — Bayes with rain
If the morning is sunny, the chance of rain that day is \(\tfrac16\). On non-sunny mornings, the chance of rain is \(\tfrac12\). 60% of mornings start sunny. Given that it rained, what is the probability the morning was sunny?
Cause = sunny vs not (priors \(0.6,0.4\)); effect = rain (likelihoods \(\tfrac16,\tfrac12\)). Observed the effect (rain), want the cause (sunny) → Bayes.
\(P(S)=0.6\), \(P(S^c)=0.4\); \(P(R\mid S)=\tfrac16\), \(P(R\mid S^c)=\tfrac12\).
\(P(R)=\tfrac16\cdot0.6+\tfrac12\cdot0.4=0.1+0.2=0.3\).
\[ P(S\mid R)=\frac{P(R\mid S)P(S)}{P(R)}=\frac{0.1}{0.3}. \]
Exactly two women — conditioning in a fair setup
Three students are selected at random with replacement. Knowing that exactly two of the selected are women, what is the probability that the first selected is a woman?
This is the two-children trap (section 5) with three slots. 'With replacement' → independent trials, each woman with some probability \(p\). Let \(E_i\)='the \(i\)-th is a woman', \(W\)=number of women. Want \(P(E_1\mid W=2)\). Watch \(p\) cancel.
Number of women among 3 is Binomial: \(P(W=2)=\binom{3}{2}p^2(1-p)=3p^2(1-p)\).
The arrangements are \(E_1E_2E_3^c\) and \(E_1E_2^cE_3\) (disjoint). By independence each is \(p\cdot p\cdot(1-p)\), so \(P(E_1\cap\{W=2\})=2p^2(1-p)\).
\[ P(E_1\mid W=2)=\frac{2p^2(1-p)}{3p^2(1-p)}. \] The \(p^2(1-p)\) cancels, just like the count-the-cases reasoning in the two-children problem.
Biology not Chemistry — addition rule
A school has 300 students; every student takes at least one of Biology or Chemistry, possibly both. Biology has 250 students, Chemistry 150. Picking a student at random, what is the probability they take Biology and not Chemistry?
Two overlapping groups + 'and not' → addition rule (section 3). Everyone is in the union, so \(P(B\cup C)=1\); solve for the overlap, then use \(P(B\cap C^c)=P(B)-P(B\cap C)\).
\(P(B)=\tfrac{250}{300}\), \(P(C)=\tfrac{150}{300}\), \(P(B\cup C)=1\) (all take at least one).
\(1=\tfrac{250}{300}+\tfrac{150}{300}-P(B\cap C)\Rightarrow P(B\cap C)=\tfrac{100}{300}\).
\[ P(B\cap C^c)=P(B)-P(B\cap C)=\tfrac{250}{300}-\tfrac{100}{300}=\tfrac{150}{300}. \]
Two letters from APPLE, all different — counting
Consider the word APPLE. Choose two letters at random without replacement. What is the probability that they are different?
Equally-likely selections → count (section 9). \(\binom{5}{2}=10\) pairs. 'All different' → count the complement (the pairs that repeat) and subtract.
Letters A, P, P, L, E → \(\binom{5}{2}=10\) unordered pairs.
The only pair of equal letters is PP — exactly 1 outcome.
Different pairs \(=10-1=9\), so \(P(\text{different})=\tfrac{9}{10}\).
Three letters from APPLE, both P's chosen — counting
From the word APPLE, choose three letters at random without replacement. What is the probability that both P's are among the chosen letters?
\(\binom{5}{3}=10\) selections. 'Both P's included' → fix the two P's, count choices for the one remaining slot.
\(\binom{5}{3}=10\) unordered selections of three letters.
If both P's are taken, the third letter is one of A, L, E → 3 favourable selections.
\( P(\text{both P's})=\tfrac{3}{10} \).
Discrete Random Variables
From outcomes to numbers: probability mass functions, expected value, variance, and how means and variances behave under transformations and sums — built from scratch with worked exam problems.
From outcomes to numbers: what a random variable is
Until now an experiment gave you outcomes — "Heads", "the die shows 3", "the customer is left-handed". A random variable is one simple move on top of that: attach a number to each outcome.
Formally, a random variable \(X\) is a rule that assigns a real number to every outcome of an experiment. We write it with a capital letter (\(X,Y,N\)); the actual value it takes after the experiment is a lowercase letter (\(x\)).
- Toss a coin three times, let \(X=\) number of Heads. Then \(X\) can be \(0,1,2,3\).
- Pick a student, let \(X=1\) if left-handed, \(0\) if not. Then \(X\in\{0,1\}\).
- Play a game, let \(X=\) euros you walk away with. \(X\) might be \(-5,\,0,\,+10,\dots\)
Why bother? Because once outcomes are numbers you can average them, measure how spread they are, and feed them into every formula in the rest of the course.
Discrete vs continuous. A random variable is discrete when its possible values are separate points you can list — finite (\(0,1,2,3\)) or an endless but listable sequence (\(1,2,3,\dots\)). It is continuous when it can be any value in an interval (a height like \(172.4\,\)cm). This whole chapter is about the discrete case; continuous ones come later.
The probability mass function (pmf): the full ID card of a discrete RV
To know a discrete random variable completely you need two columns: the list of values it can take, and the probability of each. That table is the probability mass function (pmf), written
\[ p(x)=P\{X=x\}. \]
Read it as "the probability that \(X\) lands exactly on \(x\)". Ross also writes it \(p_X(x)\) when several variables are around.
A list of numbers is a valid pmf exactly when:
- (1) every probability is \(\ge 0\): \(p(x)\ge 0\);
- (2) they add up to one: \(\sum_x p(x)=1\).
Concrete example. In a population \(10\%\) of people are left-handed. Pick two people independently and let \(X=\) how many of the two are left-handed. Working out the cases gives the pmf:
| \(x\) | 0 | 1 | 2 |
|---|---|---|---|
| \(p(x)\) | 0.81 | 0.18 | 0.01 |
Check it is valid: all entries \(\ge0\), and \(0.81+0.18+0.01=1\). Good. This one table is the variable — every question below ("what's the average? the spread?") is answered just by reading off these numbers.
Tip. If a problem gives you frequencies (counts) instead of probabilities, divide each count by the total to turn the frequency table into a pmf. That single step unlocks every formula in this chapter.
Expected value E[X]: the long-run average
The first number you summarise a random variable with is its expected value (also called the mean), written \(E[X]\) or \(\mu\). It is the weighted average of the possible values, each weighted by its probability:
\[ E[X]=\sum_i x_i\,p(x_i). \]
What it really means. \(E[X]\) is not "the value you should expect to see" on a single try — a fair die never shows \(3.5\). It is the average you would get over a long run of repeated experiments. Physically it is the balance point of the pmf: put weight \(p(x_i)\) at position \(x_i\) on a ruler, and \(E[X]\) is where the ruler balances.
Example — fair die. Values \(1,\dots,6\) each with probability \(\tfrac16\):
\[ E[X]=1\cdot\tfrac16+2\cdot\tfrac16+\dots+6\cdot\tfrac16=\frac{21}{6}=3.5. \]
Example — indicator. If \(X=1\) when an event \(A\) happens and \(0\) otherwise, then \(E[X]=1\cdot P(A)+0\cdot P(A^c)=P(A)\). The mean of a 0/1 variable is just the probability of the "1".
For the left-handed pmf above: \(E[X]=0(0.81)+1(0.18)+2(0.01)=0.20\). On average \(0.2\) of the two people are left-handed — a perfectly sensible number even though \(X\) itself is only ever \(0,1,\) or \(2\).
E[g(X)]: averaging a function of X (and why E[X²] ≠ (E[X])²)
Often you don't care about \(X\) itself but about some function of it — a payoff, a squared error, a transformed score. If \(Y=g(X)\), you do not need to build a new pmf for \(Y\). Just weight \(g\) by the original probabilities:
\[ E[g(X)]=\sum_i g(x_i)\,p(x_i). \]
(Sometimes called the "law of the unconscious statistician".)
Game example. Draw \(5\) balls; you win €1 per white ball and lose €1 per non-white. If \(X\) is the number of white balls, your balance is \(Y=g(X)=2X-5\). To get the expected balance you just plug \(2x_i-5\) into the sum — no separate table needed.
The crucial special case: \(E[X^2]\). Take \(g(x)=x^2\):
\[ E[X^2]=\sum_i x_i^2\,p(x_i). \]
For the left-handed pmf: \(E[X^2]=0^2(0.81)+1^2(0.18)+2^2(0.01)=0.18+0.04=0.22\).
Notice \(E[X^2]=0.22\) but \((E[X])^2=(0.20)^2=0.04\). They are not equal. For a non-linear \(g\), \(E[g(X)]\neq g(E[X])\) in general. You can only "push the average inside" for linear functions and sums — exactly the next two sections. This gap between \(E[X^2]\) and \((E[X])^2\) is precisely what variance measures.
Variance and standard deviation: measuring spread
The mean tells you the center; the variance tells you how far the values typically sit from that center. Definition:
\[ \operatorname{Var}(X)=E\big[(X-\mu)^2\big],\qquad \mu=E[X]. \]
You square the distance from the mean (so positives and negatives don't cancel) and average it. Bigger variance = more spread.
The formula you actually use. Expanding the square gives a much faster equivalent:
\[ \operatorname{Var}(X)=E[X^2]-(E[X])^2. \]
So the recipe is always: compute \(E[X]\), compute \(E[X^2]\), subtract the square of the first from the second.
Standard deviation. Variance is in squared units (euros², years²…), which is awkward. Take the square root to get back to the original units:
\[ \sigma=\operatorname{SD}(X)=\sqrt{\operatorname{Var}(X)}. \]
Example — fair die. We found \(E[X]=3.5\) and \(E[X^2]=\tfrac{91}{6}\). So
\[ \operatorname{Var}(X)=\frac{91}{6}-\left(\frac72\right)^2=\frac{91}{6}-\frac{49}{4}=\frac{35}{12}\approx2.917. \]
Example — 0/1 (Bernoulli) variable. If \(P(X=1)=p\), then \(E[X]=p\) and (since \(X^2=X\)) \(E[X^2]=p\), so
\[ \operatorname{Var}(X)=p-p^2=p(1-p). \]
Linear transformations: E[aX+b] and Var(aX+b)
Rescaling and shifting a variable — change of units, a fee added to a payoff — is a linear transformation \(Y=aX+b\). Two clean rules:
\[ E[aX+b]=a\,E[X]+b, \qquad \operatorname{Var}(aX+b)=a^2\,\operatorname{Var}(X). \]
Read the rules. For the mean, the constant \(b\) shifts it and the factor \(a\) scales it — average moves exactly the way the data moves. For the variance, adding \(b\) does nothing (shifting everything by the same amount doesn't change spread), and the scale factor comes out squared because variance is built from squared distances.
Worked example. Let \(X\) have \(E[X]=3\) and \(\operatorname{Var}(X)=2\). For \(Y=4+3X\) (so \(a=3,\,b=4\)):
\[ E[Y]=4+3(3)=13,\qquad \operatorname{Var}(Y)=3^2\cdot 2=18. \]
The standard deviation scales by \(|a|\) (not \(a^2\)): \(\operatorname{SD}(Y)=3\,\operatorname{SD}(X)\).
Sums of random variables: when do means and variances add?
Real problems often add several random variables (total earnings of a couple, total Heads in many tosses). Two rules, and one important asymmetry between them.
Means always add. No conditions, ever:
\[ E[X+Y]=E[X]+E[Y], \qquad E\!\left[\sum_{i=1}^{k}X_i\right]=\sum_{i=1}^{k}E[X_i]. \]
Variances add only when there is no interaction. In general
\[ \operatorname{Var}(X+Y)=\operatorname{Var}(X)+\operatorname{Var}(Y)+2\operatorname{Cov}(X,Y), \]
where the extra term \(\operatorname{Cov}(X,Y)=E[(X-\mu_X)(Y-\mu_Y)]\) is the covariance — it captures whether the two move together. When \(X\) and \(Y\) are independent (knowing one tells you nothing about the other), the covariance is \(0\) and the variances simply add:
\[ \operatorname{Var}(X+Y)=\operatorname{Var}(X)+\operatorname{Var}(Y),\qquad \operatorname{Var}\!\left(\sum_{i=1}^{k}X_i\right)=\sum_{i=1}^{k}\operatorname{Var}(X_i)\quad(\text{independent}). \]
Also useful: for independent \(X,Y\), \(E[XY]=E[X]\,E[Y]\). (Covariance and joint behaviour get a full chapter later; here you only need: means add unconditionally, variances add under independence.)
Decision box — when can I push the average inside? \(E[g(X)]=g(E[X])\) is safe only for: linear functions \(aX+b\), sums, and products of independent variables. For anything non-linear (like \(X^2\)), it fails.
Recap — the discrete-RV toolkit
Everything in this chapter is computed by reading a pmf table and summing.
- pmf valid: \(p(x)\ge0\) and \(\sum p(x)=1\). Frequencies → divide by total.
- Mean: \(E[X]=\sum x_i\,p(x_i)\) — long-run average / balance point.
- Function: \(E[g(X)]=\sum g(x_i)\,p(x_i)\); in particular \(E[X^2]=\sum x_i^2 p(x_i)\).
- Variance: \(\operatorname{Var}(X)=E[X^2]-(E[X])^2\); \(\operatorname{SD}=\sqrt{\operatorname{Var}}\).
- Linear: \(E[aX+b]=aE[X]+b\), \(\operatorname{Var}(aX+b)=a^2\operatorname{Var}(X)\).
- Sums: means always add; variances add when independent.
Exam reflex: a table of values with probabilities + a question about "average / expected / fair price" → \(E[X]\). About "spread / variability / standard deviation / risk" → \(\operatorname{Var}(X)=E[X^2]-(E[X])^2\). About "expected payoff of a deal / which option" → compute \(E[\cdot]\) of each option and compare.
Build and validate a pmf, then find E[X]
In a population \(10\%\) of people are left-handed. Two people are picked independently. Let \(X\) be the number of left-handers among them, with pmf \(p(0)=0.81,\ p(1)=0.18,\ p(2)=0.01\). (a) Check this is a valid pmf. (b) Find \(E[X]\).
Valid pmf = entries \(\ge0\) and sum \(=1\). Then \(E[X]=\sum x_i p(x_i)\).
All three probabilities are \(\ge0\), and \(0.81+0.18+0.01=1\). Valid pmf.
\(E[X]=0(0.81)+1(0.18)+2(0.01)=0.18+0.02=0.20.\)
E[X²] is not (E[X])²
For the same variable (\(p(0)=0.81,p(1)=0.18,p(2)=0.01\), \(E[X]=0.20\)), compute \(E[X^2]\) and compare it with \((E[X])^2\).
Use \(E[X^2]=\sum x_i^2 p(x_i)\). Then square the mean separately.
\(E[X^2]=0^2(0.81)+1^2(0.18)+2^2(0.01)=0+0.18+0.04=0.22.\)
\((E[X])^2=(0.20)^2=0.04\neq 0.22=E[X^2]\). The two differ — the squaring is non-linear.
Variance of a fair die
Let \(X\) be the result of one fair die roll. Find \(\operatorname{Var}(X)\) and \(\operatorname{SD}(X)\).
Get \(E[X]\) and \(E[X^2]\) from the six equally likely values, then \(\operatorname{Var}=E[X^2]-(E[X])^2\).
\(E[X]=\frac{1+2+3+4+5+6}{6}=\frac{21}{6}=3.5.\)
\(E[X^2]=\frac{1+4+9+16+25+36}{6}=\frac{91}{6}.\)
\(\operatorname{Var}(X)=\frac{91}{6}-\left(\frac72\right)^2=\frac{91}{6}-\frac{49}{4}=\frac{182-147}{12}=\frac{35}{12}.\)
\(\operatorname{SD}(X)=\sqrt{35/12}\approx1.708.\)
Employee tenure: E[X] and Var(X) from a frequency table
A company has \(50\) employees. The number of years (rounded up) they have worked there:
| Years | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| Count | 12 | 15 | 16 | 7 |
One employee is chosen at random; \(X=\) years of service. Find (a) \(E[X]\) and (b) \(\operatorname{Var}(X)\).
Turn counts into a pmf (divide by \(50\)), then apply \(E[X]\) and \(\operatorname{Var}=E[X^2]-(E[X])^2\).
\(p(1)=\tfrac{12}{50},\ p(2)=\tfrac{15}{50},\ p(3)=\tfrac{16}{50},\ p(4)=\tfrac{7}{50}\). (Sum \(=1\).)
\(E[X]=\frac{1(12)+2(15)+3(16)+4(7)}{50}=\frac{12+30+48+28}{50}=\frac{118}{50}=2.36.\)
\(E[X^2]=\frac{1(12)+4(15)+9(16)+16(7)}{50}=\frac{12+60+144+112}{50}=\frac{328}{50}=6.56.\)
\(\operatorname{Var}(X)=6.56-(2.36)^2=6.56-5.5696=0.9904.\) (\(\operatorname{SD}\approx0.995\).)
Tomorrow's earnings: expected value and variance of a payoff
If it rains tomorrow you earn €200 tutoring; if it is dry you earn €300 in construction. The probability of rain is \(\tfrac14\). Find the expected amount you earn and its variance.
Define the payoff RV: \(X=200\) with prob \(\tfrac14\), \(X=300\) with prob \(\tfrac34\). Then \(E[X]\) and \(\operatorname{Var}=E[X^2]-(E[X])^2\).
\(P(X=200)=0.25,\ P(X=300)=0.75.\)
\(E[X]=200(0.25)+300(0.75)=50+225=275.\)
\(E[X^2]=200^2(0.25)+300^2(0.75)=10000+67500=77500.\)
\(\operatorname{Var}(X)=77500-275^2=77500-75625=1875.\) (\(\operatorname{SD}\approx43.3\).)
Family earnings: sum of independent variables
Matteo earns an amount with mean €30000 and SD €3000. His wife Alessia earns an amount with mean €32000 and SD €5000. Their earnings are independent. Find the (a) expected value and (b) standard deviation of the family's total earnings.
Means always add. For the SD, add the variances (independence), then take the square root — you cannot add SDs directly.
\(E[M+A]=E[M]+E[A]=30000+32000=62000.\)
\(\operatorname{Var}(M+A)=\operatorname{Var}(M)+\operatorname{Var}(A)=3000^2+5000^2=9{,}000{,}000+25{,}000{,}000=34{,}000{,}000.\)
\(\operatorname{SD}=\sqrt{34{,}000{,}000}\approx 5830.95.\) Note this is NOT \(3000+5000=8000\).
Should the firm risk using the computers? (expected-loss decision)
There is a \(25\%\) chance the power will be shut off during the next working day. If employees do not use their computers, the firm loses €400 in revenue for sure. If they use them and the power is cut mid-use, it costs €1200 (and €0 if no cut). To minimise expected loss, should the firm risk using the computers?
Compute the expected loss of each option and pick the smaller one.
Loss is €400 with certainty: \(E[\text{loss}]=400.\)
\(E[\text{loss}]=0.25(1200)+0.75(0)=300.\)
\(300<400\), so using the computers has the lower expected loss.
Size-biased sampling: E[X] vs E[Y] (the bus problem)
Four buses carry \(40,33,50,25\) students (\(148\) total). (a) A student is chosen at random; \(X=\) number on that student's bus. (b) A bus driver is chosen at random; \(Y=\) number on that driver's bus. Find \(E[X]\) and \(E[Y]\) and explain why they differ.
Picking a student makes big buses more likely (size bias): \(P(X=n)=n/148\). Picking a driver makes each bus equally likely: \(P(Y=n)=1/4\).
\(P(X=n)=\dfrac{n}{148}\) for \(n\in\{25,33,40,50\}\).
\(E[X]=\dfrac{25^2+33^2+40^2+50^2}{148}=\dfrac{625+1089+1600+2500}{148}=\dfrac{5814}{148}\approx39.28.\)
\(P(Y=n)=\tfrac14\) each, so \(E[Y]=\dfrac{25+33+40+50}{4}=\dfrac{148}{4}=37.\)
\(E[X]>E[Y]\) because a randomly chosen student is more likely to come from a crowded bus — size-biased sampling inflates the average.
Discrete Distributions
The named models — Bernoulli, Binomial, Geometric, Poisson (and Hypergeometric, out of scope): recognise the story, then read the pmf, mean, and variance off the shelf.
Why named distributions: recognise the story, pull the formula off the shelf
Chapter 3 gave you the general machinery (pmf, \(E[X]\), \(\operatorname{Var}\)). But a handful of situations come up again and again — counting successes, waiting for a first success, counting rare events. For each, statisticians have already worked out the pmf, mean, and variance once and for all. These are the named discrete distributions.
Your job in an exam is almost never to derive them. It is to recognise the story in the problem text and then read the right formula off the shelf. So for each model below, learn three things: the story (when it applies), the pmf, and \(E[X]\) and \(\operatorname{Var}(X)\).
The four you must know: Bernoulli (one yes/no trial), Binomial (count successes in a fixed number of trials), Geometric (how long until the first success), Poisson (count of rare events at a known rate). A fifth, Hypergeometric, is covered lightly — it is out of exam scope but appears in the slides.
Bernoulli: a single yes/no trial
Story. One trial with exactly two outcomes — success (\(1\)) with probability \(p\), failure (\(0\)) with probability \(1-p\). A single coin flip, one inspected item, one customer who either buys or not.
pmf: \[ p(1)=p,\qquad p(0)=1-p, \] compactly \(P(X=x)=p^x(1-p)^{1-x}\) for \(x\in\{0,1\}\).
Mean and variance: \[ E[X]=p,\qquad \operatorname{Var}(X)=p(1-p). \]
(Both shown in Chapter 3: \(E[X]=p\) because the mean of a 0/1 variable is the probability of the 1; \(\operatorname{Var}=p(1-p)\) because \(X^2=X\).) Bernoulli is the atom: every Binomial is a sum of independent Bernoullis.
Binomial: count successes in n independent trials
Story. You repeat the same Bernoulli trial \(n\) times, independently, each with success probability \(p\). \(X=\) total number of successes. Recognition cues: a fixed number of trials \(n\), each trial independent, same \(p\), and you count how many succeed. "Out of 10 items, how many defective?" "Out of 4 components, how many work?"
pmf: \[ P(X=i)=\binom{n}{i}p^i(1-p)^{n-i},\qquad i=0,1,\dots,n, \] where \(\binom{n}{i}=\dfrac{n!}{i!\,(n-i)!}\) counts the ways to choose which \(i\) trials succeed.
Mean and variance: \[ E[X]=np,\qquad \operatorname{Var}(X)=np(1-p). \]
Both follow instantly from writing \(X=X_1+\dots+X_n\) (independent Bernoullis): means add (\(np\)) and, by independence, variances add (\(np(1-p)\)).
Common asks. "At least one" is easiest via the complement: \(P(X\ge1)=1-P(X=0)=1-(1-p)^n\). "At most one": \(P(X\le1)=P(X=0)+P(X=1)\). Reverse-engineering \(n,p\) from a stated \(E[X]\) and \(\operatorname{Var}(X)\) is a classic exam twist — solve \(np\) and \(np(1-p)\) together.
Geometric: how many trials until the first success
Story. Repeat an independent Bernoulli trial (prob \(p\)) until the first success, and let \(X=\) the trial on which it happens. Recognition cue: "draw/try until the first…", "how many attempts needed?". Unlike Binomial, the number of trials is not fixed — it is what you are counting.
pmf: \[ P(X=k)=p\,(1-p)^{k-1},\qquad k=1,2,3,\dots \] (the first \(k-1\) trials fail, the \(k\)-th succeeds).
Mean and variance: \[ E[X]=\frac1p,\qquad \operatorname{Var}(X)=\frac{1-p}{p^2}. \]
The mean \(1/p\) matches intuition: if success has probability \(\tfrac15\), you wait about \(5\) trials on average. The pmf \(p(1-p)^{k-1}\) (sometimes written with parameter \(\theta\)) also shows up in estimation problems, where its parameter is estimated by \(\hat p=1/\bar X\).
Poisson: counting rare events at a known rate
Story. Count how many times a rare event happens in a fixed interval of time or space, when events occur independently at a constant average rate \(\lambda\). "On average \(3\) accidents per week — probability of at least one?" "\(5\) claims per day on average." It is the law of rare events: misprints per page, calls per minute, particles per second, customers per hour.
pmf: \[ P(X=i)=e^{-\lambda}\,\frac{\lambda^i}{i!},\qquad i=0,1,2,\dots \] where \(\lambda>0\) is the average count for the interval considered.
Mean and variance — both equal \(\lambda\): \[ E[X]=\lambda,\qquad \operatorname{Var}(X)=\lambda. \] (So \(\operatorname{SD}=\sqrt\lambda\).) If the mean and variance of a count are roughly equal, Poisson is a natural model.
Scaling the rate. If the rate is "\(1\) per \(2\) minutes" and you look at a \(5\)-minute window, set \(\lambda = 5/2 = 2.5\) for that window. Match \(\lambda\) to the interval in the question. The handy complement again: \(P(X\ge1)=1-e^{-\lambda}\).
Poisson as an approximation to the Binomial
When you have many trials with a tiny success probability — \(n\) large, \(p\) small — the Binomial is awkward to compute but is almost exactly Poisson with
\[ \lambda = np. \]
\[ \binom{n}{i}p^i(1-p)^{n-i}\;\approx\;e^{-\lambda}\frac{\lambda^i}{i!},\qquad \lambda=np. \]
Rule of thumb: good when \(n\) is large (say \(\ge 20\)) and \(p\) small (say \(\le 0.05\)). Recognition cue in exams: a binomial setup with a huge \(n\) and a tiny \(p\) ("\(1000\) poker hands, each a full house with prob \(0.0014\)"). Switch to Poisson with \(\lambda=np\) and the arithmetic becomes trivial.
Hypergeometric (out of exam scope): sampling without replacement
Marked OUT-OF-EXAM — taught in the slides but not tested. Know the story so you can tell it apart from Binomial.
Story. A finite population has \(N\) "good" and \(M\) "bad" items. You draw \(n\) without replacement; \(X=\) number of good ones drawn. Because draws are not replaced, the trials are not independent — that is exactly why it is not Binomial.
pmf: \[ P(X=i)=\frac{\binom{N}{i}\binom{M}{n-i}}{\binom{N+M}{n}}. \]
Mean: with \(p=\dfrac{N}{N+M}\), \(E[X]=np\) (same as Binomial), but the variance carries a finite-population correction \(\operatorname{Var}(X)=np(1-p)\big(1-\tfrac{n-1}{N+M-1}\big)\).
Key distinction. With replacement (or a huge population) → Binomial. Without replacement from a small population → Hypergeometric. When \(N+M\) is large relative to \(n\), the two coincide.
Decision box — which discrete model is it?
Read the problem and match the cue:
- One yes/no trial → Bernoulli(\(p\)): \(E=p,\ \operatorname{Var}=p(1-p)\).
- Fixed \(n\) independent trials, count successes → Binomial(\(n,p\)): \(P=\binom{n}{i}p^i(1-p)^{n-i}\), \(E=np,\ \operatorname{Var}=np(1-p)\).
- Trials until first success → Geometric(\(p\)): \(P=p(1-p)^{k-1}\), \(E=1/p,\ \operatorname{Var}=(1-p)/p^2\).
- Count of rare events at rate \(\lambda\) per interval → Poisson(\(\lambda\)): \(P=e^{-\lambda}\lambda^i/i!\), \(E=\operatorname{Var}=\lambda\).
- Big \(n\), tiny \(p\) → approximate Binomial by Poisson with \(\lambda=np\).
- Draw without replacement, count good → Hypergeometric (out of exam).
Reflexes: "at least one" → \(1-P(0)\). "with/without replacement" decides Binomial vs Hypergeometric. "fixed trials, count hits" = Binomial; "how long until" = Geometric.
Binomial: defective ball bearings
Each ball bearing is independently defective with probability \(0.05\). A sample of \(5\) is inspected. Find (a) \(P(\text{none defective})\) and (b) \(P(\text{two or more defective})\).
\(X\sim\text{Binomial}(5,0.05)\). Use the complement for (b).
Fixed \(n=5\) independent trials, \(p=0.05\), count defectives \(\Rightarrow X\sim\text{Binomial}(5,0.05)\).
\(P(X=0)=(0.95)^5\approx0.7738.\)
\(P(X\ge2)=1-P(0)-P(1)\). \(P(1)=\binom{5}{1}(0.05)(0.95)^4\approx0.2036\). So \(P(X\ge2)\approx1-0.7738-0.2036=0.0226.\)
Binomial: satellite reliability
A system has \(4\) components; it works if at least \(2\) function. Each works independently with probability \(0.8\). Find \(P(\text{system works})\).
\(Y\sim\text{Binomial}(4,0.8)\); \(P(Y\ge2)=1-P(0)-P(1)\).
\(Y=\)number working \(\sim\text{Binomial}(4,0.8)\).
\(P(Y=0)=(0.2)^4=0.0016\); \(P(Y=1)=\binom{4}{1}(0.8)(0.2)^3=0.0256\).
\(P(Y\ge2)=1-0.0016-0.0256=0.9728.\)
Binomial: recover n and p from E and Var
\(X\sim\text{Binomial}(n,p)\) with \(E[X]=6\) and \(\operatorname{Var}(X)=2.4\). Find \(n\), \(p\), and \(P(X=5)\).
Use \(np=E\) and \(np(1-p)=\operatorname{Var}\); divide to isolate \(1-p\).
\(1-p=\dfrac{\operatorname{Var}}{E}=\dfrac{2.4}{6}=0.4\Rightarrow p=0.6.\)
\(n=\dfrac{E}{p}=\dfrac{6}{0.6}=10.\) (Check: \(np(1-p)=10\cdot0.6\cdot0.4=2.4\) ✓.)
\(P(X=5)=\binom{10}{5}(0.6)^5(0.4)^5=252\cdot0.07776\cdot0.01024\approx0.2007.\)
Poisson: weekly accidents
The average number of accidents on a road per week is \(1.2\). Find the probability of at least one accident this week.
\(X\sim\text{Poisson}(1.2)\); \(P(X\ge1)=1-e^{-\lambda}\).
Rare events at rate \(\lambda=1.2\) per week \(\Rightarrow X\sim\text{Poisson}(1.2)\).
\(P(X\ge1)=1-P(X=0)=1-e^{-1.2}\approx1-0.3012=0.6988.\)
Poisson: radioactive emissions, at most 2
A gram of radioactive material emits on average \(3.2\) \(\alpha\)-particles per second. Find the probability of at most \(2\) emissions in one second.
\(X\sim\text{Poisson}(3.2)\); sum \(P(0)+P(1)+P(2)\).
\(X\sim\text{Poisson}(3.2)\), so \(P(X\le2)=e^{-3.2}\!\left(1+3.2+\dfrac{3.2^2}{2}\right).\)
\(1+3.2+5.12=9.32\); \(e^{-3.2}\approx0.04076\); product \(\approx0.380.\)
Poisson: casino arrivals
People enter a casino at an average rate of \(1\) every \(2\) minutes. For the window 12:00–12:05, find (a) \(P(\text{nobody enters})\) and (b) \(P(\text{at least }4\text{ enter})\).
Scale the rate to the 5-minute window: \(\lambda=5/2=2.5\).
\(\lambda=5/2=2.5\Rightarrow X\sim\text{Poisson}(2.5)\).
\(P(X=0)=e^{-2.5}\approx0.0821.\)
\(P(X\ge4)=1-\sum_{k=0}^{3}e^{-2.5}\dfrac{2.5^k}{k!}=1-0.7576\approx0.2424.\)
Poisson approximation to a Binomial: poker full houses
A poker hand is a full house with probability \(0.0014\). In \(1000\) independent hands, find the probability of at least \(2\) full houses.
\(X\sim\text{Binomial}(1000,0.0014)\): large \(n\), tiny \(p\). Approximate by Poisson with \(\lambda=np\).
\(\lambda=np=1000\cdot0.0014=1.4\Rightarrow X\approx\text{Poisson}(1.4)\).
\(P(X\ge2)=1-P(0)-P(1)=1-e^{-1.4}(1+1.4)=1-0.5918\approx0.4082.\)
Geometric: drawing until the first black ball
An urn has \(N\) white and \(M\) black balls. You draw with replacement until you get a black ball. Find \(P(\text{exactly }n\text{ draws})\) and the expected number of draws.
Each draw is black with probability \(p=M/(N+M)\); 'until first success' \(\Rightarrow\) Geometric.
\(p=P(\text{black})=\dfrac{M}{N+M}\), draws independent (replacement).
\(n-1\) whites then a black: \(P(X=n)=p(1-p)^{n-1}=\dfrac{M}{N+M}\left(\dfrac{N}{N+M}\right)^{n-1}.\)
\(E[X]=1/p=\dfrac{N+M}{M}.\)
Hypergeometric (out of exam): defective batteries
(Out of exam scope — for contrast with the Binomial.) From a bin of \(10\) batteries (\(7\) good, \(3\) defective), \(2\) are chosen without replacement. Let \(X=\) number of defectives. Give the pmf.
Without replacement from a small population \(\Rightarrow\) Hypergeometric. Use \(\dfrac{\binom{3}{i}\binom{7}{2-i}}{\binom{10}{2}}\).
\(\dfrac{\binom{3}{0}\binom{7}{2}}{\binom{10}{2}}=\dfrac{21}{45}=\dfrac{7}{15}\approx0.467.\)
\(\dfrac{\binom{3}{1}\binom{7}{1}}{\binom{10}{2}}=\dfrac{21}{45}=\dfrac{7}{15}\approx0.467.\)
\(\dfrac{\binom{3}{2}\binom{7}{0}}{\binom{10}{2}}=\dfrac{3}{45}=\dfrac{1}{15}\approx0.067.\) (Sum \(=1\) ✓.)
Joint Distributions & Covariance
Two variables at once: joint and marginal pmfs, independence, covariance and correlation, and the variance of a sum — with the crucial warning that zero covariance is not independence.
Why one variable at a time isn't enough
So far each random variable lived alone. But real questions involve two at once: height and weight, a student's satisfaction and their year, today's return and tomorrow's. To study how two variables move together, their separate distributions are not enough.
Here's the catch that motivates the whole chapter: the individual (marginal) distributions of \(X\) and \(Y\) do not determine how they relate. Two completely different relationships — one where \(X\) and \(Y\) are unrelated, one where high \(X\) forces high \(Y\) — can have the same marginals. To see the relationship you need the joint distribution: the probability of each pair of values.
The joint pmf: probability of each pair
For two discrete random variables \(X\) and \(Y\), the joint probability mass function gives the probability they take a particular pair of values:
\[ p(x_i,y_j)=P\{X=x_i,\ Y=y_j\}. \]
You lay it out as a table: rows are the values of \(X\), columns the values of \(Y\), each cell a probability. As with any pmf, all cells are \(\ge0\) and
\[ \sum_i\sum_j p(x_i,y_j)=1. \]
Example (independence by construction). Daily stock changes are independent and identically distributed, with \(P(\text{change}=0)=0.30\), \(P(\pm1)=0.20\), \(P(\pm2)=0.10\), \(P(\pm3)=0.05\). The probability of a specific 3-day path multiplies: \(P\{X_1=1,X_2=2,X_3=0\}=(0.20)(0.10)(0.30)=0.006\).
Marginals: recovering each variable from the table
Given the joint table you can always get each variable's own distribution back by summing across the other. The marginal pmf of \(X\) is the row sums; the marginal of \(Y\) is the column sums:
\[ p_X(x_i)=\sum_j p(x_i,y_j),\qquad p_Y(y_j)=\sum_i p(x_i,y_j). \]
(The name comes from writing these totals in the margins of the table.) Each marginal must itself sum to \(1\) — a handy check.
One-way street. You can always go joint → marginals. You cannot in general go marginals → joint: many different joint tables share the same row and column totals. That's exactly why the joint distribution carries information the marginals don't.
Independence of two random variables
\(X\) and \(Y\) are independent when knowing one tells you nothing about the other. For discrete variables this has a clean test: the joint factors into the product of the marginals, for every cell:
\[ p(x_i,y_j)=p_X(x_i)\,p_Y(y_j)\quad\text{for all }i,j. \]
How to check. Compute the marginals (row/column sums), then verify every cell equals row-total × column-total. If even one cell fails, the variables are dependent.
Under independence two shortcuts hold: \(E[XY]=E[X]\,E[Y]\), and (as we'll see) covariance is \(0\) so variances of sums simply add.
Expectation of functions of two variables
To average any function \(g(X,Y)\) — a combined score, a product, a sum — weight it by the joint probabilities:
\[ E[g(X,Y)]=\sum_i\sum_j g(x_i,y_j)\,p(x_i,y_j). \]
Two cases dominate:
- Sum: \(E[X+Y]=E[X]+E[Y]\) — always, no independence needed.
- Product: \(E[XY]=\sum_i\sum_j x_i y_j\,p(x_i,y_j)\). This equals \(E[X]\,E[Y]\) only when \(X,Y\) are independent.
The gap between \(E[XY]\) and \(E[X]E[Y]\) is precisely what the next idea — covariance — measures.
Covariance: do they move together?
Covariance measures the linear tendency of two variables to move together:
\[ \operatorname{Cov}(X,Y)=E[(X-\mu_X)(Y-\mu_Y)]. \]
When \(X\) is above its mean at the same time \(Y\) is above its mean (and vice versa), the product is positive on average → positive covariance. If one tends to be high when the other is low, it's negative.
Computational formula (the one you use):
\[ \operatorname{Cov}(X,Y)=E[XY]-E[X]\,E[Y]=E[XY]-\mu_X\mu_Y. \]
Sign tells the story: \(>0\) move together; \(<0\) move oppositely; \(=0\) no linear trend.
Properties: \(\operatorname{Cov}(X,Y)=\operatorname{Cov}(Y,X)\); \(\operatorname{Cov}(X,X)=\operatorname{Var}(X)\); \(\operatorname{Cov}(aX,Y)=a\operatorname{Cov}(X,Y)\); and if \(X,Y\) are independent then \(\operatorname{Cov}(X,Y)=0\).
Correlation: covariance on a fixed scale
Covariance has awkward units (units of \(X\) times units of \(Y\)) and its size depends on scale, so you can't tell "strong" from "weak". The correlation coefficient fixes this by dividing out both standard deviations:
\[ \operatorname{Corr}(X,Y)=\frac{\operatorname{Cov}(X,Y)}{\sqrt{\operatorname{Var}(X)\,\operatorname{Var}(Y)}}=\frac{\operatorname{Cov}(X,Y)}{\sigma_X\,\sigma_Y}. \]
It is always between \(-1\) and \(1\):
\[ -1\le\operatorname{Corr}(X,Y)\le 1. \]
- \(+1\): perfect increasing linear relation \(Y=a+bX\) with \(b>0\).
- \(-1\): perfect decreasing linear relation (\(b<0\)).
- \(0\): no linear relation (but a non-linear one can still exist).
The closer to \(\pm1\), the tighter the linear association; the sign matches the sign of the covariance.
Variance of a sum, with the covariance term
Chapter 3 promised the full rule; here it is. Adding two variables, the variance is not just the sum of variances — there is a covariance correction:
\[ \operatorname{Var}(X+Y)=\operatorname{Var}(X)+\operatorname{Var}(Y)+2\operatorname{Cov}(X,Y). \]
For many variables:
\[ \operatorname{Var}\!\left(\sum_{i=1}^n X_i\right)=\sum_{i=1}^n\operatorname{Var}(X_i)+\sum_{i\neq j}\operatorname{Cov}(X_i,X_j). \]
When the variables are independent, every covariance is \(0\) and the variances simply add — which is why, back in Chapters 3–4, "independent" was the magic word that let \(\operatorname{Var}(np(1-p))\) and friends fall out so cleanly.
Warning: zero covariance does NOT mean independent
Independence \(\Rightarrow\) \(\operatorname{Cov}=0\). The reverse is false: two variables can have zero covariance yet be completely dependent, because covariance only sees linear association.
Classic counterexample. Let \(X\) take \(-1,0,1\) each with probability \(\tfrac13\), and set \(Y=X^2\). Then \(Y\) is totally determined by \(X\) (as dependent as can be), yet:
\[ E[X]=0,\quad E[XY]=E[X^3]=\tfrac{-1+0+1}{3}=0,\quad \operatorname{Cov}(X,Y)=0-0\cdot E[Y]=0. \]
Covariance is \(0\) but \(X\) and \(Y\) are dependent (e.g. \(p(0,1)=0\neq p_X(0)\,p_Y(1)=\tfrac13\cdot\tfrac23\)). Takeaway: use the factorisation test \(p(x,y)=p_X(x)p_Y(y)\) to decide independence — never infer it from \(\operatorname{Cov}=0\) alone.
Full pipeline from a joint table: marginals, covariance, correlation
Student satisfaction \(X\in\{1,2,3,4\}\) and university year \(Y\in\{1,2,3\}\) have joint pmf:
| \(X\backslash Y\) | 1 | 2 | 3 |
|---|---|---|---|
| 1 | 0.10 | 0 | 0 |
| 2 | 0.20 | 0 | 0 |
| 3 | 0.30 | 0.20 | 0 |
| 4 | 0 | 0 | 0.20 |
Find \(E[X],E[Y]\), \(\operatorname{Var}(X),\operatorname{Var}(Y)\), \(\operatorname{Cov}(X,Y)\), and \(\operatorname{Corr}(X,Y)\).
Row sums give \(p_X\), column sums give \(p_Y\). Then the usual moments, \(E[XY]\) from nonzero cells, and combine.
Rows: \(p_X=(0.10,0.20,0.50,0.20)\). Columns: \(p_Y=(0.60,0.20,0.20)\). Both sum to 1.
\(E[X]=0.1+0.4+1.5+0.8=2.8\); \(E[X^2]=0.1+0.8+4.5+3.2=8.6\Rightarrow\operatorname{Var}(X)=8.6-2.8^2=0.76\). \(E[Y]=0.6+0.4+0.6=1.6\); \(E[Y^2]=0.6+0.8+1.8=3.2\Rightarrow\operatorname{Var}(Y)=3.2-1.6^2=0.64\).
Nonzero cells: \(1\!\cdot\!1(0.10)+2\!\cdot\!1(0.20)+3\!\cdot\!1(0.30)+3\!\cdot\!2(0.20)+4\!\cdot\!3(0.20)=0.1+0.4+0.9+1.2+2.4=5.0\).
\(\operatorname{Cov}=5.0-2.8\cdot1.6=0.52\). \(\operatorname{Corr}=\dfrac{0.52}{\sqrt{0.76\cdot0.64}}=\dfrac{0.52}{0.6974}\approx0.746\) — strong positive.
Test independence from a joint table
For \(X,Y\in\{0,1\}\) the joint pmf is \(p(0,0)=0.2,\ p(0,1)=0.2,\ p(1,0)=0.3,\ p(1,1)=0.3\). Are \(X\) and \(Y\) independent?
Get the marginals, then check whether every cell equals (row total)×(column total).
\(p_X(0)=0.2+0.2=0.4,\ p_X(1)=0.6\); \(p_Y(0)=0.2+0.3=0.5,\ p_Y(1)=0.5\).
\(p_X(0)p_Y(0)=0.4\cdot0.5=0.2=p(0,0)\) ✓. Checking all four cells, each equals row×column.
Every cell factors, so \(X\) and \(Y\) are independent.
Zero covariance but dependent
Let \(X\) take \(-1,0,1\), each with probability \(\tfrac13\), and let \(Y=X^2\). Show \(\operatorname{Cov}(X,Y)=0\), yet \(X\) and \(Y\) are not independent.
Compute \(E[X],E[XY]\) (note \(XY=X^3\)), then test independence on one cell.
\(E[X]=\tfrac{-1+0+1}{3}=0\); \(Y\in\{0,1\}\) with \(P(Y=0)=\tfrac13,P(Y=1)=\tfrac23\), so \(E[Y]=\tfrac23\).
\(E[XY]=E[X^3]=\tfrac{-1+0+1}{3}=0\), so \(\operatorname{Cov}=E[XY]-E[X]E[Y]=0-0\cdot\tfrac23=0\).
\(p(X{=}0,Y{=}1)=0\) while \(p_X(0)p_Y(1)=\tfrac13\cdot\tfrac23=\tfrac29\neq0\). The factorisation fails \(\Rightarrow\) dependent.
Variance of a sum with a covariance term
Suppose \(\operatorname{Var}(X)=4\), \(\operatorname{Var}(Y)=9\), and \(\operatorname{Cov}(X,Y)=-2\). Find \(\operatorname{Var}(X+Y)\). Compare with the value you'd get if \(X,Y\) were independent.
Use \(\operatorname{Var}(X+Y)=\operatorname{Var}(X)+\operatorname{Var}(Y)+2\operatorname{Cov}(X,Y)\).
\(\operatorname{Var}(X+Y)=4+9+2(-2)=13-4=9.\)
Then \(\operatorname{Cov}=0\) and \(\operatorname{Var}(X+Y)=4+9=13\). The negative covariance reduced the spread of the sum.
Independence shortcuts: product expectation and sum of a path
Daily stock changes are independent with \(P(0)=0.30,P(\pm1)=0.20,P(\pm2)=0.10,P(\pm3)=0.05\). (a) Find \(P\{X_1=1,X_2=2,X_3=0\}\). (b) For two such independent days, argue \(E[X_1X_2]=E[X_1]E[X_2]\).
Independence makes the joint probability the product of marginals, and \(E[XY]=E[X]E[Y]\).
By independence the joint factors: \(P=(0.20)(0.10)(0.30)=0.006.\)
The change distribution is symmetric about 0, so \(E[X_1]=E[X_2]=0\). By independence \(E[X_1X_2]=E[X_1]E[X_2]=0\), hence \(\operatorname{Cov}(X_1,X_2)=0\) too.
Continuous Random Variables
When values fill an interval: density functions where probability is area, the P(X=a)=0 rule, expectation and variance by integration, and the two workhorse models — the continuous Uniform and the memoryless Exponential.
From mass to density: when values form a continuum
Discrete variables sat on separate points, each carrying a chunk of probability (the pmf). But many quantities — a waiting time, a height, a lifetime — can be any value in an interval. There are infinitely many possible values, so no single one can carry positive probability. We need a new tool: the probability density function \(f(x)\).
The shift in one sentence: for discrete variables, probability is the height of a bar; for continuous variables, probability is the area under a curve. Nothing else about expectation or variance changes in spirit — sums just become integrals.
The density function: probability is area
A continuous random variable \(X\) has a density \(f(x)\ge0\) such that the probability of landing in an interval is the area under \(f\) over that interval:
\[ P(a\le X\le b)=\int_a^b f(x)\,dx. \]
For \(f\) to be a valid density it must be nonnegative and enclose total area \(1\):
\[ f(x)\ge0,\qquad \int_{-\infty}^{\infty} f(x)\,dx=1. \]
A single point has zero probability. Since the area over a point is zero, \[ P(X=a)=\int_a^a f(x)\,dx=0. \] A practical consequence: for continuous variables \(\le\) and \(<\) (and \(\ge\) and \(>\)) give the same probability — endpoints never matter. \(f(x)\) itself is not a probability (it can exceed \(1\)); only areas are probabilities.
Finding the normalising constant
A very common exam setup gives a density "up to a constant" — \(f(x)=c\cdot(\text{shape})\) — and asks you to find \(c\). The rule is always the same: the total area must be \(1\), so
\[ \int_{-\infty}^{\infty} f(x)\,dx=1 \quad\Longrightarrow\quad c=\frac{1}{\int(\text{shape})\,dx}. \]
Example. Let \(f(x)=c(1-x^2)\) on \((-1,1)\), \(0\) elsewhere. Then
\[ \int_{-1}^{1} c(1-x^2)\,dx=c\Big[x-\tfrac{x^3}{3}\Big]_{-1}^{1}=c\cdot\tfrac43=1\ \Rightarrow\ c=\tfrac34. \]
Once \(c\) is known you compute any probability as an area, e.g. \(P(0
Expectation and variance by integration
Everything from Chapter 3 carries over with sums replaced by integrals:
\[ E[X]=\int_{-\infty}^{\infty} x\,f(x)\,dx,\qquad E[g(X)]=\int_{-\infty}^{\infty} g(x)\,f(x)\,dx, \]
\[ \operatorname{Var}(X)=E[X^2]-(E[X])^2=\int x^2 f(x)\,dx-(E[X])^2. \]
The linear-transformation rules are unchanged too: \(E[aX+b]=aE[X]+b\) and \(\operatorname{Var}(aX+b)=a^2\operatorname{Var}(X)\).
Example. For \(f(x)=x/2\) on \([0,2]\): \(E[X]=\int_0^2 x\cdot\tfrac{x}{2}\,dx=\tfrac12\cdot\tfrac{8}{3}=\tfrac43\), \(E[X^2]=\int_0^2 x^2\cdot\tfrac{x}{2}\,dx=2\), so \(\operatorname{Var}(X)=2-\tfrac{16}{9}=\tfrac29\).
The continuous Uniform on [α, β]
Story. Any value in \([\alpha,\beta]\) is equally likely — no part of the interval is favoured. The density is a flat line whose height makes the area \(1\):
\[ f(x)=\frac{1}{\beta-\alpha}\quad\text{for }x\in[\alpha,\beta]\ \ (0\text{ outside}). \]
(Base \(\times\) height \(=(\beta-\alpha)\cdot\frac{1}{\beta-\alpha}=1\).)
Probabilities are length ratios. For a sub-interval \([a,b]\subseteq[\alpha,\beta]\): \[ P(a Mean and variance: \[ E[X]=\frac{\alpha+\beta}{2},\qquad \operatorname{Var}(X)=\frac{(\beta-\alpha)^2}{12}. \] The mean is just the midpoint. Recognition cue: "arrives at a time uniformly between …", "no information, equally likely anywhere in the interval".
The Exponential: waiting times with no memory
Story. The time you wait for the next event when events happen at a constant rate \(\lambda\) (the continuous-time cousin of the Poisson). Lifetimes of components that don't wear out, time to the next phone call, service times in a queue.
Density (rate \(\lambda>0\)): \[ f(x)=\lambda e^{-\lambda x}\quad\text{for }x\ge0\ \ (0\text{ for }x<0). \]
Tail probability (the handy one — no integral needed): \[ P(X>t)=e^{-\lambda t},\qquad P(X\le t)=1-e^{-\lambda t}. \]
Mean and variance: \[ E[X]=\frac1\lambda,\qquad \operatorname{Var}(X)=\frac1{\lambda^2}. \] So a higher rate \(\lambda\) means shorter expected wait. Careful: \(\lambda\) is the rate; the mean is its reciprocal \(1/\lambda\).
Memoryless property. The exponential forgets how long it has already waited: \[ P(X>s+t\mid X>s)=P(X>t). \] An old component that hasn't failed is "as good as new" — its remaining life has the same distribution as a fresh one. This is the defining feature of the exponential.
Recap — the continuous toolkit
- Density: probability = area, \(P(a\le X\le b)=\int_a^b f\). Valid if \(f\ge0\) and \(\int f=1\). \(P(X=a)=0\), so \(\le\) and \(<\) coincide.
- Find a constant: set \(\int f=1\) and solve.
- Moments: \(E[X]=\int x f\), \(\operatorname{Var}=E[X^2]-(E[X])^2\); \(E[aX+b]=aE[X]+b\), \(\operatorname{Var}(aX+b)=a^2\operatorname{Var}(X)\).
- Uniform\([\alpha,\beta]\): \(f=\tfrac{1}{\beta-\alpha}\), \(E=\tfrac{\alpha+\beta}{2}\), \(\operatorname{Var}=\tfrac{(\beta-\alpha)^2}{12}\); sub-interval prob = length ratio.
- Exponential\((\lambda)\): \(f=\lambda e^{-\lambda x}\), \(P(X>t)=e^{-\lambda t}\), \(E=\tfrac1\lambda\), \(\operatorname{Var}=\tfrac1{\lambda^2}\), memoryless.
Exam reflex: "density given, find constant / a probability" → normalise then integrate. "equally likely in an interval" → Uniform, use length ratios. "waiting time / lifetime at constant rate / memoryless" → Exponential, use \(e^{-\lambda t}\).
Uniform waiting time, with a conditional twist
A bus arrives at a time uniformly distributed between 10:00 and 10:30; let \(X\) be your wait in minutes, \(X\sim U(0,30)\). (a) Find \(P(X>10)\). (b) If at 10:15 the bus still hasn't come, find the probability you wait at least 10 more minutes.
Uniform probabilities are length ratios. For (b) use conditional probability \(P(X>25\mid X>15)\).
\(P(X>10)=\dfrac{30-10}{30}=\dfrac{20}{30}=\dfrac23\approx0.667.\)
'10 more minutes after 10:15' means \(X>25\) given \(X>15\): \(P(X>25\mid X>15)=\dfrac{P(X>25)}{P(X>15)}\).
\(=\dfrac{15/30}{15/30}\cdots=\dfrac{(30-25)/30}{(30-15)/30}=\dfrac{5}{15}=\dfrac13\approx0.333.\) (Note: the uniform is NOT memoryless — the answer differs from \(2/3\).)
Find the constant of a density, then probabilities
\(X\) has density \(f(x)=c(1-x^2)\) for \(-1
Impose \(\int_{-1}^{1}f=1\) for \(c\). The density is symmetric about 0.
\(\int_{-1}^{1}c(1-x^2)\,dx=c\big[x-\tfrac{x^3}{3}\big]_{-1}^{1}=c\cdot\tfrac43=1\Rightarrow c=\tfrac34.\)
By symmetry about 0, \(P(X<0)=\tfrac12.\)
\(\tfrac34\int_0^{1/2}(1-x^2)\,dx=\tfrac34\big[x-\tfrac{x^3}{3}\big]_0^{1/2}=\tfrac34\big(\tfrac12-\tfrac1{24}\big)=\tfrac34\cdot\tfrac{11}{24}=\tfrac{11}{32}\approx0.344.\)
Expectation, variance, and a linear transform
\(X\) has density \(f(x)=x/2\) for \(0\le x\le2\) (0 elsewhere). (a) Find \(E[X]\) and \(\operatorname{Var}(X)\). (b) For \(Y=2X-1\), find \(E[Y]\) and \(\operatorname{Var}(Y)\).
Integrate for \(E[X]\) and \(E[X^2]\); then use the linear-transform rules.
\(E[X]=\int_0^2 x\cdot\tfrac{x}{2}\,dx=\tfrac12\cdot\tfrac{8}{3}=\tfrac43.\)
\(E[X^2]=\int_0^2 x^2\cdot\tfrac{x}{2}\,dx=\tfrac12\cdot4=2\), so \(\operatorname{Var}(X)=2-\big(\tfrac43\big)^2=2-\tfrac{16}{9}=\tfrac29.\)
\(E[Y]=2\cdot\tfrac43-1=\tfrac53\); \(\operatorname{Var}(Y)=2^2\cdot\tfrac29=\tfrac89.\)
Exponential lifetime: mean, variance, and a tail probability
The lifetime of an electronic tube is exponential, \(f(x)=\lambda e^{-\lambda x}\), \(x\ge0\). (a) Show \(E[X]=1/\lambda\) and \(\operatorname{Var}(X)=1/\lambda^2\). (b) A car battery has exponential life with mean \(10{,}000\) miles. Find \(P(\text{lasts a }5{,}000\text{-mile trip})\).
(a) integrate by parts (\(E[X^2]=2/\lambda^2\)). (b) mean \(=1/\lambda\Rightarrow\lambda\); use the tail \(P(X>t)=e^{-\lambda t}\).
\(E[X]=\int_0^\infty x\lambda e^{-\lambda x}\,dx=\tfrac1\lambda\) (by parts).
\(E[X^2]=\int_0^\infty x^2\lambda e^{-\lambda x}\,dx=\tfrac{2}{\lambda^2}\), so \(\operatorname{Var}(X)=\tfrac{2}{\lambda^2}-\tfrac{1}{\lambda^2}=\tfrac{1}{\lambda^2}.\)
Mean \(=1/\lambda=10000\Rightarrow\lambda=10^{-4}\). \(P(X>5000)=e^{-10^{-4}\cdot5000}=e^{-0.5}\approx0.6065.\)
Uniform meeting time
A friend arrives uniformly between 2:00 and 3:00; let \(X\sim U(0,60)\) be your wait in minutes. Find (a) \(P(X\ge30)\), (b) \(P(X<15)\), (c) \(P(10
All are length ratios over the interval of length 60.
\(P(X\ge30)=\dfrac{60-30}{60}=\dfrac12.\)
\(P(X<15)=\dfrac{15}{60}=\dfrac14.\)
\(P(10
\(P(X<45)=\dfrac{45}{60}=\dfrac34.\)
Uniform with a memoryless contrast (bus every 30 min)
You arrive uniformly in a 30-minute window before a bus. (a) Compute \(P(\text{wait}<5)\) over the whole window. (b) Why is the exponential's memoryless property a special feature the uniform does NOT share?
(a) is a length ratio. (b) compare the conditional from Exercise 1.
\(P(\text{wait}<5)=\dfrac{5}{30}=\dfrac16\approx0.167.\)
For a uniform, conditioning on 'already waited 15 min' changed the answer (Exercise 1: \(1/3\neq2/3\)) — it remembers. Only the exponential satisfies \(P(X>s+t\mid X>s)=P(X>t)\).
Normal Distribution
The bell curve and the one skill that unlocks the rest of the course: standardise to Z, read the Φ table (with symmetry), invert for percentiles and critical values, and combine independent normals.
The normal: the bell curve that's everywhere
The normal (or Gaussian) distribution is the bell-shaped curve you've seen everywhere: heights, exam scores, measurement errors, and — crucially for this course — the distribution that sample means tend toward (next chapter). Master it now: it underlies the Central Limit Theorem, confidence intervals, and hypothesis tests that dominate the exam.
A normal variable is fixed by just two numbers, its mean and variance, written \(X\sim N(\mu,\sigma^2)\). The good news: every normal question reduces to looking up areas in a single table, once you learn the trick of standardisation.
Density, and what μ and σ do
\(X\sim N(\mu,\sigma^2)\) has density
\[ f(x)=\frac{1}{\sigma\sqrt{2\pi}}\,e^{-\frac{(x-\mu)^2}{2\sigma^2}},\qquad -\infty You will essentially never integrate this by hand — that's what the table is for. What matters is the shape and the two knobs: The two parameters are the mean and variance: \[ E[X]=\mu,\qquad \operatorname{Var}(X)=\sigma^2. \] (So \(\sigma=\sqrt{\sigma^2}\) is the standard deviation.) That's why the notation \(N(\mu,\sigma^2)\) is so natural.
The standard normal Z and standardisation
The special case \(\mu=0,\ \sigma^2=1\) is the standard normal \(Z\sim N(0,1)\). Its cumulative function is tabulated as
\[ \Phi(z)=P(Z\le z). \]
The key trick. Any normal can be turned into \(Z\) by subtracting the mean and dividing by the SD — standardising:
\[ Z=\frac{X-\mu}{\sigma}\sim N(0,1). \]
So a probability about \(X\) becomes a probability about \(Z\), which the table answers:
\[ P(X\le x)=P\!\left(Z\le\frac{x-\mu}{\sigma}\right)=\Phi\!\left(\frac{x-\mu}{\sigma}\right). \]
One table serves every normal — that's the whole point. The standardised value \(z=\frac{x-\mu}{\sigma}\) tells you "how many standard deviations above the mean" \(x\) sits.
Reading the Φ table: symmetry and intervals
The table gives \(\Phi(z)\) for \(z\ge0\) (rows = units+tenths, columns = hundredths). Some anchor values:
\[ \Phi(0)=0.5,\ \ \Phi(1)=0.8413,\ \ \Phi(1.645)=0.95,\ \ \Phi(1.96)=0.975,\ \ \Phi(2)=0.9772,\ \ \Phi(2.576)=0.995. \]
Negative z — use symmetry. The table stops at \(0\); for negatives use the bell's symmetry:
\[ \Phi(-z)=1-\Phi(z). \]
(E.g. \(\Phi(-1)=1-0.8413=0.1587\).)
Intervals. Standardise both ends and subtract:
\[ P(a\le X\le b)=\Phi\!\left(\frac{b-\mu}{\sigma}\right)-\Phi\!\left(\frac{a-\mu}{\sigma}\right). \]
"Greater than". \(P(X>x)=1-\Phi\!\big(\frac{x-\mu}{\sigma}\big)\).
Inverse problems: percentiles and critical values
Sometimes you're given a probability and asked for the value — "what score is the top 5%?", "find the 95th percentile". Reverse the steps:
- Translate to a cumulative probability \(p=P(X\le x)\).
- Find \(z\) with \(\Phi(z)=p\) by reading the table backwards.
- Un-standardise: \[ x=\mu+z\,\sigma. \]
Critical values \(z_\alpha\) (the \(z\) with \(P(Z>z_\alpha)=\alpha\), i.e. \(\Phi(z_\alpha)=1-\alpha\)) recur constantly later:
\[ z_{0.05}=1.645,\qquad z_{0.025}=1.96,\qquad z_{0.005}=2.576. \]
Memorise these three — they are the backbone of \(90\%,95\%,99\%\) confidence intervals and two-sided tests.
The 68–95–99.7 rule
A quick mental picture of how much probability sits within \(1,2,3\) standard deviations of the mean:
\[ P(|Z|\le1)\approx0.68,\quad P(|Z|\le2)\approx0.95,\quad P(|Z|\le3)\approx0.997. \]
Each follows from \(P(|Z|\le k)=2\Phi(k)-1\): \(2(0.8413)-1=0.6826\), \(2(0.9772)-1=0.9544\), \(2(0.99865)-1=0.9973\). Useful for sanity checks: a value \(3\sigma\) from the mean is rare (\(\sim0.3\%\) in the tails).
Linear transforms and sums of independent normals
Normals stay normal under the operations you care about — you only need to track the new mean and variance.
Linear transform: if \(X\sim N(\mu,\sigma^2)\) then \[ aX+b\sim N\!\big(a\mu+b,\ a^2\sigma^2\big). \] (Standardisation itself is the case \(a=1/\sigma,\ b=-\mu/\sigma\).)
Sum/difference of independent normals: means add, and (independence!) variances add — even for a difference:
\[ X\pm Y\sim N\!\big(\mu_X\pm\mu_Y,\ \sigma_X^2+\sigma_Y^2\big). \]
Example. A woman's apple consumption \(W\sim N(19.9,3.2^2)\), a man's \(M\sim N(20.7,3.4^2)\), independent. Then \(D=W-M\sim N(19.9-20.7,\ 3.2^2+3.4^2)=N(-0.8,\ 21.8)\), \(\sigma_D=\sqrt{21.8}\approx4.67\). So \(P(W>M)=P(D>0)=P\!\big(Z>\frac{0-(-0.8)}{4.67}\big)=P(Z>0.17)\approx0.43\).
Recap — the normal toolkit
- Standardise: \(Z=\frac{X-\mu}{\sigma}\); then \(P(X\le x)=\Phi\big(\frac{x-\mu}{\sigma}\big)\).
- Symmetry: \(\Phi(-z)=1-\Phi(z)\); greater-than: \(P(X>x)=1-\Phi(\cdot)\); interval: difference of two \(\Phi\).
- Inverse: probability → \(z\) (table backwards) → \(x=\mu+z\sigma\).
- Critical values: \(z_{0.05}=1.645,\ z_{0.025}=1.96,\ z_{0.005}=2.576\).
- Transforms/sums: \(aX+b\sim N(a\mu+b,a^2\sigma^2)\); independent \(X\pm Y\sim N(\mu_X\pm\mu_Y,\sigma_X^2+\sigma_Y^2)\).
Exam reflex: "probability that X is below/above/between" → standardise + \(\Phi\). "value such that top/bottom p%" → inverse (table backwards, then \(x=\mu+z\sigma\)). "find \(\mu\) or \(\sigma\) given a probability/percentile" → set up the standardised equation and solve.
Basic normal probabilities: standardise and look up
\(X\sim N(3,16)\) (so \(\mu=3,\ \sigma=4\)). Find (a) \(P(X<11)\), (b) \(P(X>-1)\), (c) \(P(2
Standardise each bound with \(z=\frac{x-3}{4}\); use symmetry for negatives.
\(z=\frac{11-3}{4}=2\Rightarrow P(X<11)=\Phi(2)=0.9772.\)
\(z=\frac{-1-3}{4}=-1\Rightarrow P(X>-1)=P(Z>-1)=\Phi(1)=0.8413.\)
\(z\) from \(-0.25\) to \(1\): \(\Phi(1)-\Phi(-0.25)=0.8413-(1-0.5987)=0.8413-0.4013=0.4400.\)
IQ scores: three probabilities
IQs are \(X\sim N(100,225)\) (so \(\sigma=15\)). Find the proportion of students with IQ (a) below 90, (b) above 145, (c) between 120 and 140. Use \(\Phi(0.67)=0.7486,\ \Phi(1.33)=0.9082,\ \Phi(2.67)=0.9962,\ \Phi(3)=0.9987\).
Standardise with \(z=\frac{x-100}{15}\); symmetry for the negative one.
\(z=\frac{90-100}{15}=-0.67\Rightarrow P=\Phi(-0.67)=1-0.7486=0.2514.\)
\(z=\frac{145-100}{15}=3\Rightarrow P(X>145)=1-\Phi(3)=1-0.9987=0.0013.\)
\(z\) from \(1.33\) to \(2.67\): \(\Phi(2.67)-\Phi(1.33)=0.9962-0.9082=0.0880.\)
Inverse normal: GRE cutoff scores for the top p%
GRE quantitative scores are \(X\sim N(510,92^2)\). What score puts you in the top (a) 10%, (b) 5%, (c) 1%? Use \(z_{0.10}=1.28,\ z_{0.05}=1.645,\ z_{0.01}=2.33\).
Top \(p\%\) means \(\Phi(z)=1-p\); then \(x=\mu+z\sigma=510+92z\).
\(z=1.28\Rightarrow x=510+1.28(92)=510+117.76=627.76.\)
\(z=1.645\Rightarrow x=510+1.645(92)=510+151.34=661.34.\)
\(z=2.33\Rightarrow x=510+2.33(92)=510+214.36=724.36.\)
Find σ from a probability interval
Heights of 12-year-olds are \(N(150,\sigma^2)\), and \(95\%\) lie between \(140\) and \(160\) cm. Find \(\sigma\).
The interval is symmetric about the mean 150, so \(2\Phi(10/\sigma)-1=0.95\).
\(P(140\le X\le160)=2\Phi\!\big(\tfrac{10}{\sigma}\big)-1=0.95\Rightarrow\Phi\!\big(\tfrac{10}{\sigma}\big)=0.975.\)
\(\tfrac{10}{\sigma}=1.96\Rightarrow\sigma=\tfrac{10}{1.96}\approx5.10\) cm.
Recover σ² from a moment, then a probability
\(X\) is normal with \(E[X]=2\) and \(E[X(X-1)]=6\). Find \(\operatorname{Var}(X)\) and \(P(X\le4)\). Use \(\Phi(1)=0.8413\).
Expand \(E[X(X-1)]=E[X^2]-E[X]\) to get \(E[X^2]\), then \(\operatorname{Var}=E[X^2]-(E[X])^2\).
\(E[X(X-1)]=E[X^2]-E[X]=6\Rightarrow E[X^2]=6+2=8.\)
\(\operatorname{Var}(X)=8-2^2=4\), so \(\sigma=2\) and \(X\sim N(2,4)\).
\(P(X\le4)=\Phi\!\big(\tfrac{4-2}{2}\big)=\Phi(1)=0.8413.\)
From two percentiles to μ and σ
A normal variable has 25th percentile \(=3.0\) and 75th percentile \(=7.0\). Find its mean and standard deviation. Use \(z_{0.75}=0.674\).
By symmetry the mean is the midpoint. Then use one percentile to get \(\sigma\).
Symmetric quartiles \(\Rightarrow\mu=\tfrac{3+7}{2}=5.\)
\(P(X\le7)=0.75\Rightarrow\Phi\!\big(\tfrac{7-5}{\sigma}\big)=0.75\Rightarrow\tfrac{2}{\sigma}=0.674\Rightarrow\sigma\approx2.97.\)
Comparing two normal strategies
Strategy A gives a return \(\sim N(100,100^2)\); strategy B gives \(\sim N(60,30^2)\). It matters that the return is at least 50. Which strategy maximises \(P(\text{return}\ge50)\)? Use \(\Phi(0.5)=0.6915,\ \Phi(0.33)=0.6293\).
Compute \(P(\text{return}\ge50)\) for each by standardising; bigger wins.
\(z=\frac{50-100}{100}=-0.5\Rightarrow P(R_A\ge50)=\Phi(0.5)=0.6915.\)
\(z=\frac{50-60}{30}=-\tfrac13\Rightarrow P(R_B\ge50)=\Phi(0.33)=0.6293.\)
\(0.6915>0.6293\Rightarrow\) choose Strategy A.
CLT & Sampling Distributions
Sums and counts of many i.i.d. units go Normal — standardize the total and read Φ
The Central Limit Theorem
Take \(n\) independent, identically distributed units \(X_1,\dots,X_n\), each with mean \(\mu\) and variance \(\sigma^2\). For large \(n\) (rule of thumb \(n\ge30\)) their sum and mean are approximately Normal:
\[ S_n=\sum_{i=1}^n X_i \;\approx\; N\big(n\mu,\;n\sigma^2\big), \qquad \bar X \;\approx\; N\!\left(\mu,\;\frac{\sigma^2}{n}\right). \]The shape of the individual \(X_i\) does not matter — only \(\mu\), \(\sigma^2\), and \(n\). This is what lets you answer "probability that the total exceeds a threshold" without knowing the per-unit distribution.
The standardization template
Almost every CLT exam problem is: many i.i.d. units, per-unit \(\mu\) and \(\sigma\) given, asked the probability a total crosses a threshold \(s\). The recipe:
- Total parameters: mean \(n\mu\), variance \(n\sigma^2\), sd \(\sigma\sqrt n\).
- Standardize: \( Z=\dfrac{s-n\mu}{\sigma\sqrt n} \).
- Read \(\Phi\): \( P(S_n\le s)=\Phi(Z) \), \( P(S_n\ge s)=1-\Phi(Z) \).
Watch the direction ("sufficient to cover" / "exceed" → upper tail). A standardized value \(|Z|\gtrsim4\) means the probability is effectively 0 or 1.
Binomial → Normal approximation
A Binomial count is a sum of \(n\) Bernoulli's, so for large \(n\) the CLT gives
\[ X\sim\mathrm{Bin}(n,p)\;\approx\;N\big(np,\;np(1-p)\big). \]Standardize with \(\mu=np\), \(\sigma=\sqrt{np(1-p)}\): \( P(X\le k)\approx\Phi\!\big(\tfrac{k-np}{\sqrt{np(1-p)}}\big) \).
Continuity correction. Because \(X\) is discrete, the more accurate version replaces \(k\) with \(k+\tfrac12\) (for \(\le\)) or \(k-\tfrac12\) (for \(\ge\)). Exams often say "without (half) continuity correction" — then use \(k\) as-is. Always check the wording.
Recognition guide
| Wording | Tool |
|---|---|
| "\(n\) (≥30) i.i.d. units, per-unit μ and σ, prob the TOTAL exceeds …" | CLT on the sum: \(N(n\mu,n\sigma^2)\) |
| "average of \(n\) measurements is within … of μ" | CLT on the mean: \(N(\mu,\sigma^2/n)\) |
| "coin thrown \(n\) times / count of successes", large \(n\), "normal approximation" | Binomial→Normal: \(N(np,np(1-p))\) |
| "without half correction" | use \(k\) as-is, no \(\pm\tfrac12\) |
Will 40 cans of paint be enough?
A can of paint covers on average 52 m² with sd 3 m². You must paint 2260 m². What is the approximate probability that 40 cans suffice?
"Sufficient" means the total coverage \(S_{40}=\sum X_i\ge2260\). Use CLT on the sum: \(N(40\mu,40\sigma^2)\).
Per can \(\mu=52\), \(\sigma^2=9\). Sum of 40: mean \(40\cdot52=2080\), variance \(40\cdot9=360\), sd \(\sqrt{360}\approx18.97\).
\( Z=\dfrac{2260-2080}{\sqrt{360}}=\dfrac{180}{18.97}\approx9.49 \).
\( P(S_{40}\ge2260)=1-\Phi(9.49)\approx0 \).
Do 36 batteries last a year?
A battery lasts on average 10 days with sd 1 day; lifetimes are i.i.d. and each expired battery is replaced. Find the approximate probability that the total lifetime of 36 batteries exceeds one year (365 days).
Total lifetime \(L=\sum_{i=1}^{36}X_i\). CLT: \(L\approx N(36\mu,36\sigma^2)\). Want \(P(L>365)\).
Per battery \(\mu=10\), \(\sigma^2=1\). For 36: mean \(360\), variance \(36\), sd \(6\).
\( Z=\dfrac{365-360}{6}=\dfrac{5}{6}\approx0.83 \).
\( P(L>365)=1-\Phi(5/6)\approx1-0.7977=0.2023 \).
100 coin tosses — normal approximation
A fair coin is thrown 100 times; \(X\) = number of Heads. Using the normal approximation without half-correction, compute \(P(X\le60)\).
\(X\sim\mathrm{Bin}(100,\tfrac12)\). Approximate by \(N(np,np(1-p))\). "Without half correction" → use 60 directly.
\( np=50 \), \( np(1-p)=25 \), so \(X\approx N(50,25)\), sd \(=5\).
\( Z=\dfrac{60-50}{5}=2 \).
\( P(X\le60)\approx\Phi(2)=0.9772 \).
Estimator Theory
Bias, variance, MSE, efficiency, unbiasedness — plus MLE and method-of-moments. The estimator-algebra slice appears in 5 of 7 exams.
The estimator vocabulary
An estimator \(T\) for a parameter \(\theta\) is a function of the sample. Three numbers grade it:
\[ \mathrm{Bias}(T)=E[T]-\theta, \qquad \mathrm{Var}(T), \qquad \mathrm{MSE}(T)=E[(T-\theta)^2]=\mathrm{Var}(T)+\mathrm{Bias}(T)^2. \]Unbiased means \(E[T]=\theta\) (Bias \(=0\)), and then \(\mathrm{MSE}=\mathrm{Var}\). More efficient = smaller MSE (smaller variance, among unbiased estimators).
The algebra engine (from the discrete-RV chapter): \(E\) is linear always; for independent pieces \( \mathrm{Var}(aU+bW)=a^2\mathrm{Var}(U)+b^2\mathrm{Var}(W) \). Population moments you reuse: Bernoulli \(E=p,\mathrm{Var}=p(1-p)\); Poisson \(E=\mathrm{Var}=\lambda\); sample mean of \(m\) obs \(\mathrm{Var}(\bar X_m)=\sigma^2/m\); and \(\sigma^2=E[X^2]-\mu^2\).
Unbiasedness & finding the constant
Weights that sum to 1 → automatically unbiased for the mean. If \(T=c\bar X_1+(1-c)\bar X_2\) then \(E[T]=\mu\) for every \(c\).
Find a constant. When asked "for what \(a\) is \(T\) unbiased for \(\sigma^2\)", set \(E[T]=\theta\) and solve. The recurring move is \(E[X_i^2]=\sigma^2+\mu^2\), and for independent \(X_i,X_j\), \(E[X_iX_j]=\mu^2\). Example: \(T=\frac1n\sum X_i^2+a\) with \(\mu\) known → \(E[T]=\sigma^2+\mu^2+a=\sigma^2\Rightarrow a=-\mu^2\).
Efficiency & minimum-MSE weighting
Compare efficiency: if all candidates are unbiased, compute each variance and pick the smallest.
Optimal weight. For \(M_c=c\bar X_1+(1-c)\bar X_2\) (independent), \(\mathrm{MSE}=c^2\mathrm{Var}_1+(1-c)^2\mathrm{Var}_2\); differentiate and set to 0. The minimiser is inverse-variance weighting
\[ c^*=\frac{1/\mathrm{Var}_1}{1/\mathrm{Var}_1+1/\mathrm{Var}_2}. \]With sample means of sizes \(n_1,n_2\) (\(\mathrm{Var}_i=\sigma^2/n_i\)) this becomes \(c^*=\dfrac{n_1}{n_1+n_2}\) — weight each sample by its size. Put more weight on the more precise (larger / lower-variance) sample.
Nonlinear transforms break unbiasedness (Jensen)
If \(T\) is unbiased for \(\sigma^2\), is \(\sqrt T\) unbiased for \(\sigma\)? No. For a nonlinear \(g\), \(E[g(T)]\neq g(E[T])\) in general (Jensen's inequality). Since \(\sqrt{\cdot}\) is concave, \(E[\sqrt T]\le\sqrt{E[T]}=\sigma\), with strict inequality unless \(T\) is constant → \(\sqrt T\) is biased low for \(\sigma\). Whenever an exam takes a square root, log, or reciprocal of an unbiased estimator, the answer to "still unbiased?" is no.
Maximum likelihood (MLE)
Recipe: write the likelihood \(L(\theta)=\prod_i f(x_i\mid\theta)\), take \(\log\), differentiate, set to 0, solve.
\[ \ell(\theta)=\log L(\theta)=\sum_i \log f(x_i\mid\theta), \qquad \frac{d\ell}{d\theta}=0. \]Geometric example \(f(x\mid\theta)=\theta(1-\theta)^{x-1}\): \(L=\theta^n(1-\theta)^{\sum(x_i-1)}\), \(\ell=n\log\theta+\big(\sum(x_i-1)\big)\log(1-\theta)\), giving \(\hat\theta=\dfrac{n}{\sum x_i}=\dfrac1{\bar X}\). On the exam the pmf/pdf is given in the problem — you just run the recipe.
Method-of-moments (MoM)
Equate the population moment to the sample moment and solve for the parameter. For one parameter, set \(E[X]=\bar X\) (using the theoretical mean as a function of \(\theta\)) and invert.
Example density \(f(x)=2x/\theta^2\) on \([0,\theta]\): \(E[X]=\int_0^\theta x\frac{2x}{\theta^2}dx=\frac{2\theta}{3}\). Set \(\bar X=\frac{2\theta}{3}\Rightarrow \hat\theta_M=\frac{3}{2}\bar X\). It is unbiased here (\(E[\hat\theta_M]=\theta\)) with \(\mathrm{MSE}=\theta^2/(8n)\to0\) — consistent. MoM and MLE can differ; MoM is usually the quicker algebra.
Recognition guide
| Wording | Do |
|---|---|
| "which estimator is more efficient" (several unbiased) | compute each variance, smallest wins |
| "value of c that minimizes MSE" | differentiate MSE in c, or inverse-variance weight \(c^*\) |
| "determine constant a so T is unbiased for σ²/μ" | set \(E[T]=\theta\), use \(E[X^2]=\sigma^2+\mu^2\), solve |
| "is √T (or log, 1/T) unbiased" | No — Jensen, nonlinear ⇒ biased |
| "write the likelihood / find the MLE" | \(L=\prod f\) → log → differentiate → solve |
| "find the moments estimator" | \(E[X]=\bar X\) (as function of θ), invert |
| "Bias and MSE of T" | \(E[T]-\theta\); \(\mathrm{Var}(T)+\mathrm{Bias}^2\) |
Which of three unbiased estimators is most efficient?
\(\hat p\) is the success proportion in a Bernoulli sample of size \(n=9\); \(Y\) is one further independent observation. Consider \(T_1=\hat p\), \(T_2=\tfrac12\hat p+\tfrac12 Y\), \(T_3=\tfrac{9}{10}\hat p+\tfrac{1}{10}Y\). Which is most efficient?
Check all three are unbiased (they are), then compare variances. Use \(\mathrm{Var}(\hat p)=p(1-p)/9\), \(\mathrm{Var}(Y)=p(1-p)\).
Each is a weight-1 combination of unbiased pieces, so \(E[T_i]=p\) and MSE = Var.
\(\mathrm{Var}(T_1)=\tfrac19 p(1-p)\). \(\mathrm{Var}(T_2)=\tfrac14\cdot\tfrac{p(1-p)}{9}+\tfrac14 p(1-p)=\tfrac{10}{36}p(1-p)\). \(\mathrm{Var}(T_3)=\tfrac{81}{100}\cdot\tfrac{p(1-p)}{9}+\tfrac1{100}p(1-p)=\tfrac{1}{10}p(1-p)\).
\(\tfrac1{10}=0.100<\tfrac19\approx0.111<\tfrac{10}{36}\approx0.278\).
Combine two sample means — unbiased c and optimal c
\(\bar X_{10}\) and \(\bar Y_{15}\) are means of independent samples of sizes 10 and 15 from a population with mean μ. Let \(M_c=c\bar X_{10}+(1-c)\bar Y_{15}\). (a) For which c is \(M_c\) unbiased? (b) Find the c minimizing MSE.
Weights sum to 1 → unbiased for all c. MSE \(=c^2\sigma^2/10+(1-c)^2\sigma^2/15\); differentiate.
\(E[M_c]=c\mu+(1-c)\mu=\mu\) for every c → unbiased for all c.
\(\mathrm{MSE}(M_c)=\mathrm{Var}(M_c)=\sigma^2\!\left(\tfrac{c^2}{10}+\tfrac{(1-c)^2}{15}\right)\).
\(\dfrac{d}{dc}=\sigma^2\!\left(\tfrac{2c}{10}-\tfrac{2(1-c)}{15}\right)=\dfrac{\sigma^2}{15}(5c-2)=0\Rightarrow c=\tfrac25\). (Matches \(c^*=\tfrac{n_1}{n_1+n_2}=\tfrac{10}{25}\).)
Constant for an unbiased variance estimator, and √T
Sample \((X_1,\dots,X_n)\) from a population with known mean \(\mu=3\) and unknown variance \(\sigma^2\). (a) Find \(a\) so that \(T=\frac1n\sum_{i=1}^n X_i^2+a\) is unbiased for \(\sigma^2\). (b) Is \(\sqrt T\) unbiased for \(\sigma\)?
(a) \(E[X_i^2]=\sigma^2+\mu^2=\sigma^2+9\). (b) Think about \(E[\sqrt T]\) vs \(\sqrt{E[T]}\).
\(E[T]=E[X_i^2]+a=(\sigma^2+9)+a\). Unbiased ⇒ \(\sigma^2+9+a=\sigma^2\Rightarrow a=-9\).
We'd need \(E[\sqrt T]=\sigma=\sqrt{E[T]}\). But \(\sqrt{\cdot}\) is concave, so by Jensen \(E[\sqrt T]<\sqrt{E[T]}=\sigma\) (strict unless T is constant).
Make a(X₁−Xₙ)² unbiased for σ²
Sample from a population with known mean \(\mu=3\), unknown variance \(\sigma^2\). Find the constant \(a\) so that \(T=a(X_1-X_n)^2\) is unbiased for \(\sigma^2\).
Expand the square; use \(E[X_i^2]=\sigma^2+9\) and \(E[X_1X_n]=\mu^2=9\) (independence).
\(E[(X_1-X_n)^2]=E[X_1^2]+E[X_n^2]-2E[X_1X_n]\).
\(=(\sigma^2+9)+(\sigma^2+9)-2(9)=2\sigma^2\).
\(E[T]=a\cdot2\sigma^2=\sigma^2\Rightarrow a=\tfrac12\).
Bias and MSE of a Poisson estimator
Sample \((X_1,\dots,X_n)\) from Poisson(λ). For \(T=\tfrac12\!\left(\dfrac{X_1+\cdots+X_{n-1}}{n-1}+X_n\right)\), determine the bias and the MSE.
Write \(T=\tfrac12\bar X_{n-1}+\tfrac12 X_n\). Poisson: \(E=\mathrm{Var}=\lambda\); \(\mathrm{Var}(\bar X_{n-1})=\lambda/(n-1)\).
\(E[T]=\tfrac12\lambda+\tfrac12\lambda=\lambda\Rightarrow\mathrm{Bias}=0\).
\(\mathrm{Var}(T)=\tfrac14\cdot\dfrac{\lambda}{n-1}+\tfrac14\lambda=\dfrac{\lambda}{4}\!\left(\dfrac{1}{n-1}+1\right)=\dfrac{\lambda}{4}\cdot\dfrac{n}{n-1}\).
Unbiased ⇒ \(\mathrm{MSE}=\mathrm{Var}=\dfrac{n\lambda}{4(n-1)}\).
Geometric MLE
Sample of \(n\) observations from \(f(x\mid\theta)=\theta(1-\theta)^{x-1}\), \(x=1,2,\dots\), \(\theta\in(0,1)\). (a) Write the likelihood. (b) Find the MLE. (c) For the sample \(3,5,1,2,4\), give the estimate.
\(L=\prod f\), then log, differentiate, set 0. The sum of exponents is \(\sum(x_i-1)\).
\(L(\theta)=\prod_{i=1}^n\theta(1-\theta)^{x_i-1}=\theta^n(1-\theta)^{\sum(x_i-1)}\).
\(\ell=n\log\theta+\big(\sum(x_i-1)\big)\log(1-\theta)\); \(\dfrac{d\ell}{d\theta}=\dfrac{n}{\theta}-\dfrac{\sum(x_i-1)}{1-\theta}=0\Rightarrow\hat\theta=\dfrac{n}{\sum x_i}=\dfrac1{\bar X}\).
\(\bar X=(3+5+1+2+4)/5=3\Rightarrow\hat\theta=1/3\).
Method-of-moments for a triangular density
Sample from \(f(x;\theta)=\dfrac{2x}{\theta^2}\) for \(x\in[0,\theta]\) (0 otherwise), \(\theta>0\). (a) Find \(E[X]\) and \(\mathrm{Var}(X)\). (b) Find the method-of-moments estimator of \(\theta\). (c) Its bias and MSE; behaviour as \(n\to\infty\).
\(E[X]=\int_0^\theta x\frac{2x}{\theta^2}dx\). MoM: set \(\bar X=E[X]\) and invert.
\(E[X]=\dfrac{2}{\theta^2}\!\int_0^\theta x^2dx=\dfrac{2\theta}{3}\); \(E[X^2]=\dfrac{2}{\theta^2}\!\int_0^\theta x^3dx=\dfrac{\theta^2}{2}\); \(\mathrm{Var}(X)=\dfrac{\theta^2}{2}-\dfrac{4\theta^2}{9}=\dfrac{\theta^2}{18}\).
Set \(\bar X=E[X]=\dfrac{2\theta}{3}\Rightarrow\hat\theta_M=\dfrac32\bar X\).
\(E[\hat\theta_M]=\tfrac32\cdot\tfrac{2\theta}{3}=\theta\) → unbiased. \(\mathrm{Var}(\hat\theta_M)=\tfrac94\cdot\dfrac{\mathrm{Var}(X)}{n}=\tfrac94\cdot\dfrac{\theta^2}{18n}=\dfrac{\theta^2}{8n}\). So \(\mathrm{MSE}=\dfrac{\theta^2}{8n}\to0\).
Confidence Intervals & Sample Size
Build an interval, back-solve the sample size, recover n from a published CI, handle asymmetric tails — appears in all 7 past exams.
The CI machinery
Every confidence interval is estimate ± margin, the margin being a critical value times a standard error.
\[ \text{mean (known }\sigma\text{ / large }n): \quad \bar x\pm z_{\alpha/2}\,\frac{\sigma}{\sqrt n}\quad(\text{use }s\text{ if }\sigma\text{ unknown, large }n). \]\[ \text{proportion}: \quad \hat p\pm z_{\alpha/2}\,\sqrt{\frac{\hat p(1-\hat p)}{n}}. \]Critical values: \(z_{0.025}=1.96\) (95%), \(z_{0.005}=2.576\) (99%), \(z_{0.05}=1.645\) (90%). The total length of a symmetric CI is twice the margin, \(2z_{\alpha/2}\cdot\text{SE}\).
Sample size: back-solve from a target precision
Given a target on the interval (total length \(L\), or half-width / accuracy \(d=L/2\)), set the length condition and solve for \(n\), then round up.
\[ \text{mean}: \; 2z_{\alpha/2}\frac{\sigma}{\sqrt n}\le L \;\Rightarrow\; n\ge\left(\frac{2z_{\alpha/2}\sigma}{L}\right)^2. \]\[ \text{proportion}: \; n\ge\frac{z_{\alpha/2}^2\,p(1-p)}{d^2}, \quad\text{worst case }p(1-p)=\tfrac14. \]When \(\hat p\) is unknown (sample not yet taken), use the worst case \(p(1-p)\le\frac14\) — it guarantees the precision whatever the true \(p\).
Recover n from a published interval
If a poll reports "with \(C\%\) confidence, support is between \(a\) and \(b\)", read off \(\hat p=\tfrac{a+b}{2}\) and half-width \(d=\tfrac{b-a}{2}\), then invert the margin:
\[ z_{\alpha/2}\sqrt{\frac{\hat p(1-\hat p)}{n}}=d \;\Rightarrow\; n=\frac{z_{\alpha/2}^2\,\hat p(1-\hat p)}{d^2}. \]E.g. \((48\%,52\%)\) at 90%: \(\hat p=0.5\), \(d=0.02\), \(n=1.645^2(0.25)/0.02^2\approx1691\).
Asymmetric tails (non-standard split)
A "95% CI" normally splits \(\alpha=0.05\) as \(2.5\%\) per tail. If the problem asks for an unequal split — say \(3\%\) in the lower tail, \(2\%\) in the upper — use a different z on each side:
\[ \Big(\bar x - z_{0.03}\tfrac{\sigma}{\sqrt n}, \; \bar x + z_{0.02}\tfrac{\sigma}{\sqrt n}\Big), \quad z_{0.03}=1.88,\; z_{0.02}=2.055. \]The total tail probability still sums to \(\alpha\); only the per-side allocation changed, so the interval is no longer symmetric about \(\bar x\).
z or t?
Default to z: known σ, or large \(n\) (≥30) so the CLT applies and \(s\) is a good plug-in. Use t\(_{n-1}\) only for a small sample from a normal population with unknown σ. For \(n=100\) the two barely differ (e.g. \(z_{0.025}=1.96\) vs \(t_{99,0.025}\approx1.98\)) and either is acceptable — the exams lean on z.
Recognition guide
| Wording | Do |
|---|---|
| "compute a C% CI for the mean/proportion" | estimate ± \(z_{\alpha/2}\)·SE |
| "how large a sample", "length less than L", "accurate to within ±d" | back-solve \(n\), round up; prop → use ¼ |
| "support between a% and b%, find n" | \(\hat p\)=mid, \(d\)=half-width, invert |
| "lower tail 3%, upper tail 2%" | different z each side (asymmetric) |
| small n, normal, σ unknown | t\(_{n-1}\); else z |
99% CI for mean toothbrush purchases
In a sample of \(n=100\), the number of toothbrushes bought per year has mean \(\bar x=0.9\) and sd \(s=0.2\). Compute an approximate 99% confidence interval for the population mean.
Large n → z CI with s. 99% → \(z_{0.005}=2.576\). SE \(=0.2/\sqrt{100}=0.02\).
\(z_{0.005}\,s/\sqrt n=2.576\cdot0.02=0.05152\).
\(0.9\pm0.05152\).
Tire lifetimes — CI then required sample size
Tire lifetimes are normal with known \(\sigma=3600\) miles. A sample of \(n=81\) gave \(\bar x=28400\). (a) Build a 95% CI for the mean. (b) How large a sample gives a 99% CI shorter than the interval in (a)?
(a) \(z_{0.025}=1.96\), \(\sqrt{81}=9\). (b) Set the 99% length \(\le\) the (a) length, solve for n, round up.
\(1.96\cdot3600/9=1.96\cdot400=784\); CI \(=28400\pm784=(27616,29184)\), length \(1568\).
99% length \(=2\cdot2.576\cdot3600/\sqrt n=18547.2/\sqrt n\le1568\).
\(\sqrt n\ge18547.2/1568=11.83\Rightarrow n\ge139.9\Rightarrow n\ge140\).
Sample size for a proportion (length < 0.1)
Estimate the percentage of spaghetti eaters who use parmigiano. With \(\hat p\) unknown, what sample size gives a 95% CI of total length less than 0.1?
\(2\cdot1.96\sqrt{p(1-p)/n}<0.1\); use the worst case \(p(1-p)=\tfrac14\).
\(2\cdot1.96\sqrt{p(1-p)/n}<0.1\Rightarrow n>\dfrac{4\cdot1.96^2}{0.1^2}p(1-p)\).
Use \(p(1-p)=\tfrac14\): \(n>\dfrac{4\cdot3.8416}{0.01}\cdot\tfrac14=384.16\).
\(n\ge385\).
Super Bowl poll — sample size within ±0.02
How large a sample is needed to be 90% confident that the estimated proportion of households watching is accurate to within ±0.02?
Half-width \(d=0.02\), 90% → \(z=1.645\), worst case \(p(1-p)=\tfrac14\).
\(n\ge\dfrac{z_{0.05}^2\,p(1-p)}{d^2}=\dfrac{1.645^2\cdot\tfrac14}{0.02^2}\).
\(=\dfrac{2.706\cdot0.25}{0.0004}=1691.3\).
\(n\ge1692\).
Recover the sample size from a published CI
A poll states: "with 90% confidence, the minister's support is between 48% and 52%". Using the standard proportion-CI formula, how large was the sample?
\(\hat p=0.5\) (midpoint), half-width \(d=0.02\). Invert \(z\sqrt{\hat p(1-\hat p)/n}=d\).
\(\hat p=(0.48+0.52)/2=0.5\); margin \(=0.02\); 90% → \(z=1.645\).
\(1.645\sqrt{0.25/n}=0.02\Rightarrow n=\dfrac{1.645^2\cdot0.25}{0.02^2}=1691.3\).
Confidence interval with asymmetric tails
\(n=100\) observations, normal with known \(\sigma=1\), \(\bar x=3.5\). Build a 95% CI but with 3% probability in the lower tail and 2% in the upper tail.
Different z each side: \(z_{0.03}=1.88\) (lower), \(z_{0.02}=2.055\) (upper). \(\sigma/\sqrt n=0.1\).
\(3.5-1.88\cdot0.1=3.5-0.188=3.312\).
\(3.5+2.055\cdot0.1=3.5+0.2055=3.7055\).
Large-sample CI — z or t
\(n=100\) measurements give mean \(1.0\) and sample sd \(2.0\). Compute a 95% CI for the true value.
Large n → z is the standard choice (\(z_{0.025}=1.96\)); t with df 99 (≈1.98) gives almost the same answer. SE \(=2/\sqrt{100}=0.2\).
\(1\pm1.96\cdot0.2=1\pm0.392=(0.608,1.392)\).
Using \(t_{99,0.025}\approx1.98\): \(1\pm1.98\cdot0.2=1\pm0.396=(0.604,1.396)\).
Hypothesis Testing
One skeleton, five standard-error variants: one/two mean, one/two proportion, paired. Appears in all 7 past exams.
The test skeleton
Every test is the same four moves; only the standard error changes.
- State \(H_0,H_1\) — the direction comes from the wording (below).
- Pick the statistic and its null distribution: \( \text{TS}=\dfrac{\text{estimate}-\text{null}}{\text{SE}} \), \(\sim N(0,1)\) (large n / known σ) or \(t_{df}\) (small n, normal).
- Compute the observed value \(ts_{obs}\).
- Decide: reject if \(ts_{obs}\) falls in the rejection region — two-sided \(|ts|\ge z_{\alpha/2}\), one-sided \(ts\ge z_\alpha\) (or \(\le-z_\alpha\)). Or report the p-value and reject when \(\text{p-value}\le\alpha\).
Default to z (large samples / known σ); §8.6–8.7 proportion and Poisson tests are asymptotic-normal too. Use t only for small-n normal data (and the paired test).
The standard-error table
| Test | Statistic | Null dist. |
|---|---|---|
| 1-mean, known σ / large n | \((\bar x-\mu_0)/(\sigma/\sqrt n)\) | N(0,1) |
| 1-mean, small n | \((\bar x-\mu_0)/(s/\sqrt n)\) | \(t_{n-1}\) |
| paired | \((\bar d-0)/(s_d/\sqrt n)\) on \(d_i=\)before−after | \(t_{n-1}\) |
| 2-mean, large n | \((\bar x_1-\bar x_2)/\sqrt{s_1^2/n_1+s_2^2/n_2}\) | N(0,1) |
| 1-proportion | \((\hat p-p_0)/\sqrt{p_0(1-p_0)/n}\) | N(0,1) |
| 2-proportion (pooled) | \((\hat p_1-\hat p_2)/\sqrt{\hat p_p(1-\hat p_p)(1/n_1+1/n_2)}\) | N(0,1) |
Pooled proportion: \( \hat p_p=\dfrac{X_1+X_2}{n_1+n_2} \). Critical values: \(z_{0.025}=1.96\), \(z_{0.05}=1.645\); \(t_{9,0.05}=1.833\).
One-sided or two-sided? Read the wording
- One-sided: "more than", "over 30%", "more effective", "improved", "larger mean" → \(H_1\) points one way; reject only in that tail with \(z_\alpha\) (1.645 at 5%).
- Two-sided: "changed", "differ significantly", "need to recalibrate", "is there a difference" → \(H_1\neq\); reject in both tails with \(z_{\alpha/2}\) (1.96 at 5%).
For a two-group test, set \(H_0:\) difference \(=0\). The direction of \(H_1\) decides which tail; e.g. "B more effective" with statistic \((\bar x_A-\bar x_B)/SE\) rejects for small (negative) values.
p-value & the threshold significance level
The p-value is the null-probability of a statistic at least as extreme as observed, in the direction of \(H_1\): one-sided \(P(Z\ge ts_{obs})\); two-sided \(2P(Z\ge|ts_{obs}|)\). Reject whenever \(\alpha\ge\) p-value.
"Find the significance levels at which \(H_0\) is rejected" is just the p-value: e.g. a one-sided \(ts=0.93\) gives \(P(Z\ge0.93)=1-\Phi(0.93)=1-0.8238=0.1762\), so reject for \(\alpha\ge17.62\%\). A \(ts=2.4\) gives p \(=1-\Phi(2.4)=0.0082\) → reject for \(\alpha\ge0.82\%\).
Recognition guide
| Wording | Test |
|---|---|
| one group vs a target μ₀, "recalibrate / changed" | 1-mean z (two-sided) |
| one group vs target proportion, "more than X%" | 1-prop z (one-sided) |
| two groups, "more effective / larger" | 2-mean z (one-sided) |
| two proportions, "significant difference" | 2-prop pooled z (two-sided) |
| same units measured before/after (Initial/Final) | paired t on differences |
| "at what significance levels reject" | compute the p-value |
Recalibrate the bottling machine? (1-mean, two-sided)
A machine should fill bottles to a mean of 750 g; content is normal with known \(\sigma=5\) g. A sample of \(n=25\) gives \(\bar x=745\) g. Is there reason to recalibrate (α = 0.05)?
"Recalibrate" = changed = two-sided. \(\text{TS}=(\bar x-750)/(\sigma/\sqrt n)\), reject if \(|ts|\ge1.96\).
\(H_0:\mu=750\) vs \(H_1:\mu\neq750\).
\(ts=\dfrac{745-750}{5/\sqrt{25}}=\dfrac{-5}{1}=-5\).
\(-5<-1.96\) → in the rejection region.
Over 30% smokers? (1-proportion, one-sided + p-value)
66 of 200 adults are smokers. (a) At α = 0.05, can we conclude more than 30% are smokers? (b) At which significance levels would \(H_0\) be rejected?
One-sided: \(H_0:p\le0.30\) vs \(H_1:p>0.30\). \(\text{TS}=(\hat p-0.3)/\sqrt{0.3\cdot0.7/n}\); reject if \(ts\ge1.645\). (b) is the p-value.
\(\hat p=66/200=0.33\); \(ts=\dfrac{0.33-0.3}{\sqrt{0.21/200}}=\dfrac{0.03}{0.0324}=0.93\).
\(0.93
Reject when \(z_\alpha\le0.93\): \(1-\alpha\le\Phi(0.93)=0.8238\Rightarrow\alpha\ge0.1762\).
Is treatment B more effective? (2-mean, one-sided)
Two groups of \(n=140\): A has \(\bar x_A=105\), \(s_A=50\); B has \(\bar x_B=120\), \(s_B=60\) (higher = more effective). Can we claim B is more effective (α = 0.05)?
\(H_1:\mu_B>\mu_A\), i.e. \(H_0:\mu_A-\mu_B\ge0\) vs \(H_1:\mu_A-\mu_B<0\). Statistic \((\bar x_A-\bar x_B)/\sqrt{s_A^2/n_A+s_B^2/n_B}\); reject if \(ts\le-1.645\).
\(\sqrt{2500/140+3600/140}=\sqrt{43.571}=6.601\).
\(ts=\dfrac{105-120}{6.601}=-2.27\).
\(-2.27<-1.645\) → reject.
Heavier bananas from provider 1? (2-mean, p-value)
Two providers, \(n=128\) each: provider 1 \(\bar x_1=155.0\), \(s_1=10\); provider 2 \(\bar x_2=152.0\), \(s_2=10\). Is the claim that provider 1's bananas are heavier justified?
\(H_0:\mu_1-\mu_2\le0\) vs \(H_1:\mu_1-\mu_2>0\). Compute \(ts\), then \(p=P(Z\ge ts)\).
\(ts=\dfrac{155-152}{\sqrt{100/128+100/128}}=\dfrac{3}{\sqrt{1.5625}}=\dfrac{3}{1.25}=2.4\).
\(P(Z\ge2.4)=1-\Phi(2.4)=0.0082\).
Reject for \(\alpha\ge0.82\%\); at 5% (and 1%) → reject.
Do two coins differ? (2-proportion, pooled)
Two coins are each thrown 800 times: A shows Heads 430 times, B shows Heads 400 times. Is there a significant difference in their Heads probabilities (α = 0.05)?
Two-sided 2-proportion test with pooled \(\hat p_p=(430+400)/1600\). Reject if \(|ts|\ge1.96\).
\(\hat p_p=830/1600=0.51875\); \(\hat p_A=0.5375\), \(\hat p_B=0.5\).
\(ts=\dfrac{0.5375-0.5}{\sqrt{0.51875\cdot0.48125\,(1/800+1/800)}}=\dfrac{0.0375}{0.02498}=1.50\).
\(1.50<1.96\) → do not reject.
Did training improve scores? (paired t)
Ten gamers' scores are recorded before (Initial) and after (Final) a week of training. The improvements (Final−Initial) are \(0.6,-0.7,0.4,-1.4,1.7,0.6,2.4,1.0,1.9,0.5\). Does the data support that training improved the average score (α = 0.05)?
Paired data → one-sample t on the differences. \(H_0:\mu\le0\) vs \(H_1:\mu>0\). Critical \(t_{9,0.05}=1.833\).
\(\bar d=0.7\), \(s_d=1.15\), \(n=10\).
\(ts=\dfrac{0.7-0}{1.15/\sqrt{10}}=\dfrac{0.7}{0.3637}=1.92\).
\(1.92>t_{9,0.05}=1.833\) → reject.
Has the average grade changed? (1-mean, two-sided)
Historical average grade is 23.5. A recent exam with \(n=100\) students had mean 25.0, sd 2.5. Can we conclude the average changed (α = 0.05)?
"Changed" = two-sided. Large n → z with s. Reject if \(|ts|\ge1.96\).
\(H_0:\mu=23.5\) vs \(H_1:\mu\neq23.5\).
\(ts=\dfrac{25.0-23.5}{2.5/\sqrt{100}}=\dfrac{1.5}{0.25}=6\).
\(6\gg1.96\) → reject.
Cheatsheet & Decision Map
The open-book weapon: which procedure → which formula. Print this (Print cheat sheet) and bring it to the exam.
Which procedure? — top-level routing
| The question is about… | Go to |
|---|---|
| "chosen at random then observe", "given the result, prob it was…" | Bayes / total probability → |
| two overlapping groups, "at least one", "not the other" | inclusion–exclusion |
| "choose k without replacement", "all different" | counting \(\binom{n}{k}\) |
| "normally distributed", a probability / percentile / unknown σ | standardize Z, Φ → |
| weighted/grouped mean, add an observation, back-solve a size | descriptive (Σx=nx̄) |
| many i.i.d. units, prob the TOTAL exceeds a threshold | CLT: \(N(n\mu,n\sigma^2)\) |
| symbolic estimator T: unbiased? MSE? efficient? find a / c? | estimator theory → |
| "write the likelihood / MLE"; "moments estimator" | MLE / method-of-moments |
| "confidence interval", "how large a sample" | CI / sample size → |
| "is there reason to conclude / more than / changed" | hypothesis test → |
Inference: which CI / which test?
Walk the questions in order:
- CI or test? "compute an interval / how large a sample" → CI. "is there reason / more than / changed" → test.
- Mean or proportion? averages/measurements → mean. percentages/counts of successes → proportion.
- One sample or two? one group vs a target → one-sample. two groups compared → two-sample. same units before/after → paired.
- z or t? known σ or large n (≥30) → z. small n, normal, unknown σ → t. (Proportions always z.)
- One- or two-sided? "more / over / improved" → one-sided \(z_\alpha\). "changed / differ" → two-sided \(z_{\alpha/2}\).
CI & test formula bank (T1 + T2 spine)
| Case | CI: estimate ± margin | Test statistic |
|---|---|---|
| 1-mean, σ known / large n | \(\bar x\pm z_{\alpha/2}\sigma/\sqrt n\) | \((\bar x-\mu_0)/(\sigma/\sqrt n)\) |
| 1-mean, small n | \(\bar x\pm t_{n-1,\alpha/2}s/\sqrt n\) | \((\bar x-\mu_0)/(s/\sqrt n)\sim t_{n-1}\) |
| paired | \(\bar d\pm t_{n-1,\alpha/2}s_d/\sqrt n\) | \(\bar d/(s_d/\sqrt n)\sim t_{n-1}\) |
| 2-mean, large n | — | \((\bar x_1-\bar x_2)/\sqrt{s_1^2/n_1+s_2^2/n_2}\) |
| 1-proportion | \(\hat p\pm z_{\alpha/2}\sqrt{\hat p(1-\hat p)/n}\) | \((\hat p-p_0)/\sqrt{p_0(1-p_0)/n}\) |
| 2-proportion (pooled) | — | \((\hat p_1-\hat p_2)/\sqrt{\hat p_p(1-\hat p_p)(1/n_1+1/n_2)}\) |
\(\hat p_p=\dfrac{X_1+X_2}{n_1+n_2}\). Reject: two-sided \(|ts|\ge z_{\alpha/2}\); one-sided \(ts\ge z_\alpha\) (or \(\le-z_\alpha\)). p-value: one-sided \(1-\Phi(ts)\), two-sided \(2(1-\Phi(|ts|))\); reject when \(\alpha\ge\) p-value.
Sample size (round UP)
| Target | Formula |
|---|---|
| mean, total length \(L\) | \(n\ge(2z_{\alpha/2}\sigma/L)^2\) |
| proportion, half-width \(d\) | \(n\ge z_{\alpha/2}^2\,p(1-p)/d^2\), worst case \(p(1-p)=\tfrac14\) |
| recover n from CI | \(\hat p\)=mid, \(d\)=half-width, \(n=z_{\alpha/2}^2\hat p(1-\hat p)/d^2\) |
Critical values & Φ shortcuts
| Confidence / tail | z |
|---|---|
| 90% (two-sided) / 0.05 tail | \(z_{0.05}=1.645\) |
| 95% / 0.025 tail | \(z_{0.025}=1.96\) |
| 99% / 0.005 tail | \(z_{0.005}=2.576\) |
| 0.03 / 0.02 tails (asymmetric) | \(z_{0.03}=1.88,\;z_{0.02}=2.055\) |
| small-n paired | \(t_{9,0.05}=1.833\) |
Φ values: \(\Phi(0.5)=0.6915\), \(\Phi(1)=0.8413\), \(\Phi(2)=0.9772\), \(\Phi(2.4)=0.9918\), \(\Phi(1/3)=0.6293\). Symmetry \(\Phi(-z)=1-\Phi(z)\). Central band \(2\Phi(d/\sigma)-1\). Inverse: \(\Phi^{-1}(0.975)=1.96\), \(\Phi^{-1}(0.75)=0.675\). Standardize \(Z=(X-\mu)/\sigma\).
Estimator-theory recipes (T3 / T10)
- MSE: \(\mathrm{MSE}=\mathrm{Var}+\mathrm{Bias}^2\); Bias \(=E[T]-\theta\). Unbiased ⇒ MSE=Var.
- Var of combo: \(\mathrm{Var}(aU+bW)=a^2\mathrm{Var}(U)+b^2\mathrm{Var}(W)\) (independent).
- Unbiased weights: any \(c\bar X_1+(1-c)\bar X_2\) is unbiased for μ (weights sum to 1).
- Min-MSE weight: inverse-variance, \(c^*=\dfrac{n_1}{n_1+n_2}\) for sample means.
- Find constant: set \(E[T]=\theta\); use \(E[X^2]=\sigma^2+\mu^2\), \(E[X_iX_j]=\mu^2\) (indep).
- Jensen: \(\sqrt T\) (or log, 1/T) of an unbiased T is biased.
- MLE: \(L=\prod f\) → \(\log\) → \(d/d\theta=0\). Geometric → \(\hat\theta=1/\bar X\).
- Method-of-moments: set \(E[X]=\bar X\) (as a function of θ), invert.
- Moments: Bernoulli \(E{=}p,V{=}p(1{-}p)\); Poisson \(E{=}V{=}\lambda\); \(\mathrm{Var}(\bar X_m)=\sigma^2/m\).
Probability templates (T4 / counting / CLT)
- Total probability: \(P(E)=\sum_i P(E\mid H_i)P(H_i)\).
- Bayes: \(P(H_j\mid E)=\dfrac{P(E\mid H_j)P(H_j)}{\sum_i P(E\mid H_i)P(H_i)}\).
- Coin mixture: \(P(X=0\mid N=n)=(\tfrac12)^n\).
- Conditional in Bernoulli: disjoint arrangements ÷ binomial; \(p\) cancels.
- Inclusion–exclusion: \(P(B\cup C)=P(B)+P(C)-P(B\cap C)\); \(P(B\cap C^c)=P(B)-P(B\cap C)\).
- Counting: \(P=\#\text{fav}/\binom{n}{k}\); complement for "different", fix items for "included".
- CLT (sum): \(S_n\approx N(n\mu,n\sigma^2)\), \(Z=\dfrac{s-n\mu}{\sigma\sqrt n}\) (SUM uses \(\sigma\sqrt n\), not \(\sigma/\sqrt n\)).
- Binomial→Normal: \(N(np,np(1-p))\); skip \(\pm\tfrac12\) if "no continuity correction".
- Normal models→prob: \(\mathrm{Var}=E[X^2]-(E[X])^2\), \(E[X(X-1)]=E[X^2]-E[X]\).