Statistics.
From probability to hypothesis testing — theory, flashcards, spaced repetition, and worked exercises, built to be read.
How to use this volume
- Begin with Theory. Every concept is introduced from first principles, with worked examples.
- Move to Flashcards for active recall. Click any card to flip, then mark HARD or GOT IT — your judgments persist.
- Attempt each Exercise on paper. Reveal the hint if stuck, then the full solution to check your reasoning.
- Navigate with ← →. Focus search with /. Mark a problem solved to track your progress.
Probability Foundations
Starting from zero: what probability is, how to combine events, and how to reason with Bayes — built up with concrete examples before any exam problem.
1 · What probability actually is
Start with the picture, not the formula. An experiment is any situation whose result you can't predict for sure, but where you do know the list of possible results. Tossing a coin is an experiment; so is "which operating system will my next laptop run?" or "what will Apple's share price be on Monday?"
Each single result is an outcome. The full list of possible outcomes is the sample space, written \(S\).
- Coin toss: \(S=\{\text{Heads},\text{Tails}\}\).
- Roll a die: \(S=\{1,2,3,4,5,6\}\).
- Next laptop OS: \(S=\{\text{Windows},\text{MacOS},\text{Linux},\dots\}\).
An event is just a statement about the result — "the die shows an even number", "I get Heads". Formally it's a subset of \(S\): the even-number event is the set \(\{2,4,6\}\). We say the event occurs when the actual outcome is one of the outcomes inside it.
So what is a probability? The most useful picture is long-run frequency: if you repeated the experiment over and over, the probability of an event is the proportion of times it would happen. Flip a fair coin thousands of times and the fraction landing Tails settles near \(0.5\) — that limiting fraction is the probability. (Real example: across the world about 105 boys are born for every 100 girls, year after year — so \(P(\text{newborn is male})\approx0.51\).) A probability is always a number between 0 and 1: 0 = never, 1 = certain.
2 · Combining events: AND, OR, NOT
Events are sets, so we combine them like sets. Picture a rectangle for \(S\) and circles inside it for events (a Venn diagram).
- OR — union \(A\cup B\): the outcomes in \(A\), or in \(B\), or in both. It occurs if at least one of them happens.
- AND — intersection \(A\cap B\): the outcomes in both. It occurs only if they happen together.
- NOT — complement \(A^c\): everything in \(S\) that is not in \(A\). It occurs exactly when \(A\) doesn't.
Mutually exclusive (disjoint) events can't happen at the same time — their intersection is empty, \(A\cap B=\varnothing\). "The die shows 2" and "the die shows 5" are mutually exclusive. The empty set \(\varnothing\) is the impossible event.
One identity worth seeing now, because it drives a lot of exam problems: "in \(A\) but not in \(B\)" is \(A\cap B^c\), and it equals what's in \(A\) minus the overlap. We'll turn that into numbers next.
3 · The rules of probability
Three basic rules (they just codify the frequency picture):
- \(0\le P(A)\le1\) — proportions live between 0 and 1.
- \(P(S)=1\) — something in the list always happens.
- If \(A,B\) are mutually exclusive, \(P(A\cup B)=P(A)+P(B)\) — non-overlapping chances just add.
From these you derive the two you'll actually use:
Complement rule. Since \(A\) and \(A^c\) split \(S\): \(P(A^c)=1-P(A)\). (Heads has probability 0.4 ⇒ Tails has 0.6.) This is the engine behind "at least one" problems — it's usually easier to compute the opposite and subtract.
Addition rule (when events can overlap): \[ P(A\cup B)=P(A)+P(B)-P(A\cap B). \] Why subtract? If you just add \(P(A)+P(B)\), the overlap \(A\cap B\) gets counted twice, so you remove it once.
Concrete example (Ross 4.3). A shop takes Amex or VISA. 22% of customers carry Amex, 58% carry VISA, 14% carry both. Probability a customer has at least one card: \(0.22+0.58-0.14=0.66\). And "VISA but not Amex" \(=0.58-0.14=0.44\) — exactly the \(A\cap B^c\) idea from section 2.
4 · Equally likely outcomes — just count
When every outcome in \(S\) is equally likely (a fair die, a well-shuffled deck, "a person chosen at random"), probability becomes pure counting:
\[ P(A)=\frac{\#\text{outcomes in }A}{\#\text{outcomes in }S}. \]The phrase "chosen at random" is your signal that outcomes are equally likely.
- Fair die: \(P(\text{even})=3/6=1/2\).
- European roulette, bet on odd: numbers \(\{0,1,\dots,36\}\), 18 are odd ⇒ \(18/37\).
- Retirement centre (Ross 4.4): 420 members, 144 smokers ⇒ \(P(\text{smoker})=144/420=12/35\).
This is why counting techniques (section 9) matter: to get a probability you often just need to count the favourable outcomes and divide by the total.
5 · Conditional probability — updating on information
Often you learn something partway through. Conditional probability \(P(B\mid A)\) is "the probability of \(B\) given that \(A\) has happened."
The intuition (no formula yet). Roll two dice; you're told the first die is a 4. That knowledge shrinks the world: only six outcomes are still possible — \((4,1),\dots,(4,6)\). Among those, only \((4,6)\) makes the sum 10, so the chance is \(1/6\). You re-computed the probability inside a reduced sample space: \(A\) became your new \(S\).
Turning that into a formula — measure the overlap relative to the thing you now know:
\[ P(B\mid A)=\frac{P(A\cap B)}{P(A)}. \]The famous trap (Ross 4.10 / the two-children problem). A couple has two children; you learn at least one is a girl. Probability both are girls? Equally likely families are \(\{(g,g),(g,b),(b,g),(b,b)\}\). "At least one girl" rules out only \((b,b)\), leaving three equally likely cases; just one is \((g,g)\). So the answer is \(1/3\), not \(1/2\) — the extra information reshapes the sample space. (This exact reasoning powers exam exercise 5 below.)
Rearranging the formula gives the multiplication rule: \(P(A\cap B)=P(A)\,P(B\mid A)\) — useful for "draw two without replacement" problems, where the second draw's odds depend on the first.
6 · Independence — when information doesn't help
Sometimes knowing \(A\) tells you nothing about \(B\). Then \(P(B\mid A)=P(B)\), and the multiplication rule simplifies to the test you'll use:
\[ A,B\text{ independent}\iff P(A\cap B)=P(A)\,P(B). \]Signals of independence: "with replacement", "i.i.d.", "each toss is fair" — separate trials that don't influence each other.
Why it's not automatic (Ross 4.13). Two fair dice. Let \(A=\)"first die is 3". Compare two events: \(B=\)"sum is 8" and \(C=\)"sum is 7". Knowing the first die is 3 changes the chance of an 8 (now you just need a 5 next) — so \(A,B\) are dependent. But the chance of a 7 stays \(1/6\) whatever the first die shows (there's always exactly one matching second die) — so \(A,C\) are independent. Same first event, opposite verdicts: independence is something you check, not assume.
For "at least one" across independent trials, lean on the complement: three children, \(P(\text{at least one girl})=1-P(\text{all boys})=1-(1/2)^3=7/8\).
7 · Total probability — averaging over cases
Often the thing you want depends on a hidden "case" or "cause". Split the problem by the case, compute each piece, and combine. If \(B\) either happens or not, then for any \(A\):
\[ P(A)=P(A\mid B)\,P(B)+P(A\mid B^c)\,P(B^c). \]Read it as a weighted average: the chance of \(A\) in each case, weighted by how likely that case is. (It generalises to any set of mutually-exclusive cases \(B_1,\dots,B_k\): \(P(A)=\sum_i P(A\mid B_i)P(B_i)\).)
Concrete example (insurance). 30% of drivers are high-risk, 70% low-risk. A high-risk driver has an accident this year with probability 0.4, a low-risk one with probability 0.2. Overall chance a random driver has an accident: \[ P(\text{accident})=0.4(0.30)+0.2(0.70)=0.12+0.14=0.26. \] You couldn't answer without splitting by risk type — that's the whole move.
8 · Bayes' theorem — reasoning backwards
Total probability runs cause → effect. Bayes runs it backwards: you observed the effect, and you want the probability of the cause. "It rained — how likely is it that the morning had been sunny?" "The test is positive — how likely is the disease?"
Start from the definition \(P(\text{cause}\mid\text{effect})=\dfrac{P(\text{cause}\cap\text{effect})}{P(\text{effect})}\), write the top as \(P(\text{effect}\mid\text{cause})P(\text{cause})\), and expand the bottom with total probability:
\[ P(H\mid E)=\frac{P(E\mid H)\,P(H)}{P(E\mid H)\,P(H)+P(E\mid H^c)\,P(H^c)}. \]The procedure: (1) name the causes \(H\) and the observed effect \(E\); (2) write the priors \(P(H)\) and the likelihoods \(P(E\mid H)\) straight from the text; (3) total probability gives the denominator; (4) divide.
The showcase example — why a positive test can still be reassuring (Ross 4.17). A blood test is 99% accurate when the disease is present (\(P(E\mid H)=0.99\)) and gives a false positive 2% of the time (\(P(E\mid H^c)=0.02\)). Only 0.5% of people have the disease (\(P(H)=0.005\)). You test positive — what's the chance you're actually sick? \[ P(H\mid E)=\frac{0.99(0.005)}{0.99(0.005)+0.02(0.995)}\approx0.199. \] About 20% — surprisingly low, because the huge healthy population produces many false positives that swamp the few true cases. This is exactly the engine behind exam exercises 3 and 4.
9 · Counting — tools for equally-likely problems
Section 4 said probability is often just counting favourable ÷ total. Here are the tools, introduced only because we need them.
Basic principle. If step 1 has \(n\) options and step 2 has \(m\), together there are \(n\cdot m\). (One man from 8, one woman from 12 → \(96\) pairs.)
Permutations / factorial. The number of orderings of \(n\) distinct objects is \(n!=n(n-1)\cdots2\cdot1\) (with \(0!=1\)). Order matters here.
Combinations. When order does not matter — choosing a group of \(k\) from \(n\):
\[ \binom{n}{k}=\frac{n!}{k!\,(n-k)!}. \]Two reflexes for the exam: "all different" → count the complement (the few repeats) and subtract; "these specific items are included" → fix them, then count the choices for the remaining slots. Watch for repeated elements — e.g. the two identical P's in APPLE: \(\binom{5}{2}=10\) selections, only one is the pair PP.
10 · Which tool? — recognition guide
Once the concepts are in place, exam problems are about spotting which one applies.
| If the wording says… | It's a… | Do |
|---|---|---|
| "chosen at random" among options, then observe a result; "given [result], prob it was [cause]" | Bayes | priors × likelihoods, divide |
| "a cause is random, then count X", asked an unconditional \(P(X{=}k)\) | Total probability (mixture) | \(\sum P(X{=}k\mid \text{case})P(\text{case})\) |
| two overlapping groups + counts; "at least one", "one but not the other" | Addition / inclusion–exclusion | \(P(A\cup B)=P(A)+P(B)-P(A\cap B)\) |
| "choose \(k\) without replacement", "all different", "both included" | Counting | \(\binom{n}{k}\) + complement / fix-items |
| "knowing exactly \(m\) of \(n\) are…", asked about a specific draw | Conditional in a fair setup | reduced sample space; count favourable ÷ remaining |
Coin tossed a random number of times — P(X=0)
Let \(N\) be a number chosen from \(\{1,2,3\}\) with equal probability. Throw a fair coin \(N\) times, counting the number \(X\) of Heads obtained. Calculate \(P(X=0)\).
The hidden 'case' is how many times you tossed (\(N\)). No reversal is asked, so it's the law of total probability (section 7). Given \(N=n\), getting zero Heads means \(n\) Tails in a row: \((\tfrac12)^n\).
Cases: \(P(N=1)=P(N=2)=P(N=3)=\tfrac13\). The question asks a plain \(P(X=0)\) — no 'given' — so average over the cases (total probability), not Bayes.
\(n\) independent fair tosses, all Tails: \(P(X=0\mid N=n)=(\tfrac12)^n\). For \(n=1,2,3\) that's \(\tfrac12,\tfrac14,\tfrac18\).
\[ P(X=0)=\sum_{n=1}^{3}P(X=0\mid N=n)P(N=n)=\tfrac13\left(\tfrac12+\tfrac14+\tfrac18\right)=\tfrac13\cdot\tfrac78. \]
Same coin experiment — reverse it with Bayes
Same setup (\(N\) chosen from \(\{1,2,3\}\), fair coin tossed \(N\) times, \(X\) = Heads). Given that \(X=0\), calculate \(P(N=1\mid X=0)\).
Now you observe the effect \(X=0\) and want the hidden cause \(N=1\) → Bayes (section 8). The denominator is the \(P(X=0)=\tfrac{7}{24}\) you just built.
From exercise 1, \(P(X=0)=\tfrac{7}{24}\).
\(P(X=0\mid N=1)\,P(N=1)=\tfrac12\cdot\tfrac13=\tfrac16\).
\[ P(N=1\mid X=0)=\frac{1/6}{7/24}=\frac16\cdot\frac{24}{7}. \]
Which coin was it? — Bayes with two coins
An urn has two coins: coin A with \(P(\text{Heads})=\tfrac14\), coin B with \(P(\text{Heads})=\tfrac12\). One coin is picked at random and thrown, giving "Heads". Given this result, what is the probability it was coin A?
Cause = which coin (priors \(\tfrac12,\tfrac12\)); effect = Heads (likelihoods \(\tfrac14,\tfrac12\)). You observed the effect, want the cause → Bayes. Same shape as the medical-test example.
\(P(A)=P(B)=\tfrac12\); \(P(H\mid A)=\tfrac14\), \(P(H\mid B)=\tfrac12\).
\(P(H)=\tfrac14\cdot\tfrac12+\tfrac12\cdot\tfrac12=\tfrac18+\tfrac14=\tfrac38\).
\[ P(A\mid H)=\frac{P(H\mid A)P(A)}{P(H)}=\frac{1/8}{3/8}. \]
Was the morning sunny? — Bayes with rain
If the morning is sunny, the chance of rain that day is \(\tfrac16\). On non-sunny mornings, the chance of rain is \(\tfrac12\). 60% of mornings start sunny. Given that it rained, what is the probability the morning was sunny?
Cause = sunny vs not (priors \(0.6,0.4\)); effect = rain (likelihoods \(\tfrac16,\tfrac12\)). Observed the effect (rain), want the cause (sunny) → Bayes.
\(P(S)=0.6\), \(P(S^c)=0.4\); \(P(R\mid S)=\tfrac16\), \(P(R\mid S^c)=\tfrac12\).
\(P(R)=\tfrac16\cdot0.6+\tfrac12\cdot0.4=0.1+0.2=0.3\).
\[ P(S\mid R)=\frac{P(R\mid S)P(S)}{P(R)}=\frac{0.1}{0.3}. \]
Exactly two women — conditioning in a fair setup
Three students are selected at random with replacement. Knowing that exactly two of the selected are women, what is the probability that the first selected is a woman?
This is the two-children trap (section 5) with three slots. 'With replacement' → independent trials, each woman with some probability \(p\). Let \(E_i\)='the \(i\)-th is a woman', \(W\)=number of women. Want \(P(E_1\mid W=2)\). Watch \(p\) cancel.
Number of women among 3 is Binomial: \(P(W=2)=\binom{3}{2}p^2(1-p)=3p^2(1-p)\).
The arrangements are \(E_1E_2E_3^c\) and \(E_1E_2^cE_3\) (disjoint). By independence each is \(p\cdot p\cdot(1-p)\), so \(P(E_1\cap\{W=2\})=2p^2(1-p)\).
\[ P(E_1\mid W=2)=\frac{2p^2(1-p)}{3p^2(1-p)}. \] The \(p^2(1-p)\) cancels, just like the count-the-cases reasoning in the two-children problem.
Biology not Chemistry — addition rule
A school has 300 students; every student takes at least one of Biology or Chemistry, possibly both. Biology has 250 students, Chemistry 150. Picking a student at random, what is the probability they take Biology and not Chemistry?
Two overlapping groups + 'and not' → addition rule (section 3). Everyone is in the union, so \(P(B\cup C)=1\); solve for the overlap, then use \(P(B\cap C^c)=P(B)-P(B\cap C)\).
\(P(B)=\tfrac{250}{300}\), \(P(C)=\tfrac{150}{300}\), \(P(B\cup C)=1\) (all take at least one).
\(1=\tfrac{250}{300}+\tfrac{150}{300}-P(B\cap C)\Rightarrow P(B\cap C)=\tfrac{100}{300}\).
\[ P(B\cap C^c)=P(B)-P(B\cap C)=\tfrac{250}{300}-\tfrac{100}{300}=\tfrac{150}{300}. \]
Two letters from APPLE, all different — counting
Consider the word APPLE. Choose two letters at random without replacement. What is the probability that they are different?
Equally-likely selections → count (section 9). \(\binom{5}{2}=10\) pairs. 'All different' → count the complement (the pairs that repeat) and subtract.
Letters A, P, P, L, E → \(\binom{5}{2}=10\) unordered pairs.
The only pair of equal letters is PP — exactly 1 outcome.
Different pairs \(=10-1=9\), so \(P(\text{different})=\tfrac{9}{10}\).
Three letters from APPLE, both P's chosen — counting
From the word APPLE, choose three letters at random without replacement. What is the probability that both P's are among the chosen letters?
\(\binom{5}{3}=10\) selections. 'Both P's included' → fix the two P's, count choices for the one remaining slot.
\(\binom{5}{3}=10\) unordered selections of three letters.
If both P's are taken, the third letter is one of A, L, E → 3 favourable selections.
\( P(\text{both P's})=\tfrac{3}{10} \).
Discrete Random Variables & Key Models
Expectation, variance, and the Bernoulli / Binomial / Poisson / Geometric toolkit that feeds every later chapter
pmf, expectation, variance
A discrete random variable \(X\) is described by its pmf \(p(x)=P(X=x)\), with \(\sum_x p(x)=1\). Two summaries do most of the work:
\[ E[X]=\sum_x x\,p(x), \qquad \mathrm{Var}(X)=E[X^2]-\big(E[X]\big)^2,\quad E[X^2]=\sum_x x^2 p(x). \]The variance shortcut \(E[X^2]-(E[X])^2\) is almost always faster than \(\sum (x-\mu)^2 p(x)\) on the exam — compute \(E[X]\) and \(E[X^2]\) in one pass over the table, then subtract.
Normalising trick. If a pmf is given up to a constant \(c\) (e.g. \(p(i)=c\,\lambda^i/i!\)), find \(c\) by forcing \(\sum_x p(x)=1\). Recognising the series (here \(\sum \lambda^i/i! = e^\lambda\)) names the distribution.
Expectation & variance of linear combinations
These rules are the engine of the estimator-theory chapter (bias/MSE) — learn them cold.
\[ E[aX+bY+c]=aE[X]+bE[Y]+c \quad(\text{always}). \]\[ \mathrm{Var}(aX+b)=a^2\mathrm{Var}(X), \qquad \mathrm{Var}(aX+bY)=a^2\mathrm{Var}(X)+b^2\mathrm{Var}(Y)+2ab\,\mathrm{Cov}(X,Y). \]If \(X,Y\) are independent then \(\mathrm{Cov}(X,Y)=0\), so \(\mathrm{Var}(aX+bY)=a^2\mathrm{Var}(X)+b^2\mathrm{Var}(Y)\). Note expectation is linear unconditionally, but a constant added to \(X\) never changes the variance: \(\mathrm{Var}(Y+4)=\mathrm{Var}(Y)\).
The four key discrete models
| Model | When | pmf | Mean | Var |
|---|---|---|---|---|
| Bernoulli(p) | one yes/no trial | \(p,\,1-p\) | \(p\) | \(p(1-p)\) |
| Binomial(n,p) | # successes in \(n\) indep. trials | \(\binom{n}{k}p^k(1-p)^{n-k}\) | \(np\) | \(np(1-p)\) |
| Poisson(λ) | count of rare events / given a rate | \(e^{-\lambda}\lambda^k/k!\) | \(\lambda\) | \(\lambda\) |
| Geometric(p) | # trials until first success | \((1-p)^{k-1}p\) | \(1/p\) | \((1-p)/p^2\) |
| Discrete Unif. | \(m\) equally-likely values | \(1/m\) | midpoint | — |
Geometric on the exam: its pmf is usually given inside the problem (e.g. "draw with replacement until a black ball"); you just identify \(p\) and plug in. The MLE \(\hat\theta=1/\bar X\) is derived later (estimator chapter).
Poisson tell: "average number per week/page/area" and a count question → Poisson with \(\lambda\)=that average. \(P(X\ge1)=1-e^{-\lambda}\) is the most common ask.
Which model? — recognition guide
| If the wording says… | Model |
|---|---|
| "\(n\) independent trials, each succeeds w.p. \(p\); how many succeed" | Binomial\((n,p)\) |
| "average / mean number of [events] per [unit]", count question | Poisson\((\lambda)\) |
| "repeat until the first success", "number of draws needed" | Geometric\((p)\) |
| finitely many equally-likely outcomes | Discrete uniform |
| pmf given up to a constant \(c\) | normalise: \(\sum p(x)=1\) |
Expectation and variance from a frequency table
A company has 50 employees. The number of years \(X\) each has been with the company is distributed as: 12 employees → 1 year, 15 → 2 years, 16 → 3 years, 7 → 4 years. A worker is chosen at random. Find \(E[X]\) and \(\mathrm{Var}(X)\).
Turn frequencies into a pmf (divide by 50). Then \(E[X]=\sum x\,p(x)\), \(E[X^2]=\sum x^2 p(x)\), \(\mathrm{Var}=E[X^2]-(E[X])^2\).
\(P(X{=}1)=\tfrac{12}{50}\), \(P(X{=}2)=\tfrac{15}{50}\), \(P(X{=}3)=\tfrac{16}{50}\), \(P(X{=}4)=\tfrac{7}{50}\).
\( E[X]=\dfrac{12+30+48+28}{50}=\dfrac{118}{50}=2.36 \).
\( E[X^2]=\dfrac{1\cdot12+4\cdot15+9\cdot16+16\cdot7}{50}=\dfrac{12+60+144+112}{50}=\dfrac{328}{50}=6.56 \).
\( \mathrm{Var}(X)=6.56-2.36^2=6.56-5.5696=0.9904 \).
Defective bearings — Binomial
Each ball bearing is independently defective with probability 0.05. A sample of 5 is inspected. Find (a) the probability none are defective, (b) the probability two or more are defective.
\(X=\)#defective \(\sim\)Binomial\((5,0.05)\). For (b) use the complement \(P(X\ge2)=1-P(X{=}0)-P(X{=}1)\).
\( P(X=0)=\binom{5}{0}(0.05)^0(0.95)^5=0.95^5=0.7738 \).
\( P(X=1)=\binom{5}{1}(0.05)(0.95)^4=5\cdot0.05\cdot0.8145=0.2036 \).
\( P(X\ge2)=1-P(X{=}0)-P(X{=}1)=1-0.7738-0.2036=0.0226 \).
Highway accidents — Poisson
The average number of accidents per week on a highway is 1.2. Compute the probability that there is at least one accident this week.
"Average number per week" + count → Poisson\((\lambda=1.2)\). "At least one" → complement of zero.
\(X\sim\)Poisson\((1.2)\), \(P(X=k)=e^{-1.2}\,1.2^k/k!\).
\( P(X\ge1)=1-P(X=0)=1-e^{-1.2} \).
\( =1-0.3012=0.6988 \).
Draw until a black ball — Geometric
An urn has \(N\) white and \(M\) black balls. Balls are drawn one at a time with replacement until a black ball appears. What is the probability that exactly \(n\) draws are needed?
Each draw is independent, success = black, \(p=M/(N+M)\). "Number of draws until first success" → Geometric: \(P(X=n)=(1-p)^{n-1}p\).
\( p=\dfrac{M}{N+M} \), so failure (white) has prob \(1-p=\dfrac{N}{N+M}\).
Need \(n-1\) whites then a black: \( P(X=n)=\left(\dfrac{N}{N+M}\right)^{n-1}\dfrac{M}{N+M} \).
\( P(X=n)=\dfrac{M\,N^{\,n-1}}{(N+M)^n},\quad n=1,2,3,\dots \)
Mean and variance of linear combinations
Given \(E[X]=5\), \(\mathrm{Var}(X)=2\), \(E[Y]=12\), \(\mathrm{Var}(Y)=1\), \(\mathrm{Cov}(X,Y)=3\), compute: (a) \(E[3X+4Y]\); (b) \(E[2+5Y+X]\); (c) \(\mathrm{Var}(3X+4Y)\); (d) \(\mathrm{Var}(2+5Y+X)\); (e) \(\mathrm{Var}(Y+4)\).
Expectation is linear (constants pass through). For variance use \(\mathrm{Var}(aX+bY)=a^2\mathrm{Var}(X)+b^2\mathrm{Var}(Y)+2ab\,\mathrm{Cov}(X,Y)\); additive constants vanish.
\(E[3X+4Y]=3(5)+4(12)=63\). \(E[2+5Y+X]=2+5(12)+5=67\).
\( \mathrm{Var}(3X+4Y)=9(2)+16(1)+2\cdot3\cdot4\cdot3=18+16+72=106 \).
\( \mathrm{Var}(2+5Y+X)=\mathrm{Var}(5Y+X)=25(1)+1(2)+2\cdot5\cdot1\cdot3=25+2+30=57 \).
\( \mathrm{Var}(Y+4)=\mathrm{Var}(Y)=1 \).
Normalise a pmf given up to a constant
The pmf of \(X\) is \(p(i)=c\,\lambda^i/i!\) for \(i=0,1,2,\dots\) (\(\lambda>0\)). Find (a) \(c\) and \(P(X=0)\); (b) \(P(X>2)\). Hint: \(e^x=\sum_{i\ge0}x^i/i!\).
Make probabilities sum to 1; recognise the exponential series. The result is a named distribution.
\( \sum_{i\ge0}c\lambda^i/i!=c\,e^{\lambda}=1\Rightarrow c=e^{-\lambda} \). So \(X\sim\)Poisson\((\lambda)\).
\( P(X=0)=e^{-\lambda}\lambda^0/0!=e^{-\lambda} \).
\( P(X>2)=1-P(X{=}0)-P(X{=}1)-P(X{=}2)=1-e^{-\lambda}\left(1+\lambda+\tfrac{\lambda^2}{2}\right) \).
Normal Distribution
Standardize with Z, read Φ — tail probabilities, inverse-σ, percentiles, moments→probability
Standardize and read the table
If \(X\sim N(\mu,\sigma^2)\), convert to the standard normal \(Z\sim N(0,1)\) by
\[ Z=\frac{X-\mu}{\sigma}, \qquad P(X\le a)=\Phi\!\left(\frac{a-\mu}{\sigma}\right). \]Everything is one of: a tail probability, an inverse (recover \(\sigma\) or a percentile), or a comparison. Two identities you use every time:
- Symmetry: \( \Phi(-z)=1-\Phi(z) \), so \( P(X\ge a)=1-\Phi\!\big(\tfrac{a-\mu}{\sigma}\big)=\Phi\!\big(\tfrac{\mu-a}{\sigma}\big) \).
- Central band: \( P(\mu-d\le X\le \mu+d)=2\Phi\!\big(\tfrac{d}{\sigma}\big)-1 \).
Table values worth memorising: \(\Phi(0.5)=0.6915\), \(\Phi(1)=0.8413\), \(\Phi(1/3)=0.6293\); inverses \(\Phi^{-1}(0.975)=1.96\), \(\Phi^{-1}(0.75)=0.675\).
The four exam flavours
- Tail / comparison ("probability the return is at least 50", "which is more likely"): standardize, read \(\Phi\); for a comparison the bigger \(\Phi\big(\tfrac{\mu-a}{\sigma}\big)\) wins.
- Inverse-σ ("95% are between 140 and 160, find σ"): write the band as \(2\Phi(d/\sigma)-1=\text{coverage}\), solve \(d/\sigma=z\), then \(\sigma=d/z\).
- Percentiles ("25th pct = 3, 75th pct = 7"): \(\mu\) is the midpoint of symmetric percentiles; get \(\sigma\) from \(\Phi\big(\tfrac{x_p-\mu}{\sigma}\big)=p\).
- Moments→probability ("E[X]=2, E[X(X−1)]=6, find Var and P(X≤4)"): recover \(\sigma^2=E[X^2]-(E[X])^2\) using \(E[X(X-1)]=E[X^2]-E[X]\), then standardize.
Which flavour? — recognition guide
| Wording | Flavour | Key move |
|---|---|---|
| "probability ≥/≤ a", "which strategy/option is better" | tail / comparison | \(\Phi\big(\tfrac{\mu-a}{\sigma}\big)\), bigger wins |
| "X% are between a and b", find σ | inverse-σ | \(2\Phi(d/\sigma)-1=\) coverage |
| "the p-percentile is …", find μ and σ | percentiles | μ = midpoint; σ from \(\Phi^{-1}(p)\) |
| given E[X] and E[X²] or E[X(X−1)] | moments→prob | \(\sigma^2=E[X^2]-(E[X])^2\) |
Two strategies — compare normal tail probabilities
Strategy A yields a return \(\sim N(100,100^2)\); strategy B yields a return \(\sim N(60,30^2)\). If the return must be at least 50, which strategy should be chosen?
Compute \(P(R\ge50)=\Phi\big(\tfrac{\mu-50}{\sigma}\big)\) for each, then compare.
\( P(R_A\ge50)=P\big(Z\ge\tfrac{50-100}{100}\big)=P(Z\ge-\tfrac12)=\Phi(\tfrac12)=0.6915 \).
\( P(R_B\ge50)=P\big(Z\ge\tfrac{50-60}{30}\big)=P(Z\ge-\tfrac13)=\Phi(\tfrac13)=0.6293 \).
\(0.6915>0.6293\), so A is more likely to clear the 50 threshold.
Children's heights — recover σ (inverse problem)
Heights of 12-year-olds are \(N(150,\sigma^2)\) cm, and 95% lie between 140 and 160 cm. Find \(\sigma\).
The band 140–160 is symmetric about 150 with half-width \(d=10\). Use \(2\Phi(d/\sigma)-1=0.95\).
\( 0.95=P(140\le X\le160)=2\Phi\!\big(\tfrac{10}{\sigma}\big)-1 \).
\( \Phi(10/\sigma)=0.975\Rightarrow 10/\sigma=\Phi^{-1}(0.975)=1.96 \).
\( \sigma=10/1.96\approx5.10 \).
From moments to a probability
A normal \(X\) has \(E[X]=2\) and \(E[X(X-1)]=6\). Determine \(\mathrm{Var}(X)\) and \(P(X\le4)\).
\(E[X(X-1)]=E[X^2]-E[X]\) gives \(E[X^2]\); then variance, then standardize.
\( 6=E[X(X-1)]=E[X^2]-E[X]=E[X^2]-2\Rightarrow E[X^2]=8 \).
\( \mathrm{Var}(X)=E[X^2]-(E[X])^2=8-4=4 \), so \(\sigma=2\).
\( P(X\le4)=P\big(Z\le\tfrac{4-2}{2}\big)=\Phi(1)=0.8413 \).
Quartiles → mean and standard deviation
A normal random variable has 25th percentile 3.0 and 75th percentile 7.0. Find its mean and standard deviation.
By symmetry μ is the midpoint of the two percentiles. For σ use \(\Phi\big(\tfrac{x_{0.75}-\mu}{\sigma}\big)=0.75\).
\( \mu=\tfrac{3+7}{2}=5 \) (midpoint of symmetric percentiles).
\( 0.75=P(X\le7)=\Phi\!\big(\tfrac{7-5}{\sigma}\big)=\Phi(2/\sigma)\Rightarrow 2/\sigma=\Phi^{-1}(0.75)=0.675 \).
\( \sigma=2/0.675\approx2.96 \).
Descriptive Statistics & Correlation
Weighted & grouped means, back-solving, augmenting a sample, reading dispersion, and the collinearity shortcut for correlation
Sample mean, variance, and the two sums you reconstruct
For a sample \(x_1,\dots,x_n\):
\[ \bar x=\frac1n\sum_i x_i, \qquad s^2=\frac{1}{n-1}\sum_i (x_i-\bar x)^2 \quad(\text{note the }n-1). \]Exam problems rarely give you the raw data — they give summaries and ask you to recover the two totals:
\[ \sum_i x_i = n\,\bar x, \qquad \sum_i (x_i-\bar x)^2 = (n-1)\,s^2. \]Once you hold those two sums you can add a point, merge groups, or back-solve a missing piece.
Weighted / grouped mean & back-solving
An overall mean is the weighted average of subgroup means, weights = subgroup proportions (or sizes):
\[ \bar x=\sum_g w_g\,\bar x_g, \qquad w_g=\frac{n_g}{\sum_h n_h}. \]Forward: proportions and group means given → plug in. Back-solve: the overall mean is given and a group mean / group size is unknown → set up one linear equation (or a 2×2 system if a difference between group means is also stated) and solve.
Typical back-solve: total grade \(=\) (class-A mean)·\(n_A+\)(class-B mean)·\((N-n_A)\), divided by \(N\), equals the stated overall mean → solve for \(n_A\).
Augmenting a sample with a new observation
Adding one point \(x_{n+1}\): recover \(\sum x_i=n\bar x\) and \(\sum(x_i-\bar x)^2=(n-1)s^2\), add the new value, recompute.
- New mean: \( \bar x_{new}=\dfrac{n\bar x + x_{n+1}}{n+1} \).
- If the new point equals the old mean, the mean is unchanged and the sum of squared deviations is unchanged; the variance still shifts because the divisor grows from \(n-1\) to \(n\). Example: \(n=12\), \(\bar x=10\), \(s^2=12\), add \(x_{13}=10\) → mean stays 10, SS stays \(132\), new \(s^2=132/12=11\).
Reading dispersion from a histogram (no computation)
When several histograms share the same mean and you must rank their standard deviations without computing: SD measures spread about the mean, i.e. the typical size of \((x_i-\bar x)^2\).
- Mass concentrated near the centre (mound/bell shape) → smallest SD.
- Mass spread evenly (flat/uniform) → middle SD.
- Mass pushed to the extremes (U-shape / bimodal) → largest SD.
So for shapes {mound B, uniform A, U-shape C} with SDs {1.38, 1.71, 1.98}: B = 1.38 (smallest), A = 1.71, C = 1.98 (largest). You cannot actually compute an SD from a histogram — the argument is purely about where the mass sits.
Sample correlation — spot the line
The sample correlation coefficient is
\[ r=\frac{\sum_i (x_i-\bar x)(y_i-\bar y)}{\sqrt{\sum_i (x_i-\bar x)^2}\,\sqrt{\sum_i (y_i-\bar y)^2}}\in[-1,1]. \]Exam shortcut: if the points lie exactly on a straight line \(y=a+bx\), then \(r=\operatorname{sign}(b)\) — \(+1\) for positive slope, \(-1\) for negative — with no arithmetic. The hint "try to plot the points" is the giveaway. Only compute the full formula if the points are not perfectly collinear.
Which tool? — recognition guide
| Wording | Tool |
|---|---|
| subgroup proportions + group means → overall mean | weighted mean (forward) |
| overall mean given, a group mean or group size missing | back-solve linear eq / 2×2 system |
| "add a new observation", new mean/variance | recover \(\sum x\), \(\sum(x-\bar x)^2\); add; recompute |
| several histograms, same mean, "match the SDs" | dispersion reasoning (no computation) |
| few (x,y) pairs, "correlation", "plot the points" | collinear → \(r=\pm1\); else full formula |
Average salary across staff types
Staff is 10% type A, 70% type B, 20% type C, with average monthly salaries 1000, 2000, 3000 respectively. What is the average salary across all staff?
Weighted mean: proportions are the weights.
\( \bar x=0.1(1000)+0.7(2000)+0.2(3000)=100+1400+600 \).
Augment a 12-point sample
Twelve numbers have sample mean 10 and sample variance 12. A new observation \(x_{13}=10\) is added. Find the mean and variance of the 13-point sample.
Recover \(\sum x_i=12\cdot10\) and \(\sum(x_i-10)^2=11\cdot12\). The new point equals the mean.
\( \sum_{i=1}^{12}x_i=12\cdot10=120 \); \( \sum_{i=1}^{12}(x_i-10)^2=11\cdot12=132 \).
\( \bar x_{13}=\dfrac{120+10}{13}=\dfrac{130}{13}=10 \) (unchanged).
New point adds \((10-10)^2=0\) to the SS, so \( s^2_{13}=\dfrac{132+0}{13-1}=\dfrac{132}{12}=11 \).
Female vs male average grade — a 2×2 system
A class of 100 has overall average grade 25.0. It is 60% female, 40% male, and the female–male difference in averages is 1.0. Find the female and male average grades.
Let \(\bar x_F=1+\bar x_M\) and \(0.6\bar x_F+0.4\bar x_M=25\).
\( \bar x_F=1+\bar x_M \); weighted mean \(0.6\bar x_F+0.4\bar x_M=25\).
\( 0.6(1+\bar x_M)+0.4\bar x_M=25\Rightarrow 0.6+\bar x_M=25\Rightarrow \bar x_M=24.4 \).
\( \bar x_F=1+24.4=25.4 \).
Back-solve class sizes from a pooled mean
Two classes take the same test. Class A averages 7.2, class B averages 6.7, and the 50 students together average 6.9. How many students are in each class?
Let \(n_A\) be class A's size, \(50-n_A\) class B's. Pooled mean = total grades / 50 = 6.9.
\( \dfrac{7.2\,n_A+6.7(50-n_A)}{50}=6.9 \).
\( 7.2n_A+335-6.7n_A=345\Rightarrow 0.5\,n_A=10\Rightarrow n_A=20 \).
Match standard deviations to histograms
Three samples have the same mean \(\bar x=3.5\) over the range 0–7, with standard deviations (in some order) \(1.38,\,1.71,\,1.98\). Sample A is a flat/uniform histogram; Sample B is mound-shaped (mass concentrated near the centre); Sample C is U-shaped (mass at the extremes). Match each SD to its sample — without computing.
SD measures spread about the mean. Rank by how far the typical observation sits from 3.5.
Sample B (mound) keeps observations closest to \(\bar x\) → smallest SD \(=1.38\).
Sample C (U-shape) pushes mass to the extremes → largest \((x-\bar x)^2\) → largest SD \(=1.98\).
Sample A (uniform) is in between → \(1.71\).
Correlation of five collinear points
Five pairs: \((1,3),(3,7),(5,11),(4,9),(2,5)\). What is the sample correlation coefficient? (Hint: plot the points.)
Check whether they lie on a line \(y=a+bx\). If so, \(r=\operatorname{sign}(b)\).
Each pair satisfies \(y=2x+1\): \(1\!\to\!3,\,2\!\to\!5,\,3\!\to\!7,\,4\!\to\!9,\,5\!\to\!11\). All five are exactly collinear.
Positive slope \(b=2>0\), perfect line → \(r=+1\). No computation needed.
CLT & Sampling Distributions
Sums and counts of many i.i.d. units go Normal — standardize the total and read Φ
The Central Limit Theorem
Take \(n\) independent, identically distributed units \(X_1,\dots,X_n\), each with mean \(\mu\) and variance \(\sigma^2\). For large \(n\) (rule of thumb \(n\ge30\)) their sum and mean are approximately Normal:
\[ S_n=\sum_{i=1}^n X_i \;\approx\; N\big(n\mu,\;n\sigma^2\big), \qquad \bar X \;\approx\; N\!\left(\mu,\;\frac{\sigma^2}{n}\right). \]The shape of the individual \(X_i\) does not matter — only \(\mu\), \(\sigma^2\), and \(n\). This is what lets you answer "probability that the total exceeds a threshold" without knowing the per-unit distribution.
The standardization template
Almost every CLT exam problem is: many i.i.d. units, per-unit \(\mu\) and \(\sigma\) given, asked the probability a total crosses a threshold \(s\). The recipe:
- Total parameters: mean \(n\mu\), variance \(n\sigma^2\), sd \(\sigma\sqrt n\).
- Standardize: \( Z=\dfrac{s-n\mu}{\sigma\sqrt n} \).
- Read \(\Phi\): \( P(S_n\le s)=\Phi(Z) \), \( P(S_n\ge s)=1-\Phi(Z) \).
Watch the direction ("sufficient to cover" / "exceed" → upper tail). A standardized value \(|Z|\gtrsim4\) means the probability is effectively 0 or 1.
Binomial → Normal approximation
A Binomial count is a sum of \(n\) Bernoulli's, so for large \(n\) the CLT gives
\[ X\sim\mathrm{Bin}(n,p)\;\approx\;N\big(np,\;np(1-p)\big). \]Standardize with \(\mu=np\), \(\sigma=\sqrt{np(1-p)}\): \( P(X\le k)\approx\Phi\!\big(\tfrac{k-np}{\sqrt{np(1-p)}}\big) \).
Continuity correction. Because \(X\) is discrete, the more accurate version replaces \(k\) with \(k+\tfrac12\) (for \(\le\)) or \(k-\tfrac12\) (for \(\ge\)). Exams often say "without (half) continuity correction" — then use \(k\) as-is. Always check the wording.
Recognition guide
| Wording | Tool |
|---|---|
| "\(n\) (≥30) i.i.d. units, per-unit μ and σ, prob the TOTAL exceeds …" | CLT on the sum: \(N(n\mu,n\sigma^2)\) |
| "average of \(n\) measurements is within … of μ" | CLT on the mean: \(N(\mu,\sigma^2/n)\) |
| "coin thrown \(n\) times / count of successes", large \(n\), "normal approximation" | Binomial→Normal: \(N(np,np(1-p))\) |
| "without half correction" | use \(k\) as-is, no \(\pm\tfrac12\) |
Will 40 cans of paint be enough?
A can of paint covers on average 52 m² with sd 3 m². You must paint 2260 m². What is the approximate probability that 40 cans suffice?
"Sufficient" means the total coverage \(S_{40}=\sum X_i\ge2260\). Use CLT on the sum: \(N(40\mu,40\sigma^2)\).
Per can \(\mu=52\), \(\sigma^2=9\). Sum of 40: mean \(40\cdot52=2080\), variance \(40\cdot9=360\), sd \(\sqrt{360}\approx18.97\).
\( Z=\dfrac{2260-2080}{\sqrt{360}}=\dfrac{180}{18.97}\approx9.49 \).
\( P(S_{40}\ge2260)=1-\Phi(9.49)\approx0 \).
Do 36 batteries last a year?
A battery lasts on average 10 days with sd 1 day; lifetimes are i.i.d. and each expired battery is replaced. Find the approximate probability that the total lifetime of 36 batteries exceeds one year (365 days).
Total lifetime \(L=\sum_{i=1}^{36}X_i\). CLT: \(L\approx N(36\mu,36\sigma^2)\). Want \(P(L>365)\).
Per battery \(\mu=10\), \(\sigma^2=1\). For 36: mean \(360\), variance \(36\), sd \(6\).
\( Z=\dfrac{365-360}{6}=\dfrac{5}{6}\approx0.83 \).
\( P(L>365)=1-\Phi(5/6)\approx1-0.7977=0.2023 \).
100 coin tosses — normal approximation
A fair coin is thrown 100 times; \(X\) = number of Heads. Using the normal approximation without half-correction, compute \(P(X\le60)\).
\(X\sim\mathrm{Bin}(100,\tfrac12)\). Approximate by \(N(np,np(1-p))\). "Without half correction" → use 60 directly.
\( np=50 \), \( np(1-p)=25 \), so \(X\approx N(50,25)\), sd \(=5\).
\( Z=\dfrac{60-50}{5}=2 \).
\( P(X\le60)\approx\Phi(2)=0.9772 \).
Estimator Theory
Bias, variance, MSE, efficiency, unbiasedness — plus MLE and method-of-moments. The estimator-algebra slice appears in 5 of 7 exams.
The estimator vocabulary
An estimator \(T\) for a parameter \(\theta\) is a function of the sample. Three numbers grade it:
\[ \mathrm{Bias}(T)=E[T]-\theta, \qquad \mathrm{Var}(T), \qquad \mathrm{MSE}(T)=E[(T-\theta)^2]=\mathrm{Var}(T)+\mathrm{Bias}(T)^2. \]Unbiased means \(E[T]=\theta\) (Bias \(=0\)), and then \(\mathrm{MSE}=\mathrm{Var}\). More efficient = smaller MSE (smaller variance, among unbiased estimators).
The algebra engine (from the discrete-RV chapter): \(E\) is linear always; for independent pieces \( \mathrm{Var}(aU+bW)=a^2\mathrm{Var}(U)+b^2\mathrm{Var}(W) \). Population moments you reuse: Bernoulli \(E=p,\mathrm{Var}=p(1-p)\); Poisson \(E=\mathrm{Var}=\lambda\); sample mean of \(m\) obs \(\mathrm{Var}(\bar X_m)=\sigma^2/m\); and \(\sigma^2=E[X^2]-\mu^2\).
Unbiasedness & finding the constant
Weights that sum to 1 → automatically unbiased for the mean. If \(T=c\bar X_1+(1-c)\bar X_2\) then \(E[T]=\mu\) for every \(c\).
Find a constant. When asked "for what \(a\) is \(T\) unbiased for \(\sigma^2\)", set \(E[T]=\theta\) and solve. The recurring move is \(E[X_i^2]=\sigma^2+\mu^2\), and for independent \(X_i,X_j\), \(E[X_iX_j]=\mu^2\). Example: \(T=\frac1n\sum X_i^2+a\) with \(\mu\) known → \(E[T]=\sigma^2+\mu^2+a=\sigma^2\Rightarrow a=-\mu^2\).
Efficiency & minimum-MSE weighting
Compare efficiency: if all candidates are unbiased, compute each variance and pick the smallest.
Optimal weight. For \(M_c=c\bar X_1+(1-c)\bar X_2\) (independent), \(\mathrm{MSE}=c^2\mathrm{Var}_1+(1-c)^2\mathrm{Var}_2\); differentiate and set to 0. The minimiser is inverse-variance weighting
\[ c^*=\frac{1/\mathrm{Var}_1}{1/\mathrm{Var}_1+1/\mathrm{Var}_2}. \]With sample means of sizes \(n_1,n_2\) (\(\mathrm{Var}_i=\sigma^2/n_i\)) this becomes \(c^*=\dfrac{n_1}{n_1+n_2}\) — weight each sample by its size. Put more weight on the more precise (larger / lower-variance) sample.
Nonlinear transforms break unbiasedness (Jensen)
If \(T\) is unbiased for \(\sigma^2\), is \(\sqrt T\) unbiased for \(\sigma\)? No. For a nonlinear \(g\), \(E[g(T)]\neq g(E[T])\) in general (Jensen's inequality). Since \(\sqrt{\cdot}\) is concave, \(E[\sqrt T]\le\sqrt{E[T]}=\sigma\), with strict inequality unless \(T\) is constant → \(\sqrt T\) is biased low for \(\sigma\). Whenever an exam takes a square root, log, or reciprocal of an unbiased estimator, the answer to "still unbiased?" is no.
Maximum likelihood (MLE)
Recipe: write the likelihood \(L(\theta)=\prod_i f(x_i\mid\theta)\), take \(\log\), differentiate, set to 0, solve.
\[ \ell(\theta)=\log L(\theta)=\sum_i \log f(x_i\mid\theta), \qquad \frac{d\ell}{d\theta}=0. \]Geometric example \(f(x\mid\theta)=\theta(1-\theta)^{x-1}\): \(L=\theta^n(1-\theta)^{\sum(x_i-1)}\), \(\ell=n\log\theta+\big(\sum(x_i-1)\big)\log(1-\theta)\), giving \(\hat\theta=\dfrac{n}{\sum x_i}=\dfrac1{\bar X}\). On the exam the pmf/pdf is given in the problem — you just run the recipe.
Method-of-moments (MoM)
Equate the population moment to the sample moment and solve for the parameter. For one parameter, set \(E[X]=\bar X\) (using the theoretical mean as a function of \(\theta\)) and invert.
Example density \(f(x)=2x/\theta^2\) on \([0,\theta]\): \(E[X]=\int_0^\theta x\frac{2x}{\theta^2}dx=\frac{2\theta}{3}\). Set \(\bar X=\frac{2\theta}{3}\Rightarrow \hat\theta_M=\frac{3}{2}\bar X\). It is unbiased here (\(E[\hat\theta_M]=\theta\)) with \(\mathrm{MSE}=\theta^2/(8n)\to0\) — consistent. MoM and MLE can differ; MoM is usually the quicker algebra.
Recognition guide
| Wording | Do |
|---|---|
| "which estimator is more efficient" (several unbiased) | compute each variance, smallest wins |
| "value of c that minimizes MSE" | differentiate MSE in c, or inverse-variance weight \(c^*\) |
| "determine constant a so T is unbiased for σ²/μ" | set \(E[T]=\theta\), use \(E[X^2]=\sigma^2+\mu^2\), solve |
| "is √T (or log, 1/T) unbiased" | No — Jensen, nonlinear ⇒ biased |
| "write the likelihood / find the MLE" | \(L=\prod f\) → log → differentiate → solve |
| "find the moments estimator" | \(E[X]=\bar X\) (as function of θ), invert |
| "Bias and MSE of T" | \(E[T]-\theta\); \(\mathrm{Var}(T)+\mathrm{Bias}^2\) |
Which of three unbiased estimators is most efficient?
\(\hat p\) is the success proportion in a Bernoulli sample of size \(n=9\); \(Y\) is one further independent observation. Consider \(T_1=\hat p\), \(T_2=\tfrac12\hat p+\tfrac12 Y\), \(T_3=\tfrac{9}{10}\hat p+\tfrac{1}{10}Y\). Which is most efficient?
Check all three are unbiased (they are), then compare variances. Use \(\mathrm{Var}(\hat p)=p(1-p)/9\), \(\mathrm{Var}(Y)=p(1-p)\).
Each is a weight-1 combination of unbiased pieces, so \(E[T_i]=p\) and MSE = Var.
\(\mathrm{Var}(T_1)=\tfrac19 p(1-p)\). \(\mathrm{Var}(T_2)=\tfrac14\cdot\tfrac{p(1-p)}{9}+\tfrac14 p(1-p)=\tfrac{10}{36}p(1-p)\). \(\mathrm{Var}(T_3)=\tfrac{81}{100}\cdot\tfrac{p(1-p)}{9}+\tfrac1{100}p(1-p)=\tfrac{1}{10}p(1-p)\).
\(\tfrac1{10}=0.100<\tfrac19\approx0.111<\tfrac{10}{36}\approx0.278\).
Combine two sample means — unbiased c and optimal c
\(\bar X_{10}\) and \(\bar Y_{15}\) are means of independent samples of sizes 10 and 15 from a population with mean μ. Let \(M_c=c\bar X_{10}+(1-c)\bar Y_{15}\). (a) For which c is \(M_c\) unbiased? (b) Find the c minimizing MSE.
Weights sum to 1 → unbiased for all c. MSE \(=c^2\sigma^2/10+(1-c)^2\sigma^2/15\); differentiate.
\(E[M_c]=c\mu+(1-c)\mu=\mu\) for every c → unbiased for all c.
\(\mathrm{MSE}(M_c)=\mathrm{Var}(M_c)=\sigma^2\!\left(\tfrac{c^2}{10}+\tfrac{(1-c)^2}{15}\right)\).
\(\dfrac{d}{dc}=\sigma^2\!\left(\tfrac{2c}{10}-\tfrac{2(1-c)}{15}\right)=\dfrac{\sigma^2}{15}(5c-2)=0\Rightarrow c=\tfrac25\). (Matches \(c^*=\tfrac{n_1}{n_1+n_2}=\tfrac{10}{25}\).)
Constant for an unbiased variance estimator, and √T
Sample \((X_1,\dots,X_n)\) from a population with known mean \(\mu=3\) and unknown variance \(\sigma^2\). (a) Find \(a\) so that \(T=\frac1n\sum_{i=1}^n X_i^2+a\) is unbiased for \(\sigma^2\). (b) Is \(\sqrt T\) unbiased for \(\sigma\)?
(a) \(E[X_i^2]=\sigma^2+\mu^2=\sigma^2+9\). (b) Think about \(E[\sqrt T]\) vs \(\sqrt{E[T]}\).
\(E[T]=E[X_i^2]+a=(\sigma^2+9)+a\). Unbiased ⇒ \(\sigma^2+9+a=\sigma^2\Rightarrow a=-9\).
We'd need \(E[\sqrt T]=\sigma=\sqrt{E[T]}\). But \(\sqrt{\cdot}\) is concave, so by Jensen \(E[\sqrt T]<\sqrt{E[T]}=\sigma\) (strict unless T is constant).
Make a(X₁−Xₙ)² unbiased for σ²
Sample from a population with known mean \(\mu=3\), unknown variance \(\sigma^2\). Find the constant \(a\) so that \(T=a(X_1-X_n)^2\) is unbiased for \(\sigma^2\).
Expand the square; use \(E[X_i^2]=\sigma^2+9\) and \(E[X_1X_n]=\mu^2=9\) (independence).
\(E[(X_1-X_n)^2]=E[X_1^2]+E[X_n^2]-2E[X_1X_n]\).
\(=(\sigma^2+9)+(\sigma^2+9)-2(9)=2\sigma^2\).
\(E[T]=a\cdot2\sigma^2=\sigma^2\Rightarrow a=\tfrac12\).
Bias and MSE of a Poisson estimator
Sample \((X_1,\dots,X_n)\) from Poisson(λ). For \(T=\tfrac12\!\left(\dfrac{X_1+\cdots+X_{n-1}}{n-1}+X_n\right)\), determine the bias and the MSE.
Write \(T=\tfrac12\bar X_{n-1}+\tfrac12 X_n\). Poisson: \(E=\mathrm{Var}=\lambda\); \(\mathrm{Var}(\bar X_{n-1})=\lambda/(n-1)\).
\(E[T]=\tfrac12\lambda+\tfrac12\lambda=\lambda\Rightarrow\mathrm{Bias}=0\).
\(\mathrm{Var}(T)=\tfrac14\cdot\dfrac{\lambda}{n-1}+\tfrac14\lambda=\dfrac{\lambda}{4}\!\left(\dfrac{1}{n-1}+1\right)=\dfrac{\lambda}{4}\cdot\dfrac{n}{n-1}\).
Unbiased ⇒ \(\mathrm{MSE}=\mathrm{Var}=\dfrac{n\lambda}{4(n-1)}\).
Geometric MLE
Sample of \(n\) observations from \(f(x\mid\theta)=\theta(1-\theta)^{x-1}\), \(x=1,2,\dots\), \(\theta\in(0,1)\). (a) Write the likelihood. (b) Find the MLE. (c) For the sample \(3,5,1,2,4\), give the estimate.
\(L=\prod f\), then log, differentiate, set 0. The sum of exponents is \(\sum(x_i-1)\).
\(L(\theta)=\prod_{i=1}^n\theta(1-\theta)^{x_i-1}=\theta^n(1-\theta)^{\sum(x_i-1)}\).
\(\ell=n\log\theta+\big(\sum(x_i-1)\big)\log(1-\theta)\); \(\dfrac{d\ell}{d\theta}=\dfrac{n}{\theta}-\dfrac{\sum(x_i-1)}{1-\theta}=0\Rightarrow\hat\theta=\dfrac{n}{\sum x_i}=\dfrac1{\bar X}\).
\(\bar X=(3+5+1+2+4)/5=3\Rightarrow\hat\theta=1/3\).
Method-of-moments for a triangular density
Sample from \(f(x;\theta)=\dfrac{2x}{\theta^2}\) for \(x\in[0,\theta]\) (0 otherwise), \(\theta>0\). (a) Find \(E[X]\) and \(\mathrm{Var}(X)\). (b) Find the method-of-moments estimator of \(\theta\). (c) Its bias and MSE; behaviour as \(n\to\infty\).
\(E[X]=\int_0^\theta x\frac{2x}{\theta^2}dx\). MoM: set \(\bar X=E[X]\) and invert.
\(E[X]=\dfrac{2}{\theta^2}\!\int_0^\theta x^2dx=\dfrac{2\theta}{3}\); \(E[X^2]=\dfrac{2}{\theta^2}\!\int_0^\theta x^3dx=\dfrac{\theta^2}{2}\); \(\mathrm{Var}(X)=\dfrac{\theta^2}{2}-\dfrac{4\theta^2}{9}=\dfrac{\theta^2}{18}\).
Set \(\bar X=E[X]=\dfrac{2\theta}{3}\Rightarrow\hat\theta_M=\dfrac32\bar X\).
\(E[\hat\theta_M]=\tfrac32\cdot\tfrac{2\theta}{3}=\theta\) → unbiased. \(\mathrm{Var}(\hat\theta_M)=\tfrac94\cdot\dfrac{\mathrm{Var}(X)}{n}=\tfrac94\cdot\dfrac{\theta^2}{18n}=\dfrac{\theta^2}{8n}\). So \(\mathrm{MSE}=\dfrac{\theta^2}{8n}\to0\).
Confidence Intervals & Sample Size
Build an interval, back-solve the sample size, recover n from a published CI, handle asymmetric tails — appears in all 7 past exams.
The CI machinery
Every confidence interval is estimate ± margin, the margin being a critical value times a standard error.
\[ \text{mean (known }\sigma\text{ / large }n): \quad \bar x\pm z_{\alpha/2}\,\frac{\sigma}{\sqrt n}\quad(\text{use }s\text{ if }\sigma\text{ unknown, large }n). \]\[ \text{proportion}: \quad \hat p\pm z_{\alpha/2}\,\sqrt{\frac{\hat p(1-\hat p)}{n}}. \]Critical values: \(z_{0.025}=1.96\) (95%), \(z_{0.005}=2.576\) (99%), \(z_{0.05}=1.645\) (90%). The total length of a symmetric CI is twice the margin, \(2z_{\alpha/2}\cdot\text{SE}\).
Sample size: back-solve from a target precision
Given a target on the interval (total length \(L\), or half-width / accuracy \(d=L/2\)), set the length condition and solve for \(n\), then round up.
\[ \text{mean}: \; 2z_{\alpha/2}\frac{\sigma}{\sqrt n}\le L \;\Rightarrow\; n\ge\left(\frac{2z_{\alpha/2}\sigma}{L}\right)^2. \]\[ \text{proportion}: \; n\ge\frac{z_{\alpha/2}^2\,p(1-p)}{d^2}, \quad\text{worst case }p(1-p)=\tfrac14. \]When \(\hat p\) is unknown (sample not yet taken), use the worst case \(p(1-p)\le\frac14\) — it guarantees the precision whatever the true \(p\).
Recover n from a published interval
If a poll reports "with \(C\%\) confidence, support is between \(a\) and \(b\)", read off \(\hat p=\tfrac{a+b}{2}\) and half-width \(d=\tfrac{b-a}{2}\), then invert the margin:
\[ z_{\alpha/2}\sqrt{\frac{\hat p(1-\hat p)}{n}}=d \;\Rightarrow\; n=\frac{z_{\alpha/2}^2\,\hat p(1-\hat p)}{d^2}. \]E.g. \((48\%,52\%)\) at 90%: \(\hat p=0.5\), \(d=0.02\), \(n=1.645^2(0.25)/0.02^2\approx1691\).
Asymmetric tails (non-standard split)
A "95% CI" normally splits \(\alpha=0.05\) as \(2.5\%\) per tail. If the problem asks for an unequal split — say \(3\%\) in the lower tail, \(2\%\) in the upper — use a different z on each side:
\[ \Big(\bar x - z_{0.03}\tfrac{\sigma}{\sqrt n}, \; \bar x + z_{0.02}\tfrac{\sigma}{\sqrt n}\Big), \quad z_{0.03}=1.88,\; z_{0.02}=2.055. \]The total tail probability still sums to \(\alpha\); only the per-side allocation changed, so the interval is no longer symmetric about \(\bar x\).
z or t?
Default to z: known σ, or large \(n\) (≥30) so the CLT applies and \(s\) is a good plug-in. Use t\(_{n-1}\) only for a small sample from a normal population with unknown σ. For \(n=100\) the two barely differ (e.g. \(z_{0.025}=1.96\) vs \(t_{99,0.025}\approx1.98\)) and either is acceptable — the exams lean on z.
Recognition guide
| Wording | Do |
|---|---|
| "compute a C% CI for the mean/proportion" | estimate ± \(z_{\alpha/2}\)·SE |
| "how large a sample", "length less than L", "accurate to within ±d" | back-solve \(n\), round up; prop → use ¼ |
| "support between a% and b%, find n" | \(\hat p\)=mid, \(d\)=half-width, invert |
| "lower tail 3%, upper tail 2%" | different z each side (asymmetric) |
| small n, normal, σ unknown | t\(_{n-1}\); else z |
99% CI for mean toothbrush purchases
In a sample of \(n=100\), the number of toothbrushes bought per year has mean \(\bar x=0.9\) and sd \(s=0.2\). Compute an approximate 99% confidence interval for the population mean.
Large n → z CI with s. 99% → \(z_{0.005}=2.576\). SE \(=0.2/\sqrt{100}=0.02\).
\(z_{0.005}\,s/\sqrt n=2.576\cdot0.02=0.05152\).
\(0.9\pm0.05152\).
Tire lifetimes — CI then required sample size
Tire lifetimes are normal with known \(\sigma=3600\) miles. A sample of \(n=81\) gave \(\bar x=28400\). (a) Build a 95% CI for the mean. (b) How large a sample gives a 99% CI shorter than the interval in (a)?
(a) \(z_{0.025}=1.96\), \(\sqrt{81}=9\). (b) Set the 99% length \(\le\) the (a) length, solve for n, round up.
\(1.96\cdot3600/9=1.96\cdot400=784\); CI \(=28400\pm784=(27616,29184)\), length \(1568\).
99% length \(=2\cdot2.576\cdot3600/\sqrt n=18547.2/\sqrt n\le1568\).
\(\sqrt n\ge18547.2/1568=11.83\Rightarrow n\ge139.9\Rightarrow n\ge140\).
Sample size for a proportion (length < 0.1)
Estimate the percentage of spaghetti eaters who use parmigiano. With \(\hat p\) unknown, what sample size gives a 95% CI of total length less than 0.1?
\(2\cdot1.96\sqrt{p(1-p)/n}<0.1\); use the worst case \(p(1-p)=\tfrac14\).
\(2\cdot1.96\sqrt{p(1-p)/n}<0.1\Rightarrow n>\dfrac{4\cdot1.96^2}{0.1^2}p(1-p)\).
Use \(p(1-p)=\tfrac14\): \(n>\dfrac{4\cdot3.8416}{0.01}\cdot\tfrac14=384.16\).
\(n\ge385\).
Super Bowl poll — sample size within ±0.02
How large a sample is needed to be 90% confident that the estimated proportion of households watching is accurate to within ±0.02?
Half-width \(d=0.02\), 90% → \(z=1.645\), worst case \(p(1-p)=\tfrac14\).
\(n\ge\dfrac{z_{0.05}^2\,p(1-p)}{d^2}=\dfrac{1.645^2\cdot\tfrac14}{0.02^2}\).
\(=\dfrac{2.706\cdot0.25}{0.0004}=1691.3\).
\(n\ge1692\).
Recover the sample size from a published CI
A poll states: "with 90% confidence, the minister's support is between 48% and 52%". Using the standard proportion-CI formula, how large was the sample?
\(\hat p=0.5\) (midpoint), half-width \(d=0.02\). Invert \(z\sqrt{\hat p(1-\hat p)/n}=d\).
\(\hat p=(0.48+0.52)/2=0.5\); margin \(=0.02\); 90% → \(z=1.645\).
\(1.645\sqrt{0.25/n}=0.02\Rightarrow n=\dfrac{1.645^2\cdot0.25}{0.02^2}=1691.3\).
Confidence interval with asymmetric tails
\(n=100\) observations, normal with known \(\sigma=1\), \(\bar x=3.5\). Build a 95% CI but with 3% probability in the lower tail and 2% in the upper tail.
Different z each side: \(z_{0.03}=1.88\) (lower), \(z_{0.02}=2.055\) (upper). \(\sigma/\sqrt n=0.1\).
\(3.5-1.88\cdot0.1=3.5-0.188=3.312\).
\(3.5+2.055\cdot0.1=3.5+0.2055=3.7055\).
Large-sample CI — z or t
\(n=100\) measurements give mean \(1.0\) and sample sd \(2.0\). Compute a 95% CI for the true value.
Large n → z is the standard choice (\(z_{0.025}=1.96\)); t with df 99 (≈1.98) gives almost the same answer. SE \(=2/\sqrt{100}=0.2\).
\(1\pm1.96\cdot0.2=1\pm0.392=(0.608,1.392)\).
Using \(t_{99,0.025}\approx1.98\): \(1\pm1.98\cdot0.2=1\pm0.396=(0.604,1.396)\).
Hypothesis Testing
One skeleton, five standard-error variants: one/two mean, one/two proportion, paired. Appears in all 7 past exams.
The test skeleton
Every test is the same four moves; only the standard error changes.
- State \(H_0,H_1\) — the direction comes from the wording (below).
- Pick the statistic and its null distribution: \( \text{TS}=\dfrac{\text{estimate}-\text{null}}{\text{SE}} \), \(\sim N(0,1)\) (large n / known σ) or \(t_{df}\) (small n, normal).
- Compute the observed value \(ts_{obs}\).
- Decide: reject if \(ts_{obs}\) falls in the rejection region — two-sided \(|ts|\ge z_{\alpha/2}\), one-sided \(ts\ge z_\alpha\) (or \(\le-z_\alpha\)). Or report the p-value and reject when \(\text{p-value}\le\alpha\).
Default to z (large samples / known σ); §8.6–8.7 proportion and Poisson tests are asymptotic-normal too. Use t only for small-n normal data (and the paired test).
The standard-error table
| Test | Statistic | Null dist. |
|---|---|---|
| 1-mean, known σ / large n | \((\bar x-\mu_0)/(\sigma/\sqrt n)\) | N(0,1) |
| 1-mean, small n | \((\bar x-\mu_0)/(s/\sqrt n)\) | \(t_{n-1}\) |
| paired | \((\bar d-0)/(s_d/\sqrt n)\) on \(d_i=\)before−after | \(t_{n-1}\) |
| 2-mean, large n | \((\bar x_1-\bar x_2)/\sqrt{s_1^2/n_1+s_2^2/n_2}\) | N(0,1) |
| 1-proportion | \((\hat p-p_0)/\sqrt{p_0(1-p_0)/n}\) | N(0,1) |
| 2-proportion (pooled) | \((\hat p_1-\hat p_2)/\sqrt{\hat p_p(1-\hat p_p)(1/n_1+1/n_2)}\) | N(0,1) |
Pooled proportion: \( \hat p_p=\dfrac{X_1+X_2}{n_1+n_2} \). Critical values: \(z_{0.025}=1.96\), \(z_{0.05}=1.645\); \(t_{9,0.05}=1.833\).
One-sided or two-sided? Read the wording
- One-sided: "more than", "over 30%", "more effective", "improved", "larger mean" → \(H_1\) points one way; reject only in that tail with \(z_\alpha\) (1.645 at 5%).
- Two-sided: "changed", "differ significantly", "need to recalibrate", "is there a difference" → \(H_1\neq\); reject in both tails with \(z_{\alpha/2}\) (1.96 at 5%).
For a two-group test, set \(H_0:\) difference \(=0\). The direction of \(H_1\) decides which tail; e.g. "B more effective" with statistic \((\bar x_A-\bar x_B)/SE\) rejects for small (negative) values.
p-value & the threshold significance level
The p-value is the null-probability of a statistic at least as extreme as observed, in the direction of \(H_1\): one-sided \(P(Z\ge ts_{obs})\); two-sided \(2P(Z\ge|ts_{obs}|)\). Reject whenever \(\alpha\ge\) p-value.
"Find the significance levels at which \(H_0\) is rejected" is just the p-value: e.g. a one-sided \(ts=0.93\) gives \(P(Z\ge0.93)=1-\Phi(0.93)=1-0.8238=0.1762\), so reject for \(\alpha\ge17.62\%\). A \(ts=2.4\) gives p \(=1-\Phi(2.4)=0.0082\) → reject for \(\alpha\ge0.82\%\).
Recognition guide
| Wording | Test |
|---|---|
| one group vs a target μ₀, "recalibrate / changed" | 1-mean z (two-sided) |
| one group vs target proportion, "more than X%" | 1-prop z (one-sided) |
| two groups, "more effective / larger" | 2-mean z (one-sided) |
| two proportions, "significant difference" | 2-prop pooled z (two-sided) |
| same units measured before/after (Initial/Final) | paired t on differences |
| "at what significance levels reject" | compute the p-value |
Recalibrate the bottling machine? (1-mean, two-sided)
A machine should fill bottles to a mean of 750 g; content is normal with known \(\sigma=5\) g. A sample of \(n=25\) gives \(\bar x=745\) g. Is there reason to recalibrate (α = 0.05)?
"Recalibrate" = changed = two-sided. \(\text{TS}=(\bar x-750)/(\sigma/\sqrt n)\), reject if \(|ts|\ge1.96\).
\(H_0:\mu=750\) vs \(H_1:\mu\neq750\).
\(ts=\dfrac{745-750}{5/\sqrt{25}}=\dfrac{-5}{1}=-5\).
\(-5<-1.96\) → in the rejection region.
Over 30% smokers? (1-proportion, one-sided + p-value)
66 of 200 adults are smokers. (a) At α = 0.05, can we conclude more than 30% are smokers? (b) At which significance levels would \(H_0\) be rejected?
One-sided: \(H_0:p\le0.30\) vs \(H_1:p>0.30\). \(\text{TS}=(\hat p-0.3)/\sqrt{0.3\cdot0.7/n}\); reject if \(ts\ge1.645\). (b) is the p-value.
\(\hat p=66/200=0.33\); \(ts=\dfrac{0.33-0.3}{\sqrt{0.21/200}}=\dfrac{0.03}{0.0324}=0.93\).
\(0.93
Reject when \(z_\alpha\le0.93\): \(1-\alpha\le\Phi(0.93)=0.8238\Rightarrow\alpha\ge0.1762\).
Is treatment B more effective? (2-mean, one-sided)
Two groups of \(n=140\): A has \(\bar x_A=105\), \(s_A=50\); B has \(\bar x_B=120\), \(s_B=60\) (higher = more effective). Can we claim B is more effective (α = 0.05)?
\(H_1:\mu_B>\mu_A\), i.e. \(H_0:\mu_A-\mu_B\ge0\) vs \(H_1:\mu_A-\mu_B<0\). Statistic \((\bar x_A-\bar x_B)/\sqrt{s_A^2/n_A+s_B^2/n_B}\); reject if \(ts\le-1.645\).
\(\sqrt{2500/140+3600/140}=\sqrt{43.571}=6.601\).
\(ts=\dfrac{105-120}{6.601}=-2.27\).
\(-2.27<-1.645\) → reject.
Heavier bananas from provider 1? (2-mean, p-value)
Two providers, \(n=128\) each: provider 1 \(\bar x_1=155.0\), \(s_1=10\); provider 2 \(\bar x_2=152.0\), \(s_2=10\). Is the claim that provider 1's bananas are heavier justified?
\(H_0:\mu_1-\mu_2\le0\) vs \(H_1:\mu_1-\mu_2>0\). Compute \(ts\), then \(p=P(Z\ge ts)\).
\(ts=\dfrac{155-152}{\sqrt{100/128+100/128}}=\dfrac{3}{\sqrt{1.5625}}=\dfrac{3}{1.25}=2.4\).
\(P(Z\ge2.4)=1-\Phi(2.4)=0.0082\).
Reject for \(\alpha\ge0.82\%\); at 5% (and 1%) → reject.
Do two coins differ? (2-proportion, pooled)
Two coins are each thrown 800 times: A shows Heads 430 times, B shows Heads 400 times. Is there a significant difference in their Heads probabilities (α = 0.05)?
Two-sided 2-proportion test with pooled \(\hat p_p=(430+400)/1600\). Reject if \(|ts|\ge1.96\).
\(\hat p_p=830/1600=0.51875\); \(\hat p_A=0.5375\), \(\hat p_B=0.5\).
\(ts=\dfrac{0.5375-0.5}{\sqrt{0.51875\cdot0.48125\,(1/800+1/800)}}=\dfrac{0.0375}{0.02498}=1.50\).
\(1.50<1.96\) → do not reject.
Did training improve scores? (paired t)
Ten gamers' scores are recorded before (Initial) and after (Final) a week of training. The improvements (Final−Initial) are \(0.6,-0.7,0.4,-1.4,1.7,0.6,2.4,1.0,1.9,0.5\). Does the data support that training improved the average score (α = 0.05)?
Paired data → one-sample t on the differences. \(H_0:\mu\le0\) vs \(H_1:\mu>0\). Critical \(t_{9,0.05}=1.833\).
\(\bar d=0.7\), \(s_d=1.15\), \(n=10\).
\(ts=\dfrac{0.7-0}{1.15/\sqrt{10}}=\dfrac{0.7}{0.3637}=1.92\).
\(1.92>t_{9,0.05}=1.833\) → reject.
Has the average grade changed? (1-mean, two-sided)
Historical average grade is 23.5. A recent exam with \(n=100\) students had mean 25.0, sd 2.5. Can we conclude the average changed (α = 0.05)?
"Changed" = two-sided. Large n → z with s. Reject if \(|ts|\ge1.96\).
\(H_0:\mu=23.5\) vs \(H_1:\mu\neq23.5\).
\(ts=\dfrac{25.0-23.5}{2.5/\sqrt{100}}=\dfrac{1.5}{0.25}=6\).
\(6\gg1.96\) → reject.
Appendix — Low-ROI Topics
In the syllabus but never exam-tested: exponential, Chebyshev, the t/χ² distributions, variance inference — plus the out-of-scope chi-squared and F tests. Read once; don't over-invest.
Exponential distribution (memoryless)
In the program but its inference (§7.6 exp-CI) is skipped, and it never appeared on an exam. Know only the basics: \(f(x)=\lambda e^{-\lambda x}\) for \(x\ge0\), with \(E[X]=1/\lambda\), \(\mathrm{Var}(X)=1/\lambda^2\), and tail \(P(X>t)=e^{-\lambda t}\).
Memoryless property: \(P(X>s+t\mid X>s)=P(X>t)\) — a used component is as good as new. This is the one fact most likely to be asked.
Chebyshev's inequality & the Weak Law
Distribution-free bound on how far a variable strays from its mean:
\[ P\big(|X-\mu|\ge a\big)\le\frac{\sigma^2}{a^2}, \qquad\text{equivalently}\qquad P\big(|X-\mu|\ge k\sigma\big)\le\frac1{k^2}. \]It is loose (works for any distribution) but needs no normality. The Weak Law of Large Numbers follows: \(\bar X_n\to\mu\) in probability as \(n\to\infty\), since \(\mathrm{Var}(\bar X_n)=\sigma^2/n\to0\).
The t and χ² distributions
t\(_{df}\): bell-shaped, symmetric, heavier tails than the normal; \(df=n-1\). Used for the mean when \(\sigma\) is unknown and \(n\) is small; as \(df\to\infty\) it converges to \(N(0,1)\). On these exams it appears only once (the paired test).
χ²\(_{df}\): the distribution of a sum of squared standard normals; right-skewed, positive. It is the sampling distribution behind variance inference: \(\dfrac{(n-1)s^2}{\sigma^2}\sim\chi^2_{n-1}\).
Confidence interval & test for a variance
Author-light — plausibly examinable, never seen. Using \(\dfrac{(n-1)s^2}{\sigma^2}\sim\chi^2_{n-1}\):
\[ \text{CI for }\sigma^2:\quad \left(\frac{(n-1)s^2}{\chi^2_{n-1,\,\alpha/2}},\;\frac{(n-1)s^2}{\chi^2_{n-1,\,1-\alpha/2}}\right). \]Test \(H_0:\sigma^2=\sigma_0^2\): statistic \(\chi^2=\dfrac{(n-1)s^2}{\sigma_0^2}\), compared to \(\chi^2_{n-1}\) critical values. (Note the chi-square table is asymmetric — you read two different critical values for the two tails.)
Out of scope — recognise and move on
Chi-squared test of homogeneity / independence (contingency table). Appeared once (EBf1-P6, no solution provided) but it is Ross chapter 9, outside this program (ch4–8). If you ever see a contingency table: build expected counts \(E_{ij}=\dfrac{\text{row}_i\cdot\text{col}_j}{N}\), statistic \(\chi^2=\sum\dfrac{(O-E)^2}{E}\), \(df=(r-1)(c-1)\). Just a margin note — do not study deeply.
F-test (equality of two variances, §8.5.1). The F-distribution (§5.8.3) is explicitly skipped, so treat the F-test as out of scope — recognition only, never required on these papers.
A distribution-free bound (Chebyshev)
A variable has mean \(\mu=50\) and standard deviation \(\sigma=5\), with unknown distribution. Bound the probability that it differs from 50 by at least 15.
\(a=15=3\sigma\), so \(k=3\). Use \(P(|X-\mu|\ge k\sigma)\le1/k^2\).
\(a=15=3\sigma\Rightarrow k=3\).
\(P(|X-50|\ge15)\le\dfrac1{3^2}=\dfrac19\).
Exponential tail & memorylessness
A component's lifetime is exponential with mean 10 hours. (a) Find \(P(X>20)\). (b) Given it has already lasted 30 hours, find \(P(X>50\mid X>30)\).
Mean \(=1/\lambda=10\Rightarrow\lambda=0.1\). Tail \(P(X>t)=e^{-\lambda t}\). Part (b) uses memorylessness.
\(P(X>20)=e^{-0.1\cdot20}=e^{-2}\approx0.1353\).
\(P(X>50\mid X>30)=P(X>20)=e^{-2}\approx0.1353\) — the extra 30 hours are forgotten.
Cheatsheet & Decision Map
The open-book weapon: which procedure → which formula. Print this (Print cheat sheet) and bring it to the exam.
Which procedure? — top-level routing
| The question is about… | Go to |
|---|---|
| "chosen at random then observe", "given the result, prob it was…" | Bayes / total probability → |
| two overlapping groups, "at least one", "not the other" | inclusion–exclusion |
| "choose k without replacement", "all different" | counting \(\binom{n}{k}\) |
| "normally distributed", a probability / percentile / unknown σ | standardize Z, Φ → |
| weighted/grouped mean, add an observation, back-solve a size | descriptive (Σx=nx̄) |
| many i.i.d. units, prob the TOTAL exceeds a threshold | CLT: \(N(n\mu,n\sigma^2)\) |
| symbolic estimator T: unbiased? MSE? efficient? find a / c? | estimator theory → |
| "write the likelihood / MLE"; "moments estimator" | MLE / method-of-moments |
| "confidence interval", "how large a sample" | CI / sample size → |
| "is there reason to conclude / more than / changed" | hypothesis test → |
Inference: which CI / which test?
Walk the questions in order:
- CI or test? "compute an interval / how large a sample" → CI. "is there reason / more than / changed" → test.
- Mean or proportion? averages/measurements → mean. percentages/counts of successes → proportion.
- One sample or two? one group vs a target → one-sample. two groups compared → two-sample. same units before/after → paired.
- z or t? known σ or large n (≥30) → z. small n, normal, unknown σ → t. (Proportions always z.)
- One- or two-sided? "more / over / improved" → one-sided \(z_\alpha\). "changed / differ" → two-sided \(z_{\alpha/2}\).
CI & test formula bank (T1 + T2 spine)
| Case | CI: estimate ± margin | Test statistic |
|---|---|---|
| 1-mean, σ known / large n | \(\bar x\pm z_{\alpha/2}\sigma/\sqrt n\) | \((\bar x-\mu_0)/(\sigma/\sqrt n)\) |
| 1-mean, small n | \(\bar x\pm t_{n-1,\alpha/2}s/\sqrt n\) | \((\bar x-\mu_0)/(s/\sqrt n)\sim t_{n-1}\) |
| paired | \(\bar d\pm t_{n-1,\alpha/2}s_d/\sqrt n\) | \(\bar d/(s_d/\sqrt n)\sim t_{n-1}\) |
| 2-mean, large n | — | \((\bar x_1-\bar x_2)/\sqrt{s_1^2/n_1+s_2^2/n_2}\) |
| 1-proportion | \(\hat p\pm z_{\alpha/2}\sqrt{\hat p(1-\hat p)/n}\) | \((\hat p-p_0)/\sqrt{p_0(1-p_0)/n}\) |
| 2-proportion (pooled) | — | \((\hat p_1-\hat p_2)/\sqrt{\hat p_p(1-\hat p_p)(1/n_1+1/n_2)}\) |
\(\hat p_p=\dfrac{X_1+X_2}{n_1+n_2}\). Reject: two-sided \(|ts|\ge z_{\alpha/2}\); one-sided \(ts\ge z_\alpha\) (or \(\le-z_\alpha\)). p-value: one-sided \(1-\Phi(ts)\), two-sided \(2(1-\Phi(|ts|))\); reject when \(\alpha\ge\) p-value.
Sample size (round UP)
| Target | Formula |
|---|---|
| mean, total length \(L\) | \(n\ge(2z_{\alpha/2}\sigma/L)^2\) |
| proportion, half-width \(d\) | \(n\ge z_{\alpha/2}^2\,p(1-p)/d^2\), worst case \(p(1-p)=\tfrac14\) |
| recover n from CI | \(\hat p\)=mid, \(d\)=half-width, \(n=z_{\alpha/2}^2\hat p(1-\hat p)/d^2\) |
Critical values & Φ shortcuts
| Confidence / tail | z |
|---|---|
| 90% (two-sided) / 0.05 tail | \(z_{0.05}=1.645\) |
| 95% / 0.025 tail | \(z_{0.025}=1.96\) |
| 99% / 0.005 tail | \(z_{0.005}=2.576\) |
| 0.03 / 0.02 tails (asymmetric) | \(z_{0.03}=1.88,\;z_{0.02}=2.055\) |
| small-n paired | \(t_{9,0.05}=1.833\) |
Φ values: \(\Phi(0.5)=0.6915\), \(\Phi(1)=0.8413\), \(\Phi(2)=0.9772\), \(\Phi(2.4)=0.9918\), \(\Phi(1/3)=0.6293\). Symmetry \(\Phi(-z)=1-\Phi(z)\). Central band \(2\Phi(d/\sigma)-1\). Inverse: \(\Phi^{-1}(0.975)=1.96\), \(\Phi^{-1}(0.75)=0.675\). Standardize \(Z=(X-\mu)/\sigma\).
Estimator-theory recipes (T3 / T10)
- MSE: \(\mathrm{MSE}=\mathrm{Var}+\mathrm{Bias}^2\); Bias \(=E[T]-\theta\). Unbiased ⇒ MSE=Var.
- Var of combo: \(\mathrm{Var}(aU+bW)=a^2\mathrm{Var}(U)+b^2\mathrm{Var}(W)\) (independent).
- Unbiased weights: any \(c\bar X_1+(1-c)\bar X_2\) is unbiased for μ (weights sum to 1).
- Min-MSE weight: inverse-variance, \(c^*=\dfrac{n_1}{n_1+n_2}\) for sample means.
- Find constant: set \(E[T]=\theta\); use \(E[X^2]=\sigma^2+\mu^2\), \(E[X_iX_j]=\mu^2\) (indep).
- Jensen: \(\sqrt T\) (or log, 1/T) of an unbiased T is biased.
- MLE: \(L=\prod f\) → \(\log\) → \(d/d\theta=0\). Geometric → \(\hat\theta=1/\bar X\).
- Method-of-moments: set \(E[X]=\bar X\) (as a function of θ), invert.
- Moments: Bernoulli \(E{=}p,V{=}p(1{-}p)\); Poisson \(E{=}V{=}\lambda\); \(\mathrm{Var}(\bar X_m)=\sigma^2/m\).
Probability templates (T4 / counting / CLT)
- Total probability: \(P(E)=\sum_i P(E\mid H_i)P(H_i)\).
- Bayes: \(P(H_j\mid E)=\dfrac{P(E\mid H_j)P(H_j)}{\sum_i P(E\mid H_i)P(H_i)}\).
- Coin mixture: \(P(X=0\mid N=n)=(\tfrac12)^n\).
- Conditional in Bernoulli: disjoint arrangements ÷ binomial; \(p\) cancels.
- Inclusion–exclusion: \(P(B\cup C)=P(B)+P(C)-P(B\cap C)\); \(P(B\cap C^c)=P(B)-P(B\cap C)\).
- Counting: \(P=\#\text{fav}/\binom{n}{k}\); complement for "different", fix items for "included".
- CLT (sum): \(S_n\approx N(n\mu,n\sigma^2)\), \(Z=\dfrac{s-n\mu}{\sigma\sqrt n}\) (SUM uses \(\sigma\sqrt n\), not \(\sigma/\sqrt n\)).
- Binomial→Normal: \(N(np,np(1-p))\); skip \(\pm\tfrac12\) if "no continuity correction".
- Normal models→prob: \(\mathrm{Var}=E[X^2]-(E[X])^2\), \(E[X(X-1)]=E[X^2]-E[X]\).