An exam companion

Statistics.

From probability to hypothesis testing — theory, flashcards, spaced repetition, and worked exercises, built to be read.

Chapters
Flashcards
Exercises
Reading Instructions

How to use this volume

  1. Begin with Theory. Every concept is introduced from first principles, with worked examples.
  2. Move to Flashcards for active recall. Click any card to flip, then mark HARD or GOT IT — your judgments persist.
  3. Attempt each Exercise on paper. Reveal the hint if stuck, then the full solution to check your reasoning.
  4. Navigate with . Focus search with /. Mark a problem solved to track your progress.
CH. 01 Statistics

Descriptive Statistics

From raw data to a clear summary: variables, tables, graphs, centre, spread, shape, and the relationship between two variables.

Sections 11
Flashcards 11
Exercises 8
Read time 7'
Sources Ross, Introductory Statistics 3E, ch. 1–3 · MACS lectures: Lezione Sett2 1–3, MACS01–02 · Past exams: MACSf1/2, MAIf3/5, EBf1, sampletest1
Key concepts
Population vs sample Variable types Frequency tables Pie/bar/histogram Mean/median/mode Quartiles & percentiles Variance & SD Skewness & boxplots Covariance & correlation Sampling design

1 · What statistics is, and the basic vocabulary

"Statistics is the art of learning from data." The work splits in two: descriptive statistics — summarising the data you have with tables, graphs and numbers — and inferential statistics — drawing conclusions about a larger group from a sample. This chapter is the descriptive half, and it's the foundation for everything later.

Five words you'll use constantly:

  • Population: the whole collection you care about (e.g. all LUISS students).
  • Sample: the subset you actually observe (e.g. the students in one Statistics class).
  • Units (or subjects): the members — people, countries, days, objects.
  • Variable: the feature you record on each unit (age, height, grade).
  • Modalities: the possible values a variable can take (for age: 18, 19, 20, …).

Think of the data as a table: one row per unit, one column per variable. Everything in this chapter is a way of squeezing such a table into something you can actually read.

2 · Types of variable — this decides what you can do

Before computing anything, classify the variable — it dictates which summaries make sense.

  • Categorical (qualitative) — values are categories, not real magnitudes.
    • Nominal: categories with no order — hair colour, favourite team, laptop OS.
    • Ordinal: categories with a natural order — education level, hotel stars (1–5), film ratings (G, PG, …).
  • Quantitative (numerical) — values are genuine numbers.
    • Discrete: countable values — number of goals, exam grade.
    • Continuous: any value in a range — height, time, the length of an episode.

Careful: a number used as a label is still categorical. Hotel stars run 1–5 but they're ordered categories, not a measured quantity. You wouldn't average shirt numbers on a football team.

Why it matters: for a categorical variable the useful summary is "what fraction falls in each category"; for a quantitative one it's "where is the centre and how spread out is it".

3 · Frequency tables

The first summary: count how often each value appears.

  • Absolute frequency \(n_i\): how many units have value \(v_i\) (just count).
  • Relative frequency \(f_i=n_i/n\): the proportion — these sum to 1.
  • Cumulative relative frequency \(F_i=f_1+\dots+f_i\): the proportion of units up to value \(v_i\) (only for ordered data).

Course example — favourite Thanksgiving pie (2238 US adults, 2015): Pumpkin 729 (\(f=0.33\)), Apple 514 (0.23), Pecan 342 (0.15), … summing to 1. A wall of 2238 raw answers tells you nothing; the table tells you Pumpkin wins.

Cumulative example — 1000 customers' satisfaction (ordinal): Very unhappy 0.315, Quite unhappy 0.123 → \(F=0.438\) are unhappy; the rest are neutral-or-better. Cumulative frequencies answer "what proportion is at most this level?"

For a continuous variable every value is basically unique, so you group into classes (intervals) like \([0,10),[10,20),\dots\) and count per class (classes needn't be equal width).

4 · Graphs: pie, bar, histogram

Pick the graph to match the variable type:

  • Pie chart (categorical): each category is a slice; slice angle \(=360^\circ\cdot f_i\).
  • Bar plot (categorical or discrete): each value is a bar whose height = its frequency.
  • Histogram (continuous, grouped): each class is a bar whose area = its frequency. This is the key difference from a bar plot — with unequal class widths, area (not height) carries the count, so a wide class isn't made to look bigger than it is.

A histogram reveals the data's shape at a glance: where the values pile up, whether it's symmetric, gaps, and outliers. When comparing two groups of different sizes, switch to relative frequencies first so the comparison is fair.

5 · Where is the centre? Mode, median, mean

Three ways to say "typical value":

  • Mode: the most frequent value (can be more than one). For grouped data, the modal class = tallest histogram bar. Works for any variable type.
  • Median: the middle of the ordered data. Order the values; if \(n\) is odd it's the middle one, if even the average of the two middle ones. Equivalently, the smallest value with \(F_i\ge0.5\). It ignores how extreme the values are — only their order.
  • Mean \(\bar x=\dfrac1n\sum_i x_i\): the arithmetic average. From a frequency table it's the weighted form \(\bar x=\sum_j v_j f_j\); for grouped data, use class midpoints \(c_j\): \(\bar x\approx\sum_j c_j f_j\).

Mean vs median — the robustness lesson. The mean uses every value, so a few extreme ones drag it; the median doesn't budge. Pets per family in a building: values mostly 0–2 but a couple of 16–17 → mean \(\approx3.05\), median \(=1\). The median better reflects "a typical family". Mean and median coincide only when the data is symmetric.

6 · Percentiles & quartiles

The 100p-th percentile is the value with a fraction \(p\) of the data at or below it. Rule to locate it in the ordered sample of size \(n\):

  • if \(np\) is not an integer → take the value at position \(\lceil np\rceil\) (round up);
  • if \(np\) is an integer → average the values at positions \(np\) and \(np+1\).

Quartiles are the 25th, 50th (=median) and 75th percentiles: \(Q_1,Q_2,Q_3\). They cut the data into four equal quarters.

Worked (18 bowling scores, ordered). \(Q_1\): \(0.25\cdot18=4.5\to\) 5th value \(=145\). Median: \(0.5\cdot18=9\to\) average of 9th & 10th \(=159.5\). \(Q_3\): \(0.75\cdot18=13.5\to\) 14th value \(=177\). (Exercise 2 below walks this through.)

7 · How spread out? Range, variance, standard deviation

Centre isn't enough — two datasets can share a mean yet look completely different. Measures of spread:

  • Range \(=x_{(n)}-x_{(1)}\) (max − min). Interquartile range \(\mathrm{IQR}=Q_3-Q_1\) — the width holding the middle 50%, unaffected by outliers.
  • Variance. Natural idea: average the distances from the mean. But \(\sum_i(x_i-\bar x)=0\) always (positives cancel negatives), so we square the deviations first:
\[ s^2=\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar x)^2 \quad=\quad \frac{1}{n-1}\Big(\sum_i x_i^2-n\bar x^2\Big). \]

The two forms are equal; the right one is faster by hand. The divisor is \(n-1\), not \(n\) — "for technical reasons that become clear in the estimation chapter" (it makes \(s^2\) unbiased; you'll prove it later).

  • Standard deviation \(s=\sqrt{s^2}\): same units as the data, so it's the interpretable one.

Worked (first 10 of the course's exam grades): 25,16,30,24,24,21,30,27,24,30. \(\sum x_i=251\Rightarrow\bar x=25.1\); \(\sum x_i^2=6479\); \(s^2=\dfrac{6479-10(25.1)^2}{9}=\dfrac{178.9}{9}\approx19.9\), so \(s\approx4.46\). (Exercise 1.)

Chebyshev's inequality ties spread to concentration for any shape: at most \(1/k^2\) of the data lies more than \(k\) standard deviations from the mean (so at most \(1/4\) lies outside \(\bar x\pm2s\)).

8 · Shape: symmetric, skewed, normal

The histogram's shape matters as much as its centre.

  • Approximately normal: bell-shaped — tallest in the middle, symmetric, tapering on both sides. Lots of real data looks like this.
  • Skewed: one long tail. Right-skewed = long tail to the right (income, pets-per-family); left-skewed = long tail to the left. A quick tell: if mean and median differ noticeably, the data is skewed (the mean is pulled toward the long tail).

A boxplot draws the five-number summary (min, \(Q_1\), median, \(Q_3\), max) — the box spans \(Q_1\) to \(Q_3\) with a line at the median — and is the easiest way to compare groups or spot skew (median off-centre in the box).

For approximately normal data, the Empirical Rule: about 68% of values lie within \(\bar x\pm s\), 95% within \(\bar x\pm2s\), 99.7% within \(\bar x\pm3s\). (This is the bridge to the Normal distribution chapter.)

9 · Two variables together: covariance & correlation

Often you want to know whether two variables move together. Plot each unit as a point \((x_i,y_i)\) — a scatterplot — and look.

Covariance measures the direction of a linear relationship:

\[ \mathrm{cov}_{x,y}=\frac{1}{n-1}\sum_i (x_i-\bar x)(y_i-\bar y). \]

When big-\(x\) goes with big-\(y\), the products are positive → positive covariance; opposite movement → negative. But its size depends on the units, so you can't tell "strong" from "weak". Fix that by rescaling into the correlation coefficient:

\[ r_{x,y}=\frac{\mathrm{cov}_{x,y}}{s_x\,s_y}\in[-1,1]. \]

\(r\) is unit-free. \(|r|\) near 1 = strong linear relationship (\(r=+1\) or \(-1\) means the points lie exactly on a line); \(r\) near 0 = weak linear relationship. Course example: LSD dose vs maths score gives \(r\approx-0.93\) — strong negative.

Two warnings. (1) \(r\) only sees linear structure — a perfect parabola can give \(r\approx0\), so "\(r\approx0\)" doesn't mean "unrelated". (2) Correlation is not causation: deaths from falling out of bed correlate \(0.96\) with the number of lawyers in Puerto Rico — a coincidence, or a lurking third variable, not cause and effect.

10 · Where the data comes from: sampling design

Summaries are only as good as the sample. A simple random sample picks units so that every possible group of that size is equally likely — the best general safeguard for representativeness. (Sampling the first 100 people entering a library to estimate a city's average age fails: libraries over-represent students and retirees.)

A stratified random sample first splits the population into groups (strata) and samples each in proportion. If customers are 70% teenagers and 30% adults, take 70%/30% — guaranteeing both groups appear. Stratification helps most when the strata genuinely differ on what you're measuring.

Finally, the distinction that drives all of inference: a parameter is a (usually unknown) number describing the whole population; a sample statistic (like \(\bar x\) or \(s\)) is what you compute from the sample to estimate it. Everything from chapter 8 onward is about going from statistic back to parameter.

11 · What the exam asks from this chapter

Descriptive questions on past papers cluster into three recognisable types:

WordingTypeTool
group proportions/sizes + group means; overall mean given or asked; "back out a subgroup/size"; "add a new observation"Weighted / grouped mean\(\bar x=\sum w_g\bar x_g\); recover \(\sum x=n\bar x\), \(\sum(x-\bar x)^2=(n-1)s^2\)
several histograms, same mean, "match the standard deviations"Dispersion reasoningmass near centre = small SD; mass at extremes = large SD (don't compute)
few \((x,y)\) pairs, "correlation coefficient", "plot the points"Correlationpoints on a line → \(r=\pm1\); else the \(r\) formula

Exercises 1–2 build the raw skills (mean, variance, quartiles); exercises 3–8 are the real exam problems of each type.

Click any card to flip. Rate it after to track what you need to revisit.
Card 1
Population, sample, unit, variable, modality — define each in one line.
Population = the whole group of interest. Sample = the observed subset. Unit/subject = a member. Variable = a feature recorded on each unit. Modality = a possible value of a variable.
How sure? 50%
Card 2
The four variable types, with the key caveat.
Categorical: nominal (no order — hair colour) / ordinal (ordered — hotel stars). Quantitative: discrete (countable — goals) / continuous (any value in a range — height). Caveat: numbers used as labels (hotel stars, shirt numbers) are still categorical.
How sure? 50%
Card 3
Absolute, relative, and cumulative relative frequency?
Absolute \(n_i\) = count of value \(v_i\). Relative \(f_i=n_i/n\) (sum to 1). Cumulative \(F_i=f_1+\dots+f_i\) = proportion up to \(v_i\) (ordered data only).
How sure? 50%
Card 4
Bar plot vs histogram — what's the crucial difference?
Bar plot (categorical/discrete): bar HEIGHT = frequency. Histogram (continuous, grouped): bar AREA = frequency. With unequal class widths, area keeps the comparison honest; height would mislead.
How sure? 50%
Card 5
Mode, median, mean — and when do mean and median differ?
Mode = most frequent value. Median = middle of ordered data (or smallest value with \(F_i\ge0.5\)). Mean = \(\frac1n\sum x_i=\sum v_j f_j\). They differ when data is skewed — the mean is pulled toward the long tail; they coincide when symmetric.
How sure? 50%
Card 6
How do you locate the 100p-th percentile in an ordered sample of size n?
If \(np\) is not an integer → value at position \(\lceil np\rceil\). If \(np\) is an integer → average of the values at positions \(np\) and \(np+1\). Quartiles = 25th, 50th (median), 75th percentiles.
How sure? 50%
Card 7
Sample variance: why square the deviations, the formula, and the fast form?
Because \(\sum(x_i-\bar x)=0\) (deviations cancel), so square them. \(s^2=\frac{1}{n-1}\sum(x_i-\bar x)^2=\frac{1}{n-1}(\sum x_i^2-n\bar x^2)\). SD \(s=\sqrt{s^2}\) is in the data's units. Divisor \(n-1\) (makes it unbiased — proven later).
How sure? 50%
Card 8
Chebyshev's inequality, in words and symbols.
For ANY distribution, at most \(1/k^2\) of the data lies more than \(k\) SDs from the mean: \(P(|X-\bar x|\ge ks)\le 1/k^2\). E.g. at most 1/4 lies outside \(\bar x\pm 2s\).
How sure? 50%
Card 9
Skewness: how do you tell, and what's the Empirical Rule?
Right-skew = long right tail; left-skew = long left tail. Tell-tale: mean noticeably ≠ median (mean pulled toward the tail). Empirical Rule (approx-normal data): ~68% within \(\bar x\pm s\), ~95% within \(\pm2s\), ~99.7% within \(\pm3s\).
How sure? 50%
Card 10
Covariance vs correlation, and the two big warnings.
\(\mathrm{cov}=\frac{1}{n-1}\sum(x_i-\bar x)(y_i-\bar y)\) gives direction but is scale-dependent. \(r=\mathrm{cov}/(s_x s_y)\in[-1,1]\) is unit-free strength. Warnings: (1) \(r\) sees only LINEAR structure (a curve can give \(r\approx0\)); (2) correlation ≠ causation.
How sure? 50%
Card 11
Simple random vs stratified sampling; parameter vs statistic.
Simple random: every group of size k equally likely. Stratified: split into strata, sample each in proportion (best when strata differ on the measured feature). Parameter = unknown population number; statistic = computed from the sample to estimate it.
How sure? 50%
Progress
0 / 8
EX. 01

Compute the mean, variance and SD from scratch

course dataset (first 10 exam grades) easy

Ten exam grades: \(25,16,30,24,24,21,30,27,24,30\). Compute the sample mean, sample variance, and standard deviation.

\(\bar x=\frac1n\sum x_i\). Then use the fast form \(s^2=\frac{1}{n-1}(\sum x_i^2-n\bar x^2)\), and \(s=\sqrt{s^2}\).

Step 1. Mean
\(\sum x_i=251\), so \(\bar x=251/10=25.1\).
Step 2. Sum of squares
\(\sum x_i^2=625+256+900+576+576+441+900+729+576+900=6479\).
Step 3. Variance (fast form)
\(s^2=\dfrac{6479-10(25.1)^2}{10-1}=\dfrac{6479-6300.1}{9}=\dfrac{178.9}{9}\approx19.9\).
Step 4. Standard deviation
\(s=\sqrt{19.9}\approx4.46\) (same units as the grades).
\(\bar x=25.1\), \(s^2\approx19.9\), \(s\approx4.46\).
EX. 02

Median and quartiles

Ross 3.11 (bowling scores) easy

Eighteen scores, already ordered: \(122,126,133,140,145,145,149,150,157,162,166,175,177,177,183,188,199,212\). Find \(Q_1\), the median, and \(Q_3\).

For each quartile compute \(np\) (\(p=0.25,0.5,0.75\), \(n=18\)); not integer → round the position up; integer → average two positions.

Step 1. First quartile
\(0.25\cdot18=4.5\) (not integer) → 5th value \(=145\).
Step 2. Median
\(0.5\cdot18=9\) (integer) → average of 9th and 10th \(=(157+162)/2=159.5\).
Step 3. Third quartile
\(0.75\cdot18=13.5\) (not integer) → 14th value \(=177\).
\(Q_1=145\), median \(=159.5\), \(Q_3=177\) (so \(\mathrm{IQR}=32\)).
EX. 03

Average salary across staff types (weighted mean)

MACSf2-P3 easy

Staff is 10% type A, 70% type B, 20% type C, with average monthly salaries 1000, 2000, 3000 respectively. What is the average salary across all staff?

Overall mean = weighted average of group means, weights = proportions.

Step 1. Apply the weighted mean
\( \bar x=0.1(1000)+0.7(2000)+0.2(3000)=100+1400+600 \).
\( \bar x=2100 \).
EX. 04

Augment a 12-point sample

MAIf3-P1 medium

Twelve numbers have sample mean 10 and sample variance 12. A new observation \(x_{13}=10\) is added. Find the mean and variance of the 13-point sample.

Recover the totals \(\sum x_i=n\bar x\) and \(\sum(x_i-\bar x)^2=(n-1)s^2\), add the point, recompute. The new point equals the old mean.

Step 1. Recover the totals
\( \sum_{i=1}^{12}x_i=12\cdot10=120 \); \( \sum_{i=1}^{12}(x_i-10)^2=11\cdot12=132 \).
Step 2. New mean
\( \bar x_{13}=\dfrac{120+10}{13}=10 \) (unchanged).
Step 3. New variance
New point adds \((10-10)^2=0\) to the sum of squares, so \( s^2_{13}=\dfrac{132+0}{13-1}=\dfrac{132}{12}=11 \).
Mean \(=10\), variance \(=11\) (it drops from 12 because the divisor grew from 11 to 12).
EX. 05

Female vs male average grade — a 2×2 system

MAIf5-P1 medium

A class of 100 has overall average grade 25.0. It is 60% female, 40% male, and the female–male difference in averages is 1.0. Find the female and male average grades.

Let \(\bar x_F=1+\bar x_M\) and \(0.6\bar x_F+0.4\bar x_M=25\); substitute.

Step 1. Set up
\( \bar x_F=1+\bar x_M \); weighted mean \(0.6\bar x_F+0.4\bar x_M=25\).
Step 2. Substitute and solve
\( 0.6(1+\bar x_M)+0.4\bar x_M=25\Rightarrow 0.6+\bar x_M=25\Rightarrow \bar x_M=24.4 \).
Step 3. Back out the other
\( \bar x_F=1+24.4=25.4 \).
\( \bar x_M=24.4 \), \( \bar x_F=25.4 \).
EX. 06

Back-solve class sizes from a pooled mean

EBf1-P1 medium

Two classes take the same test. Class A averages 7.2, class B averages 6.7, and the 50 students together average 6.9. How many students are in each class?

Let \(n_A\) be class A's size, \(50-n_A\) class B's. Pooled mean = total grades / 50 = 6.9.

Step 1. Equation
\( \dfrac{7.2\,n_A+6.7(50-n_A)}{50}=6.9 \).
Step 2. Clear and solve
\( 7.2n_A+335-6.7n_A=345\Rightarrow 0.5\,n_A=10\Rightarrow n_A=20 \).
Class A has 20 students, class B has 30.
EX. 07

Match standard deviations to histograms

MAIf2-P1 / sampletest1-P6 easy

Three samples have the same mean \(\bar x=3.5\) over the range 0–7, with standard deviations (in some order) \(1.38,\,1.71,\,1.98\). Sample A is a flat/uniform histogram; Sample B is mound-shaped (mass near the centre); Sample C is U-shaped (mass at the extremes). Match each SD to its sample — without computing.

SD measures spread about the mean. Rank by how far the typical observation sits from 3.5.

Step 1. Smallest spread
Sample B (mound) keeps observations closest to \(\bar x\) → smallest SD \(=1.38\).
Step 2. Largest spread
Sample C (U-shape) pushes mass to the extremes → largest \((x-\bar x)^2\) → largest SD \(=1.98\).
Step 3. Middle
Sample A (uniform) is in between → \(1.71\).
B = 1.38, A = 1.71, C = 1.98. (You can't read an SD off a histogram — the ranking comes from where the mass sits.)
EX. 08

Correlation of five collinear points

MACSf1-P1 easy

Five pairs: \((1,3),(3,7),(5,11),(4,9),(2,5)\). What is the sample correlation coefficient? (Hint: plot the points.)

Check whether they lie on a line \(y=a+bx\). If so, \(r=\operatorname{sign}(b)\) with no arithmetic.

Step 1. Spot the line
Each pair satisfies \(y=2x+1\): \(1\!\to\!3,\,2\!\to\!5,\,3\!\to\!7,\,4\!\to\!9,\,5\!\to\!11\). All five are exactly collinear.
Step 2. Read off r
Positive slope \(b=2>0\) and a perfect line → \(r=+1\).
\( r=1 \).
CH. 02 Statistics

Probability Foundations

Starting from zero: what probability is, how to combine events, and how to reason with Bayes — built up with concrete examples before any exam problem.

Sections 10
Flashcards 11
Exercises 8
Read time 7'
Sources Ross, Introductory Statistics 3E, ch. 4 · MACS lectures 02–05 (Perone Pacifico, LUISS) · Past exams: MACSf1/2, MAIf2/3/5, sampletest1, EBf1
Key concepts
Sample space & events Set operations Probability rules Equally likely outcomes Conditional probability Independence Total probability Bayes' theorem Counting

1 · What probability actually is

Start with the picture, not the formula. An experiment is any situation whose result you can't predict for sure, but where you do know the list of possible results. Tossing a coin is an experiment; so is "which operating system will my next laptop run?" or "what will Apple's share price be on Monday?"

Each single result is an outcome. The full list of possible outcomes is the sample space, written \(S\).

  • Coin toss: \(S=\{\text{Heads},\text{Tails}\}\).
  • Roll a die: \(S=\{1,2,3,4,5,6\}\).
  • Next laptop OS: \(S=\{\text{Windows},\text{MacOS},\text{Linux},\dots\}\).

An event is just a statement about the result — "the die shows an even number", "I get Heads". Formally it's a subset of \(S\): the even-number event is the set \(\{2,4,6\}\). We say the event occurs when the actual outcome is one of the outcomes inside it.

So what is a probability? The most useful picture is long-run frequency: if you repeated the experiment over and over, the probability of an event is the proportion of times it would happen. Flip a fair coin thousands of times and the fraction landing Tails settles near \(0.5\) — that limiting fraction is the probability. (Real example: across the world about 105 boys are born for every 100 girls, year after year — so \(P(\text{newborn is male})\approx0.51\).) A probability is always a number between 0 and 1: 0 = never, 1 = certain.

2 · Combining events: AND, OR, NOT

Events are sets, so we combine them like sets. Picture a rectangle for \(S\) and circles inside it for events (a Venn diagram).

  • OR — union \(A\cup B\): the outcomes in \(A\), or in \(B\), or in both. It occurs if at least one of them happens.
  • AND — intersection \(A\cap B\): the outcomes in both. It occurs only if they happen together.
  • NOT — complement \(A^c\): everything in \(S\) that is not in \(A\). It occurs exactly when \(A\) doesn't.

Mutually exclusive (disjoint) events can't happen at the same time — their intersection is empty, \(A\cap B=\varnothing\). "The die shows 2" and "the die shows 5" are mutually exclusive. The empty set \(\varnothing\) is the impossible event.

One identity worth seeing now, because it drives a lot of exam problems: "in \(A\) but not in \(B\)" is \(A\cap B^c\), and it equals what's in \(A\) minus the overlap. We'll turn that into numbers next.

3 · The rules of probability

Three basic rules (they just codify the frequency picture):

  1. \(0\le P(A)\le1\) — proportions live between 0 and 1.
  2. \(P(S)=1\) — something in the list always happens.
  3. If \(A,B\) are mutually exclusive, \(P(A\cup B)=P(A)+P(B)\) — non-overlapping chances just add.

From these you derive the two you'll actually use:

Complement rule. Since \(A\) and \(A^c\) split \(S\): \(P(A^c)=1-P(A)\). (Heads has probability 0.4 ⇒ Tails has 0.6.) This is the engine behind "at least one" problems — it's usually easier to compute the opposite and subtract.

Addition rule (when events can overlap): \[ P(A\cup B)=P(A)+P(B)-P(A\cap B). \] Why subtract? If you just add \(P(A)+P(B)\), the overlap \(A\cap B\) gets counted twice, so you remove it once.

Concrete example (Ross 4.3). A shop takes Amex or VISA. 22% of customers carry Amex, 58% carry VISA, 14% carry both. Probability a customer has at least one card: \(0.22+0.58-0.14=0.66\). And "VISA but not Amex" \(=0.58-0.14=0.44\) — exactly the \(A\cap B^c\) idea from section 2.

4 · Equally likely outcomes — just count

When every outcome in \(S\) is equally likely (a fair die, a well-shuffled deck, "a person chosen at random"), probability becomes pure counting:

\[ P(A)=\frac{\#\text{outcomes in }A}{\#\text{outcomes in }S}. \]

The phrase "chosen at random" is your signal that outcomes are equally likely.

  • Fair die: \(P(\text{even})=3/6=1/2\).
  • European roulette, bet on odd: numbers \(\{0,1,\dots,36\}\), 18 are odd ⇒ \(18/37\).
  • Retirement centre (Ross 4.4): 420 members, 144 smokers ⇒ \(P(\text{smoker})=144/420=12/35\).

This is why counting techniques (section 9) matter: to get a probability you often just need to count the favourable outcomes and divide by the total.

5 · Conditional probability — updating on information

Often you learn something partway through. Conditional probability \(P(B\mid A)\) is "the probability of \(B\) given that \(A\) has happened."

The intuition (no formula yet). Roll two dice; you're told the first die is a 4. That knowledge shrinks the world: only six outcomes are still possible — \((4,1),\dots,(4,6)\). Among those, only \((4,6)\) makes the sum 10, so the chance is \(1/6\). You re-computed the probability inside a reduced sample space: \(A\) became your new \(S\).

Turning that into a formula — measure the overlap relative to the thing you now know:

\[ P(B\mid A)=\frac{P(A\cap B)}{P(A)}. \]

The famous trap (Ross 4.10 / the two-children problem). A couple has two children; you learn at least one is a girl. Probability both are girls? Equally likely families are \(\{(g,g),(g,b),(b,g),(b,b)\}\). "At least one girl" rules out only \((b,b)\), leaving three equally likely cases; just one is \((g,g)\). So the answer is \(1/3\), not \(1/2\) — the extra information reshapes the sample space. (This exact reasoning powers exam exercise 5 below.)

Rearranging the formula gives the multiplication rule: \(P(A\cap B)=P(A)\,P(B\mid A)\) — useful for "draw two without replacement" problems, where the second draw's odds depend on the first.

6 · Independence — when information doesn't help

Sometimes knowing \(A\) tells you nothing about \(B\). Then \(P(B\mid A)=P(B)\), and the multiplication rule simplifies to the test you'll use:

\[ A,B\text{ independent}\iff P(A\cap B)=P(A)\,P(B). \]

Signals of independence: "with replacement", "i.i.d.", "each toss is fair" — separate trials that don't influence each other.

Why it's not automatic (Ross 4.13). Two fair dice. Let \(A=\)"first die is 3". Compare two events: \(B=\)"sum is 8" and \(C=\)"sum is 7". Knowing the first die is 3 changes the chance of an 8 (now you just need a 5 next) — so \(A,B\) are dependent. But the chance of a 7 stays \(1/6\) whatever the first die shows (there's always exactly one matching second die) — so \(A,C\) are independent. Same first event, opposite verdicts: independence is something you check, not assume.

For "at least one" across independent trials, lean on the complement: three children, \(P(\text{at least one girl})=1-P(\text{all boys})=1-(1/2)^3=7/8\).

7 · Total probability — averaging over cases

Often the thing you want depends on a hidden "case" or "cause". Split the problem by the case, compute each piece, and combine. If \(B\) either happens or not, then for any \(A\):

\[ P(A)=P(A\mid B)\,P(B)+P(A\mid B^c)\,P(B^c). \]

Read it as a weighted average: the chance of \(A\) in each case, weighted by how likely that case is. (It generalises to any set of mutually-exclusive cases \(B_1,\dots,B_k\): \(P(A)=\sum_i P(A\mid B_i)P(B_i)\).)

Concrete example (insurance). 30% of drivers are high-risk, 70% low-risk. A high-risk driver has an accident this year with probability 0.4, a low-risk one with probability 0.2. Overall chance a random driver has an accident: \[ P(\text{accident})=0.4(0.30)+0.2(0.70)=0.12+0.14=0.26. \] You couldn't answer without splitting by risk type — that's the whole move.

8 · Bayes' theorem — reasoning backwards

Total probability runs cause → effect. Bayes runs it backwards: you observed the effect, and you want the probability of the cause. "It rained — how likely is it that the morning had been sunny?" "The test is positive — how likely is the disease?"

Start from the definition \(P(\text{cause}\mid\text{effect})=\dfrac{P(\text{cause}\cap\text{effect})}{P(\text{effect})}\), write the top as \(P(\text{effect}\mid\text{cause})P(\text{cause})\), and expand the bottom with total probability:

\[ P(H\mid E)=\frac{P(E\mid H)\,P(H)}{P(E\mid H)\,P(H)+P(E\mid H^c)\,P(H^c)}. \]

The procedure: (1) name the causes \(H\) and the observed effect \(E\); (2) write the priors \(P(H)\) and the likelihoods \(P(E\mid H)\) straight from the text; (3) total probability gives the denominator; (4) divide.

The showcase example — why a positive test can still be reassuring (Ross 4.17). A blood test is 99% accurate when the disease is present (\(P(E\mid H)=0.99\)) and gives a false positive 2% of the time (\(P(E\mid H^c)=0.02\)). Only 0.5% of people have the disease (\(P(H)=0.005\)). You test positive — what's the chance you're actually sick? \[ P(H\mid E)=\frac{0.99(0.005)}{0.99(0.005)+0.02(0.995)}\approx0.199. \] About 20% — surprisingly low, because the huge healthy population produces many false positives that swamp the few true cases. This is exactly the engine behind exam exercises 3 and 4.

9 · Counting — tools for equally-likely problems

Section 4 said probability is often just counting favourable ÷ total. Here are the tools, introduced only because we need them.

Basic principle. If step 1 has \(n\) options and step 2 has \(m\), together there are \(n\cdot m\). (One man from 8, one woman from 12 → \(96\) pairs.)

Permutations / factorial. The number of orderings of \(n\) distinct objects is \(n!=n(n-1)\cdots2\cdot1\) (with \(0!=1\)). Order matters here.

Combinations. When order does not matter — choosing a group of \(k\) from \(n\):

\[ \binom{n}{k}=\frac{n!}{k!\,(n-k)!}. \]

Two reflexes for the exam: "all different" → count the complement (the few repeats) and subtract; "these specific items are included" → fix them, then count the choices for the remaining slots. Watch for repeated elements — e.g. the two identical P's in APPLE: \(\binom{5}{2}=10\) selections, only one is the pair PP.

10 · Which tool? — recognition guide

Once the concepts are in place, exam problems are about spotting which one applies.

If the wording says…It's a…Do
"chosen at random" among options, then observe a result; "given [result], prob it was [cause]"Bayespriors × likelihoods, divide
"a cause is random, then count X", asked an unconditional \(P(X{=}k)\)Total probability (mixture)\(\sum P(X{=}k\mid \text{case})P(\text{case})\)
two overlapping groups + counts; "at least one", "one but not the other"Addition / inclusion–exclusion\(P(A\cup B)=P(A)+P(B)-P(A\cap B)\)
"choose \(k\) without replacement", "all different", "both included"Counting\(\binom{n}{k}\) + complement / fix-items
"knowing exactly \(m\) of \(n\) are…", asked about a specific drawConditional in a fair setupreduced sample space; count favourable ÷ remaining
Click any card to flip. Rate it after to track what you need to revisit.
Card 1
In plain words: what are an experiment, an outcome, the sample space S, and an event?
Experiment = a situation with an unpredictable result but a known list of possibilities. Outcome = one single result. Sample space \(S\) = the set of all possible outcomes. Event = a statement about the result, i.e. a subset of \(S\); it 'occurs' when the actual outcome is inside it.
How sure? 50%
Card 2
What does it mean to say 'the probability of Tails is 0.5' (the long-run frequency picture)?
If you repeated the experiment many times, the proportion of times the event happens settles toward that number. Probability = limiting relative frequency. Always between 0 (never) and 1 (certain).
How sure? 50%
Card 3
AND, OR, NOT for events — which set operation is each, and what does 'mutually exclusive' mean?
OR = union \(A\cup B\) (at least one). AND = intersection \(A\cap B\) (both). NOT = complement \(A^c\). Mutually exclusive (disjoint) = can't both happen, \(A\cap B=\varnothing\).
How sure? 50%
Card 4
Why does the addition rule subtract \(P(A\cap B)\)? State it.
\(P(A\cup B)=P(A)+P(B)-P(A\cap B)\). Adding \(P(A)+P(B)\) double-counts the overlap, so you remove it once. (Also \(P(A\cap B^c)=P(A)-P(A\cap B)\).)
How sure? 50%
Card 5
When outcomes are equally likely, how do you compute a probability? What phrase signals this?
\(P(A)=\#A/\#S\) — favourable outcomes divided by total. The phrase 'chosen at random' signals equally-likely outcomes.
How sure? 50%
Card 6
Conditional probability: the intuition (reduced sample space) and the formula.
Given \(A\) happened, \(A\) becomes your new sample space; re-measure \(B\) inside it. \(P(B\mid A)=\dfrac{P(A\cap B)}{P(A)}\). The two-children problem: 'at least one girl' leaves 3 equally likely cases, so P(both girls)=1/3, not 1/2.
How sure? 50%
Card 7
Independence: the test, and how to recognise it in a problem.
\(A,B\) independent \(\iff P(A\cap B)=P(A)P(B)\) \(\iff P(B\mid A)=P(B)\). Signalled by 'with replacement', 'i.i.d.', 'each trial is fair'. Don't assume it — check it.
How sure? 50%
Card 8
Law of total probability — state it and explain it in words.
\(P(A)=\sum_i P(A\mid B_i)P(B_i)\) over mutually-exclusive cases \(B_i\). It's a weighted average: the chance of \(A\) in each case, weighted by how likely that case is. Use it when the answer depends on a hidden case/cause.
How sure? 50%
Card 9
Bayes' theorem: what does it do, and what are the priors and likelihoods?
It reverses conditioning: from effect back to cause. \(P(H\mid E)=\dfrac{P(E\mid H)P(H)}{\sum_i P(E\mid H_i)P(H_i)}\). Priors \(P(H_i)\) = how likely each cause before evidence; likelihoods \(P(E\mid H_i)\) = how the cause produces the effect. The denominator is total probability.
How sure? 50%
Card 10
Why can a 99%-accurate positive test still mean only ~20% chance of disease?
If the disease is rare (e.g. 0.5%), the large healthy group produces many false positives that outnumber the few true cases. Bayes weighs prior × likelihood: \(\frac{0.99\cdot0.005}{0.99\cdot0.005+0.02\cdot0.995}\approx0.20\). Rarity of the cause matters as much as test accuracy.
How sure? 50%
Card 11
Counting: factorial, combinations, and the two exam reflexes.
Orderings of \(n\): \(n!\). Choosing \(k\) of \(n\) (order irrelevant): \(\binom{n}{k}=\frac{n!}{k!(n-k)!}\). Reflexes: 'all different' → count the complement and subtract; 'specific items included' → fix them, count the rest.
How sure? 50%
Progress
0 / 8
EX. 01

Coin tossed a random number of times — P(X=0)

MACSf1-P2 medium

Let \(N\) be a number chosen from \(\{1,2,3\}\) with equal probability. Throw a fair coin \(N\) times, counting the number \(X\) of Heads obtained. Calculate \(P(X=0)\).

The hidden 'case' is how many times you tossed (\(N\)). No reversal is asked, so it's the law of total probability (section 7). Given \(N=n\), getting zero Heads means \(n\) Tails in a row: \((\tfrac12)^n\).

Step 1. Identify the structure
Cases: \(P(N=1)=P(N=2)=P(N=3)=\tfrac13\). The question asks a plain \(P(X=0)\) — no 'given' — so average over the cases (total probability), not Bayes.
Step 2. Chance of zero Heads in each case
\(n\) independent fair tosses, all Tails: \(P(X=0\mid N=n)=(\tfrac12)^n\). For \(n=1,2,3\) that's \(\tfrac12,\tfrac14,\tfrac18\).
Step 3. Weighted average
\[ P(X=0)=\sum_{n=1}^{3}P(X=0\mid N=n)P(N=n)=\tfrac13\left(\tfrac12+\tfrac14+\tfrac18\right)=\tfrac13\cdot\tfrac78. \]
\( P(X=0)=\dfrac{7}{24} \).
EX. 02

Same coin experiment — reverse it with Bayes

MACSf2-P2 medium

Same setup (\(N\) chosen from \(\{1,2,3\}\), fair coin tossed \(N\) times, \(X\) = Heads). Given that \(X=0\), calculate \(P(N=1\mid X=0)\).

Now you observe the effect \(X=0\) and want the hidden cause \(N=1\) → Bayes (section 8). The denominator is the \(P(X=0)=\tfrac{7}{24}\) you just built.

Step 1. Denominator = total probability
From exercise 1, \(P(X=0)=\tfrac{7}{24}\).
Step 2. Numerator = prior × likelihood for the cause N=1
\(P(X=0\mid N=1)\,P(N=1)=\tfrac12\cdot\tfrac13=\tfrac16\).
Step 3. Divide (Bayes)
\[ P(N=1\mid X=0)=\frac{1/6}{7/24}=\frac16\cdot\frac{24}{7}. \]
\( P(N=1\mid X=0)=\dfrac{4}{7} \).
EX. 03

Which coin was it? — Bayes with two coins

sampletest1-P1 easy

An urn has two coins: coin A with \(P(\text{Heads})=\tfrac14\), coin B with \(P(\text{Heads})=\tfrac12\). One coin is picked at random and thrown, giving "Heads". Given this result, what is the probability it was coin A?

Cause = which coin (priors \(\tfrac12,\tfrac12\)); effect = Heads (likelihoods \(\tfrac14,\tfrac12\)). You observed the effect, want the cause → Bayes. Same shape as the medical-test example.

Step 1. Priors and likelihoods
\(P(A)=P(B)=\tfrac12\); \(P(H\mid A)=\tfrac14\), \(P(H\mid B)=\tfrac12\).
Step 2. Total probability of Heads (denominator)
\(P(H)=\tfrac14\cdot\tfrac12+\tfrac12\cdot\tfrac12=\tfrac18+\tfrac14=\tfrac38\).
Step 3. Bayes
\[ P(A\mid H)=\frac{P(H\mid A)P(A)}{P(H)}=\frac{1/8}{3/8}. \]
\( P(A\mid H)=\dfrac13 \).
EX. 04

Was the morning sunny? — Bayes with rain

EBf1-P2 easy

If the morning is sunny, the chance of rain that day is \(\tfrac16\). On non-sunny mornings, the chance of rain is \(\tfrac12\). 60% of mornings start sunny. Given that it rained, what is the probability the morning was sunny?

Cause = sunny vs not (priors \(0.6,0.4\)); effect = rain (likelihoods \(\tfrac16,\tfrac12\)). Observed the effect (rain), want the cause (sunny) → Bayes.

Step 1. Priors and likelihoods
\(P(S)=0.6\), \(P(S^c)=0.4\); \(P(R\mid S)=\tfrac16\), \(P(R\mid S^c)=\tfrac12\).
Step 2. Total probability of rain
\(P(R)=\tfrac16\cdot0.6+\tfrac12\cdot0.4=0.1+0.2=0.3\).
Step 3. Bayes
\[ P(S\mid R)=\frac{P(R\mid S)P(S)}{P(R)}=\frac{0.1}{0.3}. \]
\( P(S\mid R)=\dfrac13 \).
EX. 05

Exactly two women — conditioning in a fair setup

MAIf3-P2 hard

Three students are selected at random with replacement. Knowing that exactly two of the selected are women, what is the probability that the first selected is a woman?

This is the two-children trap (section 5) with three slots. 'With replacement' → independent trials, each woman with some probability \(p\). Let \(E_i\)='the \(i\)-th is a woman', \(W\)=number of women. Want \(P(E_1\mid W=2)\). Watch \(p\) cancel.

Step 1. The conditioning event
Number of women among 3 is Binomial: \(P(W=2)=\binom{3}{2}p^2(1-p)=3p^2(1-p)\).
Step 2. Favourable: first is a woman AND exactly two women
The arrangements are \(E_1E_2E_3^c\) and \(E_1E_2^cE_3\) (disjoint). By independence each is \(p\cdot p\cdot(1-p)\), so \(P(E_1\cap\{W=2\})=2p^2(1-p)\).
Step 3. Divide — and watch p cancel
\[ P(E_1\mid W=2)=\frac{2p^2(1-p)}{3p^2(1-p)}. \] The \(p^2(1-p)\) cancels, just like the count-the-cases reasoning in the two-children problem.
\( P(E_1\mid W=2)=\dfrac23 \) — independent of \(p\).
EX. 06

Biology not Chemistry — addition rule

MAIf2-P2 easy

A school has 300 students; every student takes at least one of Biology or Chemistry, possibly both. Biology has 250 students, Chemistry 150. Picking a student at random, what is the probability they take Biology and not Chemistry?

Two overlapping groups + 'and not' → addition rule (section 3). Everyone is in the union, so \(P(B\cup C)=1\); solve for the overlap, then use \(P(B\cap C^c)=P(B)-P(B\cap C)\).

Step 1. Translate counts to probabilities
\(P(B)=\tfrac{250}{300}\), \(P(C)=\tfrac{150}{300}\), \(P(B\cup C)=1\) (all take at least one).
Step 2. Find the overlap via the addition rule
\(1=\tfrac{250}{300}+\tfrac{150}{300}-P(B\cap C)\Rightarrow P(B\cap C)=\tfrac{100}{300}\).
Step 3. 'And not'
\[ P(B\cap C^c)=P(B)-P(B\cap C)=\tfrac{250}{300}-\tfrac{100}{300}=\tfrac{150}{300}. \]
\( P(B\cap C^c)=\dfrac12 \) (150 of 300 take Biology only).
EX. 07

Two letters from APPLE, all different — counting

MAIf5-P2a easy

Consider the word APPLE. Choose two letters at random without replacement. What is the probability that they are different?

Equally-likely selections → count (section 9). \(\binom{5}{2}=10\) pairs. 'All different' → count the complement (the pairs that repeat) and subtract.

Step 1. Count the sample space
Letters A, P, P, L, E → \(\binom{5}{2}=10\) unordered pairs.
Step 2. Complement
The only pair of equal letters is PP — exactly 1 outcome.
Step 3. Subtract
Different pairs \(=10-1=9\), so \(P(\text{different})=\tfrac{9}{10}\).
\( P(\text{different})=\dfrac{9}{10} \).
EX. 08

Three letters from APPLE, both P's chosen — counting

MAIf5-P2b easy

From the word APPLE, choose three letters at random without replacement. What is the probability that both P's are among the chosen letters?

\(\binom{5}{3}=10\) selections. 'Both P's included' → fix the two P's, count choices for the one remaining slot.

Step 1. Count the sample space
\(\binom{5}{3}=10\) unordered selections of three letters.
Step 2. Fix the required items
If both P's are taken, the third letter is one of A, L, E → 3 favourable selections.
Step 3. Divide
\( P(\text{both P's})=\tfrac{3}{10} \).
\( P(\text{both P's})=\dfrac{3}{10} \).
CH. 03 Statistics

Discrete Random Variables

From outcomes to numbers: probability mass functions, expected value, variance, and how means and variances behave under transformations and sums — built from scratch with worked exam problems.

Sections 08
Flashcards 11
Exercises 8
Read time 6'
Sources Ross, Probability & Statistics for Engineers 5E, ch. 4 (discrete parts; CDF/MGF/conditional skipped) · MACS lectures 06–07 (Perone Pacifico, LUISS) · Past exams & practice: Esercizio 3 (nn. 9–12), MACS06 worked examples
Key concepts
Random variable Discrete vs continuous Probability mass function Expected value E[X] E[g(X)] and E[X²] Variance & standard deviation Linear transformations Sums & independence Expected-value decisions

From outcomes to numbers: what a random variable is

Until now an experiment gave you outcomes — "Heads", "the die shows 3", "the customer is left-handed". A random variable is one simple move on top of that: attach a number to each outcome.

Formally, a random variable \(X\) is a rule that assigns a real number to every outcome of an experiment. We write it with a capital letter (\(X,Y,N\)); the actual value it takes after the experiment is a lowercase letter (\(x\)).

  • Toss a coin three times, let \(X=\) number of Heads. Then \(X\) can be \(0,1,2,3\).
  • Pick a student, let \(X=1\) if left-handed, \(0\) if not. Then \(X\in\{0,1\}\).
  • Play a game, let \(X=\) euros you walk away with. \(X\) might be \(-5,\,0,\,+10,\dots\)

Why bother? Because once outcomes are numbers you can average them, measure how spread they are, and feed them into every formula in the rest of the course.

Discrete vs continuous. A random variable is discrete when its possible values are separate points you can list — finite (\(0,1,2,3\)) or an endless but listable sequence (\(1,2,3,\dots\)). It is continuous when it can be any value in an interval (a height like \(172.4\,\)cm). This whole chapter is about the discrete case; continuous ones come later.

The probability mass function (pmf): the full ID card of a discrete RV

To know a discrete random variable completely you need two columns: the list of values it can take, and the probability of each. That table is the probability mass function (pmf), written

\[ p(x)=P\{X=x\}. \]

Read it as "the probability that \(X\) lands exactly on \(x\)". Ross also writes it \(p_X(x)\) when several variables are around.

A list of numbers is a valid pmf exactly when:

  • (1) every probability is \(\ge 0\): \(p(x)\ge 0\);
  • (2) they add up to one: \(\sum_x p(x)=1\).

Concrete example. In a population \(10\%\) of people are left-handed. Pick two people independently and let \(X=\) how many of the two are left-handed. Working out the cases gives the pmf:

\(x\)012
\(p(x)\)0.810.180.01

Check it is valid: all entries \(\ge0\), and \(0.81+0.18+0.01=1\). Good. This one table is the variable — every question below ("what's the average? the spread?") is answered just by reading off these numbers.

Tip. If a problem gives you frequencies (counts) instead of probabilities, divide each count by the total to turn the frequency table into a pmf. That single step unlocks every formula in this chapter.

Expected value E[X]: the long-run average

The first number you summarise a random variable with is its expected value (also called the mean), written \(E[X]\) or \(\mu\). It is the weighted average of the possible values, each weighted by its probability:

\[ E[X]=\sum_i x_i\,p(x_i). \]

What it really means. \(E[X]\) is not "the value you should expect to see" on a single try — a fair die never shows \(3.5\). It is the average you would get over a long run of repeated experiments. Physically it is the balance point of the pmf: put weight \(p(x_i)\) at position \(x_i\) on a ruler, and \(E[X]\) is where the ruler balances.

Example — fair die. Values \(1,\dots,6\) each with probability \(\tfrac16\):

\[ E[X]=1\cdot\tfrac16+2\cdot\tfrac16+\dots+6\cdot\tfrac16=\frac{21}{6}=3.5. \]

Example — indicator. If \(X=1\) when an event \(A\) happens and \(0\) otherwise, then \(E[X]=1\cdot P(A)+0\cdot P(A^c)=P(A)\). The mean of a 0/1 variable is just the probability of the "1".

For the left-handed pmf above: \(E[X]=0(0.81)+1(0.18)+2(0.01)=0.20\). On average \(0.2\) of the two people are left-handed — a perfectly sensible number even though \(X\) itself is only ever \(0,1,\) or \(2\).

E[g(X)]: averaging a function of X (and why E[X²] ≠ (E[X])²)

Often you don't care about \(X\) itself but about some function of it — a payoff, a squared error, a transformed score. If \(Y=g(X)\), you do not need to build a new pmf for \(Y\). Just weight \(g\) by the original probabilities:

\[ E[g(X)]=\sum_i g(x_i)\,p(x_i). \]

(Sometimes called the "law of the unconscious statistician".)

Game example. Draw \(5\) balls; you win €1 per white ball and lose €1 per non-white. If \(X\) is the number of white balls, your balance is \(Y=g(X)=2X-5\). To get the expected balance you just plug \(2x_i-5\) into the sum — no separate table needed.

The crucial special case: \(E[X^2]\). Take \(g(x)=x^2\):

\[ E[X^2]=\sum_i x_i^2\,p(x_i). \]

For the left-handed pmf: \(E[X^2]=0^2(0.81)+1^2(0.18)+2^2(0.01)=0.18+0.04=0.22\).

Notice \(E[X^2]=0.22\) but \((E[X])^2=(0.20)^2=0.04\). They are not equal. For a non-linear \(g\), \(E[g(X)]\neq g(E[X])\) in general. You can only "push the average inside" for linear functions and sums — exactly the next two sections. This gap between \(E[X^2]\) and \((E[X])^2\) is precisely what variance measures.

Variance and standard deviation: measuring spread

The mean tells you the center; the variance tells you how far the values typically sit from that center. Definition:

\[ \operatorname{Var}(X)=E\big[(X-\mu)^2\big],\qquad \mu=E[X]. \]

You square the distance from the mean (so positives and negatives don't cancel) and average it. Bigger variance = more spread.

The formula you actually use. Expanding the square gives a much faster equivalent:

\[ \operatorname{Var}(X)=E[X^2]-(E[X])^2. \]

So the recipe is always: compute \(E[X]\), compute \(E[X^2]\), subtract the square of the first from the second.

Standard deviation. Variance is in squared units (euros², years²…), which is awkward. Take the square root to get back to the original units:

\[ \sigma=\operatorname{SD}(X)=\sqrt{\operatorname{Var}(X)}. \]

Example — fair die. We found \(E[X]=3.5\) and \(E[X^2]=\tfrac{91}{6}\). So

\[ \operatorname{Var}(X)=\frac{91}{6}-\left(\frac72\right)^2=\frac{91}{6}-\frac{49}{4}=\frac{35}{12}\approx2.917. \]

Example — 0/1 (Bernoulli) variable. If \(P(X=1)=p\), then \(E[X]=p\) and (since \(X^2=X\)) \(E[X^2]=p\), so

\[ \operatorname{Var}(X)=p-p^2=p(1-p). \]

Linear transformations: E[aX+b] and Var(aX+b)

Rescaling and shifting a variable — change of units, a fee added to a payoff — is a linear transformation \(Y=aX+b\). Two clean rules:

\[ E[aX+b]=a\,E[X]+b, \qquad \operatorname{Var}(aX+b)=a^2\,\operatorname{Var}(X). \]

Read the rules. For the mean, the constant \(b\) shifts it and the factor \(a\) scales it — average moves exactly the way the data moves. For the variance, adding \(b\) does nothing (shifting everything by the same amount doesn't change spread), and the scale factor comes out squared because variance is built from squared distances.

Worked example. Let \(X\) have \(E[X]=3\) and \(\operatorname{Var}(X)=2\). For \(Y=4+3X\) (so \(a=3,\,b=4\)):

\[ E[Y]=4+3(3)=13,\qquad \operatorname{Var}(Y)=3^2\cdot 2=18. \]

The standard deviation scales by \(|a|\) (not \(a^2\)): \(\operatorname{SD}(Y)=3\,\operatorname{SD}(X)\).

Sums of random variables: when do means and variances add?

Real problems often add several random variables (total earnings of a couple, total Heads in many tosses). Two rules, and one important asymmetry between them.

Means always add. No conditions, ever:

\[ E[X+Y]=E[X]+E[Y], \qquad E\!\left[\sum_{i=1}^{k}X_i\right]=\sum_{i=1}^{k}E[X_i]. \]

Variances add only when there is no interaction. In general

\[ \operatorname{Var}(X+Y)=\operatorname{Var}(X)+\operatorname{Var}(Y)+2\operatorname{Cov}(X,Y), \]

where the extra term \(\operatorname{Cov}(X,Y)=E[(X-\mu_X)(Y-\mu_Y)]\) is the covariance — it captures whether the two move together. When \(X\) and \(Y\) are independent (knowing one tells you nothing about the other), the covariance is \(0\) and the variances simply add:

\[ \operatorname{Var}(X+Y)=\operatorname{Var}(X)+\operatorname{Var}(Y),\qquad \operatorname{Var}\!\left(\sum_{i=1}^{k}X_i\right)=\sum_{i=1}^{k}\operatorname{Var}(X_i)\quad(\text{independent}). \]

Also useful: for independent \(X,Y\), \(E[XY]=E[X]\,E[Y]\). (Covariance and joint behaviour get a full chapter later; here you only need: means add unconditionally, variances add under independence.)

Decision box — when can I push the average inside? \(E[g(X)]=g(E[X])\) is safe only for: linear functions \(aX+b\), sums, and products of independent variables. For anything non-linear (like \(X^2\)), it fails.

Recap — the discrete-RV toolkit

Everything in this chapter is computed by reading a pmf table and summing.

  • pmf valid: \(p(x)\ge0\) and \(\sum p(x)=1\). Frequencies → divide by total.
  • Mean: \(E[X]=\sum x_i\,p(x_i)\) — long-run average / balance point.
  • Function: \(E[g(X)]=\sum g(x_i)\,p(x_i)\); in particular \(E[X^2]=\sum x_i^2 p(x_i)\).
  • Variance: \(\operatorname{Var}(X)=E[X^2]-(E[X])^2\); \(\operatorname{SD}=\sqrt{\operatorname{Var}}\).
  • Linear: \(E[aX+b]=aE[X]+b\), \(\operatorname{Var}(aX+b)=a^2\operatorname{Var}(X)\).
  • Sums: means always add; variances add when independent.

Exam reflex: a table of values with probabilities + a question about "average / expected / fair price" → \(E[X]\). About "spread / variability / standard deviation / risk" → \(\operatorname{Var}(X)=E[X^2]-(E[X])^2\). About "expected payoff of a deal / which option" → compute \(E[\cdot]\) of each option and compare.

Click any card to flip. Rate it after to track what you need to revisit.
Card 1
When is a random variable called discrete?
When its possible values are separate points you can list — either finitely many (\(0,1,2,3\)) or an endless but listable sequence (\(1,2,3,\dots\)). Contrast: continuous = any value in an interval.
How sure? 50%
Card 2
What two conditions make a list \(p(x)\) a valid probability mass function?
(1) \(p(x)\ge 0\) for every value, and (2) \(\sum_x p(x)=1\). If you're given counts/frequencies, divide by the total first to get the \(p(x)\).
How sure? 50%
Card 3
Formula and meaning of the expected value \(E[X]\)?
\(E[X]=\sum_i x_i\,p(x_i)\): the probability-weighted average. It's the long-run average over many repeats / the balance point of the pmf — NOT the value to expect on a single try (a die never shows 3.5).
How sure? 50%
Card 4
How do you compute \(E[g(X)]\) for a function of a discrete RV?
Weight the function by the original probabilities: \(E[g(X)]=\sum_i g(x_i)\,p(x_i)\). No need to build a new pmf for \(g(X)\). Special case \(E[X^2]=\sum x_i^2 p(x_i)\).
How sure? 50%
Card 5
Is \(E[X^2]=(E[X])^2\)?
No, in general \(E[X^2]\neq (E[X])^2\). Pushing the average inside (\(E[g(X)]=g(E[X])\)) is valid only for linear functions, sums, and products of independent variables — never for a non-linear \(g\) like the square.
How sure? 50%
Card 6
Two formulas for the variance \(\operatorname{Var}(X)\)?
Definition \(\operatorname{Var}(X)=E[(X-\mu)^2]\); computational (the one you use) \(\operatorname{Var}(X)=E[X^2]-(E[X])^2\). Standard deviation \(\sigma=\sqrt{\operatorname{Var}(X)}\) restores the original units.
How sure? 50%
Card 7
You see a table of values \(x_i\) with probabilities \(p_i\) and a question about spread / variability / standard deviation. Procedure?
It's a variance problem. Compute \(E[X]=\sum x_i p_i\), then \(E[X^2]=\sum x_i^2 p_i\), then \(\operatorname{Var}=E[X^2]-(E[X])^2\), and \(\operatorname{SD}=\sqrt{\operatorname{Var}}\).
How sure? 50%
Card 8
Rules for a linear transformation \(Y=aX+b\)?
\(E[aX+b]=aE[X]+b\) and \(\operatorname{Var}(aX+b)=a^2\operatorname{Var}(X)\). The shift \(b\) doesn't affect variance; the scale comes out squared. \(\operatorname{SD}\) scales by \(|a|\).
How sure? 50%
Card 9
For Bernoulli (a 0/1 variable with \(P(X=1)=p\)), what are \(E[X]\) and \(\operatorname{Var}(X)\)?
\(E[X]=p\) (mean of a 0/1 variable is the probability of the 1), and \(\operatorname{Var}(X)=p(1-p)\) (because \(X^2=X\), so \(E[X^2]=p\)).
How sure? 50%
Card 10
Does \(E[X+Y]=E[X]+E[Y]\) always hold? Does \(\operatorname{Var}(X+Y)=\operatorname{Var}(X)+\operatorname{Var}(Y)\)?
Means always add, no conditions. Variances add only when \(X,Y\) are independent (covariance \(=0\)). In general \(\operatorname{Var}(X+Y)=\operatorname{Var}(X)+\operatorname{Var}(Y)+2\operatorname{Cov}(X,Y)\).
How sure? 50%
Card 11
An exam asks the expected payoff of a deal / which of two options to take. What do you do?
Model the payoff as a discrete RV (value × probability table), compute \(E[\text{payoff}]\) for each option, and pick the better expected value. If 'fair price' is asked, it's the value making \(E[\text{net}]=0\).
How sure? 50%
Progress
0 / 8
EX. 01

Build and validate a pmf, then find E[X]

MACS06 (left-handed example) easy

In a population \(10\%\) of people are left-handed. Two people are picked independently. Let \(X\) be the number of left-handers among them, with pmf \(p(0)=0.81,\ p(1)=0.18,\ p(2)=0.01\). (a) Check this is a valid pmf. (b) Find \(E[X]\).

Valid pmf = entries \(\ge0\) and sum \(=1\). Then \(E[X]=\sum x_i p(x_i)\).

Step 1. Validity
All three probabilities are \(\ge0\), and \(0.81+0.18+0.01=1\). Valid pmf.
Step 2. Expected value
\(E[X]=0(0.81)+1(0.18)+2(0.01)=0.18+0.02=0.20.\)
\(E[X]=0.20\) left-handers on average.
EX. 02

E[X²] is not (E[X])²

MACS06 (left-handed example) easy

For the same variable (\(p(0)=0.81,p(1)=0.18,p(2)=0.01\), \(E[X]=0.20\)), compute \(E[X^2]\) and compare it with \((E[X])^2\).

Use \(E[X^2]=\sum x_i^2 p(x_i)\). Then square the mean separately.

Step 1. Apply E[g(X)] with g(x)=x²
\(E[X^2]=0^2(0.81)+1^2(0.18)+2^2(0.01)=0+0.18+0.04=0.22.\)
Step 2. Compare
\((E[X])^2=(0.20)^2=0.04\neq 0.22=E[X^2]\). The two differ — the squaring is non-linear.
\(E[X^2]=0.22\), while \((E[X])^2=0.04\): not equal.
EX. 03

Variance of a fair die

MACS06–07 easy

Let \(X\) be the result of one fair die roll. Find \(\operatorname{Var}(X)\) and \(\operatorname{SD}(X)\).

Get \(E[X]\) and \(E[X^2]\) from the six equally likely values, then \(\operatorname{Var}=E[X^2]-(E[X])^2\).

Step 1. Mean
\(E[X]=\frac{1+2+3+4+5+6}{6}=\frac{21}{6}=3.5.\)
Step 2. Second moment
\(E[X^2]=\frac{1+4+9+16+25+36}{6}=\frac{91}{6}.\)
Step 3. Variance
\(\operatorname{Var}(X)=\frac{91}{6}-\left(\frac72\right)^2=\frac{91}{6}-\frac{49}{4}=\frac{182-147}{12}=\frac{35}{12}.\)
Step 4. SD
\(\operatorname{SD}(X)=\sqrt{35/12}\approx1.708.\)
\(\operatorname{Var}(X)=\tfrac{35}{12}\approx2.917,\ \operatorname{SD}\approx1.708.\)
EX. 04

Employee tenure: E[X] and Var(X) from a frequency table

Esercizio 3, n.11 medium

A company has \(50\) employees. The number of years (rounded up) they have worked there:

Years1234
Count1215167

One employee is chosen at random; \(X=\) years of service. Find (a) \(E[X]\) and (b) \(\operatorname{Var}(X)\).

Turn counts into a pmf (divide by \(50\)), then apply \(E[X]\) and \(\operatorname{Var}=E[X^2]-(E[X])^2\).

Step 1. Frequencies → pmf
\(p(1)=\tfrac{12}{50},\ p(2)=\tfrac{15}{50},\ p(3)=\tfrac{16}{50},\ p(4)=\tfrac{7}{50}\). (Sum \(=1\).)
Step 2. Mean
\(E[X]=\frac{1(12)+2(15)+3(16)+4(7)}{50}=\frac{12+30+48+28}{50}=\frac{118}{50}=2.36.\)
Step 3. Second moment
\(E[X^2]=\frac{1(12)+4(15)+9(16)+16(7)}{50}=\frac{12+60+144+112}{50}=\frac{328}{50}=6.56.\)
Step 4. Variance
\(\operatorname{Var}(X)=6.56-(2.36)^2=6.56-5.5696=0.9904.\) (\(\operatorname{SD}\approx0.995\).)
\(E[X]=2.36\) years, \(\operatorname{Var}(X)=0.9904\) (\(\operatorname{SD}\approx0.995\)).
EX. 05

Tomorrow's earnings: expected value and variance of a payoff

Esercizio 3, n.9 medium

If it rains tomorrow you earn €200 tutoring; if it is dry you earn €300 in construction. The probability of rain is \(\tfrac14\). Find the expected amount you earn and its variance.

Define the payoff RV: \(X=200\) with prob \(\tfrac14\), \(X=300\) with prob \(\tfrac34\). Then \(E[X]\) and \(\operatorname{Var}=E[X^2]-(E[X])^2\).

Step 1. Payoff pmf
\(P(X=200)=0.25,\ P(X=300)=0.75.\)
Step 2. Expected earnings
\(E[X]=200(0.25)+300(0.75)=50+225=275.\)
Step 3. Second moment
\(E[X^2]=200^2(0.25)+300^2(0.75)=10000+67500=77500.\)
Step 4. Variance
\(\operatorname{Var}(X)=77500-275^2=77500-75625=1875.\) (\(\operatorname{SD}\approx43.3\).)
\(E[X]=€275\), \(\operatorname{Var}(X)=1875\) (\(\operatorname{SD}\approx€43.3\)).
EX. 06

Family earnings: sum of independent variables

Esercizio 3, n.12 medium

Matteo earns an amount with mean €30000 and SD €3000. His wife Alessia earns an amount with mean €32000 and SD €5000. Their earnings are independent. Find the (a) expected value and (b) standard deviation of the family's total earnings.

Means always add. For the SD, add the variances (independence), then take the square root — you cannot add SDs directly.

Step 1. Expected total
\(E[M+A]=E[M]+E[A]=30000+32000=62000.\)
Step 2. Variance of total (independent)
\(\operatorname{Var}(M+A)=\operatorname{Var}(M)+\operatorname{Var}(A)=3000^2+5000^2=9{,}000{,}000+25{,}000{,}000=34{,}000{,}000.\)
Step 3. Standard deviation
\(\operatorname{SD}=\sqrt{34{,}000{,}000}\approx 5830.95.\) Note this is NOT \(3000+5000=8000\).
\(E=€62000\), \(\operatorname{SD}\approx€5831\) (not €8000).
EX. 07

Should the firm risk using the computers? (expected-loss decision)

Esercizio 3, n.10 medium

There is a \(25\%\) chance the power will be shut off during the next working day. If employees do not use their computers, the firm loses €400 in revenue for sure. If they use them and the power is cut mid-use, it costs €1200 (and €0 if no cut). To minimise expected loss, should the firm risk using the computers?

Compute the expected loss of each option and pick the smaller one.

Step 1. Option A — don't use
Loss is €400 with certainty: \(E[\text{loss}]=400.\)
Step 2. Option B — use
\(E[\text{loss}]=0.25(1200)+0.75(0)=300.\)
Step 3. Compare
\(300<400\), so using the computers has the lower expected loss.
Yes — risk using them: expected loss €300 vs €400.
EX. 08

Size-biased sampling: E[X] vs E[Y] (the bus problem)

MACS06 hard

Four buses carry \(40,33,50,25\) students (\(148\) total). (a) A student is chosen at random; \(X=\) number on that student's bus. (b) A bus driver is chosen at random; \(Y=\) number on that driver's bus. Find \(E[X]\) and \(E[Y]\) and explain why they differ.

Picking a student makes big buses more likely (size bias): \(P(X=n)=n/148\). Picking a driver makes each bus equally likely: \(P(Y=n)=1/4\).

Step 1. pmf of X (student-weighted)
\(P(X=n)=\dfrac{n}{148}\) for \(n\in\{25,33,40,50\}\).
Step 2. E[X]
\(E[X]=\dfrac{25^2+33^2+40^2+50^2}{148}=\dfrac{625+1089+1600+2500}{148}=\dfrac{5814}{148}\approx39.28.\)
Step 3. E[Y] (driver-weighted)
\(P(Y=n)=\tfrac14\) each, so \(E[Y]=\dfrac{25+33+40+50}{4}=\dfrac{148}{4}=37.\)
Step 4. Why bigger
\(E[X]>E[Y]\) because a randomly chosen student is more likely to come from a crowded bus — size-biased sampling inflates the average.
\(E[X]=\tfrac{5814}{148}\approx39.28\), \(E[Y]=37\); \(X\) is larger due to size bias.
CH. 04 Statistics

Discrete Distributions

The named models — Bernoulli, Binomial, Geometric, Poisson (and Hypergeometric, out of scope): recognise the story, then read the pmf, mean, and variance off the shelf.

Sections 08
Flashcards 12
Exercises 9
Read time 4'
Sources Ross, Probability & Statistics for Engineers 5E, §5.1–5.3 (Bernoulli/Binomial, Poisson, Hypergeometric) · MACS lectures 06–07 (Perone Pacifico, LUISS) · Past exams & practice: Esercizio 3 (nn.2,5,8), Esercizio 4 (nn.1,5), Esercizio 5 (nn.6,8), MAIfinal5
Key concepts
Bernoulli Binomial Geometric Poisson Poisson approximation Hypergeometric (out of scope) Binomial coefficient Recognising the distribution

Why named distributions: recognise the story, pull the formula off the shelf

Chapter 3 gave you the general machinery (pmf, \(E[X]\), \(\operatorname{Var}\)). But a handful of situations come up again and again — counting successes, waiting for a first success, counting rare events. For each, statisticians have already worked out the pmf, mean, and variance once and for all. These are the named discrete distributions.

Your job in an exam is almost never to derive them. It is to recognise the story in the problem text and then read the right formula off the shelf. So for each model below, learn three things: the story (when it applies), the pmf, and \(E[X]\) and \(\operatorname{Var}(X)\).

The four you must know: Bernoulli (one yes/no trial), Binomial (count successes in a fixed number of trials), Geometric (how long until the first success), Poisson (count of rare events at a known rate). A fifth, Hypergeometric, is covered lightly — it is out of exam scope but appears in the slides.

Bernoulli: a single yes/no trial

Story. One trial with exactly two outcomes — success (\(1\)) with probability \(p\), failure (\(0\)) with probability \(1-p\). A single coin flip, one inspected item, one customer who either buys or not.

pmf: \[ p(1)=p,\qquad p(0)=1-p, \] compactly \(P(X=x)=p^x(1-p)^{1-x}\) for \(x\in\{0,1\}\).

Mean and variance: \[ E[X]=p,\qquad \operatorname{Var}(X)=p(1-p). \]

(Both shown in Chapter 3: \(E[X]=p\) because the mean of a 0/1 variable is the probability of the 1; \(\operatorname{Var}=p(1-p)\) because \(X^2=X\).) Bernoulli is the atom: every Binomial is a sum of independent Bernoullis.

Binomial: count successes in n independent trials

Story. You repeat the same Bernoulli trial \(n\) times, independently, each with success probability \(p\). \(X=\) total number of successes. Recognition cues: a fixed number of trials \(n\), each trial independent, same \(p\), and you count how many succeed. "Out of 10 items, how many defective?" "Out of 4 components, how many work?"

pmf: \[ P(X=i)=\binom{n}{i}p^i(1-p)^{n-i},\qquad i=0,1,\dots,n, \] where \(\binom{n}{i}=\dfrac{n!}{i!\,(n-i)!}\) counts the ways to choose which \(i\) trials succeed.

Mean and variance: \[ E[X]=np,\qquad \operatorname{Var}(X)=np(1-p). \]

Both follow instantly from writing \(X=X_1+\dots+X_n\) (independent Bernoullis): means add (\(np\)) and, by independence, variances add (\(np(1-p)\)).

Common asks. "At least one" is easiest via the complement: \(P(X\ge1)=1-P(X=0)=1-(1-p)^n\). "At most one": \(P(X\le1)=P(X=0)+P(X=1)\). Reverse-engineering \(n,p\) from a stated \(E[X]\) and \(\operatorname{Var}(X)\) is a classic exam twist — solve \(np\) and \(np(1-p)\) together.

Geometric: how many trials until the first success

Story. Repeat an independent Bernoulli trial (prob \(p\)) until the first success, and let \(X=\) the trial on which it happens. Recognition cue: "draw/try until the first…", "how many attempts needed?". Unlike Binomial, the number of trials is not fixed — it is what you are counting.

pmf: \[ P(X=k)=p\,(1-p)^{k-1},\qquad k=1,2,3,\dots \] (the first \(k-1\) trials fail, the \(k\)-th succeeds).

Mean and variance: \[ E[X]=\frac1p,\qquad \operatorname{Var}(X)=\frac{1-p}{p^2}. \]

The mean \(1/p\) matches intuition: if success has probability \(\tfrac15\), you wait about \(5\) trials on average. The pmf \(p(1-p)^{k-1}\) (sometimes written with parameter \(\theta\)) also shows up in estimation problems, where its parameter is estimated by \(\hat p=1/\bar X\).

Poisson: counting rare events at a known rate

Story. Count how many times a rare event happens in a fixed interval of time or space, when events occur independently at a constant average rate \(\lambda\). "On average \(3\) accidents per week — probability of at least one?" "\(5\) claims per day on average." It is the law of rare events: misprints per page, calls per minute, particles per second, customers per hour.

pmf: \[ P(X=i)=e^{-\lambda}\,\frac{\lambda^i}{i!},\qquad i=0,1,2,\dots \] where \(\lambda>0\) is the average count for the interval considered.

Mean and variance — both equal \(\lambda\): \[ E[X]=\lambda,\qquad \operatorname{Var}(X)=\lambda. \] (So \(\operatorname{SD}=\sqrt\lambda\).) If the mean and variance of a count are roughly equal, Poisson is a natural model.

Scaling the rate. If the rate is "\(1\) per \(2\) minutes" and you look at a \(5\)-minute window, set \(\lambda = 5/2 = 2.5\) for that window. Match \(\lambda\) to the interval in the question. The handy complement again: \(P(X\ge1)=1-e^{-\lambda}\).

Poisson as an approximation to the Binomial

When you have many trials with a tiny success probability — \(n\) large, \(p\) small — the Binomial is awkward to compute but is almost exactly Poisson with

\[ \lambda = np. \]

\[ \binom{n}{i}p^i(1-p)^{n-i}\;\approx\;e^{-\lambda}\frac{\lambda^i}{i!},\qquad \lambda=np. \]

Rule of thumb: good when \(n\) is large (say \(\ge 20\)) and \(p\) small (say \(\le 0.05\)). Recognition cue in exams: a binomial setup with a huge \(n\) and a tiny \(p\) ("\(1000\) poker hands, each a full house with prob \(0.0014\)"). Switch to Poisson with \(\lambda=np\) and the arithmetic becomes trivial.

Hypergeometric (out of exam scope): sampling without replacement

Marked OUT-OF-EXAM — taught in the slides but not tested. Know the story so you can tell it apart from Binomial.

Story. A finite population has \(N\) "good" and \(M\) "bad" items. You draw \(n\) without replacement; \(X=\) number of good ones drawn. Because draws are not replaced, the trials are not independent — that is exactly why it is not Binomial.

pmf: \[ P(X=i)=\frac{\binom{N}{i}\binom{M}{n-i}}{\binom{N+M}{n}}. \]

Mean: with \(p=\dfrac{N}{N+M}\), \(E[X]=np\) (same as Binomial), but the variance carries a finite-population correction \(\operatorname{Var}(X)=np(1-p)\big(1-\tfrac{n-1}{N+M-1}\big)\).

Key distinction. With replacement (or a huge population) → Binomial. Without replacement from a small population → Hypergeometric. When \(N+M\) is large relative to \(n\), the two coincide.

Decision box — which discrete model is it?

Read the problem and match the cue:

  • One yes/no trial → Bernoulli(\(p\)): \(E=p,\ \operatorname{Var}=p(1-p)\).
  • Fixed \(n\) independent trials, count successes → Binomial(\(n,p\)): \(P=\binom{n}{i}p^i(1-p)^{n-i}\), \(E=np,\ \operatorname{Var}=np(1-p)\).
  • Trials until first success → Geometric(\(p\)): \(P=p(1-p)^{k-1}\), \(E=1/p,\ \operatorname{Var}=(1-p)/p^2\).
  • Count of rare events at rate \(\lambda\) per interval → Poisson(\(\lambda\)): \(P=e^{-\lambda}\lambda^i/i!\), \(E=\operatorname{Var}=\lambda\).
  • Big \(n\), tiny \(p\) → approximate Binomial by Poisson with \(\lambda=np\).
  • Draw without replacement, count good → Hypergeometric (out of exam).

Reflexes: "at least one" → \(1-P(0)\). "with/without replacement" decides Binomial vs Hypergeometric. "fixed trials, count hits" = Binomial; "how long until" = Geometric.

Click any card to flip. Rate it after to track what you need to revisit.
Card 1
You read: \(n\) fixed independent trials, same success probability \(p\), count the successes. Which distribution, and its pmf / \(E\) / \(\operatorname{Var}\)?
Binomial(\(n,p\)). \(P(X=i)=\binom{n}{i}p^i(1-p)^{n-i}\), \(E[X]=np\), \(\operatorname{Var}(X)=np(1-p)\).
How sure? 50%
Card 2
Distribution for a single yes/no trial? Mean and variance?
Bernoulli(\(p\)): \(P(1)=p,P(0)=1-p\). \(E[X]=p\), \(\operatorname{Var}(X)=p(1-p)\). It's the building block of the Binomial.
How sure? 50%
Card 3
You read: keep trying until the first success; how many attempts? Distribution, pmf, mean?
Geometric(\(p\)): \(P(X=k)=p(1-p)^{k-1}\) for \(k=1,2,\dots\). \(E[X]=1/p\), \(\operatorname{Var}(X)=(1-p)/p^2\). (Number of trials is not fixed.)
How sure? 50%
Card 4
You read: count of rare events in an interval, average rate \(\lambda\). Distribution, pmf, \(E\), \(\operatorname{Var}\)?
Poisson(\(\lambda\)): \(P(X=i)=e^{-\lambda}\lambda^i/i!\) for \(i=0,1,2,\dots\). \(E[X]=\operatorname{Var}(X)=\lambda\). The 'law of rare events'.
How sure? 50%
Card 5
When can you replace a Binomial(\(n,p\)) by a Poisson, and with what parameter?
When \(n\) is large and \(p\) small (rule of thumb \(n\ge20,\ p\le0.05\)). Use \(\lambda=np\): \(\binom{n}{i}p^i(1-p)^{n-i}\approx e^{-\lambda}\lambda^i/i!\).
How sure? 50%
Card 6
Binomial vs Hypergeometric — what's the deciding feature?
With replacement (or huge population) → trials independent → Binomial. Without replacement from a small population → trials dependent → Hypergeometric. (Hypergeometric is out of exam scope here.)
How sure? 50%
Card 7
Fastest way to compute 'at least one' for Binomial or Poisson?
Use the complement: \(P(X\ge1)=1-P(X=0)\). Binomial: \(1-(1-p)^n\). Poisson: \(1-e^{-\lambda}\).
How sure? 50%
Card 8
Given a Binomial with \(E[X]=6\) and \(\operatorname{Var}(X)=2.4\), how do you recover \(n\) and \(p\)?
\(np=6\) and \(np(1-p)=2.4\). Divide: \(1-p=2.4/6=0.4\Rightarrow p=0.6\), then \(n=6/0.6=10\). (Always: \(1-p=\operatorname{Var}/E\).)
How sure? 50%
Card 9
For a Poisson(\(\lambda\)), what is the standard deviation?
\(\operatorname{SD}=\sqrt{\operatorname{Var}}=\sqrt{\lambda}\) (since \(\operatorname{Var}=\lambda\)). E.g. Poisson(144) has SD \(=12\).
How sure? 50%
Card 10
The rate is '1 event every 2 minutes'. What \(\lambda\) do you use for a 5-minute window?
Scale the rate to the interval: \(\lambda = 5/2 = 2.5\). Always match \(\lambda\) to the time/space window in the question.
How sure? 50%
Card 11
What does the binomial coefficient \(\binom{n}{i}\) equal, and what does it count?
\(\binom{n}{i}=\dfrac{n!}{i!\,(n-i)!}\). It counts the number of ways to choose which \(i\) of the \(n\) trials are the successes.
How sure? 50%
Card 12
A pmf is given as \(p(i)=c\,\lambda^i/i!\). What distribution, and what is \(c\)?
Poisson(\(\lambda\)). Since \(\sum_i \lambda^i/i!=e^{\lambda}\), normalisation forces \(c=e^{-\lambda}\). Then \(P(X=0)=e^{-\lambda}\).
How sure? 50%
Progress
0 / 9
EX. 01

Binomial: defective ball bearings

Esercizio 4, n.4 easy

Each ball bearing is independently defective with probability \(0.05\). A sample of \(5\) is inspected. Find (a) \(P(\text{none defective})\) and (b) \(P(\text{two or more defective})\).

\(X\sim\text{Binomial}(5,0.05)\). Use the complement for (b).

Step 1. Identify
Fixed \(n=5\) independent trials, \(p=0.05\), count defectives \(\Rightarrow X\sim\text{Binomial}(5,0.05)\).
Step 2. (a) none
\(P(X=0)=(0.95)^5\approx0.7738.\)
Step 3. (b) two or more
\(P(X\ge2)=1-P(0)-P(1)\). \(P(1)=\binom{5}{1}(0.05)(0.95)^4\approx0.2036\). So \(P(X\ge2)\approx1-0.7738-0.2036=0.0226.\)
(a) \(\approx0.7738\); (b) \(\approx0.0226\).
EX. 02

Binomial: satellite reliability

Esercizio 3, n.5 easy

A system has \(4\) components; it works if at least \(2\) function. Each works independently with probability \(0.8\). Find \(P(\text{system works})\).

\(Y\sim\text{Binomial}(4,0.8)\); \(P(Y\ge2)=1-P(0)-P(1)\).

Step 1. Model
\(Y=\)number working \(\sim\text{Binomial}(4,0.8)\).
Step 2. Complement
\(P(Y=0)=(0.2)^4=0.0016\); \(P(Y=1)=\binom{4}{1}(0.8)(0.2)^3=0.0256\).
Step 3. Combine
\(P(Y\ge2)=1-0.0016-0.0256=0.9728.\)
\(P(\text{works})=0.9728.\)
EX. 03

Binomial: recover n and p from E and Var

Esercitazione 4, n.1 medium

\(X\sim\text{Binomial}(n,p)\) with \(E[X]=6\) and \(\operatorname{Var}(X)=2.4\). Find \(n\), \(p\), and \(P(X=5)\).

Use \(np=E\) and \(np(1-p)=\operatorname{Var}\); divide to isolate \(1-p\).

Step 1. Solve for p
\(1-p=\dfrac{\operatorname{Var}}{E}=\dfrac{2.4}{6}=0.4\Rightarrow p=0.6.\)
Step 2. Solve for n
\(n=\dfrac{E}{p}=\dfrac{6}{0.6}=10.\) (Check: \(np(1-p)=10\cdot0.6\cdot0.4=2.4\) ✓.)
Step 3. P(X=5)
\(P(X=5)=\binom{10}{5}(0.6)^5(0.4)^5=252\cdot0.07776\cdot0.01024\approx0.2007.\)
\(n=10,\ p=0.6,\ P(X=5)\approx0.2007.\)
EX. 04

Poisson: weekly accidents

Esercitazione 4, n.5 easy

The average number of accidents on a road per week is \(1.2\). Find the probability of at least one accident this week.

\(X\sim\text{Poisson}(1.2)\); \(P(X\ge1)=1-e^{-\lambda}\).

Step 1. Model
Rare events at rate \(\lambda=1.2\) per week \(\Rightarrow X\sim\text{Poisson}(1.2)\).
Step 2. Complement
\(P(X\ge1)=1-P(X=0)=1-e^{-1.2}\approx1-0.3012=0.6988.\)
\(P(X\ge1)\approx0.6988.\)
EX. 05

Poisson: radioactive emissions, at most 2

Ross 5E, Ex 5.2c medium

A gram of radioactive material emits on average \(3.2\) \(\alpha\)-particles per second. Find the probability of at most \(2\) emissions in one second.

\(X\sim\text{Poisson}(3.2)\); sum \(P(0)+P(1)+P(2)\).

Step 1. Set up
\(X\sim\text{Poisson}(3.2)\), so \(P(X\le2)=e^{-3.2}\!\left(1+3.2+\dfrac{3.2^2}{2}\right).\)
Step 2. Compute
\(1+3.2+5.12=9.32\); \(e^{-3.2}\approx0.04076\); product \(\approx0.380.\)
\(P(X\le2)\approx0.380.\)
EX. 06

Poisson: casino arrivals

Esercizio 5, n.8 medium

People enter a casino at an average rate of \(1\) every \(2\) minutes. For the window 12:00–12:05, find (a) \(P(\text{nobody enters})\) and (b) \(P(\text{at least }4\text{ enter})\).

Scale the rate to the 5-minute window: \(\lambda=5/2=2.5\).

Step 1. Rate for the window
\(\lambda=5/2=2.5\Rightarrow X\sim\text{Poisson}(2.5)\).
Step 2. (a) nobody
\(P(X=0)=e^{-2.5}\approx0.0821.\)
Step 3. (b) at least 4
\(P(X\ge4)=1-\sum_{k=0}^{3}e^{-2.5}\dfrac{2.5^k}{k!}=1-0.7576\approx0.2424.\)
(a) \(\approx0.0821\); (b) \(\approx0.2424\).
EX. 07

Poisson approximation to a Binomial: poker full houses

Esercizio 5, n.6 medium

A poker hand is a full house with probability \(0.0014\). In \(1000\) independent hands, find the probability of at least \(2\) full houses.

\(X\sim\text{Binomial}(1000,0.0014)\): large \(n\), tiny \(p\). Approximate by Poisson with \(\lambda=np\).

Step 1. Approximate
\(\lambda=np=1000\cdot0.0014=1.4\Rightarrow X\approx\text{Poisson}(1.4)\).
Step 2. At least 2
\(P(X\ge2)=1-P(0)-P(1)=1-e^{-1.4}(1+1.4)=1-0.5918\approx0.4082.\)
\(P(X\ge2)\approx0.4082.\)
EX. 08

Geometric: drawing until the first black ball

Esercizio 3, n.8 medium

An urn has \(N\) white and \(M\) black balls. You draw with replacement until you get a black ball. Find \(P(\text{exactly }n\text{ draws})\) and the expected number of draws.

Each draw is black with probability \(p=M/(N+M)\); 'until first success' \(\Rightarrow\) Geometric.

Step 1. Identify p
\(p=P(\text{black})=\dfrac{M}{N+M}\), draws independent (replacement).
Step 2. pmf
\(n-1\) whites then a black: \(P(X=n)=p(1-p)^{n-1}=\dfrac{M}{N+M}\left(\dfrac{N}{N+M}\right)^{n-1}.\)
Step 3. Expected draws
\(E[X]=1/p=\dfrac{N+M}{M}.\)
\(P(X=n)=p(1-p)^{n-1}\) with \(p=\tfrac{M}{N+M}\); \(E[X]=\tfrac{N+M}{M}.\)
EX. 09

Hypergeometric (out of exam): defective batteries

Esercizio 3, n.2 medium

(Out of exam scope — for contrast with the Binomial.) From a bin of \(10\) batteries (\(7\) good, \(3\) defective), \(2\) are chosen without replacement. Let \(X=\) number of defectives. Give the pmf.

Without replacement from a small population \(\Rightarrow\) Hypergeometric. Use \(\dfrac{\binom{3}{i}\binom{7}{2-i}}{\binom{10}{2}}\).

Step 1. P(X=0)
\(\dfrac{\binom{3}{0}\binom{7}{2}}{\binom{10}{2}}=\dfrac{21}{45}=\dfrac{7}{15}\approx0.467.\)
Step 2. P(X=1)
\(\dfrac{\binom{3}{1}\binom{7}{1}}{\binom{10}{2}}=\dfrac{21}{45}=\dfrac{7}{15}\approx0.467.\)
Step 3. P(X=2)
\(\dfrac{\binom{3}{2}\binom{7}{0}}{\binom{10}{2}}=\dfrac{3}{45}=\dfrac{1}{15}\approx0.067.\) (Sum \(=1\) ✓.)
\(P(0)=P(1)=\tfrac{7}{15},\ P(2)=\tfrac{1}{15}.\)
CH. 05 Statistics

Joint Distributions & Covariance

Two variables at once: joint and marginal pmfs, independence, covariance and correlation, and the variance of a sum — with the crucial warning that zero covariance is not independence.

Sections 09
Flashcards 11
Exercises 5
Read time 4'
Sources Ross, Probability & Statistics for Engineers 5E, §4.3–4.7 (CDF, §4.3.2 conditional, entropy skipped) · MACS lectures 08, 10 (Perone Pacifico, LUISS) · Worked examples: MACS08 satisfaction-vs-year table & Cov=0 counterexample
Key concepts
Joint pmf Marginal pmf Independence of RVs E[XY] Covariance Correlation coefficient Variance of a sum Cov=0 ≠ independence

Why one variable at a time isn't enough

So far each random variable lived alone. But real questions involve two at once: height and weight, a student's satisfaction and their year, today's return and tomorrow's. To study how two variables move together, their separate distributions are not enough.

Here's the catch that motivates the whole chapter: the individual (marginal) distributions of \(X\) and \(Y\) do not determine how they relate. Two completely different relationships — one where \(X\) and \(Y\) are unrelated, one where high \(X\) forces high \(Y\) — can have the same marginals. To see the relationship you need the joint distribution: the probability of each pair of values.

The joint pmf: probability of each pair

For two discrete random variables \(X\) and \(Y\), the joint probability mass function gives the probability they take a particular pair of values:

\[ p(x_i,y_j)=P\{X=x_i,\ Y=y_j\}. \]

You lay it out as a table: rows are the values of \(X\), columns the values of \(Y\), each cell a probability. As with any pmf, all cells are \(\ge0\) and

\[ \sum_i\sum_j p(x_i,y_j)=1. \]

Example (independence by construction). Daily stock changes are independent and identically distributed, with \(P(\text{change}=0)=0.30\), \(P(\pm1)=0.20\), \(P(\pm2)=0.10\), \(P(\pm3)=0.05\). The probability of a specific 3-day path multiplies: \(P\{X_1=1,X_2=2,X_3=0\}=(0.20)(0.10)(0.30)=0.006\).

Marginals: recovering each variable from the table

Given the joint table you can always get each variable's own distribution back by summing across the other. The marginal pmf of \(X\) is the row sums; the marginal of \(Y\) is the column sums:

\[ p_X(x_i)=\sum_j p(x_i,y_j),\qquad p_Y(y_j)=\sum_i p(x_i,y_j). \]

(The name comes from writing these totals in the margins of the table.) Each marginal must itself sum to \(1\) — a handy check.

One-way street. You can always go joint → marginals. You cannot in general go marginals → joint: many different joint tables share the same row and column totals. That's exactly why the joint distribution carries information the marginals don't.

Independence of two random variables

\(X\) and \(Y\) are independent when knowing one tells you nothing about the other. For discrete variables this has a clean test: the joint factors into the product of the marginals, for every cell:

\[ p(x_i,y_j)=p_X(x_i)\,p_Y(y_j)\quad\text{for all }i,j. \]

How to check. Compute the marginals (row/column sums), then verify every cell equals row-total × column-total. If even one cell fails, the variables are dependent.

Under independence two shortcuts hold: \(E[XY]=E[X]\,E[Y]\), and (as we'll see) covariance is \(0\) so variances of sums simply add.

Expectation of functions of two variables

To average any function \(g(X,Y)\) — a combined score, a product, a sum — weight it by the joint probabilities:

\[ E[g(X,Y)]=\sum_i\sum_j g(x_i,y_j)\,p(x_i,y_j). \]

Two cases dominate:

  • Sum: \(E[X+Y]=E[X]+E[Y]\) — always, no independence needed.
  • Product: \(E[XY]=\sum_i\sum_j x_i y_j\,p(x_i,y_j)\). This equals \(E[X]\,E[Y]\) only when \(X,Y\) are independent.

The gap between \(E[XY]\) and \(E[X]E[Y]\) is precisely what the next idea — covariance — measures.

Covariance: do they move together?

Covariance measures the linear tendency of two variables to move together:

\[ \operatorname{Cov}(X,Y)=E[(X-\mu_X)(Y-\mu_Y)]. \]

When \(X\) is above its mean at the same time \(Y\) is above its mean (and vice versa), the product is positive on average → positive covariance. If one tends to be high when the other is low, it's negative.

Computational formula (the one you use):

\[ \operatorname{Cov}(X,Y)=E[XY]-E[X]\,E[Y]=E[XY]-\mu_X\mu_Y. \]

Sign tells the story: \(>0\) move together; \(<0\) move oppositely; \(=0\) no linear trend.

Properties: \(\operatorname{Cov}(X,Y)=\operatorname{Cov}(Y,X)\); \(\operatorname{Cov}(X,X)=\operatorname{Var}(X)\); \(\operatorname{Cov}(aX,Y)=a\operatorname{Cov}(X,Y)\); and if \(X,Y\) are independent then \(\operatorname{Cov}(X,Y)=0\).

Correlation: covariance on a fixed scale

Covariance has awkward units (units of \(X\) times units of \(Y\)) and its size depends on scale, so you can't tell "strong" from "weak". The correlation coefficient fixes this by dividing out both standard deviations:

\[ \operatorname{Corr}(X,Y)=\frac{\operatorname{Cov}(X,Y)}{\sqrt{\operatorname{Var}(X)\,\operatorname{Var}(Y)}}=\frac{\operatorname{Cov}(X,Y)}{\sigma_X\,\sigma_Y}. \]

It is always between \(-1\) and \(1\):

\[ -1\le\operatorname{Corr}(X,Y)\le 1. \]

  • \(+1\): perfect increasing linear relation \(Y=a+bX\) with \(b>0\).
  • \(-1\): perfect decreasing linear relation (\(b<0\)).
  • \(0\): no linear relation (but a non-linear one can still exist).

The closer to \(\pm1\), the tighter the linear association; the sign matches the sign of the covariance.

Variance of a sum, with the covariance term

Chapter 3 promised the full rule; here it is. Adding two variables, the variance is not just the sum of variances — there is a covariance correction:

\[ \operatorname{Var}(X+Y)=\operatorname{Var}(X)+\operatorname{Var}(Y)+2\operatorname{Cov}(X,Y). \]

For many variables:

\[ \operatorname{Var}\!\left(\sum_{i=1}^n X_i\right)=\sum_{i=1}^n\operatorname{Var}(X_i)+\sum_{i\neq j}\operatorname{Cov}(X_i,X_j). \]

When the variables are independent, every covariance is \(0\) and the variances simply add — which is why, back in Chapters 3–4, "independent" was the magic word that let \(\operatorname{Var}(np(1-p))\) and friends fall out so cleanly.

Warning: zero covariance does NOT mean independent

Independence \(\Rightarrow\) \(\operatorname{Cov}=0\). The reverse is false: two variables can have zero covariance yet be completely dependent, because covariance only sees linear association.

Classic counterexample. Let \(X\) take \(-1,0,1\) each with probability \(\tfrac13\), and set \(Y=X^2\). Then \(Y\) is totally determined by \(X\) (as dependent as can be), yet:

\[ E[X]=0,\quad E[XY]=E[X^3]=\tfrac{-1+0+1}{3}=0,\quad \operatorname{Cov}(X,Y)=0-0\cdot E[Y]=0. \]

Covariance is \(0\) but \(X\) and \(Y\) are dependent (e.g. \(p(0,1)=0\neq p_X(0)\,p_Y(1)=\tfrac13\cdot\tfrac23\)). Takeaway: use the factorisation test \(p(x,y)=p_X(x)p_Y(y)\) to decide independence — never infer it from \(\operatorname{Cov}=0\) alone.

Click any card to flip. Rate it after to track what you need to revisit.
Card 1
What is the joint pmf \(p(x,y)\) of two discrete RVs, and what must it satisfy?
\(p(x_i,y_j)=P\{X=x_i,Y=y_j\}\) — probability of each pair. All cells \(\ge0\) and \(\sum_i\sum_j p(x_i,y_j)=1\).
How sure? 50%
Card 2
How do you get the marginal pmf of \(X\) (and of \(Y\)) from a joint table?
Sum across the other variable: \(p_X(x_i)=\sum_j p(x_i,y_j)\) (row sums), \(p_Y(y_j)=\sum_i p(x_i,y_j)\) (column sums).
How sure? 50%
Card 3
Can you reconstruct the joint distribution from the two marginals?
No. Joint → marginals always works, but many different joint tables share the same marginals. The joint carries relationship info the marginals don't.
How sure? 50%
Card 4
Discrete test for independence of \(X\) and \(Y\)?
\(p(x_i,y_j)=p_X(x_i)\,p_Y(y_j)\) for EVERY cell. If even one cell fails the product, they are dependent.
How sure? 50%
Card 5
Formula for \(E[XY]\), and when does \(E[XY]=E[X]E[Y]\)?
\(E[XY]=\sum_i\sum_j x_i y_j\,p(x_i,y_j)\). It equals \(E[X]E[Y]\) only when \(X,Y\) are independent.
How sure? 50%
Card 6
Definition and computational formula for covariance?
\(\operatorname{Cov}(X,Y)=E[(X-\mu_X)(Y-\mu_Y)]=E[XY]-E[X]E[Y]\). Recipe: get \(E[X],E[Y],E[XY]\), then subtract the product of means.
How sure? 50%
Card 7
What do the sign and the zero of covariance mean?
\(>0\): move together; \(<0\): move oppositely; \(=0\): no LINEAR trend. Also \(\operatorname{Cov}(X,X)=\operatorname{Var}(X)\), and independence \(\Rightarrow\operatorname{Cov}=0\).
How sure? 50%
Card 8
Correlation coefficient: formula and range?
\(\operatorname{Corr}(X,Y)=\dfrac{\operatorname{Cov}(X,Y)}{\sigma_X\sigma_Y}\), always in \([-1,1]\). \(\pm1\) = perfect linear relation; \(0\) = no linear relation. Scale-free version of covariance.
How sure? 50%
Card 9
Full formula for \(\operatorname{Var}(X+Y)\), and when does it reduce to \(\operatorname{Var}(X)+\operatorname{Var}(Y)\)?
\(\operatorname{Var}(X+Y)=\operatorname{Var}(X)+\operatorname{Var}(Y)+2\operatorname{Cov}(X,Y)\). Reduces to the plain sum exactly when \(X,Y\) are independent (\(\operatorname{Cov}=0\)).
How sure? 50%
Card 10
Does \(\operatorname{Cov}(X,Y)=0\) imply \(X\) and \(Y\) are independent?
NO. Independence \(\Rightarrow\operatorname{Cov}=0\), but not the reverse — covariance sees only linear association. Counterexample: \(X\in\{-1,0,1\}\), \(Y=X^2\) has \(\operatorname{Cov}=0\) yet is fully dependent.
How sure? 50%
Card 11
You're given a full joint table and asked for the correlation. What's the procedure?
Marginals (row/col sums) → \(E[X],E[X^2],\operatorname{Var}(X)\) and same for \(Y\) → \(E[XY]\) from the table → \(\operatorname{Cov}=E[XY]-E[X]E[Y]\) → \(\operatorname{Corr}=\operatorname{Cov}/(\sigma_X\sigma_Y)\).
How sure? 50%
Progress
0 / 5
EX. 01

Full pipeline from a joint table: marginals, covariance, correlation

MACS08 (satisfaction vs year) hard

Student satisfaction \(X\in\{1,2,3,4\}\) and university year \(Y\in\{1,2,3\}\) have joint pmf:

\(X\backslash Y\)123
10.1000
20.2000
30.300.200
4000.20

Find \(E[X],E[Y]\), \(\operatorname{Var}(X),\operatorname{Var}(Y)\), \(\operatorname{Cov}(X,Y)\), and \(\operatorname{Corr}(X,Y)\).

Row sums give \(p_X\), column sums give \(p_Y\). Then the usual moments, \(E[XY]\) from nonzero cells, and combine.

Step 1. Marginals
Rows: \(p_X=(0.10,0.20,0.50,0.20)\). Columns: \(p_Y=(0.60,0.20,0.20)\). Both sum to 1.
Step 2. Means & variances
\(E[X]=0.1+0.4+1.5+0.8=2.8\); \(E[X^2]=0.1+0.8+4.5+3.2=8.6\Rightarrow\operatorname{Var}(X)=8.6-2.8^2=0.76\). \(E[Y]=0.6+0.4+0.6=1.6\); \(E[Y^2]=0.6+0.8+1.8=3.2\Rightarrow\operatorname{Var}(Y)=3.2-1.6^2=0.64\).
Step 3. E[XY]
Nonzero cells: \(1\!\cdot\!1(0.10)+2\!\cdot\!1(0.20)+3\!\cdot\!1(0.30)+3\!\cdot\!2(0.20)+4\!\cdot\!3(0.20)=0.1+0.4+0.9+1.2+2.4=5.0\).
Step 4. Covariance & correlation
\(\operatorname{Cov}=5.0-2.8\cdot1.6=0.52\). \(\operatorname{Corr}=\dfrac{0.52}{\sqrt{0.76\cdot0.64}}=\dfrac{0.52}{0.6974}\approx0.746\) — strong positive.
\(E[X]=2.8,E[Y]=1.6,\operatorname{Var}(X)=0.76,\operatorname{Var}(Y)=0.64,\operatorname{Cov}=0.52,\operatorname{Corr}\approx0.746.\)
EX. 02

Test independence from a joint table

Constructed (Ross §4.3.1 method) easy

For \(X,Y\in\{0,1\}\) the joint pmf is \(p(0,0)=0.2,\ p(0,1)=0.2,\ p(1,0)=0.3,\ p(1,1)=0.3\). Are \(X\) and \(Y\) independent?

Get the marginals, then check whether every cell equals (row total)×(column total).

Step 1. Marginals
\(p_X(0)=0.2+0.2=0.4,\ p_X(1)=0.6\); \(p_Y(0)=0.2+0.3=0.5,\ p_Y(1)=0.5\).
Step 2. Check the product
\(p_X(0)p_Y(0)=0.4\cdot0.5=0.2=p(0,0)\) ✓. Checking all four cells, each equals row×column.
Step 3. Conclude
Every cell factors, so \(X\) and \(Y\) are independent.
Independent — all cells satisfy \(p(x,y)=p_X(x)p_Y(y)\).
EX. 03

Zero covariance but dependent

MACS08 (counterexample) medium

Let \(X\) take \(-1,0,1\), each with probability \(\tfrac13\), and let \(Y=X^2\). Show \(\operatorname{Cov}(X,Y)=0\), yet \(X\) and \(Y\) are not independent.

Compute \(E[X],E[XY]\) (note \(XY=X^3\)), then test independence on one cell.

Step 1. Means
\(E[X]=\tfrac{-1+0+1}{3}=0\); \(Y\in\{0,1\}\) with \(P(Y=0)=\tfrac13,P(Y=1)=\tfrac23\), so \(E[Y]=\tfrac23\).
Step 2. Covariance
\(E[XY]=E[X^3]=\tfrac{-1+0+1}{3}=0\), so \(\operatorname{Cov}=E[XY]-E[X]E[Y]=0-0\cdot\tfrac23=0\).
Step 3. But dependent
\(p(X{=}0,Y{=}1)=0\) while \(p_X(0)p_Y(1)=\tfrac13\cdot\tfrac23=\tfrac29\neq0\). The factorisation fails \(\Rightarrow\) dependent.
\(\operatorname{Cov}=0\) but dependent — covariance only detects linear association.
EX. 04

Variance of a sum with a covariance term

Constructed (Ross §4.7) easy

Suppose \(\operatorname{Var}(X)=4\), \(\operatorname{Var}(Y)=9\), and \(\operatorname{Cov}(X,Y)=-2\). Find \(\operatorname{Var}(X+Y)\). Compare with the value you'd get if \(X,Y\) were independent.

Use \(\operatorname{Var}(X+Y)=\operatorname{Var}(X)+\operatorname{Var}(Y)+2\operatorname{Cov}(X,Y)\).

Step 1. Apply the formula
\(\operatorname{Var}(X+Y)=4+9+2(-2)=13-4=9.\)
Step 2. If independent
Then \(\operatorname{Cov}=0\) and \(\operatorname{Var}(X+Y)=4+9=13\). The negative covariance reduced the spread of the sum.
\(\operatorname{Var}(X+Y)=9\) (vs \(13\) if independent).
EX. 05

Independence shortcuts: product expectation and sum of a path

Ross §4.3 (stock changes) medium

Daily stock changes are independent with \(P(0)=0.30,P(\pm1)=0.20,P(\pm2)=0.10,P(\pm3)=0.05\). (a) Find \(P\{X_1=1,X_2=2,X_3=0\}\). (b) For two such independent days, argue \(E[X_1X_2]=E[X_1]E[X_2]\).

Independence makes the joint probability the product of marginals, and \(E[XY]=E[X]E[Y]\).

Step 1. (a) path probability
By independence the joint factors: \(P=(0.20)(0.10)(0.30)=0.006.\)
Step 2. (b) product of means
The change distribution is symmetric about 0, so \(E[X_1]=E[X_2]=0\). By independence \(E[X_1X_2]=E[X_1]E[X_2]=0\), hence \(\operatorname{Cov}(X_1,X_2)=0\) too.
(a) \(0.006\); (b) \(E[X_1X_2]=E[X_1]E[X_2]=0\) by independence.
CH. 06 Statistics

Continuous Random Variables

When values fill an interval: density functions where probability is area, the P(X=a)=0 rule, expectation and variance by integration, and the two workhorse models — the continuous Uniform and the memoryless Exponential.

Sections 07
Flashcards 11
Exercises 6
Read time 3'
Sources Ross, Probability & Statistics for Engineers 5E, §4.1–4.2, §5.4 (Uniform), §5.6 (Exponential) — CDF/Poisson-process/Pareto skipped · MACS lectures 09, 11 (Perone Pacifico, LUISS) · Past practice: Esercizio 5 (nn.1,2,3,5), Ross examples 5.4b/5.6a
Key concepts
Probability density function Probability as area P(X=a)=0 Normalising constant E and Var by integration Continuous Uniform Exponential Memoryless property

From mass to density: when values form a continuum

Discrete variables sat on separate points, each carrying a chunk of probability (the pmf). But many quantities — a waiting time, a height, a lifetime — can be any value in an interval. There are infinitely many possible values, so no single one can carry positive probability. We need a new tool: the probability density function \(f(x)\).

The shift in one sentence: for discrete variables, probability is the height of a bar; for continuous variables, probability is the area under a curve. Nothing else about expectation or variance changes in spirit — sums just become integrals.

The density function: probability is area

A continuous random variable \(X\) has a density \(f(x)\ge0\) such that the probability of landing in an interval is the area under \(f\) over that interval:

\[ P(a\le X\le b)=\int_a^b f(x)\,dx. \]

For \(f\) to be a valid density it must be nonnegative and enclose total area \(1\):

\[ f(x)\ge0,\qquad \int_{-\infty}^{\infty} f(x)\,dx=1. \]

A single point has zero probability. Since the area over a point is zero, \[ P(X=a)=\int_a^a f(x)\,dx=0. \] A practical consequence: for continuous variables \(\le\) and \(<\) (and \(\ge\) and \(>\)) give the same probability — endpoints never matter. \(f(x)\) itself is not a probability (it can exceed \(1\)); only areas are probabilities.

Finding the normalising constant

A very common exam setup gives a density "up to a constant" — \(f(x)=c\cdot(\text{shape})\) — and asks you to find \(c\). The rule is always the same: the total area must be \(1\), so

\[ \int_{-\infty}^{\infty} f(x)\,dx=1 \quad\Longrightarrow\quad c=\frac{1}{\int(\text{shape})\,dx}. \]

Example. Let \(f(x)=c(1-x^2)\) on \((-1,1)\), \(0\) elsewhere. Then

\[ \int_{-1}^{1} c(1-x^2)\,dx=c\Big[x-\tfrac{x^3}{3}\Big]_{-1}^{1}=c\cdot\tfrac43=1\ \Rightarrow\ c=\tfrac34. \]

Once \(c\) is known you compute any probability as an area, e.g. \(P(0

Expectation and variance by integration

Everything from Chapter 3 carries over with sums replaced by integrals:

\[ E[X]=\int_{-\infty}^{\infty} x\,f(x)\,dx,\qquad E[g(X)]=\int_{-\infty}^{\infty} g(x)\,f(x)\,dx, \]

\[ \operatorname{Var}(X)=E[X^2]-(E[X])^2=\int x^2 f(x)\,dx-(E[X])^2. \]

The linear-transformation rules are unchanged too: \(E[aX+b]=aE[X]+b\) and \(\operatorname{Var}(aX+b)=a^2\operatorname{Var}(X)\).

Example. For \(f(x)=x/2\) on \([0,2]\): \(E[X]=\int_0^2 x\cdot\tfrac{x}{2}\,dx=\tfrac12\cdot\tfrac{8}{3}=\tfrac43\), \(E[X^2]=\int_0^2 x^2\cdot\tfrac{x}{2}\,dx=2\), so \(\operatorname{Var}(X)=2-\tfrac{16}{9}=\tfrac29\).

The continuous Uniform on [α, β]

Story. Any value in \([\alpha,\beta]\) is equally likely — no part of the interval is favoured. The density is a flat line whose height makes the area \(1\):

\[ f(x)=\frac{1}{\beta-\alpha}\quad\text{for }x\in[\alpha,\beta]\ \ (0\text{ outside}). \]

(Base \(\times\) height \(=(\beta-\alpha)\cdot\frac{1}{\beta-\alpha}=1\).)

Probabilities are length ratios. For a sub-interval \([a,b]\subseteq[\alpha,\beta]\): \[ P(a

Mean and variance: \[ E[X]=\frac{\alpha+\beta}{2},\qquad \operatorname{Var}(X)=\frac{(\beta-\alpha)^2}{12}. \]

The mean is just the midpoint. Recognition cue: "arrives at a time uniformly between …", "no information, equally likely anywhere in the interval".

The Exponential: waiting times with no memory

Story. The time you wait for the next event when events happen at a constant rate \(\lambda\) (the continuous-time cousin of the Poisson). Lifetimes of components that don't wear out, time to the next phone call, service times in a queue.

Density (rate \(\lambda>0\)): \[ f(x)=\lambda e^{-\lambda x}\quad\text{for }x\ge0\ \ (0\text{ for }x<0). \]

Tail probability (the handy one — no integral needed): \[ P(X>t)=e^{-\lambda t},\qquad P(X\le t)=1-e^{-\lambda t}. \]

Mean and variance: \[ E[X]=\frac1\lambda,\qquad \operatorname{Var}(X)=\frac1{\lambda^2}. \] So a higher rate \(\lambda\) means shorter expected wait. Careful: \(\lambda\) is the rate; the mean is its reciprocal \(1/\lambda\).

Memoryless property. The exponential forgets how long it has already waited: \[ P(X>s+t\mid X>s)=P(X>t). \] An old component that hasn't failed is "as good as new" — its remaining life has the same distribution as a fresh one. This is the defining feature of the exponential.

Recap — the continuous toolkit

  • Density: probability = area, \(P(a\le X\le b)=\int_a^b f\). Valid if \(f\ge0\) and \(\int f=1\). \(P(X=a)=0\), so \(\le\) and \(<\) coincide.
  • Find a constant: set \(\int f=1\) and solve.
  • Moments: \(E[X]=\int x f\), \(\operatorname{Var}=E[X^2]-(E[X])^2\); \(E[aX+b]=aE[X]+b\), \(\operatorname{Var}(aX+b)=a^2\operatorname{Var}(X)\).
  • Uniform\([\alpha,\beta]\): \(f=\tfrac{1}{\beta-\alpha}\), \(E=\tfrac{\alpha+\beta}{2}\), \(\operatorname{Var}=\tfrac{(\beta-\alpha)^2}{12}\); sub-interval prob = length ratio.
  • Exponential\((\lambda)\): \(f=\lambda e^{-\lambda x}\), \(P(X>t)=e^{-\lambda t}\), \(E=\tfrac1\lambda\), \(\operatorname{Var}=\tfrac1{\lambda^2}\), memoryless.

Exam reflex: "density given, find constant / a probability" → normalise then integrate. "equally likely in an interval" → Uniform, use length ratios. "waiting time / lifetime at constant rate / memoryless" → Exponential, use \(e^{-\lambda t}\).

Click any card to flip. Rate it after to track what you need to revisit.
Card 1
For a continuous RV, how is a probability obtained from the density \(f(x)\)?
As the AREA under the curve: \(P(a\le X\le b)=\int_a^b f(x)\,dx\). Valid density needs \(f(x)\ge0\) and \(\int_{-\infty}^{\infty}f(x)\,dx=1\). \(f\) itself isn't a probability (can exceed 1).
How sure? 50%
Card 2
What is \(P(X=a)\) for a continuous random variable, and what does that imply for \(\le\) vs \(<\)?
\(P(X=a)=0\) (area over a point is zero). So endpoints don't matter: \(P(a\).
How sure? 50%
Card 3
A density is given as \(f(x)=c\cdot(\text{shape})\). How do you find \(c\)?
Impose total area 1: \(\int f(x)\,dx=1\Rightarrow c=1/\!\int(\text{shape})\,dx\). E.g. \(c(1-x^2)\) on \((-1,1)\): \(c\cdot\tfrac43=1\Rightarrow c=\tfrac34\).
How sure? 50%
Card 4
\(E[X]\), \(E[g(X)]\), and \(\operatorname{Var}(X)\) for a continuous RV?
\(E[X]=\int x f(x)\,dx\), \(E[g(X)]=\int g(x)f(x)\,dx\), \(\operatorname{Var}(X)=E[X^2]-(E[X])^2\). Same as discrete with sums→integrals.
How sure? 50%
Card 5
Continuous Uniform on \([\alpha,\beta]\): density, mean, variance?
\(f(x)=\dfrac{1}{\beta-\alpha}\) on \([\alpha,\beta]\). \(E[X]=\dfrac{\alpha+\beta}{2}\), \(\operatorname{Var}(X)=\dfrac{(\beta-\alpha)^2}{12}\).
How sure? 50%
Card 6
For a Uniform on \([\alpha,\beta]\), how do you get \(P(a
It's the ratio of lengths: \(P(a
How sure? 50%
Card 7
Exponential with rate \(\lambda\): density, mean, variance?
\(f(x)=\lambda e^{-\lambda x}\) for \(x\ge0\). \(E[X]=1/\lambda\), \(\operatorname{Var}(X)=1/\lambda^2\). Note \(\lambda\) is the rate; the mean is its reciprocal.
How sure? 50%
Card 8
Fast way to get exponential tail/CDF probabilities?
\(P(X>t)=e^{-\lambda t}\) and \(P(X\le t)=1-e^{-\lambda t}\). No integration required.
How sure? 50%
Card 9
State and interpret the memoryless property of the exponential.
\(P(X>s+t\mid X>s)=P(X>t)\). An item that has survived to age \(s\) is 'as good as new' — remaining life has the same distribution as a fresh one. It characterises the exponential.
How sure? 50%
Card 10
How does a continuous-RV computation differ from the discrete version?
Replace the pmf \(p(x)\) by the density \(f(x)\) and replace sums \(\sum\) by integrals \(\int\). Probability becomes area; everything else (E, Var, linearity) is the same.
How sure? 50%
Card 11
You read 'lifetime is exponential with mean 10000'. What is \(\lambda\), and \(P(X>5000)\)?
Mean \(=1/\lambda=10000\Rightarrow\lambda=10^{-4}\). \(P(X>5000)=e^{-\lambda\cdot5000}=e^{-0.5}\approx0.6065\).
How sure? 50%
Progress
0 / 6
EX. 01

Uniform waiting time, with a conditional twist

Esercizio 5, n.1 easy

A bus arrives at a time uniformly distributed between 10:00 and 10:30; let \(X\) be your wait in minutes, \(X\sim U(0,30)\). (a) Find \(P(X>10)\). (b) If at 10:15 the bus still hasn't come, find the probability you wait at least 10 more minutes.

Uniform probabilities are length ratios. For (b) use conditional probability \(P(X>25\mid X>15)\).

Step 1. (a)
\(P(X>10)=\dfrac{30-10}{30}=\dfrac{20}{30}=\dfrac23\approx0.667.\)
Step 2. (b) set up conditional
'10 more minutes after 10:15' means \(X>25\) given \(X>15\): \(P(X>25\mid X>15)=\dfrac{P(X>25)}{P(X>15)}\).
Step 3. (b) compute
\(=\dfrac{15/30}{15/30}\cdots=\dfrac{(30-25)/30}{(30-15)/30}=\dfrac{5}{15}=\dfrac13\approx0.333.\) (Note: the uniform is NOT memoryless — the answer differs from \(2/3\).)
(a) \(2/3\); (b) \(1/3\).
EX. 02

Find the constant of a density, then probabilities

Esercizio 5, n.2 medium

\(X\) has density \(f(x)=c(1-x^2)\) for \(-1

Impose \(\int_{-1}^{1}f=1\) for \(c\). The density is symmetric about 0.

Step 1. (a) normalise
\(\int_{-1}^{1}c(1-x^2)\,dx=c\big[x-\tfrac{x^3}{3}\big]_{-1}^{1}=c\cdot\tfrac43=1\Rightarrow c=\tfrac34.\)
Step 2. (b) P(X<0)
By symmetry about 0, \(P(X<0)=\tfrac12.\)
Step 3. (b) P(0
\(\tfrac34\int_0^{1/2}(1-x^2)\,dx=\tfrac34\big[x-\tfrac{x^3}{3}\big]_0^{1/2}=\tfrac34\big(\tfrac12-\tfrac1{24}\big)=\tfrac34\cdot\tfrac{11}{24}=\tfrac{11}{32}\approx0.344.\)
\(c=\tfrac34\); \(P(X<0)=\tfrac12\); \(P(0
EX. 03

Expectation, variance, and a linear transform

Esercizio 5, n.5 medium

\(X\) has density \(f(x)=x/2\) for \(0\le x\le2\) (0 elsewhere). (a) Find \(E[X]\) and \(\operatorname{Var}(X)\). (b) For \(Y=2X-1\), find \(E[Y]\) and \(\operatorname{Var}(Y)\).

Integrate for \(E[X]\) and \(E[X^2]\); then use the linear-transform rules.

Step 1. E[X]
\(E[X]=\int_0^2 x\cdot\tfrac{x}{2}\,dx=\tfrac12\cdot\tfrac{8}{3}=\tfrac43.\)
Step 2. Var(X)
\(E[X^2]=\int_0^2 x^2\cdot\tfrac{x}{2}\,dx=\tfrac12\cdot4=2\), so \(\operatorname{Var}(X)=2-\big(\tfrac43\big)^2=2-\tfrac{16}{9}=\tfrac29.\)
Step 3. (b) transform
\(E[Y]=2\cdot\tfrac43-1=\tfrac53\); \(\operatorname{Var}(Y)=2^2\cdot\tfrac29=\tfrac89.\)
\(E[X]=\tfrac43,\ \operatorname{Var}(X)=\tfrac29;\ E[Y]=\tfrac53,\ \operatorname{Var}(Y)=\tfrac89.\)
EX. 04

Exponential lifetime: mean, variance, and a tail probability

Esercizio 5, n.3 + Ross 5.6a medium

The lifetime of an electronic tube is exponential, \(f(x)=\lambda e^{-\lambda x}\), \(x\ge0\). (a) Show \(E[X]=1/\lambda\) and \(\operatorname{Var}(X)=1/\lambda^2\). (b) A car battery has exponential life with mean \(10{,}000\) miles. Find \(P(\text{lasts a }5{,}000\text{-mile trip})\).

(a) integrate by parts (\(E[X^2]=2/\lambda^2\)). (b) mean \(=1/\lambda\Rightarrow\lambda\); use the tail \(P(X>t)=e^{-\lambda t}\).

Step 1. (a) mean
\(E[X]=\int_0^\infty x\lambda e^{-\lambda x}\,dx=\tfrac1\lambda\) (by parts).
Step 2. (a) variance
\(E[X^2]=\int_0^\infty x^2\lambda e^{-\lambda x}\,dx=\tfrac{2}{\lambda^2}\), so \(\operatorname{Var}(X)=\tfrac{2}{\lambda^2}-\tfrac{1}{\lambda^2}=\tfrac{1}{\lambda^2}.\)
Step 3. (b) battery
Mean \(=1/\lambda=10000\Rightarrow\lambda=10^{-4}\). \(P(X>5000)=e^{-10^{-4}\cdot5000}=e^{-0.5}\approx0.6065.\)
(a) \(E=1/\lambda,\ \operatorname{Var}=1/\lambda^2\); (b) \(e^{-0.5}\approx0.6065.\)
EX. 05

Uniform meeting time

Ross 5.4 / MACS09 easy

A friend arrives uniformly between 2:00 and 3:00; let \(X\sim U(0,60)\) be your wait in minutes. Find (a) \(P(X\ge30)\), (b) \(P(X<15)\), (c) \(P(10

All are length ratios over the interval of length 60.

Step 1. (a)
\(P(X\ge30)=\dfrac{60-30}{60}=\dfrac12.\)
Step 2. (b)
\(P(X<15)=\dfrac{15}{60}=\dfrac14.\)
Step 3. (c)
\(P(10
Step 4. (d)
\(P(X<45)=\dfrac{45}{60}=\dfrac34.\)
(a) \(1/2\); (b) \(1/4\); (c) \(25/60\approx0.417\); (d) \(3/4\).
EX. 06

Uniform with a memoryless contrast (bus every 30 min)

Ross 5.4b style medium

You arrive uniformly in a 30-minute window before a bus. (a) Compute \(P(\text{wait}<5)\) over the whole window. (b) Why is the exponential's memoryless property a special feature the uniform does NOT share?

(a) is a length ratio. (b) compare the conditional from Exercise 1.

Step 1. (a)
\(P(\text{wait}<5)=\dfrac{5}{30}=\dfrac16\approx0.167.\)
Step 2. (b)
For a uniform, conditioning on 'already waited 15 min' changed the answer (Exercise 1: \(1/3\neq2/3\)) — it remembers. Only the exponential satisfies \(P(X>s+t\mid X>s)=P(X>t)\).
(a) \(1/6\); (b) uniform is not memoryless; only the exponential forgets elapsed time.
CH. 07 Statistics

Normal Distribution

The bell curve and the one skill that unlocks the rest of the course: standardise to Z, read the Φ table (with symmetry), invert for percentiles and critical values, and combine independent normals.

Sections 08
Flashcards 12
Exercises 7
Read time 3'
Sources Ross, Probability & Statistics for Engineers 5E, §5.5; Ross Introductory Statistics 3E §3.6; Ross standard-normal table · MACS lecture 12 (Perone Pacifico, LUISS) · Past exams: MACSfinal1/2, MAIfinal3/5, 'Distribuzione Normale' worksheet
Key concepts
Normal density N(μ,σ²) Standard normal Z Standardisation Φ table & symmetry Interval probabilities Inverse / percentiles Critical values z_α 68–95–99.7 rule Sums of independent normals

The normal: the bell curve that's everywhere

The normal (or Gaussian) distribution is the bell-shaped curve you've seen everywhere: heights, exam scores, measurement errors, and — crucially for this course — the distribution that sample means tend toward (next chapter). Master it now: it underlies the Central Limit Theorem, confidence intervals, and hypothesis tests that dominate the exam.

A normal variable is fixed by just two numbers, its mean and variance, written \(X\sim N(\mu,\sigma^2)\). The good news: every normal question reduces to looking up areas in a single table, once you learn the trick of standardisation.

Density, and what μ and σ do

\(X\sim N(\mu,\sigma^2)\) has density

\[ f(x)=\frac{1}{\sigma\sqrt{2\pi}}\,e^{-\frac{(x-\mu)^2}{2\sigma^2}},\qquad -\infty

You will essentially never integrate this by hand — that's what the table is for. What matters is the shape and the two knobs:

  • The curve is a symmetric bell centred at \(\mu\). \(\mu\) slides it left/right (the centre).
  • \(\sigma\) sets the width: small \(\sigma\) → tall and narrow; large \(\sigma\) → short and wide.

The two parameters are the mean and variance: \[ E[X]=\mu,\qquad \operatorname{Var}(X)=\sigma^2. \] (So \(\sigma=\sqrt{\sigma^2}\) is the standard deviation.) That's why the notation \(N(\mu,\sigma^2)\) is so natural.

The standard normal Z and standardisation

The special case \(\mu=0,\ \sigma^2=1\) is the standard normal \(Z\sim N(0,1)\). Its cumulative function is tabulated as

\[ \Phi(z)=P(Z\le z). \]

The key trick. Any normal can be turned into \(Z\) by subtracting the mean and dividing by the SD — standardising:

\[ Z=\frac{X-\mu}{\sigma}\sim N(0,1). \]

So a probability about \(X\) becomes a probability about \(Z\), which the table answers:

\[ P(X\le x)=P\!\left(Z\le\frac{x-\mu}{\sigma}\right)=\Phi\!\left(\frac{x-\mu}{\sigma}\right). \]

One table serves every normal — that's the whole point. The standardised value \(z=\frac{x-\mu}{\sigma}\) tells you "how many standard deviations above the mean" \(x\) sits.

Reading the Φ table: symmetry and intervals

The table gives \(\Phi(z)\) for \(z\ge0\) (rows = units+tenths, columns = hundredths). Some anchor values:

\[ \Phi(0)=0.5,\ \ \Phi(1)=0.8413,\ \ \Phi(1.645)=0.95,\ \ \Phi(1.96)=0.975,\ \ \Phi(2)=0.9772,\ \ \Phi(2.576)=0.995. \]

Negative z — use symmetry. The table stops at \(0\); for negatives use the bell's symmetry:

\[ \Phi(-z)=1-\Phi(z). \]

(E.g. \(\Phi(-1)=1-0.8413=0.1587\).)

Intervals. Standardise both ends and subtract:

\[ P(a\le X\le b)=\Phi\!\left(\frac{b-\mu}{\sigma}\right)-\Phi\!\left(\frac{a-\mu}{\sigma}\right). \]

"Greater than". \(P(X>x)=1-\Phi\!\big(\frac{x-\mu}{\sigma}\big)\).

Inverse problems: percentiles and critical values

Sometimes you're given a probability and asked for the value — "what score is the top 5%?", "find the 95th percentile". Reverse the steps:

  1. Translate to a cumulative probability \(p=P(X\le x)\).
  2. Find \(z\) with \(\Phi(z)=p\) by reading the table backwards.
  3. Un-standardise: \[ x=\mu+z\,\sigma. \]

Critical values \(z_\alpha\) (the \(z\) with \(P(Z>z_\alpha)=\alpha\), i.e. \(\Phi(z_\alpha)=1-\alpha\)) recur constantly later:

\[ z_{0.05}=1.645,\qquad z_{0.025}=1.96,\qquad z_{0.005}=2.576. \]

Memorise these three — they are the backbone of \(90\%,95\%,99\%\) confidence intervals and two-sided tests.

The 68–95–99.7 rule

A quick mental picture of how much probability sits within \(1,2,3\) standard deviations of the mean:

\[ P(|Z|\le1)\approx0.68,\quad P(|Z|\le2)\approx0.95,\quad P(|Z|\le3)\approx0.997. \]

Each follows from \(P(|Z|\le k)=2\Phi(k)-1\): \(2(0.8413)-1=0.6826\), \(2(0.9772)-1=0.9544\), \(2(0.99865)-1=0.9973\). Useful for sanity checks: a value \(3\sigma\) from the mean is rare (\(\sim0.3\%\) in the tails).

Linear transforms and sums of independent normals

Normals stay normal under the operations you care about — you only need to track the new mean and variance.

Linear transform: if \(X\sim N(\mu,\sigma^2)\) then \[ aX+b\sim N\!\big(a\mu+b,\ a^2\sigma^2\big). \] (Standardisation itself is the case \(a=1/\sigma,\ b=-\mu/\sigma\).)

Sum/difference of independent normals: means add, and (independence!) variances add — even for a difference:

\[ X\pm Y\sim N\!\big(\mu_X\pm\mu_Y,\ \sigma_X^2+\sigma_Y^2\big). \]

Example. A woman's apple consumption \(W\sim N(19.9,3.2^2)\), a man's \(M\sim N(20.7,3.4^2)\), independent. Then \(D=W-M\sim N(19.9-20.7,\ 3.2^2+3.4^2)=N(-0.8,\ 21.8)\), \(\sigma_D=\sqrt{21.8}\approx4.67\). So \(P(W>M)=P(D>0)=P\!\big(Z>\frac{0-(-0.8)}{4.67}\big)=P(Z>0.17)\approx0.43\).

Recap — the normal toolkit

  • Standardise: \(Z=\frac{X-\mu}{\sigma}\); then \(P(X\le x)=\Phi\big(\frac{x-\mu}{\sigma}\big)\).
  • Symmetry: \(\Phi(-z)=1-\Phi(z)\); greater-than: \(P(X>x)=1-\Phi(\cdot)\); interval: difference of two \(\Phi\).
  • Inverse: probability → \(z\) (table backwards) → \(x=\mu+z\sigma\).
  • Critical values: \(z_{0.05}=1.645,\ z_{0.025}=1.96,\ z_{0.005}=2.576\).
  • Transforms/sums: \(aX+b\sim N(a\mu+b,a^2\sigma^2)\); independent \(X\pm Y\sim N(\mu_X\pm\mu_Y,\sigma_X^2+\sigma_Y^2)\).

Exam reflex: "probability that X is below/above/between" → standardise + \(\Phi\). "value such that top/bottom p%" → inverse (table backwards, then \(x=\mu+z\sigma\)). "find \(\mu\) or \(\sigma\) given a probability/percentile" → set up the standardised equation and solve.

Click any card to flip. Rate it after to track what you need to revisit.
Card 1
\(X\sim N(\mu,\sigma^2)\): what are \(E[X]\) and \(\operatorname{Var}(X)\), and what do \(\mu,\sigma\) control?
\(E[X]=\mu\), \(\operatorname{Var}(X)=\sigma^2\). \(\mu\) is the centre (slides the bell); \(\sigma\) the spread (small=tall/narrow, large=short/wide).
How sure? 50%
Card 2
How do you turn any \(X\sim N(\mu,\sigma^2)\) into a standard normal, and why?
Standardise: \(Z=\dfrac{X-\mu}{\sigma}\sim N(0,1)\). It lets one \(\Phi\) table answer every normal: \(P(X\le x)=\Phi\big(\frac{x-\mu}{\sigma}\big)\).
How sure? 50%
Card 3
What does \(\Phi(z)\) mean, and why does the table only list \(z\ge0\)?
\(\Phi(z)=P(Z\le z)\), the area to the left under the standard bell. Only \(z\ge0\) is listed because symmetry gives negatives: \(\Phi(-z)=1-\Phi(z)\).
How sure? 50%
Card 4
Compute \(P(a\le X\le b)\) for a normal.
Standardise both ends and subtract: \(P(a\le X\le b)=\Phi\big(\frac{b-\mu}{\sigma}\big)-\Phi\big(\frac{a-\mu}{\sigma}\big)\).
How sure? 50%
Card 5
How do you handle \(P(X>x)\) and a negative \(z\)?
\(P(X>x)=1-\Phi\big(\frac{x-\mu}{\sigma}\big)\). For negative \(z\): \(\Phi(-z)=1-\Phi(z)\).
How sure? 50%
Card 6
Inverse problem: find the value \(x\) with \(P(X\le x)=p\).
Read the table backwards to get \(z\) with \(\Phi(z)=p\), then un-standardise: \(x=\mu+z\sigma\).
How sure? 50%
Card 7
The three critical values \(z_{0.05},z_{0.025},z_{0.005}\)?
\(z_{0.05}=1.645\) (\(\Phi=0.95\)), \(z_{0.025}=1.96\) (\(\Phi=0.975\)), \(z_{0.005}=2.576\) (\(\Phi=0.995\)). Backbone of 90/95/99% CIs and tests.
How sure? 50%
Card 8
State the 68–95–99.7 rule.
\(P(|Z|\le1)\approx0.68\), \(P(|Z|\le2)\approx0.95\), \(P(|Z|\le3)\approx0.997\). From \(P(|Z|\le k)=2\Phi(k)-1\).
How sure? 50%
Card 9
Distribution of a linear transform \(aX+b\) of a normal?
\(aX+b\sim N(a\mu+b,\ a^2\sigma^2)\) — stays normal; mean shifts/scales, variance scales by \(a^2\).
How sure? 50%
Card 10
Distribution of \(X+Y\) and \(X-Y\) for independent normals?
Both normal: \(X\pm Y\sim N(\mu_X\pm\mu_Y,\ \sigma_X^2+\sigma_Y^2)\). Means add/subtract; variances ADD in both cases (independence).
How sure? 50%
Card 11
A problem asks for the cutoff score to be in the 'top 5%'. Which kind of problem, and the steps?
Inverse-normal. Top 5% means \(P(X>x)=0.05\Rightarrow\Phi(z)=0.95\Rightarrow z=1.645\), then \(x=\mu+1.645\sigma\).
How sure? 50%
Card 12
You're told \(95\%\) of a normal lies in \([\mu-c,\mu+c]\) and asked for \(\sigma\). Method?
\(2\Phi(c/\sigma)-1=0.95\Rightarrow\Phi(c/\sigma)=0.975\Rightarrow c/\sigma=1.96\Rightarrow\sigma=c/1.96\).
How sure? 50%
Progress
0 / 7
EX. 01

Basic normal probabilities: standardise and look up

Ross 5E, Ex 5.5a easy

\(X\sim N(3,16)\) (so \(\mu=3,\ \sigma=4\)). Find (a) \(P(X<11)\), (b) \(P(X>-1)\), (c) \(P(2

Standardise each bound with \(z=\frac{x-3}{4}\); use symmetry for negatives.

Step 1. (a)
\(z=\frac{11-3}{4}=2\Rightarrow P(X<11)=\Phi(2)=0.9772.\)
Step 2. (b)
\(z=\frac{-1-3}{4}=-1\Rightarrow P(X>-1)=P(Z>-1)=\Phi(1)=0.8413.\)
Step 3. (c)
\(z\) from \(-0.25\) to \(1\): \(\Phi(1)-\Phi(-0.25)=0.8413-(1-0.5987)=0.8413-0.4013=0.4400.\)
(a) \(0.9772\); (b) \(0.8413\); (c) \(0.4400\).
EX. 02

IQ scores: three probabilities

Distribuzione Normale (Mar 22), Ex 4 medium

IQs are \(X\sim N(100,225)\) (so \(\sigma=15\)). Find the proportion of students with IQ (a) below 90, (b) above 145, (c) between 120 and 140. Use \(\Phi(0.67)=0.7486,\ \Phi(1.33)=0.9082,\ \Phi(2.67)=0.9962,\ \Phi(3)=0.9987\).

Standardise with \(z=\frac{x-100}{15}\); symmetry for the negative one.

Step 1. (a) below 90
\(z=\frac{90-100}{15}=-0.67\Rightarrow P=\Phi(-0.67)=1-0.7486=0.2514.\)
Step 2. (b) above 145
\(z=\frac{145-100}{15}=3\Rightarrow P(X>145)=1-\Phi(3)=1-0.9987=0.0013.\)
Step 3. (c) 120–140
\(z\) from \(1.33\) to \(2.67\): \(\Phi(2.67)-\Phi(1.33)=0.9962-0.9082=0.0880.\)
(a) \(\approx0.251\); (b) \(\approx0.0013\); (c) \(\approx0.088\).
EX. 03

Inverse normal: GRE cutoff scores for the top p%

Distribuzione Normale (Mar 22), Ex 2 medium

GRE quantitative scores are \(X\sim N(510,92^2)\). What score puts you in the top (a) 10%, (b) 5%, (c) 1%? Use \(z_{0.10}=1.28,\ z_{0.05}=1.645,\ z_{0.01}=2.33\).

Top \(p\%\) means \(\Phi(z)=1-p\); then \(x=\mu+z\sigma=510+92z\).

Step 1. (a) top 10%
\(z=1.28\Rightarrow x=510+1.28(92)=510+117.76=627.76.\)
Step 2. (b) top 5%
\(z=1.645\Rightarrow x=510+1.645(92)=510+151.34=661.34.\)
Step 3. (c) top 1%
\(z=2.33\Rightarrow x=510+2.33(92)=510+214.36=724.36.\)
(a) \(\approx627.8\); (b) \(\approx661.3\); (c) \(\approx724.4\).
EX. 04

Find σ from a probability interval

MACSfinal2, P1 medium

Heights of 12-year-olds are \(N(150,\sigma^2)\), and \(95\%\) lie between \(140\) and \(160\) cm. Find \(\sigma\).

The interval is symmetric about the mean 150, so \(2\Phi(10/\sigma)-1=0.95\).

Step 1. Set up
\(P(140\le X\le160)=2\Phi\!\big(\tfrac{10}{\sigma}\big)-1=0.95\Rightarrow\Phi\!\big(\tfrac{10}{\sigma}\big)=0.975.\)
Step 2. Solve
\(\tfrac{10}{\sigma}=1.96\Rightarrow\sigma=\tfrac{10}{1.96}\approx5.10\) cm.
\(\sigma\approx5.10\) cm.
EX. 05

Recover σ² from a moment, then a probability

MAIfinal3, P3 medium

\(X\) is normal with \(E[X]=2\) and \(E[X(X-1)]=6\). Find \(\operatorname{Var}(X)\) and \(P(X\le4)\). Use \(\Phi(1)=0.8413\).

Expand \(E[X(X-1)]=E[X^2]-E[X]\) to get \(E[X^2]\), then \(\operatorname{Var}=E[X^2]-(E[X])^2\).

Step 1. Second moment
\(E[X(X-1)]=E[X^2]-E[X]=6\Rightarrow E[X^2]=6+2=8.\)
Step 2. Variance
\(\operatorname{Var}(X)=8-2^2=4\), so \(\sigma=2\) and \(X\sim N(2,4)\).
Step 3. Probability
\(P(X\le4)=\Phi\!\big(\tfrac{4-2}{2}\big)=\Phi(1)=0.8413.\)
\(\operatorname{Var}(X)=4,\ P(X\le4)=0.8413.\)
EX. 06

From two percentiles to μ and σ

MAIfinal5, P3 medium

A normal variable has 25th percentile \(=3.0\) and 75th percentile \(=7.0\). Find its mean and standard deviation. Use \(z_{0.75}=0.674\).

By symmetry the mean is the midpoint. Then use one percentile to get \(\sigma\).

Step 1. Mean
Symmetric quartiles \(\Rightarrow\mu=\tfrac{3+7}{2}=5.\)
Step 2. Sigma
\(P(X\le7)=0.75\Rightarrow\Phi\!\big(\tfrac{7-5}{\sigma}\big)=0.75\Rightarrow\tfrac{2}{\sigma}=0.674\Rightarrow\sigma\approx2.97.\)
\(\mu=5,\ \sigma\approx2.97.\)
EX. 07

Comparing two normal strategies

MACSfinal1, P3 medium

Strategy A gives a return \(\sim N(100,100^2)\); strategy B gives \(\sim N(60,30^2)\). It matters that the return is at least 50. Which strategy maximises \(P(\text{return}\ge50)\)? Use \(\Phi(0.5)=0.6915,\ \Phi(0.33)=0.6293\).

Compute \(P(\text{return}\ge50)\) for each by standardising; bigger wins.

Step 1. Strategy A
\(z=\frac{50-100}{100}=-0.5\Rightarrow P(R_A\ge50)=\Phi(0.5)=0.6915.\)
Step 2. Strategy B
\(z=\frac{50-60}{30}=-\tfrac13\Rightarrow P(R_B\ge50)=\Phi(0.33)=0.6293.\)
Step 3. Decide
\(0.6915>0.6293\Rightarrow\) choose Strategy A.
Strategy A (\(0.6915\) vs \(0.6293\)).
CH. 05 Statistics

CLT & Sampling Distributions

Sums and counts of many i.i.d. units go Normal — standardize the total and read Φ

Sections 04
Flashcards 5
Exercises 3
Read time 3'
Sources Ross, Prob & Stats for Engineers 5E, ch. 6 (§6.2 sample mean, §6.3 CLT) · Ross, Introductory Statistics 3E, §5.5 (binomial→normal) · Past exams: MAIf2, EBf1, sampletest1
Key concepts
Central Limit Theorem Sum S_n≈N(nμ,nσ²) Sample mean X̄≈N(μ,σ²/n) Standardize the total Binomial→normal Continuity correction

The Central Limit Theorem

Take \(n\) independent, identically distributed units \(X_1,\dots,X_n\), each with mean \(\mu\) and variance \(\sigma^2\). For large \(n\) (rule of thumb \(n\ge30\)) their sum and mean are approximately Normal:

\[ S_n=\sum_{i=1}^n X_i \;\approx\; N\big(n\mu,\;n\sigma^2\big), \qquad \bar X \;\approx\; N\!\left(\mu,\;\frac{\sigma^2}{n}\right). \]

The shape of the individual \(X_i\) does not matter — only \(\mu\), \(\sigma^2\), and \(n\). This is what lets you answer "probability that the total exceeds a threshold" without knowing the per-unit distribution.

The standardization template

Almost every CLT exam problem is: many i.i.d. units, per-unit \(\mu\) and \(\sigma\) given, asked the probability a total crosses a threshold \(s\). The recipe:

  1. Total parameters: mean \(n\mu\), variance \(n\sigma^2\), sd \(\sigma\sqrt n\).
  2. Standardize: \( Z=\dfrac{s-n\mu}{\sigma\sqrt n} \).
  3. Read \(\Phi\): \( P(S_n\le s)=\Phi(Z) \), \( P(S_n\ge s)=1-\Phi(Z) \).
\[ P(S_n\ge s)=1-\Phi\!\left(\frac{s-n\mu}{\sigma\sqrt n}\right). \]

Watch the direction ("sufficient to cover" / "exceed" → upper tail). A standardized value \(|Z|\gtrsim4\) means the probability is effectively 0 or 1.

Binomial → Normal approximation

A Binomial count is a sum of \(n\) Bernoulli's, so for large \(n\) the CLT gives

\[ X\sim\mathrm{Bin}(n,p)\;\approx\;N\big(np,\;np(1-p)\big). \]

Standardize with \(\mu=np\), \(\sigma=\sqrt{np(1-p)}\): \( P(X\le k)\approx\Phi\!\big(\tfrac{k-np}{\sqrt{np(1-p)}}\big) \).

Continuity correction. Because \(X\) is discrete, the more accurate version replaces \(k\) with \(k+\tfrac12\) (for \(\le\)) or \(k-\tfrac12\) (for \(\ge\)). Exams often say "without (half) continuity correction" — then use \(k\) as-is. Always check the wording.

Recognition guide

WordingTool
"\(n\) (≥30) i.i.d. units, per-unit μ and σ, prob the TOTAL exceeds …"CLT on the sum: \(N(n\mu,n\sigma^2)\)
"average of \(n\) measurements is within … of μ"CLT on the mean: \(N(\mu,\sigma^2/n)\)
"coin thrown \(n\) times / count of successes", large \(n\), "normal approximation"Binomial→Normal: \(N(np,np(1-p))\)
"without half correction"use \(k\) as-is, no \(\pm\tfrac12\)
Click any card to flip. Rate it after to track what you need to revisit.
Card 1
State the CLT for the sum and the mean of n i.i.d. units with mean μ and variance σ².
\( S_n=\sum X_i\approx N(n\mu,\,n\sigma^2) \) and \( \bar X\approx N(\mu,\,\sigma^2/n) \), for large n (≥30). The per-unit distribution shape is irrelevant.
How sure? 50%
Card 2
Recognition cue: many i.i.d. units, per-unit μ and σ given, asked the probability the TOTAL exceeds a threshold s. Procedure?
Sum ≈ \(N(n\mu,n\sigma^2)\). Standardize \(Z=\dfrac{s-n\mu}{\sigma\sqrt n}\), then \(P(S_n\ge s)=1-\Phi(Z)\).
How sure? 50%
Card 3
Binomial(n,p) normal approximation — the approximating distribution, and the continuity correction?
\( \mathrm{Bin}(n,p)\approx N(np,\,np(1-p)) \). Continuity correction: use \(k+\tfrac12\) for \(P(X\le k)\), \(k-\tfrac12\) for \(P(X\ge k)\). Skip it if the problem says "without half correction".
How sure? 50%
Card 4
When standardizing a SUM of n units, what are the mean and standard deviation you divide by?
Mean \(n\mu\), standard deviation \(\sigma\sqrt n\) (variance \(n\sigma^2\)). Do NOT use \(\sigma/\sqrt n\) — that's for the mean \(\bar X\), not the sum.
How sure? 50%
Card 5
A CLT problem gives a standardized value of Z≈9.5. What's the probability the total exceeds the threshold?
Essentially 0: \(1-\Phi(9.5)\approx0\). \(|Z|\gtrsim4\) already pins the tail probability to 0 (or 1 on the other side).
How sure? 50%
Progress
0 / 3
EX. 01

Will 40 cans of paint be enough?

MAIf2-P3 medium

A can of paint covers on average 52 m² with sd 3 m². You must paint 2260 m². What is the approximate probability that 40 cans suffice?

"Sufficient" means the total coverage \(S_{40}=\sum X_i\ge2260\). Use CLT on the sum: \(N(40\mu,40\sigma^2)\).

Step 1. Total parameters
Per can \(\mu=52\), \(\sigma^2=9\). Sum of 40: mean \(40\cdot52=2080\), variance \(40\cdot9=360\), sd \(\sqrt{360}\approx18.97\).
Step 2. Standardize
\( Z=\dfrac{2260-2080}{\sqrt{360}}=\dfrac{180}{18.97}\approx9.49 \).
Step 3. Read the tail
\( P(S_{40}\ge2260)=1-\Phi(9.49)\approx0 \).
≈ 0 — it is essentially impossible for 40 cans to cover 2260 m² (you'd expect them to cover only ~2080).
EX. 02

Do 36 batteries last a year?

EBf1-P3 medium

A battery lasts on average 10 days with sd 1 day; lifetimes are i.i.d. and each expired battery is replaced. Find the approximate probability that the total lifetime of 36 batteries exceeds one year (365 days).

Total lifetime \(L=\sum_{i=1}^{36}X_i\). CLT: \(L\approx N(36\mu,36\sigma^2)\). Want \(P(L>365)\).

Step 1. Total parameters
Per battery \(\mu=10\), \(\sigma^2=1\). For 36: mean \(360\), variance \(36\), sd \(6\).
Step 2. Standardize
\( Z=\dfrac{365-360}{6}=\dfrac{5}{6}\approx0.83 \).
Step 3. Read the tail
\( P(L>365)=1-\Phi(5/6)\approx1-0.7977=0.2023 \).
≈ 0.2023 (about 20%). Note: using the coarse table value \(\Phi(0.83)=0.7967\) gives ≈0.2033; the official answer 0.2023 uses \(\Phi(0.833)\approx0.7977\).
EX. 03

100 coin tosses — normal approximation

sampletest1-P3 medium

A fair coin is thrown 100 times; \(X\) = number of Heads. Using the normal approximation without half-correction, compute \(P(X\le60)\).

\(X\sim\mathrm{Bin}(100,\tfrac12)\). Approximate by \(N(np,np(1-p))\). "Without half correction" → use 60 directly.

Step 1. Approximating normal
\( np=50 \), \( np(1-p)=25 \), so \(X\approx N(50,25)\), sd \(=5\).
Step 2. Standardize (no correction)
\( Z=\dfrac{60-50}{5}=2 \).
Step 3. Read
\( P(X\le60)\approx\Phi(2)=0.9772 \).
\( P(X\le60)\approx0.9772 \).
CH. 09 Statistics

Estimator Theory

Bias, variance, MSE, efficiency, unbiasedness — plus MLE and method-of-moments. The estimator-algebra slice appears in 5 of 7 exams.

Sections 07
Flashcards 8
Exercises 7
Read time 3'
Sources Ross, Prob & Stats for Engineers 5E, §7.2 (MLE), §7.7 (evaluating an estimator) · Bombelli, StimaPunti Proprietà Estimatori (point estimation notes) · Past exams: MACSf1/2, MAIf2/3/5, sampletest1
Key concepts
Bias Variance MSE = Var + Bias² Unbiasedness Efficiency Min-MSE weighting Jensen's inequality MLE Method-of-moments

The estimator vocabulary

An estimator \(T\) for a parameter \(\theta\) is a function of the sample. Three numbers grade it:

\[ \mathrm{Bias}(T)=E[T]-\theta, \qquad \mathrm{Var}(T), \qquad \mathrm{MSE}(T)=E[(T-\theta)^2]=\mathrm{Var}(T)+\mathrm{Bias}(T)^2. \]

Unbiased means \(E[T]=\theta\) (Bias \(=0\)), and then \(\mathrm{MSE}=\mathrm{Var}\). More efficient = smaller MSE (smaller variance, among unbiased estimators).

The algebra engine (from the discrete-RV chapter): \(E\) is linear always; for independent pieces \( \mathrm{Var}(aU+bW)=a^2\mathrm{Var}(U)+b^2\mathrm{Var}(W) \). Population moments you reuse: Bernoulli \(E=p,\mathrm{Var}=p(1-p)\); Poisson \(E=\mathrm{Var}=\lambda\); sample mean of \(m\) obs \(\mathrm{Var}(\bar X_m)=\sigma^2/m\); and \(\sigma^2=E[X^2]-\mu^2\).

Unbiasedness & finding the constant

Weights that sum to 1 → automatically unbiased for the mean. If \(T=c\bar X_1+(1-c)\bar X_2\) then \(E[T]=\mu\) for every \(c\).

Find a constant. When asked "for what \(a\) is \(T\) unbiased for \(\sigma^2\)", set \(E[T]=\theta\) and solve. The recurring move is \(E[X_i^2]=\sigma^2+\mu^2\), and for independent \(X_i,X_j\), \(E[X_iX_j]=\mu^2\). Example: \(T=\frac1n\sum X_i^2+a\) with \(\mu\) known → \(E[T]=\sigma^2+\mu^2+a=\sigma^2\Rightarrow a=-\mu^2\).

Efficiency & minimum-MSE weighting

Compare efficiency: if all candidates are unbiased, compute each variance and pick the smallest.

Optimal weight. For \(M_c=c\bar X_1+(1-c)\bar X_2\) (independent), \(\mathrm{MSE}=c^2\mathrm{Var}_1+(1-c)^2\mathrm{Var}_2\); differentiate and set to 0. The minimiser is inverse-variance weighting

\[ c^*=\frac{1/\mathrm{Var}_1}{1/\mathrm{Var}_1+1/\mathrm{Var}_2}. \]

With sample means of sizes \(n_1,n_2\) (\(\mathrm{Var}_i=\sigma^2/n_i\)) this becomes \(c^*=\dfrac{n_1}{n_1+n_2}\) — weight each sample by its size. Put more weight on the more precise (larger / lower-variance) sample.

Nonlinear transforms break unbiasedness (Jensen)

If \(T\) is unbiased for \(\sigma^2\), is \(\sqrt T\) unbiased for \(\sigma\)? No. For a nonlinear \(g\), \(E[g(T)]\neq g(E[T])\) in general (Jensen's inequality). Since \(\sqrt{\cdot}\) is concave, \(E[\sqrt T]\le\sqrt{E[T]}=\sigma\), with strict inequality unless \(T\) is constant → \(\sqrt T\) is biased low for \(\sigma\). Whenever an exam takes a square root, log, or reciprocal of an unbiased estimator, the answer to "still unbiased?" is no.

Maximum likelihood (MLE)

Recipe: write the likelihood \(L(\theta)=\prod_i f(x_i\mid\theta)\), take \(\log\), differentiate, set to 0, solve.

\[ \ell(\theta)=\log L(\theta)=\sum_i \log f(x_i\mid\theta), \qquad \frac{d\ell}{d\theta}=0. \]

Geometric example \(f(x\mid\theta)=\theta(1-\theta)^{x-1}\): \(L=\theta^n(1-\theta)^{\sum(x_i-1)}\), \(\ell=n\log\theta+\big(\sum(x_i-1)\big)\log(1-\theta)\), giving \(\hat\theta=\dfrac{n}{\sum x_i}=\dfrac1{\bar X}\). On the exam the pmf/pdf is given in the problem — you just run the recipe.

Method-of-moments (MoM)

Equate the population moment to the sample moment and solve for the parameter. For one parameter, set \(E[X]=\bar X\) (using the theoretical mean as a function of \(\theta\)) and invert.

Example density \(f(x)=2x/\theta^2\) on \([0,\theta]\): \(E[X]=\int_0^\theta x\frac{2x}{\theta^2}dx=\frac{2\theta}{3}\). Set \(\bar X=\frac{2\theta}{3}\Rightarrow \hat\theta_M=\frac{3}{2}\bar X\). It is unbiased here (\(E[\hat\theta_M]=\theta\)) with \(\mathrm{MSE}=\theta^2/(8n)\to0\) — consistent. MoM and MLE can differ; MoM is usually the quicker algebra.

Recognition guide

WordingDo
"which estimator is more efficient" (several unbiased)compute each variance, smallest wins
"value of c that minimizes MSE"differentiate MSE in c, or inverse-variance weight \(c^*\)
"determine constant a so T is unbiased for σ²/μ"set \(E[T]=\theta\), use \(E[X^2]=\sigma^2+\mu^2\), solve
"is √T (or log, 1/T) unbiased"No — Jensen, nonlinear ⇒ biased
"write the likelihood / find the MLE"\(L=\prod f\) → log → differentiate → solve
"find the moments estimator"\(E[X]=\bar X\) (as function of θ), invert
"Bias and MSE of T"\(E[T]-\theta\); \(\mathrm{Var}(T)+\mathrm{Bias}^2\)
Click any card to flip. Rate it after to track what you need to revisit.
Card 1
Define bias, variance, and MSE of an estimator T for θ, and the identity linking them.
\(\mathrm{Bias}(T)=E[T]-\theta\); \(\mathrm{Var}(T)=E[(T-E[T])^2]\); \(\mathrm{MSE}(T)=E[(T-\theta)^2]=\mathrm{Var}(T)+\mathrm{Bias}(T)^2\). Unbiased ⇒ MSE = Var.
How sure? 50%
Card 2
Several UNBIASED estimators are given; how do you pick the most efficient?
Most efficient = smallest MSE = smallest variance (since unbiased). Compute each \(\mathrm{Var}\) via \(\mathrm{Var}(aU+bW)=a^2\mathrm{Var}(U)+b^2\mathrm{Var}(W)\) (independence) and compare coefficients.
How sure? 50%
Card 3
For \(M_c=c\bar X_1+(1-c)\bar X_2\) with independent samples of sizes \(n_1,n_2\): is it unbiased, and what c minimizes MSE?
Unbiased for every c (weights sum to 1). Min-MSE weight is inverse-variance: \(c^*=\dfrac{1/\mathrm{Var}_1}{1/\mathrm{Var}_1+1/\mathrm{Var}_2}=\dfrac{n_1}{n_1+n_2}\).
How sure? 50%
Card 4
How do you find the constant a making \(T=\frac1n\sum X_i^2+a\) unbiased for σ² when μ is known?
Use \(E[X_i^2]=\sigma^2+\mu^2\). Then \(E[T]=\sigma^2+\mu^2+a=\sigma^2\Rightarrow a=-\mu^2\).
How sure? 50%
Card 5
If T is unbiased for σ², is √T unbiased for σ? Why / why not?
No. Square root is concave, so by Jensen \(E[\sqrt T]\le\sqrt{E[T]}=\sigma\), strict unless T is degenerate → √T is biased low. Any nonlinear transform of an unbiased estimator is generally biased.
How sure? 50%
Card 6
MLE recipe in four moves?
1) Likelihood \(L(\theta)=\prod_i f(x_i\mid\theta)\). 2) Log-likelihood \(\ell=\sum\log f\). 3) \(d\ell/d\theta=0\). 4) Solve for \(\hat\theta\). Geometric \(\theta(1-\theta)^{x-1}\) → \(\hat\theta=1/\bar X\).
How sure? 50%
Card 7
Method-of-moments recipe for a one-parameter family?
Express the population mean as a function of θ, set it equal to the sample mean \(\bar X\), and invert. E.g. \(E[X]=2\theta/3\Rightarrow\hat\theta_M=\tfrac32\bar X\).
How sure? 50%
Card 8
Two key population-moment facts used constantly in estimator algebra?
\(E[X^2]=\sigma^2+\mu^2\) (from \(\sigma^2=E[X^2]-\mu^2\)); and for independent \(X_i,X_j\), \(E[X_iX_j]=E[X_i]E[X_j]=\mu^2\).
How sure? 50%
Progress
0 / 7
EX. 01

Which of three unbiased estimators is most efficient?

MACSf1-P6 hard

\(\hat p\) is the success proportion in a Bernoulli sample of size \(n=9\); \(Y\) is one further independent observation. Consider \(T_1=\hat p\), \(T_2=\tfrac12\hat p+\tfrac12 Y\), \(T_3=\tfrac{9}{10}\hat p+\tfrac{1}{10}Y\). Which is most efficient?

Check all three are unbiased (they are), then compare variances. Use \(\mathrm{Var}(\hat p)=p(1-p)/9\), \(\mathrm{Var}(Y)=p(1-p)\).

Step 1. All unbiased
Each is a weight-1 combination of unbiased pieces, so \(E[T_i]=p\) and MSE = Var.
Step 2. Variances
\(\mathrm{Var}(T_1)=\tfrac19 p(1-p)\). \(\mathrm{Var}(T_2)=\tfrac14\cdot\tfrac{p(1-p)}{9}+\tfrac14 p(1-p)=\tfrac{10}{36}p(1-p)\). \(\mathrm{Var}(T_3)=\tfrac{81}{100}\cdot\tfrac{p(1-p)}{9}+\tfrac1{100}p(1-p)=\tfrac{1}{10}p(1-p)\).
Step 3. Compare coefficients
\(\tfrac1{10}=0.100<\tfrac19\approx0.111<\tfrac{10}{36}\approx0.278\).
\(T_3\) is most efficient (smallest variance, coeff \(1/10\)); \(T_2\) is least efficient. Intuition: \(T_3\) leans on the bigger, more precise sample.
EX. 02

Combine two sample means — unbiased c and optimal c

MACSf2-P6 hard

\(\bar X_{10}\) and \(\bar Y_{15}\) are means of independent samples of sizes 10 and 15 from a population with mean μ. Let \(M_c=c\bar X_{10}+(1-c)\bar Y_{15}\). (a) For which c is \(M_c\) unbiased? (b) Find the c minimizing MSE.

Weights sum to 1 → unbiased for all c. MSE \(=c^2\sigma^2/10+(1-c)^2\sigma^2/15\); differentiate.

Step 1. (a) Unbiasedness
\(E[M_c]=c\mu+(1-c)\mu=\mu\) for every c → unbiased for all c.
Step 2. (b) MSE
\(\mathrm{MSE}(M_c)=\mathrm{Var}(M_c)=\sigma^2\!\left(\tfrac{c^2}{10}+\tfrac{(1-c)^2}{15}\right)\).
Step 3. Minimize
\(\dfrac{d}{dc}=\sigma^2\!\left(\tfrac{2c}{10}-\tfrac{2(1-c)}{15}\right)=\dfrac{\sigma^2}{15}(5c-2)=0\Rightarrow c=\tfrac25\). (Matches \(c^*=\tfrac{n_1}{n_1+n_2}=\tfrac{10}{25}\).)
(a) unbiased for every c; (b) \(c^*=\tfrac25\).
EX. 03

Constant for an unbiased variance estimator, and √T

MAIf2-P6 hard

Sample \((X_1,\dots,X_n)\) from a population with known mean \(\mu=3\) and unknown variance \(\sigma^2\). (a) Find \(a\) so that \(T=\frac1n\sum_{i=1}^n X_i^2+a\) is unbiased for \(\sigma^2\). (b) Is \(\sqrt T\) unbiased for \(\sigma\)?

(a) \(E[X_i^2]=\sigma^2+\mu^2=\sigma^2+9\). (b) Think about \(E[\sqrt T]\) vs \(\sqrt{E[T]}\).

Step 1. (a) Expectation of T
\(E[T]=E[X_i^2]+a=(\sigma^2+9)+a\). Unbiased ⇒ \(\sigma^2+9+a=\sigma^2\Rightarrow a=-9\).
Step 2. (b) Square root
We'd need \(E[\sqrt T]=\sigma=\sqrt{E[T]}\). But \(\sqrt{\cdot}\) is concave, so by Jensen \(E[\sqrt T]<\sqrt{E[T]}=\sigma\) (strict unless T is constant).
(a) \(a=-9\); (b) No — \(\sqrt T\) is biased (low) for \(\sigma\) by Jensen's inequality.
EX. 04

Make a(X₁−Xₙ)² unbiased for σ²

MAIf3-P6 hard

Sample from a population with known mean \(\mu=3\), unknown variance \(\sigma^2\). Find the constant \(a\) so that \(T=a(X_1-X_n)^2\) is unbiased for \(\sigma^2\).

Expand the square; use \(E[X_i^2]=\sigma^2+9\) and \(E[X_1X_n]=\mu^2=9\) (independence).

Step 1. Expand
\(E[(X_1-X_n)^2]=E[X_1^2]+E[X_n^2]-2E[X_1X_n]\).
Step 2. Substitute moments
\(=(\sigma^2+9)+(\sigma^2+9)-2(9)=2\sigma^2\).
Step 3. Solve
\(E[T]=a\cdot2\sigma^2=\sigma^2\Rightarrow a=\tfrac12\).
\(a=\tfrac12\).
EX. 05

Bias and MSE of a Poisson estimator

sampletest1-P4 hard

Sample \((X_1,\dots,X_n)\) from Poisson(λ). For \(T=\tfrac12\!\left(\dfrac{X_1+\cdots+X_{n-1}}{n-1}+X_n\right)\), determine the bias and the MSE.

Write \(T=\tfrac12\bar X_{n-1}+\tfrac12 X_n\). Poisson: \(E=\mathrm{Var}=\lambda\); \(\mathrm{Var}(\bar X_{n-1})=\lambda/(n-1)\).

Step 1. Expectation → bias
\(E[T]=\tfrac12\lambda+\tfrac12\lambda=\lambda\Rightarrow\mathrm{Bias}=0\).
Step 2. Variance
\(\mathrm{Var}(T)=\tfrac14\cdot\dfrac{\lambda}{n-1}+\tfrac14\lambda=\dfrac{\lambda}{4}\!\left(\dfrac{1}{n-1}+1\right)=\dfrac{\lambda}{4}\cdot\dfrac{n}{n-1}\).
Step 3. MSE
Unbiased ⇒ \(\mathrm{MSE}=\mathrm{Var}=\dfrac{n\lambda}{4(n-1)}\).
Bias \(=0\); \(\mathrm{MSE}(T)=\dfrac{n\lambda}{4(n-1)}\).
EX. 06

Geometric MLE

MAIf5-P6 medium

Sample of \(n\) observations from \(f(x\mid\theta)=\theta(1-\theta)^{x-1}\), \(x=1,2,\dots\), \(\theta\in(0,1)\). (a) Write the likelihood. (b) Find the MLE. (c) For the sample \(3,5,1,2,4\), give the estimate.

\(L=\prod f\), then log, differentiate, set 0. The sum of exponents is \(\sum(x_i-1)\).

Step 1. (a) Likelihood
\(L(\theta)=\prod_{i=1}^n\theta(1-\theta)^{x_i-1}=\theta^n(1-\theta)^{\sum(x_i-1)}\).
Step 2. (b) Maximize log-likelihood
\(\ell=n\log\theta+\big(\sum(x_i-1)\big)\log(1-\theta)\); \(\dfrac{d\ell}{d\theta}=\dfrac{n}{\theta}-\dfrac{\sum(x_i-1)}{1-\theta}=0\Rightarrow\hat\theta=\dfrac{n}{\sum x_i}=\dfrac1{\bar X}\).
Step 3. (c) Plug in data
\(\bar X=(3+5+1+2+4)/5=3\Rightarrow\hat\theta=1/3\).
(a) \(L=\theta^n(1-\theta)^{\sum(x_i-1)}\); (b) \(\hat\theta=1/\bar X\); (c) \(\hat\theta=1/3\).
EX. 07

Method-of-moments for a triangular density

StimaPunti (Bombelli), Ex. 3 hard

Sample from \(f(x;\theta)=\dfrac{2x}{\theta^2}\) for \(x\in[0,\theta]\) (0 otherwise), \(\theta>0\). (a) Find \(E[X]\) and \(\mathrm{Var}(X)\). (b) Find the method-of-moments estimator of \(\theta\). (c) Its bias and MSE; behaviour as \(n\to\infty\).

\(E[X]=\int_0^\theta x\frac{2x}{\theta^2}dx\). MoM: set \(\bar X=E[X]\) and invert.

Step 1. (a) Moments
\(E[X]=\dfrac{2}{\theta^2}\!\int_0^\theta x^2dx=\dfrac{2\theta}{3}\); \(E[X^2]=\dfrac{2}{\theta^2}\!\int_0^\theta x^3dx=\dfrac{\theta^2}{2}\); \(\mathrm{Var}(X)=\dfrac{\theta^2}{2}-\dfrac{4\theta^2}{9}=\dfrac{\theta^2}{18}\).
Step 2. (b) MoM estimator
Set \(\bar X=E[X]=\dfrac{2\theta}{3}\Rightarrow\hat\theta_M=\dfrac32\bar X\).
Step 3. (c) Bias and MSE
\(E[\hat\theta_M]=\tfrac32\cdot\tfrac{2\theta}{3}=\theta\) → unbiased. \(\mathrm{Var}(\hat\theta_M)=\tfrac94\cdot\dfrac{\mathrm{Var}(X)}{n}=\tfrac94\cdot\dfrac{\theta^2}{18n}=\dfrac{\theta^2}{8n}\). So \(\mathrm{MSE}=\dfrac{\theta^2}{8n}\to0\).
\(E[X]=\tfrac{2\theta}{3}\), \(\mathrm{Var}(X)=\tfrac{\theta^2}{18}\); \(\hat\theta_M=\tfrac32\bar X\); unbiased with \(\mathrm{MSE}=\theta^2/(8n)\to0\) (consistent).
CH. 07 Statistics

Confidence Intervals & Sample Size

Build an interval, back-solve the sample size, recover n from a published CI, handle asymmetric tails — appears in all 7 past exams.

Sections 06
Flashcards 7
Exercises 7
Read time 3'
Sources Ross, Prob & Stats for Engineers 5E, §7.3 (CI mean), §7.4 (two-mean), §7.5 (proportion), §6.3.2 (sample size) · Past exams: MACSf1/2, MAIf2/3/5, sampletest1, EBf1
Key concepts
estimate ± margin z critical values CI for the mean CI for a proportion Sample size back-solve Worst-case p(1−p)=¼ Asymmetric tails z vs t

The CI machinery

Every confidence interval is estimate ± margin, the margin being a critical value times a standard error.

\[ \text{mean (known }\sigma\text{ / large }n): \quad \bar x\pm z_{\alpha/2}\,\frac{\sigma}{\sqrt n}\quad(\text{use }s\text{ if }\sigma\text{ unknown, large }n). \]\[ \text{proportion}: \quad \hat p\pm z_{\alpha/2}\,\sqrt{\frac{\hat p(1-\hat p)}{n}}. \]

Critical values: \(z_{0.025}=1.96\) (95%), \(z_{0.005}=2.576\) (99%), \(z_{0.05}=1.645\) (90%). The total length of a symmetric CI is twice the margin, \(2z_{\alpha/2}\cdot\text{SE}\).

Sample size: back-solve from a target precision

Given a target on the interval (total length \(L\), or half-width / accuracy \(d=L/2\)), set the length condition and solve for \(n\), then round up.

\[ \text{mean}: \; 2z_{\alpha/2}\frac{\sigma}{\sqrt n}\le L \;\Rightarrow\; n\ge\left(\frac{2z_{\alpha/2}\sigma}{L}\right)^2. \]\[ \text{proportion}: \; n\ge\frac{z_{\alpha/2}^2\,p(1-p)}{d^2}, \quad\text{worst case }p(1-p)=\tfrac14. \]

When \(\hat p\) is unknown (sample not yet taken), use the worst case \(p(1-p)\le\frac14\) — it guarantees the precision whatever the true \(p\).

Recover n from a published interval

If a poll reports "with \(C\%\) confidence, support is between \(a\) and \(b\)", read off \(\hat p=\tfrac{a+b}{2}\) and half-width \(d=\tfrac{b-a}{2}\), then invert the margin:

\[ z_{\alpha/2}\sqrt{\frac{\hat p(1-\hat p)}{n}}=d \;\Rightarrow\; n=\frac{z_{\alpha/2}^2\,\hat p(1-\hat p)}{d^2}. \]

E.g. \((48\%,52\%)\) at 90%: \(\hat p=0.5\), \(d=0.02\), \(n=1.645^2(0.25)/0.02^2\approx1691\).

Asymmetric tails (non-standard split)

A "95% CI" normally splits \(\alpha=0.05\) as \(2.5\%\) per tail. If the problem asks for an unequal split — say \(3\%\) in the lower tail, \(2\%\) in the upper — use a different z on each side:

\[ \Big(\bar x - z_{0.03}\tfrac{\sigma}{\sqrt n}, \; \bar x + z_{0.02}\tfrac{\sigma}{\sqrt n}\Big), \quad z_{0.03}=1.88,\; z_{0.02}=2.055. \]

The total tail probability still sums to \(\alpha\); only the per-side allocation changed, so the interval is no longer symmetric about \(\bar x\).

z or t?

Default to z: known σ, or large \(n\) (≥30) so the CLT applies and \(s\) is a good plug-in. Use t\(_{n-1}\) only for a small sample from a normal population with unknown σ. For \(n=100\) the two barely differ (e.g. \(z_{0.025}=1.96\) vs \(t_{99,0.025}\approx1.98\)) and either is acceptable — the exams lean on z.

Recognition guide

WordingDo
"compute a C% CI for the mean/proportion"estimate ± \(z_{\alpha/2}\)·SE
"how large a sample", "length less than L", "accurate to within ±d"back-solve \(n\), round up; prop → use ¼
"support between a% and b%, find n"\(\hat p\)=mid, \(d\)=half-width, invert
"lower tail 3%, upper tail 2%"different z each side (asymmetric)
small n, normal, σ unknownt\(_{n-1}\); else z
Click any card to flip. Rate it after to track what you need to revisit.
Card 1
CI for a mean (known σ or large n) and CI for a proportion — write both.
Mean: \(\bar x\pm z_{\alpha/2}\,\sigma/\sqrt n\) (use s if σ unknown, large n). Proportion: \(\hat p\pm z_{\alpha/2}\sqrt{\hat p(1-\hat p)/n}\). Critical z: 1.96 (95%), 2.576 (99%), 1.645 (90%).
How sure? 50%
Card 2
Sample-size formula for a mean given target total length L, and for a proportion given half-width d?
Mean: \(n\ge(2z_{\alpha/2}\sigma/L)^2\). Proportion: \(n\ge z_{\alpha/2}^2\,p(1-p)/d^2\), worst case \(p(1-p)=\tfrac14\). Always round UP.
How sure? 50%
Card 3
Why and when do you use the worst case p(1−p)=¼ in a sample-size calculation?
When p̂ is unknown because the sample hasn't been taken. \(p(1-p)\) is maximised at \(p=\tfrac12\) giving \(\tfrac14\); using it guarantees the target precision for any true p.
How sure? 50%
Card 4
A poll reports "90% confident, support between 48% and 52%". How do you recover the sample size?
\(\hat p=(0.48+0.52)/2=0.5\), half-width \(d=0.02\). Invert: \(n=z_{0.05}^2\hat p(1-\hat p)/d^2=1.645^2(0.25)/0.02^2\approx1691\).
How sure? 50%
Card 5
How do you build a 95% CI with 3% in the lower tail and 2% in the upper tail?
Use a different z each side: \((\bar x - z_{0.03}\sigma/\sqrt n,\; \bar x + z_{0.02}\sigma/\sqrt n)\), with \(z_{0.03}=1.88\), \(z_{0.02}=2.055\). Tails sum to α=5% but the interval is asymmetric.
How sure? 50%
Card 6
z or t for a confidence interval — decision rule?
z if σ known or n large (≥30). t\(_{n-1}\) only for small n from a normal population with unknown σ. For n=100 they nearly coincide (1.96 vs ≈1.98); exams default to z.
How sure? 50%
Card 7
What is the total length of a symmetric confidence interval, and how does it scale with n and with confidence?
Length \(=2z_{\alpha/2}\cdot\text{SE}\). It shrinks like \(1/\sqrt n\) (quadruple n to halve it) and grows with higher confidence (bigger \(z_{\alpha/2}\)).
How sure? 50%
Progress
0 / 7
EX. 01

99% CI for mean toothbrush purchases

MACSf1-P4 easy

In a sample of \(n=100\), the number of toothbrushes bought per year has mean \(\bar x=0.9\) and sd \(s=0.2\). Compute an approximate 99% confidence interval for the population mean.

Large n → z CI with s. 99% → \(z_{0.005}=2.576\). SE \(=0.2/\sqrt{100}=0.02\).

Step 1. Margin
\(z_{0.005}\,s/\sqrt n=2.576\cdot0.02=0.05152\).
Step 2. Interval
\(0.9\pm0.05152\).
\((0.84848,\,0.95152)\).
EX. 02

Tire lifetimes — CI then required sample size

MAIf3-P5 medium

Tire lifetimes are normal with known \(\sigma=3600\) miles. A sample of \(n=81\) gave \(\bar x=28400\). (a) Build a 95% CI for the mean. (b) How large a sample gives a 99% CI shorter than the interval in (a)?

(a) \(z_{0.025}=1.96\), \(\sqrt{81}=9\). (b) Set the 99% length \(\le\) the (a) length, solve for n, round up.

Step 1. (a) Margin and CI
\(1.96\cdot3600/9=1.96\cdot400=784\); CI \(=28400\pm784=(27616,29184)\), length \(1568\).
Step 2. (b) Length condition
99% length \(=2\cdot2.576\cdot3600/\sqrt n=18547.2/\sqrt n\le1568\).
Step 3. Solve for n
\(\sqrt n\ge18547.2/1568=11.83\Rightarrow n\ge139.9\Rightarrow n\ge140\).
(a) \((27616,\,29184)\); (b) at least \(140\) tires.
EX. 03

Sample size for a proportion (length < 0.1)

MACSf2-P4 medium

Estimate the percentage of spaghetti eaters who use parmigiano. With \(\hat p\) unknown, what sample size gives a 95% CI of total length less than 0.1?

\(2\cdot1.96\sqrt{p(1-p)/n}<0.1\); use the worst case \(p(1-p)=\tfrac14\).

Step 1. Length condition
\(2\cdot1.96\sqrt{p(1-p)/n}<0.1\Rightarrow n>\dfrac{4\cdot1.96^2}{0.1^2}p(1-p)\).
Step 2. Worst case
Use \(p(1-p)=\tfrac14\): \(n>\dfrac{4\cdot3.8416}{0.01}\cdot\tfrac14=384.16\).
Step 3. Round up
\(n\ge385\).
\(n\ge385\).
EX. 04

Super Bowl poll — sample size within ±0.02

EBf1-P5 medium

How large a sample is needed to be 90% confident that the estimated proportion of households watching is accurate to within ±0.02?

Half-width \(d=0.02\), 90% → \(z=1.645\), worst case \(p(1-p)=\tfrac14\).

Step 1. Formula
\(n\ge\dfrac{z_{0.05}^2\,p(1-p)}{d^2}=\dfrac{1.645^2\cdot\tfrac14}{0.02^2}\).
Step 2. Evaluate
\(=\dfrac{2.706\cdot0.25}{0.0004}=1691.3\).
Step 3. Round up
\(n\ge1692\).
\(n\ge1692\).
EX. 05

Recover the sample size from a published CI

MAIf2-P5 medium

A poll states: "with 90% confidence, the minister's support is between 48% and 52%". Using the standard proportion-CI formula, how large was the sample?

\(\hat p=0.5\) (midpoint), half-width \(d=0.02\). Invert \(z\sqrt{\hat p(1-\hat p)/n}=d\).

Step 1. Read off the CI
\(\hat p=(0.48+0.52)/2=0.5\); margin \(=0.02\); 90% → \(z=1.645\).
Step 2. Invert
\(1.645\sqrt{0.25/n}=0.02\Rightarrow n=\dfrac{1.645^2\cdot0.25}{0.02^2}=1691.3\).
\(n\approx1691\) (i.e. about 1691–1692).
EX. 06

Confidence interval with asymmetric tails

MAIf5-P4 medium

\(n=100\) observations, normal with known \(\sigma=1\), \(\bar x=3.5\). Build a 95% CI but with 3% probability in the lower tail and 2% in the upper tail.

Different z each side: \(z_{0.03}=1.88\) (lower), \(z_{0.02}=2.055\) (upper). \(\sigma/\sqrt n=0.1\).

Step 1. Lower bound
\(3.5-1.88\cdot0.1=3.5-0.188=3.312\).
Step 2. Upper bound
\(3.5+2.055\cdot0.1=3.5+0.2055=3.7055\).
\((3.312,\,3.7055)\) — note it is not symmetric about \(3.5\).
EX. 07

Large-sample CI — z or t

sampletest1-P5 easy

\(n=100\) measurements give mean \(1.0\) and sample sd \(2.0\). Compute a 95% CI for the true value.

Large n → z is the standard choice (\(z_{0.025}=1.96\)); t with df 99 (≈1.98) gives almost the same answer. SE \(=2/\sqrt{100}=0.2\).

Step 1. z interval
\(1\pm1.96\cdot0.2=1\pm0.392=(0.608,1.392)\).
Step 2. t interval (alternative)
Using \(t_{99,0.025}\approx1.98\): \(1\pm1.98\cdot0.2=1\pm0.396=(0.604,1.396)\).
z-CI \((0.608,1.392)\); t-CI \((0.604,1.396)\) — both legitimate, nearly identical at \(n=100\).
CH. 08 Statistics

Hypothesis Testing

One skeleton, five standard-error variants: one/two mean, one/two proportion, paired. Appears in all 7 past exams.

Sections 05
Flashcards 7
Exercises 7
Read time 3'
Sources Ross, Prob & Stats for Engineers 5E, ch. 8 (§8.2–8.4 means, §8.6 proportions, asymptotic) · Past exams: MACSf1/2, MAIf2/3/5, sampletest1, EBf1
Key concepts
H₀ / H₁ Test statistic z vs t One- vs two-sided Critical value p-value Pooled proportion Paired differences

The test skeleton

Every test is the same four moves; only the standard error changes.

  1. State \(H_0,H_1\) — the direction comes from the wording (below).
  2. Pick the statistic and its null distribution: \( \text{TS}=\dfrac{\text{estimate}-\text{null}}{\text{SE}} \), \(\sim N(0,1)\) (large n / known σ) or \(t_{df}\) (small n, normal).
  3. Compute the observed value \(ts_{obs}\).
  4. Decide: reject if \(ts_{obs}\) falls in the rejection region — two-sided \(|ts|\ge z_{\alpha/2}\), one-sided \(ts\ge z_\alpha\) (or \(\le-z_\alpha\)). Or report the p-value and reject when \(\text{p-value}\le\alpha\).

Default to z (large samples / known σ); §8.6–8.7 proportion and Poisson tests are asymptotic-normal too. Use t only for small-n normal data (and the paired test).

The standard-error table

TestStatisticNull dist.
1-mean, known σ / large n\((\bar x-\mu_0)/(\sigma/\sqrt n)\)N(0,1)
1-mean, small n\((\bar x-\mu_0)/(s/\sqrt n)\)\(t_{n-1}\)
paired\((\bar d-0)/(s_d/\sqrt n)\) on \(d_i=\)before−after\(t_{n-1}\)
2-mean, large n\((\bar x_1-\bar x_2)/\sqrt{s_1^2/n_1+s_2^2/n_2}\)N(0,1)
1-proportion\((\hat p-p_0)/\sqrt{p_0(1-p_0)/n}\)N(0,1)
2-proportion (pooled)\((\hat p_1-\hat p_2)/\sqrt{\hat p_p(1-\hat p_p)(1/n_1+1/n_2)}\)N(0,1)

Pooled proportion: \( \hat p_p=\dfrac{X_1+X_2}{n_1+n_2} \). Critical values: \(z_{0.025}=1.96\), \(z_{0.05}=1.645\); \(t_{9,0.05}=1.833\).

One-sided or two-sided? Read the wording

  • One-sided: "more than", "over 30%", "more effective", "improved", "larger mean" → \(H_1\) points one way; reject only in that tail with \(z_\alpha\) (1.645 at 5%).
  • Two-sided: "changed", "differ significantly", "need to recalibrate", "is there a difference" → \(H_1\neq\); reject in both tails with \(z_{\alpha/2}\) (1.96 at 5%).

For a two-group test, set \(H_0:\) difference \(=0\). The direction of \(H_1\) decides which tail; e.g. "B more effective" with statistic \((\bar x_A-\bar x_B)/SE\) rejects for small (negative) values.

p-value & the threshold significance level

The p-value is the null-probability of a statistic at least as extreme as observed, in the direction of \(H_1\): one-sided \(P(Z\ge ts_{obs})\); two-sided \(2P(Z\ge|ts_{obs}|)\). Reject whenever \(\alpha\ge\) p-value.

"Find the significance levels at which \(H_0\) is rejected" is just the p-value: e.g. a one-sided \(ts=0.93\) gives \(P(Z\ge0.93)=1-\Phi(0.93)=1-0.8238=0.1762\), so reject for \(\alpha\ge17.62\%\). A \(ts=2.4\) gives p \(=1-\Phi(2.4)=0.0082\) → reject for \(\alpha\ge0.82\%\).

Recognition guide

WordingTest
one group vs a target μ₀, "recalibrate / changed"1-mean z (two-sided)
one group vs target proportion, "more than X%"1-prop z (one-sided)
two groups, "more effective / larger"2-mean z (one-sided)
two proportions, "significant difference"2-prop pooled z (two-sided)
same units measured before/after (Initial/Final)paired t on differences
"at what significance levels reject"compute the p-value
Click any card to flip. Rate it after to track what you need to revisit.
Card 1
The four moves of any hypothesis test?
1) State \(H_0,H_1\) (direction from wording). 2) Statistic \(\text{TS}=(\text{estimate}-\text{null})/\text{SE}\), null dist z or \(t_{df}\). 3) Compute \(ts_{obs}\). 4) Reject if in the rejection region (\(|ts|\ge z_{\alpha/2}\) two-sided, \(ts\ge z_\alpha\) one-sided), or if p-value ≤ α.
How sure? 50%
Card 2
Standard errors: 1-mean (known σ), 2-mean (large n), 1-proportion, 2-proportion pooled.
1-mean: \(\sigma/\sqrt n\). 2-mean: \(\sqrt{s_1^2/n_1+s_2^2/n_2}\). 1-prop: \(\sqrt{p_0(1-p_0)/n}\). 2-prop: \(\sqrt{\hat p_p(1-\hat p_p)(1/n_1+1/n_2)}\), \(\hat p_p=(X_1+X_2)/(n_1+n_2)\).
How sure? 50%
Card 3
Paired test: when, and what's the statistic?
When the same units are measured before/after (Initial/Final pairs). Form differences \(d_i\), then \(\text{TS}=\dfrac{\bar d-0}{s_d/\sqrt n}\sim t_{n-1}\). It's a one-sample t-test on the differences.
How sure? 50%
Card 4
Which words signal a ONE-sided vs a TWO-sided alternative?
One-sided: "more than", "over X%", "more effective", "improved", "larger". Two-sided: "changed", "differ", "significant difference", "need to recalibrate". One-sided uses \(z_\alpha\) (1.645), two-sided \(z_{\alpha/2}\) (1.96).
How sure? 50%
Card 5
How do you compute a p-value, and how do you answer "at what α is H₀ rejected?"
One-sided: \(P(Z\ge ts_{obs})=1-\Phi(ts_{obs})\); two-sided: \(2P(Z\ge|ts_{obs}|)\). Reject for every \(\alpha\ge\) p-value. So the p-value IS the threshold significance level.
How sure? 50%
Card 6
Two-group test: how do you set H₀, and how does the H₁ direction pick the tail?
\(H_0:\) difference \(=0\). If \(H_1:\mu_B>\mu_A\) and the statistic is \((\bar x_A-\bar x_B)/SE\), you reject for small (negative) values; if the statistic is \((\bar x_B-\bar x_A)/SE\), reject for large values. Keep the sign convention consistent.
How sure? 50%
Card 7
z or t for a hypothesis test?
z: known σ, or large n (proportions and Poisson tests are asymptotic-normal too). t\(_{n-1}\): small n from a normal population with unknown σ, and the paired test. Exams are mostly large-sample z; t appears once (paired).
How sure? 50%
Progress
0 / 7
EX. 01

Recalibrate the bottling machine? (1-mean, two-sided)

MACSf1-P5 easy

A machine should fill bottles to a mean of 750 g; content is normal with known \(\sigma=5\) g. A sample of \(n=25\) gives \(\bar x=745\) g. Is there reason to recalibrate (α = 0.05)?

"Recalibrate" = changed = two-sided. \(\text{TS}=(\bar x-750)/(\sigma/\sqrt n)\), reject if \(|ts|\ge1.96\).

Step 1. Hypotheses
\(H_0:\mu=750\) vs \(H_1:\mu\neq750\).
Step 2. Statistic
\(ts=\dfrac{745-750}{5/\sqrt{25}}=\dfrac{-5}{1}=-5\).
Step 3. Decide
\(-5<-1.96\) → in the rejection region.
Reject \(H_0\): the machine needs recalibrating.
EX. 02

Over 30% smokers? (1-proportion, one-sided + p-value)

MACSf2-P5 medium

66 of 200 adults are smokers. (a) At α = 0.05, can we conclude more than 30% are smokers? (b) At which significance levels would \(H_0\) be rejected?

One-sided: \(H_0:p\le0.30\) vs \(H_1:p>0.30\). \(\text{TS}=(\hat p-0.3)/\sqrt{0.3\cdot0.7/n}\); reject if \(ts\ge1.645\). (b) is the p-value.

Step 1. Statistic
\(\hat p=66/200=0.33\); \(ts=\dfrac{0.33-0.3}{\sqrt{0.21/200}}=\dfrac{0.03}{0.0324}=0.93\).
Step 2. (a) Decide
\(0.93
Step 3. (b) p-value
Reject when \(z_\alpha\le0.93\): \(1-\alpha\le\Phi(0.93)=0.8238\Rightarrow\alpha\ge0.1762\).
(a) Do not reject — can't conclude >30%. (b) Reject for \(\alpha\ge17.62\%\).
EX. 03

Is treatment B more effective? (2-mean, one-sided)

MAIf2-P4 medium

Two groups of \(n=140\): A has \(\bar x_A=105\), \(s_A=50\); B has \(\bar x_B=120\), \(s_B=60\) (higher = more effective). Can we claim B is more effective (α = 0.05)?

\(H_1:\mu_B>\mu_A\), i.e. \(H_0:\mu_A-\mu_B\ge0\) vs \(H_1:\mu_A-\mu_B<0\). Statistic \((\bar x_A-\bar x_B)/\sqrt{s_A^2/n_A+s_B^2/n_B}\); reject if \(ts\le-1.645\).

Step 1. SE
\(\sqrt{2500/140+3600/140}=\sqrt{43.571}=6.601\).
Step 2. Statistic
\(ts=\dfrac{105-120}{6.601}=-2.27\).
Step 3. Decide
\(-2.27<-1.645\) → reject.
Reject \(H_0\): treatment B is more effective.
EX. 04

Heavier bananas from provider 1? (2-mean, p-value)

MAIf3-P4 medium

Two providers, \(n=128\) each: provider 1 \(\bar x_1=155.0\), \(s_1=10\); provider 2 \(\bar x_2=152.0\), \(s_2=10\). Is the claim that provider 1's bananas are heavier justified?

\(H_0:\mu_1-\mu_2\le0\) vs \(H_1:\mu_1-\mu_2>0\). Compute \(ts\), then \(p=P(Z\ge ts)\).

Step 1. Statistic
\(ts=\dfrac{155-152}{\sqrt{100/128+100/128}}=\dfrac{3}{\sqrt{1.5625}}=\dfrac{3}{1.25}=2.4\).
Step 2. p-value
\(P(Z\ge2.4)=1-\Phi(2.4)=0.0082\).
Step 3. Decide
Reject for \(\alpha\ge0.82\%\); at 5% (and 1%) → reject.
\(ts=2.4\), p \(=0.0082\) → reject \(H_0\): the claim is justified.
EX. 05

Do two coins differ? (2-proportion, pooled)

MAIf5-P5 medium

Two coins are each thrown 800 times: A shows Heads 430 times, B shows Heads 400 times. Is there a significant difference in their Heads probabilities (α = 0.05)?

Two-sided 2-proportion test with pooled \(\hat p_p=(430+400)/1600\). Reject if \(|ts|\ge1.96\).

Step 1. Pooled proportion
\(\hat p_p=830/1600=0.51875\); \(\hat p_A=0.5375\), \(\hat p_B=0.5\).
Step 2. Statistic
\(ts=\dfrac{0.5375-0.5}{\sqrt{0.51875\cdot0.48125\,(1/800+1/800)}}=\dfrac{0.0375}{0.02498}=1.50\).
Step 3. Decide
\(1.50<1.96\) → do not reject.
Do not reject \(H_0\): the difference is not significant.
EX. 06

Did training improve scores? (paired t)

sampletest1-P2 hard

Ten gamers' scores are recorded before (Initial) and after (Final) a week of training. The improvements (Final−Initial) are \(0.6,-0.7,0.4,-1.4,1.7,0.6,2.4,1.0,1.9,0.5\). Does the data support that training improved the average score (α = 0.05)?

Paired data → one-sample t on the differences. \(H_0:\mu\le0\) vs \(H_1:\mu>0\). Critical \(t_{9,0.05}=1.833\).

Step 1. Difference summaries
\(\bar d=0.7\), \(s_d=1.15\), \(n=10\).
Step 2. Statistic
\(ts=\dfrac{0.7-0}{1.15/\sqrt{10}}=\dfrac{0.7}{0.3637}=1.92\).
Step 3. Decide
\(1.92>t_{9,0.05}=1.833\) → reject.
Reject \(H_0\): training improved the average score (this is the one exam where t, not z, is used).
EX. 07

Has the average grade changed? (1-mean, two-sided)

EBf1-P4 easy

Historical average grade is 23.5. A recent exam with \(n=100\) students had mean 25.0, sd 2.5. Can we conclude the average changed (α = 0.05)?

"Changed" = two-sided. Large n → z with s. Reject if \(|ts|\ge1.96\).

Step 1. Hypotheses
\(H_0:\mu=23.5\) vs \(H_1:\mu\neq23.5\).
Step 2. Statistic
\(ts=\dfrac{25.0-23.5}{2.5/\sqrt{100}}=\dfrac{1.5}{0.25}=6\).
Step 3. Decide
\(6\gg1.96\) → reject.
Reject \(H_0\): the average grade has significantly changed.
CH. 10 Statistics

Cheatsheet & Decision Map

The open-book weapon: which procedure → which formula. Print this (Print cheat sheet) and bring it to the exam.

Sections 07
Flashcards 0
Exercises 0
Read time 3'
Sources Synthesis of chapters 1–9 · exam-taxonomy.md
Key concepts
Decision tree CI & test formulas Critical values Estimator recipes Bayes template Normal shortcuts

Which procedure? — top-level routing

The question is about…Go to
"chosen at random then observe", "given the result, prob it was…"Bayes / total probability →
two overlapping groups, "at least one", "not the other"inclusion–exclusion
"choose k without replacement", "all different"counting \(\binom{n}{k}\)
"normally distributed", a probability / percentile / unknown σstandardize Z, Φ →
weighted/grouped mean, add an observation, back-solve a sizedescriptive (Σx=nx̄)
many i.i.d. units, prob the TOTAL exceeds a thresholdCLT: \(N(n\mu,n\sigma^2)\)
symbolic estimator T: unbiased? MSE? efficient? find a / c?estimator theory →
"write the likelihood / MLE"; "moments estimator"MLE / method-of-moments
"confidence interval", "how large a sample"CI / sample size →
"is there reason to conclude / more than / changed"hypothesis test →

Inference: which CI / which test?

Walk the questions in order:

  1. CI or test? "compute an interval / how large a sample" → CI. "is there reason / more than / changed" → test.
  2. Mean or proportion? averages/measurements → mean. percentages/counts of successes → proportion.
  3. One sample or two? one group vs a target → one-sample. two groups compared → two-sample. same units before/after → paired.
  4. z or t? known σ or large n (≥30) → z. small n, normal, unknown σ → t. (Proportions always z.)
  5. One- or two-sided? "more / over / improved" → one-sided \(z_\alpha\). "changed / differ" → two-sided \(z_{\alpha/2}\).

CI & test formula bank (T1 + T2 spine)

CaseCI: estimate ± marginTest statistic
1-mean, σ known / large n\(\bar x\pm z_{\alpha/2}\sigma/\sqrt n\)\((\bar x-\mu_0)/(\sigma/\sqrt n)\)
1-mean, small n\(\bar x\pm t_{n-1,\alpha/2}s/\sqrt n\)\((\bar x-\mu_0)/(s/\sqrt n)\sim t_{n-1}\)
paired\(\bar d\pm t_{n-1,\alpha/2}s_d/\sqrt n\)\(\bar d/(s_d/\sqrt n)\sim t_{n-1}\)
2-mean, large n\((\bar x_1-\bar x_2)/\sqrt{s_1^2/n_1+s_2^2/n_2}\)
1-proportion\(\hat p\pm z_{\alpha/2}\sqrt{\hat p(1-\hat p)/n}\)\((\hat p-p_0)/\sqrt{p_0(1-p_0)/n}\)
2-proportion (pooled)\((\hat p_1-\hat p_2)/\sqrt{\hat p_p(1-\hat p_p)(1/n_1+1/n_2)}\)

\(\hat p_p=\dfrac{X_1+X_2}{n_1+n_2}\). Reject: two-sided \(|ts|\ge z_{\alpha/2}\); one-sided \(ts\ge z_\alpha\) (or \(\le-z_\alpha\)). p-value: one-sided \(1-\Phi(ts)\), two-sided \(2(1-\Phi(|ts|))\); reject when \(\alpha\ge\) p-value.

Sample size (round UP)

TargetFormula
mean, total length \(L\)\(n\ge(2z_{\alpha/2}\sigma/L)^2\)
proportion, half-width \(d\)\(n\ge z_{\alpha/2}^2\,p(1-p)/d^2\), worst case \(p(1-p)=\tfrac14\)
recover n from CI\(\hat p\)=mid, \(d\)=half-width, \(n=z_{\alpha/2}^2\hat p(1-\hat p)/d^2\)

Critical values & Φ shortcuts

Confidence / tailz
90% (two-sided) / 0.05 tail\(z_{0.05}=1.645\)
95% / 0.025 tail\(z_{0.025}=1.96\)
99% / 0.005 tail\(z_{0.005}=2.576\)
0.03 / 0.02 tails (asymmetric)\(z_{0.03}=1.88,\;z_{0.02}=2.055\)
small-n paired\(t_{9,0.05}=1.833\)

Φ values: \(\Phi(0.5)=0.6915\), \(\Phi(1)=0.8413\), \(\Phi(2)=0.9772\), \(\Phi(2.4)=0.9918\), \(\Phi(1/3)=0.6293\). Symmetry \(\Phi(-z)=1-\Phi(z)\). Central band \(2\Phi(d/\sigma)-1\). Inverse: \(\Phi^{-1}(0.975)=1.96\), \(\Phi^{-1}(0.75)=0.675\). Standardize \(Z=(X-\mu)/\sigma\).

Estimator-theory recipes (T3 / T10)

  • MSE: \(\mathrm{MSE}=\mathrm{Var}+\mathrm{Bias}^2\); Bias \(=E[T]-\theta\). Unbiased ⇒ MSE=Var.
  • Var of combo: \(\mathrm{Var}(aU+bW)=a^2\mathrm{Var}(U)+b^2\mathrm{Var}(W)\) (independent).
  • Unbiased weights: any \(c\bar X_1+(1-c)\bar X_2\) is unbiased for μ (weights sum to 1).
  • Min-MSE weight: inverse-variance, \(c^*=\dfrac{n_1}{n_1+n_2}\) for sample means.
  • Find constant: set \(E[T]=\theta\); use \(E[X^2]=\sigma^2+\mu^2\), \(E[X_iX_j]=\mu^2\) (indep).
  • Jensen: \(\sqrt T\) (or log, 1/T) of an unbiased T is biased.
  • MLE: \(L=\prod f\) → \(\log\) → \(d/d\theta=0\). Geometric → \(\hat\theta=1/\bar X\).
  • Method-of-moments: set \(E[X]=\bar X\) (as a function of θ), invert.
  • Moments: Bernoulli \(E{=}p,V{=}p(1{-}p)\); Poisson \(E{=}V{=}\lambda\); \(\mathrm{Var}(\bar X_m)=\sigma^2/m\).

Probability templates (T4 / counting / CLT)

  • Total probability: \(P(E)=\sum_i P(E\mid H_i)P(H_i)\).
  • Bayes: \(P(H_j\mid E)=\dfrac{P(E\mid H_j)P(H_j)}{\sum_i P(E\mid H_i)P(H_i)}\).
  • Coin mixture: \(P(X=0\mid N=n)=(\tfrac12)^n\).
  • Conditional in Bernoulli: disjoint arrangements ÷ binomial; \(p\) cancels.
  • Inclusion–exclusion: \(P(B\cup C)=P(B)+P(C)-P(B\cap C)\); \(P(B\cap C^c)=P(B)-P(B\cap C)\).
  • Counting: \(P=\#\text{fav}/\binom{n}{k}\); complement for "different", fix items for "included".
  • CLT (sum): \(S_n\approx N(n\mu,n\sigma^2)\), \(Z=\dfrac{s-n\mu}{\sigma\sqrt n}\) (SUM uses \(\sigma\sqrt n\), not \(\sigma/\sqrt n\)).
  • Binomial→Normal: \(N(np,np(1-p))\); skip \(\pm\tfrac12\) if "no continuity correction".
  • Normal models→prob: \(\mathrm{Var}=E[X^2]-(E[X])^2\), \(E[X(X-1)]=E[X^2]-E[X]\).