📚 Chapter 7 — Homework (CLT · Bootstrap · MLE)
What this homework is for: to see the Central Limit Theorem in action, learn a simple bootstrap CI for a mean,
and understand why OLS = MLE for a straight line when errors are Normal.
Start with the **Guided CLT experiment** below so you know exactly why you’re changing the controls.
✅ Before you start (1 min)
- Use Excel .xlsx. Calculation: Automatic. Keep CSV header in A1; paste numbers starting at A2.
- Controls: B1 = n (sample size) and B2 = #reps (how many sample means you simulate).
- Put the big random block at H2 so it never overwrites your summaries on the left.
- #SPILL!? Convert Tables→Range; unmerge cells; clear spill area; remove any leading
@ from formulas.
- Not on Excel 365? Use the “No-365 (drag)” formulas (given under each part).
🧭 CLT — Guided experiment (do this first!)
- Step 1 (rough view): Set B1 = n = 5 and B2 = #reps = 20.
You’re taking 20 small samples of size 5 from the population (N=5,000) and computing 20 sample means.
Why? With few reps, your histogram of means is noisy and your estimate of SD( X̄ ) is unstable.
- Step 2 (stabilize the picture): Keep B1 = 5, change B2 to 1000.
Why? More reps = better approximation of the sampling distribution. Your histogram of X̄ should look bell-ish and
your empirical SD( X̄ ) should be close to the theory s / √n.
- Step 3 (reduce noise per sample): Now change B1 to 30 (keep B2 = 1000).
Why? Larger n shrinks variability of X̄ by 1/√n. Compare
the empirical SD( X̄ ) for n=5 vs n=30 to the theory ratio √(5/30).
- Step 4 (report): Compare your coverage rate for “μ ± 1.96·(s/√n)” at n=5 vs n=30. Which is closer to 95% and why?
(Hint: larger n → sampling dist. of X̄ is more Normal and narrower.)
🧪 7.2 — Central Limit Theorem (CLT): Questions & Build
Goal: Build the sampling distribution of X̄ for n = 5 and n = 30, verify SD(X̄) ≈ s/√n, check ~95% coverage, and describe shape/center/spread.
Build this first (use CLT_Pop_Normal.csv in A2:A5001)
Set: B1 = n (first 5, then 30) · B2 = #reps (20 → then 1000).
Excel 365 (no drag) — put at H2 (creates n×#reps block):
=INDEX($A$2:$A$5001, RANDARRAY($B$1, $B$2, 1, ROWS($A$2:$A$5001), TRUE))
No-365 (drag) — put at H2, then fill down B1 rows and right B2 columns:
=INDEX($A$2:$A$5001, RANDBETWEEN(1, ROWS($A$2:$A$5001)))
Row of sample means (one mean per column):
=BYCOL(H2:INDEX(H:XFD, 1+$B$1, $B$2), LAMBDA(col, AVERAGE(col)))
=AVERAGE(H2:INDEX(H:H, 1+$B$1)) ← put in H1 and drag right across B2 columns (No-365)
Summaries for comparison (put these in B4:B9, left side)
=AVERAGE($A$2:$A$5001)
=STDEV.P($A$2:$A$5001)
=STDEV.S(H1:INDEX(1:1, COLUMN(H1)+$B$2-1))
=B5/SQRT($B$1)
=B6/B7
=LET(means, H1:INDEX(1:1, COLUMN(H1)+$B$2-1), low, B4-1.96*B5/SQRT($B$1), high, B4+1.96*B5/SQRT($B$1), COUNTIFS(means, ">="&low, means, "<="&high)/$B$2)
B4 = population mean; B5 = population σ (using STDEV.P since the file is the whole pop; STDEV.S is OK too for N=5000).
B6 = empirical SD of means; B7 = theory σ/√n; B8 = ratio (emp/theory); B9 = coverage % inside μ ± 1.96·(σ/√n).
Questions to answer (put in your PDF)
- A1 — Population baseline: report B4 and B5.
- A2 — n=5, 20 reps vs 1000 reps: how do histogram shape and B6 change when you increase B2?
- A3 — n=30 (1000 reps): report B6, B7, and B8. Compare shrink factor to √(5/30).
- A4 — 95% coverage: report B9 for n=5 and for n=30. Which is closer to 95%? Why?
- A5 — Histograms: paste two histograms (H1 for n=5; H1 for n=30). 2–3 bullets on shape/center/spread.
- A6 — Takeaway: one sentence on what CLT lets you assume about averages and why increasing n helps.
Tip: Histograms are made from the row H1 (the sample means). If #SPILL!, reduce B2 or clear cells right of H2.
🧪 7.3.4 — Bootstrap (one-sample mean): Questions & Build
Goal: Estimate SE and 95% CIs for a mean via bootstrap and compare with the t-interval.
Build this (use Bootstrap_One_Normal.csv in A2:A31; A1 = x)
Set: B1 = 30 samples; use 100 bootstrap resamples.
Original sample summaries (e.g., B10:B15):
=AVERAGE($A$2:$A$31)
=STDEV.S($A$2:$A$31)
=B11/SQRT(B1)
=T.INV.2T(0.05, B1-1)
=B10 - B13*B12
=B10 + B13*B12
Resamples at H2 → 30×100 block (365 vs No-365):
=INDEX($A$2:$A$31, RANDARRAY($B$1, 100, 1, $B$1, TRUE))
=INDEX($A$2:$A$31, RANDBETWEEN(1, $B$1))
Row of 100 bootstrap means at H1:
=BYCOL(H2:INDEX(H:XFD, 1+$B$1, 100), LAMBDA(col, AVERAGE(col)))
=AVERAGE(H2:INDEX(H:H, 1+$B$1)) ← put in H1 and drag right across 100 columns (No-365)
Report from H1 across 100 means:
=AVERAGE(H1:INDEX(1:1, COLUMN(H1)+100-1))
=STDEV.S(H1:INDEX(1:1, COLUMN(H1)+100-1))
=B10 - 1.96*STDEV.S(H1:INDEX(1:1, COLUMN(H1)+100-1))
=B10 + 1.96*STDEV.S(H1:INDEX(1:1, COLUMN(H1)+100-1))
=PERCENTILE.INC(H1:INDEX(1:1, COLUMN(H1)+100-1), 0.025)
=PERCENTILE.INC(H1:INDEX(1:1, COLUMN(H1)+100-1), 0.975)
Questions to answer
- B1 — Original sample: report
x̄, s, SE=s/√n, and the 95% t-interval.
- B2 — Bootstrap: report bootstrap mean, bootstrap SE, 95% normal-approx CI, and 95% percentile CI.
- B3 — Compare: which CI is wider? Are centers similar? One sentence why.
- B4 — Histogram: paste the histogram of the 100 bootstrap means (H1 row). If skewed, which CI would you trust more, and why?
🧪 7.4.2 — OLS = MLE (simple linear regression): Two Ways (Do Both)
Goal: Fit y = β₀ + β₁x. Show that minimizing SSE (Solver) matches the closed-form Excel result (INDEX+LINEST). Data: open OLS_Practice_Normal.csv, paste A2:B32 (A1:B1 = x,y).
Sanity check (this dataset): β̂₁ ≈ 4.8, β̂₀ ≈ −210, SSE ≈ ~2,000. (Tiny differences are okay.)
A) Solver (visual OLS)
- Set up parameters (cells to optimize):
Put guesses in:
=0 ← F2 (β0 guess)
=1 ← F3 (β1 guess)
Or type numbers directly into F2 and F3.
- Predictions & residuals:
=$F$2 + $F$3*A2 ← C2, fill to C32 (ŷ)
=B2 - C2 ← D2, fill to D32 (resid)
- Objective cell (sum of squared errors):
=SUMSQ(D2:D32) ← F5 (SSE)
365 one-cell variant (no residual column):
=SUMXMY2(B2:B32, $F$2 + $F$3*A2:A32) ← F5
- Run Solver: Data → Solver
- Set Objective: F5 To: Min
- By Changing Variable Cells: F2:F3
- Constraints: none • Method: GRG Nonlinear
Click Solve → Keep Solver Solution. Record β̂₀ (=F2), β̂₁ (=F3), SSE (=F5).
B) INDEX + LINEST (closed-form)
Excel’s LINEST(y,x,TRUE,FALSE) returns [slope, intercept].
- Get slope & intercept into cells:
=INDEX(LINEST(B2:B32, A2:A32, TRUE, FALSE), 1) ← H2 (β1 slope)
=INDEX(LINEST(B2:B32, A2:A32, TRUE, FALSE), 2) ← H3 (β0 intercept)
- Make predictions and SSE (using those cells):
=$H$3 + $H$2*A2 ← C2, fill to C32
=SUMSQ(B2:B32 - C2:C32) ← F6 (SSE)
365 one-cell version (no helper cells):
=SUMSQ(B2:B32 - (INDEX(LINEST(B2:B32,A2:A32,TRUE,FALSE),2) + INDEX(LINEST(B2:B32,A2:A32,TRUE,FALSE),1)*A2:A32))
✅ What to hand in (do both)
- From Solver: β̂₀ (F2), β̂₁ (F3), SSE (F5).
- From INDEX+LINEST: β̂₀ (H3), β̂₁ (H2), SSE (F6).
- Compare: Are the β̂’s and SSE nearly identical? (They should be.) One sentence: “Why does this prove OLS = MLE under Normal errors?”
⚠️ Common pitfalls
- $ signs matter: If SSE doesn’t change when you tweak F2/F3, fix $F$2/$F$3 in C2 and refill.
- Objective must be one number: F5 should be a single numeric cell.
- Older Excel: If array math misbehaves, keep the residuals column and use =SUMSQ(D2:D32).
📘 (Nice to know) The MLE logic
With Normal errors, the log-likelihood is const − (1/(2σ²))·SSE. For fixed σ², maximizing log-likelihood ⇔ minimizing SSE. So the Solver solution (min SSE) equals the closed-form OLS solution (LINEST).
📤 What to submit
- CLT: small table (B4–B9) for n=5 & n=30, two histograms (H1 row), answers A1–A6.
- Bootstrap: summary table (original + bootstrap), histogram of H1, answers B1–B4.
- OLS/MLE: β̂1, β̂0, SSE (+ optional plot), answers C1–C3.