Chapter 7 — Homework (CLT · Bootstrap · MLE) — Student Version (Excel-Ready)

✅ Before you start (1 min)

Use Excel .xlsx. Calculation: Automatic. Keep CSV header in A1; paste numbers starting at A2.
Controls: B1 = n (sample size) and B2 = #reps (how many sample means you simulate).
Put the big random block at H2 so it never overwrites your summaries on the left.
#SPILL!? Convert Tables→Range; unmerge cells; clear spill area; remove any leading @ from formulas.
Not on Excel 365? Use the “No-365 (drag)” formulas (given under each part).

🧭 CLT — Guided experiment (do this first!)

Step 1 (rough view): Set B1 = n = 5 and B2 = #reps = 20. You’re taking 20 small samples of size 5 from the population (N=5,000) and computing 20 sample means. Why? With few reps, your histogram of means is noisy and your estimate of SD( X̄ ) is unstable.
Step 2 (stabilize the picture): Keep B1 = 5, change B2 to 1000. Why? More reps = better approximation of the sampling distribution. Your histogram of X̄ should look bell-ish and your empirical SD( X̄ ) should be close to the theory s / √n.
Step 3 (reduce noise per sample): Now change B1 to 30 (keep B2 = 1000). Why? Larger n shrinks variability of X̄ by 1/√n. Compare the empirical SD( X̄ ) for n=5 vs n=30 to the theory ratio √(5/30).
Step 4 (report): Compare your coverage rate for “μ ± 1.96·(s/√n)” at n=5 vs n=30. Which is closer to 95% and why? (Hint: larger n → sampling dist. of X̄ is more Normal and narrower.)

🧪 7.2 — Central Limit Theorem (CLT): Questions & Build

Goal: Build the sampling distribution of X̄ for n = 5 and n = 30, verify SD(X̄) ≈ s/√n, check ~95% coverage, and describe shape/center/spread.

Build this first (use CLT_Pop_Normal.csv in A2:A5001)

Set: B1 = n (first 5, then 30) · B2 = #reps (20 → then 1000).

Excel 365 (no drag) — put at H2 (creates n×#reps block):

=INDEX($A$2:$A$5001, RANDARRAY($B$1, $B$2, 1, ROWS($A$2:$A$5001), TRUE))

No-365 (drag) — put at H2, then fill down B1 rows and right B2 columns:

=INDEX($A$2:$A$5001, RANDBETWEEN(1, ROWS($A$2:$A$5001)))

Row of sample means (one mean per column):

=BYCOL(H2:INDEX(H:XFD, 1+$B$1, $B$2), LAMBDA(col, AVERAGE(col)))

=AVERAGE(H2:INDEX(H:H, 1+$B$1)) ← put in H1 and drag right across B2 columns (No-365)

Summaries for comparison (put these in B4:B9, left side)

=AVERAGE($A$2:$A$5001)

=STDEV.P($A$2:$A$5001)

=STDEV.S(H1:INDEX(1:1, COLUMN(H1)+$B$2-1))

=B5/SQRT($B$1)

=B6/B7

=LET(means, H1:INDEX(1:1, COLUMN(H1)+$B$2-1), low, B4-1.96*B5/SQRT($B$1), high, B4+1.96*B5/SQRT($B$1), COUNTIFS(means, ">="&low, means, "<="&high)/$B$2)

B4 = population mean; B5 = population σ (using STDEV.P since the file is the whole pop; STDEV.S is OK too for N=5000).
B6 = empirical SD of means; B7 = theory σ/√n; B8 = ratio (emp/theory); B9 = coverage % inside μ ± 1.96·(σ/√n).

Questions to answer (put in your PDF)

A1 — Population baseline: report B4 and B5.
A2 — n=5, 20 reps vs 1000 reps: how do histogram shape and B6 change when you increase B2?
A3 — n=30 (1000 reps): report B6, B7, and B8. Compare shrink factor to √(5/30).
A4 — 95% coverage: report B9 for n=5 and for n=30. Which is closer to 95%? Why?
A5 — Histograms: paste two histograms (H1 for n=5; H1 for n=30). 2–3 bullets on shape/center/spread.
A6 — Takeaway: one sentence on what CLT lets you assume about averages and why increasing n helps.

Tip: Histograms are made from the row H1 (the sample means). If #SPILL!, reduce B2 or clear cells right of H2.

🧪 7.3.4 — Bootstrap (one-sample mean): Questions & Build

Goal: Estimate SE and 95% CIs for a mean via bootstrap and compare with the t-interval.

Build this (use Bootstrap_One_Normal.csv in A2:A31; A1 = x)

Set: B1 = 30 samples; use 100 bootstrap resamples.

Original sample summaries (e.g., B10:B15):

=AVERAGE($A$2:$A$31)

=STDEV.S($A$2:$A$31)

=B11/SQRT(B1)

=T.INV.2T(0.05, B1-1)

=B10 - B13*B12

=B10 + B13*B12

Resamples at H2 → 30×100 block (365 vs No-365):

=INDEX($A$2:$A$31, RANDARRAY($B$1, 100, 1, $B$1, TRUE))

=INDEX($A$2:$A$31, RANDBETWEEN(1, $B$1))

Row of 100 bootstrap means at H1:

=BYCOL(H2:INDEX(H:XFD, 1+$B$1, 100), LAMBDA(col, AVERAGE(col)))

=AVERAGE(H2:INDEX(H:H, 1+$B$1)) ← put in H1 and drag right across 100 columns (No-365)

Report from H1 across 100 means:

=AVERAGE(H1:INDEX(1:1, COLUMN(H1)+100-1))

=STDEV.S(H1:INDEX(1:1, COLUMN(H1)+100-1))

=B10 - 1.96*STDEV.S(H1:INDEX(1:1, COLUMN(H1)+100-1))

=B10 + 1.96*STDEV.S(H1:INDEX(1:1, COLUMN(H1)+100-1))

=PERCENTILE.INC(H1:INDEX(1:1, COLUMN(H1)+100-1), 0.025)

=PERCENTILE.INC(H1:INDEX(1:1, COLUMN(H1)+100-1), 0.975)

Questions to answer

B1 — Original sample: report x̄, s, SE=s/√n, and the 95% t-interval.
B2 — Bootstrap: report bootstrap mean, bootstrap SE, 95% normal-approx CI, and 95% percentile CI.
B3 — Compare: which CI is wider? Are centers similar? One sentence why.
B4 — Histogram: paste the histogram of the 100 bootstrap means (H1 row). If skewed, which CI would you trust more, and why?

🧪 7.4.2 — OLS = MLE (simple linear regression): Two Ways (Do Both)

Goal: Fit y = β₀ + β₁x. Show that minimizing SSE (Solver) matches the closed-form Excel result (INDEX+LINEST). Data: open OLS_Practice_Normal.csv, paste A2:B32 (A1:B1 = x,y).

Sanity check (this dataset): β̂₁ ≈ 4.8, β̂₀ ≈ −210, SSE ≈ ~2,000. (Tiny differences are okay.)

A) Solver (visual OLS)

Set up parameters (cells to optimize):
Put guesses in:

=0 ← F2 (β0 guess)

=1 ← F3 (β1 guess)

Or type numbers directly into F2 and F3.
Predictions & residuals:
=$F$2 + $F$3*A2 ← C2, fill to C32 (ŷ)

=B2 - C2 ← D2, fill to D32 (resid)
Objective cell (sum of squared errors):
=SUMSQ(D2:D32) ← F5 (SSE)

365 one-cell variant (no residual column):

=SUMXMY2(B2:B32, $F$2 + $F$3*A2:A32) ← F5
Run Solver: Data → Solver
- Set Objective: F5 To: Min
- By Changing Variable Cells: F2:F3
- Constraints: none • Method: GRG Nonlinear
Click Solve → Keep Solver Solution. Record β̂₀ (=F2), β̂₁ (=F3), SSE (=F5).

B) INDEX + LINEST (closed-form)

Excel’s LINEST(y,x,TRUE,FALSE) returns [slope, intercept].

Get slope & intercept into cells:
=INDEX(LINEST(B2:B32, A2:A32, TRUE, FALSE), 1) ← H2 (β1 slope)

=INDEX(LINEST(B2:B32, A2:A32, TRUE, FALSE), 2) ← H3 (β0 intercept)
Make predictions and SSE (using those cells):
=$H$3 + $H$2*A2 ← C2, fill to C32

=SUMSQ(B2:B32 - C2:C32) ← F6 (SSE)

365 one-cell version (no helper cells):

=SUMSQ(B2:B32 - (INDEX(LINEST(B2:B32,A2:A32,TRUE,FALSE),2) + INDEX(LINEST(B2:B32,A2:A32,TRUE,FALSE),1)*A2:A32))

✅ What to hand in (do both)

From Solver: β̂₀ (F2), β̂₁ (F3), SSE (F5).
From INDEX+LINEST: β̂₀ (H3), β̂₁ (H2), SSE (F6).
Compare: Are the β̂’s and SSE nearly identical? (They should be.) One sentence: “Why does this prove OLS = MLE under Normal errors?”

⚠️ Common pitfalls

$ signs matter: If SSE doesn’t change when you tweak F2/F3, fix $F$2/$F$3 in C2 and refill.
Objective must be one number: F5 should be a single numeric cell.
Older Excel: If array math misbehaves, keep the residuals column and use =SUMSQ(D2:D32).

📘 (Nice to know) The MLE logic

With Normal errors, the log-likelihood is const − (1/(2σ²))·SSE. For fixed σ², maximizing log-likelihood ⇔ minimizing SSE. So the Solver solution (min SSE) equals the closed-form OLS solution (LINEST).

📚 Chapter 7 — Homework (CLT · Bootstrap · MLE)