📚 Chapter 7 — Homework (CLT · Bootstrap · MLE)

✅ Before you start (1 min)

  1. Use Excel .xlsx. Calculation: Automatic. Keep CSV header in A1; paste numbers starting at A2.
  2. Controls: B1 = n (sample size) and B2 = #reps (how many sample means you simulate).
  3. Put the big random block at H2 so it never overwrites your summaries on the left.
  4. #SPILL!? Convert Tables→Range; unmerge cells; clear spill area; remove any leading @ from formulas.
  5. Not on Excel 365? Use the “No-365 (drag)” formulas (given under each part).

⬇️ Download data (fixed seeds; identical for everyone)

CLT_Pop_Normal.csv Bootstrap_One_Normal.csv OLS_Practice_Normal.csv
Headers: value (CLT); x (Bootstrap); x,y (OLS). Paste starting at A2.

🧭 CLT — Guided experiment (do this first!)

  1. Step 1 (rough view): Set B1 = n = 5 and B2 = #reps = 20. You’re taking 20 small samples of size 5 from the population (N=5,000) and computing 20 sample means. Why? With few reps, your histogram of means is noisy and your estimate of SD( X̄ ) is unstable.
  2. Step 2 (stabilize the picture): Keep B1 = 5, change B2 to 1000. Why? More reps = better approximation of the sampling distribution. Your histogram of should look bell-ish and your empirical SD( X̄ ) should be close to the theory s / √n.
  3. Step 3 (reduce noise per sample): Now change B1 to 30 (keep B2 = 1000). Why? Larger n shrinks variability of by 1/√n. Compare the empirical SD( X̄ ) for n=5 vs n=30 to the theory ratio √(5/30).
  4. Step 4 (report): Compare your coverage rate for “μ ± 1.96·(s/√n)” at n=5 vs n=30. Which is closer to 95% and why? (Hint: larger n → sampling dist. of is more Normal and narrower.)

🧪 7.2 — Central Limit Theorem (CLT): Questions & Build

Goal: Build the sampling distribution of for n = 5 and n = 30, verify SD(X̄) ≈ s/√n, check ~95% coverage, and describe shape/center/spread.

Build this first (use CLT_Pop_Normal.csv in A2:A5001)

Set: B1 = n (first 5, then 30) · B2 = #reps (20 → then 1000).

Excel 365 (no drag) — put at H2 (creates n×#reps block):

=INDEX($A$2:$A$5001, RANDARRAY($B$1, $B$2, 1, ROWS($A$2:$A$5001), TRUE))

No-365 (drag) — put at H2, then fill down B1 rows and right B2 columns:

=INDEX($A$2:$A$5001, RANDBETWEEN(1, ROWS($A$2:$A$5001)))

Row of sample means (one mean per column):

=BYCOL(H2:INDEX(H:XFD, 1+$B$1, $B$2), LAMBDA(col, AVERAGE(col)))
=AVERAGE(H2:INDEX(H:H, 1+$B$1)) ← put in H1 and drag right across B2 columns (No-365)

Summaries for comparison (put these in B4:B9, left side)

=AVERAGE($A$2:$A$5001)
=STDEV.P($A$2:$A$5001)
=STDEV.S(H1:INDEX(1:1, COLUMN(H1)+$B$2-1))
=B5/SQRT($B$1)
=B6/B7
=LET(means, H1:INDEX(1:1, COLUMN(H1)+$B$2-1), low, B4-1.96*B5/SQRT($B$1), high, B4+1.96*B5/SQRT($B$1), COUNTIFS(means, ">="&low, means, "<="&high)/$B$2)

B4 = population mean; B5 = population σ (using STDEV.P since the file is the whole pop; STDEV.S is OK too for N=5000).
B6 = empirical SD of means; B7 = theory σ/√n; B8 = ratio (emp/theory); B9 = coverage % inside μ ± 1.96·(σ/√n).

Questions to answer (put in your PDF)

  1. A1 — Population baseline: report B4 and B5.
  2. A2 — n=5, 20 reps vs 1000 reps: how do histogram shape and B6 change when you increase B2?
  3. A3 — n=30 (1000 reps): report B6, B7, and B8. Compare shrink factor to √(5/30).
  4. A4 — 95% coverage: report B9 for n=5 and for n=30. Which is closer to 95%? Why?
  5. A5 — Histograms: paste two histograms (H1 for n=5; H1 for n=30). 2–3 bullets on shape/center/spread.
  6. A6 — Takeaway: one sentence on what CLT lets you assume about averages and why increasing n helps.
Tip: Histograms are made from the row H1 (the sample means). If #SPILL!, reduce B2 or clear cells right of H2.

🧪 7.3.4 — Bootstrap (one-sample mean): Questions & Build

Goal: Estimate SE and 95% CIs for a mean via bootstrap and compare with the t-interval.

Build this (use Bootstrap_One_Normal.csv in A2:A31; A1 = x)

Set: B1 = 30 samples; use 100 bootstrap resamples.

Original sample summaries (e.g., B10:B15):

=AVERAGE($A$2:$A$31)
=STDEV.S($A$2:$A$31)
=B11/SQRT(B1)
=T.INV.2T(0.05, B1-1)
=B10 - B13*B12
=B10 + B13*B12

Resamples at H2 → 30×100 block (365 vs No-365):

=INDEX($A$2:$A$31, RANDARRAY($B$1, 100, 1, $B$1, TRUE))
=INDEX($A$2:$A$31, RANDBETWEEN(1, $B$1))

Row of 100 bootstrap means at H1:

=BYCOL(H2:INDEX(H:XFD, 1+$B$1, 100), LAMBDA(col, AVERAGE(col)))
=AVERAGE(H2:INDEX(H:H, 1+$B$1)) ← put in H1 and drag right across 100 columns (No-365)

Report from H1 across 100 means:

=AVERAGE(H1:INDEX(1:1, COLUMN(H1)+100-1))
=STDEV.S(H1:INDEX(1:1, COLUMN(H1)+100-1))
=B10 - 1.96*STDEV.S(H1:INDEX(1:1, COLUMN(H1)+100-1))
=B10 + 1.96*STDEV.S(H1:INDEX(1:1, COLUMN(H1)+100-1))
=PERCENTILE.INC(H1:INDEX(1:1, COLUMN(H1)+100-1), 0.025)
=PERCENTILE.INC(H1:INDEX(1:1, COLUMN(H1)+100-1), 0.975)

Questions to answer

  1. B1 — Original sample: report , s, SE=s/√n, and the 95% t-interval.
  2. B2 — Bootstrap: report bootstrap mean, bootstrap SE, 95% normal-approx CI, and 95% percentile CI.
  3. B3 — Compare: which CI is wider? Are centers similar? One sentence why.
  4. B4 — Histogram: paste the histogram of the 100 bootstrap means (H1 row). If skewed, which CI would you trust more, and why?

🧪 7.4.2 — OLS = MLE (simple linear regression): Two Ways (Do Both)

Goal: Fit y = β₀ + β₁x. Show that minimizing SSE (Solver) matches the closed-form Excel result (INDEX+LINEST). Data: open OLS_Practice_Normal.csv, paste A2:B32 (A1:B1 = x,y).

Sanity check (this dataset): β̂₁ ≈ 4.8, β̂₀ ≈ −210, SSE ≈ ~2,000. (Tiny differences are okay.)

A) Solver (visual OLS)

  1. Set up parameters (cells to optimize):
    Put guesses in:
    =0 ← F2 (β0 guess)
    =1 ← F3 (β1 guess)
    Or type numbers directly into F2 and F3.
  2. Predictions & residuals:
    =$F$2 + $F$3*A2 ← C2, fill to C32 (ŷ)
    =B2 - C2 ← D2, fill to D32 (resid)
  3. Objective cell (sum of squared errors):
    =SUMSQ(D2:D32) ← F5 (SSE)
    365 one-cell variant (no residual column):
    =SUMXMY2(B2:B32, $F$2 + $F$3*A2:A32) ← F5
  4. Run Solver: Data → Solver
    • Set Objective: F5   To: Min
    • By Changing Variable Cells: F2:F3
    • Constraints: none   •   Method: GRG Nonlinear
    Click SolveKeep Solver Solution. Record β̂₀ (=F2), β̂₁ (=F3), SSE (=F5).

B) INDEX + LINEST (closed-form)

Excel’s LINEST(y,x,TRUE,FALSE) returns [slope, intercept].

  1. Get slope & intercept into cells:
    =INDEX(LINEST(B2:B32, A2:A32, TRUE, FALSE), 1) ← H2 (β1 slope)
    =INDEX(LINEST(B2:B32, A2:A32, TRUE, FALSE), 2) ← H3 (β0 intercept)
  2. Make predictions and SSE (using those cells):
    =$H$3 + $H$2*A2 ← C2, fill to C32
    =SUMSQ(B2:B32 - C2:C32) ← F6 (SSE)
    365 one-cell version (no helper cells):
    =SUMSQ(B2:B32 - (INDEX(LINEST(B2:B32,A2:A32,TRUE,FALSE),2) + INDEX(LINEST(B2:B32,A2:A32,TRUE,FALSE),1)*A2:A32))

✅ What to hand in (do both)

⚠️ Common pitfalls
📘 (Nice to know) The MLE logic

With Normal errors, the log-likelihood is const − (1/(2σ²))·SSE. For fixed σ², maximizing log-likelihood ⇔ minimizing SSE. So the Solver solution (min SSE) equals the closed-form OLS solution (LINEST).

📤 What to submit