📊 Chapter 7: Sampling Distributions and Estimation

🎮 Apps 📽️ PPT 📝 Quiz 📚 Homework HW: 7.2 CLT HW: 7.3.4 Bootstrap HW: 7.4.2 MLE ⬇️ Data 📢 Q&A

🎮 Interactive Apps (HTML/JS)

📽️ PPT Slides

Download Chapter 7 Slides

📝 Quiz

📚 Homework — Chapter 7 (Excel-Ready)

Complete in order. Use the exact datasets and formulas (mirrors the apps). Submit one PDF with tables, histograms, CI endpoints, and short interpretations.

  1. 7.2 CLT (Normal source) — Use CLT_Pop_Normal.csv (N=5,000). Do n=5 and n=30 with 1,000 reps of \(\bar X\). Compare empirical SD(\(\bar X\)) to \(s/\sqrt{n}\). Include two histograms and a comment on shape/center/spread.
  2. 7.3.4 Bootstrap (100 resamples) — Use Bootstrap_One_Normal.csv (n=30). Create 100 bootstrap means, report Bootstrap SE, 95% percentile CI and normal-approx CI; include a histogram.
  3. 7.4.2 MLE via Solver (OLS) — Use OLS_Practice_Normal.csv (31 pairs). Fit \(y=\beta_0+\beta_1 x\) by minimizing SSE with Solver. Report \(\hat\beta_0,\hat\beta_1\), SSE, and a one-liner on why OLS=MLE under Normal errors.

Quick downloads:

⬇️ CLT_Pop_Normal.csv ⬇️ Bootstrap_One_Normal.csv ⬇️ OLS_Practice_Normal.csv

7.2 — Central Limit Theorem (Normal source, Excel-ready)

Use this data exactly: Download CLT_Pop_Normal.csv (N=5,000 from \( \mathcal{N}(100,20^2) \)). Paste values to Excel A2:A5001.

⬇️ CLT_Pop_Normal.csv
  1. Controls:
    B1 (n) = 5 ← then change to 30
    B2 (#reps) = 1000
  2. Resample with replacement (spill an \(n\times \#\text{reps}\) block):
    C2:
    =INDEX($A$2:$A$5001, RANDARRAY($B$1, $B$2, 1, ROWS($A$2:$A$5001), TRUE))
  3. Replicate means across columns:
    C1:
    =BYCOL(C2:INDEX(C:XFD, 1+$B$1, $B$2), LAMBDA(col, AVERAGE(col)))
    No BYCOL? Put =AVERAGE(C2:INDEX(C:C,1+$B$1)) in C1 and fill right to the number of reps.
  4. Compare SD(\(\bar X\)) to theory:
    Population mean : D2 = AVERAGE($A$2:$A$5001)
    Population SD : D3 = STDEV.S($A$2:$A$5001)
    Empirical SD of X̄: D4 = STDEV.S(C1:INDEX(1:1, $B$2+2))
    Theory SD of X̄ : D5 = D3/SQRT($B$1)
    Change B1 from 5 → 30 and observe D4 ≈ D5, both shrink as \(n\) grows.
  5. (Optional) Insert → Histogram of the row of means (C1 across). Comment on shape/center/spread.

7.3.4 — Bootstrap (100 resamples, Normal data)

Use this data exactly: Download Bootstrap_One_Normal.csv (n=30 from \( \mathcal{N}(75, 12^2) \)). Paste to Excel A2:A31.

⬇️ Bootstrap_One_Normal.csv
  1. Original sample summaries:
    B1 (n) = COUNT($A$2:$A$31) ← 30
    B2 (x̄) = AVERAGE($A$2:$A$31)
    B3 (s) = STDEV.S($A$2:$A$31)
    B4 (SE) = B3/SQRT(B1)
    t* (df=n-1) = T.INV.2T(0.05, B1-1)
    95% t-Lower = B2 - t* * B4
    95% t-Upper = B2 + t* * B4
  2. 100 bootstrap resamples (spill; no drag):
    C2:
    =INDEX($A$2:$A$31, RANDARRAY($B$1, 100, 1, $B$1, TRUE))
    (Creates a 30×100 block; each column is one resample.)
  3. Bootstrap means (row spill of 100 values):
    C1:
    =BYCOL(C2:INDEX(C:XFD, 1+$B$1, 100), LAMBDA(col, AVERAGE(col)))
    No BYCOL? Put =AVERAGE(C2:INDEX(C:C,1+$B$1)) in C1 and fill right to 100.
  4. Bootstrap SE and CIs:
    Bootstrap mean = AVERAGE(C1:CV1)
    Bootstrap SE = STDEV.S(C1:CV1)
    Normal-approx 95% CI : lower = B2 - 1.96*BootstrapSE, upper = B2 + 1.96*BootstrapSE
    Percentile 95% CI : lower = PERCENTILE.INC(C1:CV1, 0.025), upper = PERCENTILE.INC(C1:CV1, 0.975)
    Make a histogram of C1:CV1; add a short interpretation (avoid “probability the mean is in the interval” wording).

7.4.2 — MLE in Simple Linear Regression (via Excel Solver)

Use this data exactly: Download OLS_Practice_Normal.csv with pairs \( (x_i,y_i) \) where \( x\approx 40\ldots 70 \) (tiny jitter) and \( y = -210 + 4.8x + \varepsilon \), \( \varepsilon\sim \mathcal{N}(0,8^2) \). Paste to Excel A2:B32 (31 rows).

⬇️ OLS_Practice_Normal.csv
  1. Set up columns (data in A:x, B:y, rows 2..32):
    F2 (β₀ guess) = 0 ← any start
    F3 (β₁ guess) = 1 ← any start

    C2 (Ŷ) = $F$2 + $F$3*A2 → fill down to C32
    D2 (Residual) = B2 - C2 → fill down to D32
    F5 (SSE) = SUMSQ(D2:D32)
  2. Solver (Data → Solver):
    • Set Objective: F5
    • To: Min
    • By Changing Cells: F2:F3
    • Method: GRG NonlinearSolve
    Report \( \hat\beta_0 \) (F2), \( \hat\beta_1 \) (F3), SSE (F5). You should see \( \hat\beta_1\approx 4.8 \) and \( \hat\beta_0\approx -210 \) (noise causes small variation).
  3. Why OLS = MLE here: with i.i.d. Normal errors and constant variance, maximizing likelihood ⇔ minimizing SSE, so the Solver OLS fit is also the MLE.

⬇️ Data — generated locally (Normal only, fixed seed)

⬇️ CLT_Pop_Normal.csv ⬇️ Bootstrap_One_Normal.csv ⬇️ OLS_Practice_Normal.csv

📢 Student Q&A — Quick, plain-English answers (20)

Q1. What does the “mean” tell me in real life?
It’s the typical level. If quiz scores are 70, 80, 90, the mean is 80—roughly what you expect for a new student. In engineering, mean battery life (\(\bar x\)) is the average runtime you’d plan around.
Q2. Why does the average get less noisy with bigger samples?
Because \(\mathrm{SD}(\bar X)=\sigma/\sqrt{n}\). Doubling \(n\) divides the noise by \(\sqrt{2}\). Think of averaging many noisy sensor readings—the average is steadier.
Q3. What does the CLT actually do for me?
Even if the original data aren’t Normal, the mean \(\bar X\) is approximately Normal for moderate/large \(n\). That lets you use t/z tools for \(\bar X\). Example: commute times are skewed, but class-average commute time is close to Normal.
Q4. When do I use a t-interval instead of z?
Use t when population \(\sigma\) is unknown (typical) and you estimate it with \(s\). Excel: T.INV.2T(0.05, n-1) for 95%.
Q5. How do I word a 95% CI correctly?
“We are 95% confident the true mean lies between [lower, upper].” Don’t say “95% probability the mean is in the interval.”
Q6. What is a bootstrap sample in plain English?
It’s re-sampling your data with replacement, like drawing tickets from a hat and putting them back. Each resample has size \(n\), taken from your original \(n\).
Q7. How many bootstrap resamples do I need?
For class demos: 100 (fast). For more stable CI ends: 1,000+. Many use 5,000 when publishing.
Q8. Percentile CI vs Normal-approx CI for bootstrap—what’s the difference?
Percentile CI uses the 2.5th and 97.5th percentiles of the bootstrap estimates. Normal-approx uses mean ± 1.96×(bootstrap SE). If the bootstrap distribution is skewed, percentile CI is often more honest.
Q9. In a line \(y=\beta_0+\beta_1 x\), what does \(\beta_1\) mean?
Slope = expected change in \(y\) per 1 unit of \(x\). If \(x\)=study hours and \(y\)=score, \(\hat\beta_1=4.8\) means about +4.8 points per extra hour.
Q10. What does the intercept \(\beta_0\) mean?
It’s the predicted \(y\) when \(x=0\). Sometimes that point isn’t realistic (e.g., 0 hours of sleep), so interpret with care.
Q11. What are SSE, MSE, and RMSE?
SSE = \(\sum (y_i-\hat y_i)^2\). MSE = SSE/(n−k) (average squared error). RMSE = \(\sqrt{\text{MSE}}\) (error in original units). Excel: SSE=SUMSQ(residuals).
Q12. Why is OLS also MLE when errors are Normal?
Normal log-likelihood is \(\ell=-\frac{n}{2}\ln(2\pi\sigma^2)-\frac{1}{2\sigma^2}\sum (y-\beta_0-\beta_1x)^2\). For fixed \(\sigma^2\), maximizing \(\ell\) ⇔ minimizing SSE, so OLS = MLE.
Q13. What if errors aren’t Normal—does OLS break?
OLS still gives the least-squares line but may be sensitive to outliers/heavy tails. Alternatives: LAD (L1) regression, robust regression, or bootstrap CIs for more reliable inference.
Q14. Why is a t-interval wider than a z-interval?
Because we don’t know \(\sigma\) and estimate it with \(s\), adding extra uncertainty (especially for small \(n\)).
Q15. What’s Welch’s t-interval and when use it?
Two-sample mean difference when variances aren’t assumed equal. Use it for A/B tests with different spreads. It’s the default safe choice.
Q16. What does “label-preserving” bootstrap mean for two samples?
You resample within each group A and B separately (keep labels). This keeps group sizes and within-group structure the same in each resample.
Q17. How do outliers affect my line fit?
A single extreme \(x\) or \(y\) can swing the slope a lot. Check residual plots; consider transformations, winsorizing, or robust methods.
Q18. Is high \(R^2\) always good?
High \(R^2\) means the line explains more variation, but it doesn’t prove causation or guarantee good predictions outside your \(x\)-range (extrapolation risk!).
Q19. What are AIC and BIC in one line each?
They score fit with a penalty for model size \(k\). Lower is better. AIC \(=2k-2\ell(\hat\theta)\), BIC \(=k\ln n-2\ell(\hat\theta)\).
Q20. How many samples do I need for a target margin of error?
For mean with known \(\sigma\): \(n=\big(z^*\sigma/E\big)^2\). Example: want MOE \(E=2\) minutes for bus wait times, \(\sigma\approx 10\), 95% ⇒ \(z^*\approx1.96\): \(n\approx (1.96\cdot10/2)^2\approx96\).