9.1 Hypothesis Testing: Concepts and Errors

Null Hypothesis (H₀): The default assumption (e.g., μ = 50).
Alternative Hypothesis (H₁): Competing claim (e.g., μ ≠ 50).
Type I Error (α): Rejecting H₀ when it is actually true. Controlled by significance level (e.g., α = 0.05).
Type II Error (β): Failing to reject H₀ when it is false. Computed from the overlap of the H₁ distribution with the H₀ acceptance region.
Power: Probability of correctly rejecting H₀ when H₁ is true: 1 − β.

2. Visualizing Hypothesis Testing

This chart compares two normal distributions: one under H₀ (μ = 50) and one under H₁ (μ = 52). The blue curve shows the distribution under the null hypothesis. The red curve shows the distribution under the alternative hypothesis. The critical region is set using z = ±1.96, centered at 50.

Understanding Type I and II Errors

We calculate the critical values using the significance level α = 0.05 and z = ±1.96.
With μ₀ = 50, σ = 2.5, and n = 10, the standard error (SE) is σ / √n = 2.5 / √10 ≈ 0.79.
Therefore, critical values are 50 ± 1.96 × 0.79 → [48.45, 51.55].

Type I Error (α): The probability that we incorrectly reject H₀ when H₀ is true. This occurs when the sample mean falls outside [48.45, 51.55] under the blue curve. Since α = 0.05, the combined tail area outside this interval under H₀ totals 5% of the probability mass.

Type II Error (β): The probability that we fail to reject H₀ when the true mean is μ = 52 (H₁ is true). Even though 52 is above 51.55, the red distribution (centered at 52) has some probability mass inside the range [48.45, 51.55].
To find this probability, we calculate the area of the red curve that falls between 48.45 and 51.55.
Under H₁: the z-scores for 48.45 and 51.55 become:
z₁ = (48.45 - 52) / 0.79 ≈ -4.49
z₂ = (51.55 - 52) / 0.79 ≈ -0.57

Using standard normal distribution tables:
Φ(-0.57) ≈ 0.2843, Φ(-4.49) ≈ almost 0.
So β = Φ(-0.57) - Φ(-4.49) ≈ 0.2843.

Interpretation: There is about a 28.4% chance that we will incorrectly accept H₀ (i.e., fail to detect the true mean is 52).
Power = 1 − β = 1 − 0.2843 = 0.7157: About 71.6% chance to correctly reject H₀ if μ = 52.

3. Example Calculation

4. Try Your Own Values

5. How to Interpret

If |z| > 1.96: reject H₀ → evidence against H₀.
If |z| ≤ 1.96: fail to reject H₀ → not enough evidence.
α is chosen in advance (commonly 0.05), it determines the rejection region.
β is computed based on the probability mass of H₁ falling inside the H₀ acceptance region.
Type I error (α): false positive → reject a true null.
Type II error (β): false negative → fail to reject a false null.
Power = 1 - β: ability of the test to detect real differences.

4. Practice Questions for Students

Practice 1 – Light Bulb Lifespan (Two-Tailed)

Question: A light bulb manufacturer claims the average lifespan of its bulbs is 1,000 hours. A sample of 36 bulbs shows a mean of 960 hours with a standard deviation of 80. Test the claim at the 0.05 significance level.

Try solving first, then click to reveal answer.

Claim: Mean lifespan is 1,000 hours. Sample mean = 960, σ = 80, n = 36.

H₀: μ = 1000, H₁: μ ≠ 1000

α = 0.05 (two-tailed) → Each tail gets α/2 = 0.025 → Critical z = ±1.96

SE = 80 / √36 = 13.33

z = (960 - 1000) / 13.33 = -3.00 → Reject H₀ → Lifespan differs from 1,000 hours.

Type I error = 5%

If true mean = 980: Critical values = 1000 ± 1.96 × 13.33 = [973.87, 1026.13]

z₁ = (973.87 - 980) / 13.33 ≈ -0.46, z₂ = (1026.13 - 980) / 13.33 ≈ 3.46

β = Φ(3.46) - Φ(-0.46) ≈ 0.9997 - 0.3228 = 0.6769 → Power ≈ 32.3%

Practice 2 – Exam Scores (One-Tailed)

Question: A professor claims students average at least 75 points on an exam. A sample of 25 students has a mean of 72 with a standard deviation of 5. Test the claim at the 0.05 level.

Try solving first, then click to reveal answer.

Claim: Students average at least 75 points. Sample mean = 72, σ = 5, n = 25.

H₀: μ ≥ 75, H₁: μ < 75

α = 0.05 (one-tailed) → Critical z = -1.645

SE = 5 / √25 = 1

z = (72 - 75) / 1 = -3 → Reject H₀ → Students score lower than 75.

Type I error = 5%

If true mean = 73: z = (-1.645 - (73 - 75)/1) = (-1.645 + 2) = 0.355

β = Φ(0.355) ≈ 0.638 → Power ≈ 1 - 0.638 = 0.362 → (Incorrect prior claim; β ≈ 36.2%, Power ≈ 63.8%)

Practice 3 – Manufacturing Defect Rate (Two-Tailed)

Question: A factory claims its defect rate is 3%. A sample of 16 items shows a defect rate of 2.5%, with a standard deviation of 0.6%. Test this at the 5% level.

Try solving first, then click to reveal answer.

H₀: μ = 3%, H₁: μ ≠ 3%, σ = 0.6%, n = 16, sample mean = 2.5%

SE = 0.6 / √16 = 0.15%

α = 0.05 (two-tailed) → Each tail gets α/2 = 0.025 → Critical z = ±1.96

z = (2.5 - 3) / 0.15 = -3.33 → Reject H₀ → Defect rate differs from 3%

Assume true mean = 2.8%:

Critical values in raw score = 3 ± 1.96 × 0.15 = [2.706, 3.294]

z₁ = (2.706 - 2.8)/0.15 = -0.63, z₂ = (3.294 - 2.8)/0.15 = 3.29

β = Φ(3.29) - Φ(-0.63) ≈ 0.9995 - 0.2643 ≈ 0.735 → Power ≈ 26.5%

🧠 How to Use α (Type I) and β (Type II) in Real Decisions

1. What They Really Mean

α (Type I Error): The chance of making a false alarm.
β (Type II Error): The chance of making a false negative.

2. Math Recap

We assume:

H₀: μ = 50
H₁: μ = 52
σ = 2.5, n = 10 → SE = 2.5 / √10 ≈ 0.79
Critical z = ±1.96 → cutoff range: [48.45, 51.55]

Type I Error (α): Probability that we say "μ ≠ 50" even though μ = 50 → false alarm
Controlled by your chosen significance level. Usually α = 0.05 (5%).

Type II Error (β): Probability that we say "μ = 50" even though μ = 52 → missed effect
Calculated using overlap from the H₁ curve into the H₀ zone.

In our case: β ≈ 28.4%, so Power = 1 - β ≈ 71.6%

3. If α is high or low...

High α (e.g., 0.10): You're taking more risk of false alarms (bad if false alarms are costly)
Low α (e.g., 0.01): You're being strict — fewer false alarms, but it may increase β (miss real effects)

4. What’s the Damage?

Error Type	What You Say	Reality	Damage Example
Type I (α)	“It works!”	Actually doesn’t work	You approve a bad drug, waste money, risk safety
Type II (β)	“It doesn’t work.”	Actually does work	You reject a useful medicine, miss a breakthrough

5. When Do You Care Most About α vs. β?

In medical trials or criminal justice, Type I is very serious → set α very low (0.01 or less)
In early research or exploration, Type II may be more serious → allow higher α to reduce β

6. Conclusion

α and β are about risk of being wrong. You pick α to control false positives. You estimate β to check if your test is strong enough to detect the real effect.

Think of it like this:

α: How often will I sound a false alarm?
β: How often will I miss the truth?
Power: How likely am I to find the truth if it's there?

💡 Final Tip: Good tests have low α, low β, and high power. But there’s a trade-off. You must choose what mistake matters most in your situation.

7. What Affects Type I and Type II Errors?

Type I Error (α) is set by you — usually 0.05 — and defines how strict you are. But Type II Error (β) depends on:

1. Sample Size (n): Larger n means smaller standard error → curves separate more → β goes down → power goes up.
2. Effect Size (μ₁ − μ₀): Bigger gap between H₀ and H₁ → easier to detect → β goes down.
3. Standard Deviation (σ): More noise → curves are wider → harder to separate → β goes up.
4. Chosen α: Lower α (stricter) makes critical region smaller → β goes up (harder to detect real change).

📉 Why Small Effects Are Hard to Detect

If μ₁ is very close to μ₀ (e.g., 50.0 vs 50.5), then even with a large sample size, the red and blue curves still overlap a lot. That means:

It’s very hard to reject H₀
You’ll likely make a Type II Error (you say: "no change" even though there is a small one)

📊 Example:

Assume:

μ₀ = 50, μ₁ = 50.5 → very small difference
σ = 2.5, n = 10 → SE ≈ 0.79 → still wide curves
Result: β is very high → power is low → hard to detect the true mean is 50.5

🛠️ Limitations of Hypothesis Testing

Can’t "prove" anything — only reject or fail to reject H₀
Low power means real effects are missed
Too much focus on α can blind you to β risk
If n is too small, power is low no matter how you set α

So: Always consider:

What size of effect matters?
Is your sample big enough to detect it?
Are you okay with missing something (β)? Or making a false claim (α)?

💡 Summary: Type II error (β) is like a hidden trap — it depends on sample size, effect size, and data variability. Always check your power before trusting a non-significant result.

📘 9.1 Hypothesis Testing and Statistical Errors

1. Key Definitions

2. Visualizing Hypothesis Testing

Understanding Type I and II Errors

3. Example Calculation

4. Try Your Own Values

5. How to Interpret

4. Practice Questions for Students

Practice 1 – Light Bulb Lifespan (Two-Tailed)

Practice 2 – Exam Scores (One-Tailed)

Practice 3 – Manufacturing Defect Rate (Two-Tailed)

🧠 How to Use α (Type I) and β (Type II) in Real Decisions

1. What They Really Mean

2. Math Recap

3. If α is high or low...

4. What’s the Damage?

5. When Do You Care Most About α vs. β?

6. Conclusion

7. What Affects Type I and Type II Errors?

📉 Why Small Effects Are Hard to Detect

📊 Example:

🛠️ Limitations of Hypothesis Testing