📈 Session 6.6 — Scatter Plot & Correlation Wine Quality

Goal: see how Wine Color / Color Density / pH / SO₂ relate to Quality and compute the sample correlation coefficient r. Data adapted from Table 6.5 (Chapter 6.6). :contentReference[oaicite:1]{index=1}

Why a Scatter Plot?

  • Shows two variables together (each dot is one wine): quick visual check for direction (up/down), form (linear vs. curved), and strength (tight vs. spread).
  • Sets up correlation (r) and the best-fit line (ŷ = a + bx) for prediction.
  • Catches patterns histograms/boxplots miss (e.g., clusters, outliers, nonlinearity).

What you should learn

  • Make and read a scatter plot; describe trend, strength, and outliers.
  • Compute r and interpret: sign (±) and size (0 to 1).
  • Add a linear trendline and interpret slope/intercept; connect r and .
  • Know limits: correlation ≠ causation, and r only measures linear association.

Quick Excel steps (click-by-click)

  1. Paste the table below into Excel (A1:E21).
  2. Insert → Scatter → Scatter (only markers).
  3. Right-click the points → Add Trendline → choose Linear. Check:
    • Display Equation on chart
    • Display R-squared value on chart
  4. Compute correlation: =CORREL(E2:E21, A2:A21) (Color vs Quality)
    Also works: =PEARSON(E2:E21, A2:A21)
  5. Optional: slope/intercept with functions:
    =SLOPE(A2:A21, E2:E21), =INTERCEPT(A2:A21, E2:E21)
    Or full regression: =LINEST(A2:A21, E2:E21, TRUE, TRUE)

Excel ranges to switch X: use B for pH, C for SO₂, D for ColorDensity, E for Color.

Gallery — Common Scatter-Plot Relationship Types

Interactive Explorer

n =
r (sample) =
R² =
Slope b = —
Intercept a = —
How r is computed (first 5 rows preview)

Formula:

r = Σ[(xᵢ − x̄)(yᵢ − ȳ)] / √( Σ(xᵢ − x̄)² · Σ(yᵢ − ȳ)² )
Loading…

Raw Data (Table 6.5 subset)

Columns: Quality (y), pH, SO₂, ColorDensity, Color (x candidates). :contentReference[oaicite:2]{index=2}

Interpretation Tips

  • Sign: if r > 0, as X increases, Y tends to increase; if r < 0, Y tends to decrease.
  • Size: ~0.1 weak, ~0.5 moderate, ~0.8 strong (rule-of-thumb, not a law).
  • = fraction of Y’s variability “explained” by X in a linear model (here R² = r²).
  • Outliers can strongly change r and the line—always inspect the plot first.
  • Curved patterns: r may be near 0 even with a clear relationship (try ColorDensity vs Color).
  • Units: r is unit-free; slope b uses Y-units per X-unit (interpret in context).
  • Correlation ≠ causation: quality and color may be related via chemistry, but this plot alone can’t prove cause.

Practice

  1. Switch X to pH. Is r stronger or weaker vs. Color?
  2. Switch X to Total SO₂. Why is r negative?
  3. Back to Color. What does slope mean in words?

Student Q&A (fast answers)

Q1: Is a higher r always better?

No. “Better” depends on purpose. A high r helps linear prediction, but you can still have a misleading model if the shape is curved or outliers drive r.

Q2: Do I need Y on the vertical axis?

Yes—put the variable you want to predict/explain on the vertical (here, Quality). Put the “explanatory” variable (Color, pH, …) on the horizontal.

Q3: Is R² just r²?

With one X, yes: R² = r². With multiple X’s, R² comes from the full regression fit, not a single pairwise r.

Q4: My Excel r doesn’t match—why?

Check ranges (no headers), same rows for X and Y, and no extra blanks/non-numbers. Also confirm you didn’t swap the columns.

Q5: I see a curve—what then?

Try transformations (log, square), add polynomial terms, or use a different model. A simple r measures linear association only.

Source: Wine data (Quality, pH, Total SO₂, Color Density, Color) from Chapter 6, Section 6.6, Table 6.5. :contentReference[oaicite:3]{index=3}