📈 Session 6.6 — Scatter Plot & Correlation Wine Quality

Goal: see how Wine Color / Color Density / pH / SO₂ relate to Quality and compute the sample correlation coefficient r. Data adapted from Table 6.5 (Chapter 6.6). :contentReference[oaicite:1]{index=1}

Why a Scatter Plot?

Shows two variables together (each dot is one wine): quick visual check for direction (up/down), form (linear vs. curved), and strength (tight vs. spread).
Sets up correlation (r) and the best-fit line (ŷ = a + bx) for prediction.
Catches patterns histograms/boxplots miss (e.g., clusters, outliers, nonlinearity).

What you should learn

Make and read a scatter plot; describe trend, strength, and outliers.
Compute r and interpret: sign (±) and size (0 to 1).
Add a linear trendline and interpret slope/intercept; connect r and R².
Know limits: correlation ≠ causation, and r only measures linear association.

Quick Excel steps (click-by-click)

Paste the table below into Excel (A1:E21).
Insert → Scatter → Scatter (only markers).
Right-click the points → Add Trendline → choose Linear. Check:
- Display Equation on chart
- Display R-squared value on chart
Compute correlation: =CORREL(E2:E21, A2:A21) (Color vs Quality)
Also works: =PEARSON(E2:E21, A2:A21)
Optional: slope/intercept with functions:
=SLOPE(A2:A21, E2:E21), =INTERCEPT(A2:A21, E2:E21)
Or full regression: =LINEST(A2:A21, E2:E21, TRUE, TRUE)

Excel ranges to switch X: use B for pH, C for SO₂, D for ColorDensity, E for Color.

Gallery — Common Scatter-Plot Relationship Types

Interactive Explorer

Choose X (horizontal): Show best-fit line

n = —

r (sample) = —

R² = —

Slope b = —

Intercept a = —

How r is computed (first 5 rows preview)

Formula:

r = Σ[(xᵢ − x̄)(yᵢ − ȳ)] / √( Σ(xᵢ − x̄)² · Σ(yᵢ − ȳ)² )

Loading…

Raw Data (Table 6.5 subset)

Columns: Quality (y), pH, SO₂, ColorDensity, Color (x candidates). :contentReference[oaicite:2]{index=2}

Interpretation Tips

Sign: if r > 0, as X increases, Y tends to increase; if r < 0, Y tends to decrease.
Size: ~0.1 weak, ~0.5 moderate, ~0.8 strong (rule-of-thumb, not a law).
R² = fraction of Y’s variability “explained” by X in a linear model (here R² = r²).
Outliers can strongly change r and the line—always inspect the plot first.
Curved patterns: r may be near 0 even with a clear relationship (try ColorDensity vs Color).
Units: r is unit-free; slope b uses Y-units per X-unit (interpret in context).
Correlation ≠ causation: quality and color may be related via chemistry, but this plot alone can’t prove cause.

Practice

Switch X to pH. Is r stronger or weaker vs. Color?
Switch X to Total SO₂. Why is r negative?
Back to Color. What does slope mean in words?

Student Q&A (fast answers)

Q1: Is a higher r always better?

No. “Better” depends on purpose. A high r helps linear prediction, but you can still have a misleading model if the shape is curved or outliers drive r.

Q2: Do I need Y on the vertical axis?

Yes—put the variable you want to predict/explain on the vertical (here, Quality). Put the “explanatory” variable (Color, pH, …) on the horizontal.

Q3: Is R² just r²?

With one X, yes: R² = r². With multiple X’s, R² comes from the full regression fit, not a single pairwise r.

Q4: My Excel r doesn’t match—why?

Check ranges (no headers), same rows for X and Y, and no extra blanks/non-numbers. Also confirm you didn’t swap the columns.

Q5: I see a curve—what then?

Try transformations (log, square), add polynomial terms, or use a different model. A simple r measures linear association only.

Source: Wine data (Quality, pH, Total SO₂, Color Density, Color) from Chapter 6, Section 6.6, Table 6.5. :contentReference[oaicite:3]{index=3}