📈 Session 6.6 — Scatter Plot & Correlation Wine Quality
Goal: see how Wine Color / Color Density / pH / SO₂ relate to Quality and compute the sample correlation coefficient r. Data adapted from Table 6.5 (Chapter 6.6). :contentReference[oaicite:1]{index=1}
Why a Scatter Plot?
- Shows two variables together (each dot is one wine): quick visual check for direction (up/down), form (linear vs. curved), and strength (tight vs. spread).
- Sets up correlation (r) and the best-fit line (ŷ = a + bx) for prediction.
- Catches patterns histograms/boxplots miss (e.g., clusters, outliers, nonlinearity).
What you should learn
- Make and read a scatter plot; describe trend, strength, and outliers.
- Compute r and interpret: sign (±) and size (0 to 1).
- Add a linear trendline and interpret slope/intercept; connect r and R².
- Know limits: correlation ≠ causation, and r only measures linear association.
Quick Excel steps (click-by-click)
- Paste the table below into Excel (A1:E21).
- Insert → Scatter → Scatter (only markers).
- Right-click the points → Add Trendline → choose Linear. Check:
          - Display Equation on chart
- Display R-squared value on chart
 
- Compute correlation: =CORREL(E2:E21, A2:A21)(Color vs Quality)
 Also works:=PEARSON(E2:E21, A2:A21)
- Optional: slope/intercept with functions:
          Or full regression:=SLOPE(A2:A21, E2:E21),=INTERCEPT(A2:A21, E2:E21)=LINEST(A2:A21, E2:E21, TRUE, TRUE)
Excel ranges to switch X: use B for pH, C for SO₂, D for ColorDensity, E for Color.
Gallery — Common Scatter-Plot Relationship Types
Interactive Explorer
How r is computed (first 5 rows preview)
Formula:
r = Σ[(xᵢ − x̄)(yᵢ − ȳ)] / √( Σ(xᵢ − x̄)² · Σ(yᵢ − ȳ)² )
Loading…
Raw Data (Table 6.5 subset)
Columns: Quality (y), pH, SO₂, ColorDensity, Color (x candidates). :contentReference[oaicite:2]{index=2}
Interpretation Tips
- Sign: if r > 0, as X increases, Y tends to increase; if r < 0, Y tends to decrease.
- Size: ~0.1 weak, ~0.5 moderate, ~0.8 strong (rule-of-thumb, not a law).
- R² = fraction of Y’s variability “explained” by X in a linear model (here R² = r²).
- Outliers can strongly change r and the line—always inspect the plot first.
- Curved patterns: r may be near 0 even with a clear relationship (try ColorDensity vs Color).
- Units: r is unit-free; slope b uses Y-units per X-unit (interpret in context).
- Correlation ≠ causation: quality and color may be related via chemistry, but this plot alone can’t prove cause.
Practice
- Switch X to pH. Is r stronger or weaker vs. Color?
- Switch X to Total SO₂. Why is r negative?
- Back to Color. What does slope mean in words?
Student Q&A (fast answers)
Q1: Is a higher r always better?
No. “Better” depends on purpose. A high r helps linear prediction, but you can still have a misleading model if the shape is curved or outliers drive r.
Q2: Do I need Y on the vertical axis?
Yes—put the variable you want to predict/explain on the vertical (here, Quality). Put the “explanatory” variable (Color, pH, …) on the horizontal.
Q3: Is R² just r²?
With one X, yes: R² = r². With multiple X’s, R² comes from the full regression fit, not a single pairwise r.
Q4: My Excel r doesn’t match—why?
Check ranges (no headers), same rows for X and Y, and no extra blanks/non-numbers. Also confirm you didn’t swap the columns.
Q5: I see a curve—what then?
Try transformations (log, square), add polynomial terms, or use a different model. A simple r measures linear association only.
Source: Wine data (Quality, pH, Total SO₂, Color Density, Color) from Chapter 6, Section 6.6, Table 6.5. :contentReference[oaicite:3]{index=3}