Chapter 5 . Session 5.4 — Covariance, Correlation, and OLS (EXCEL)
Data: ENGR200_Term_Project_Data_GPA • n = students. Choose a variable to compare with Statics_Grade. Download CSV
1) Explore with your data
Mean(Statics)–
Var(Statics)–
Std(Statics)–
Mean()–
Var()–
Std()–
Cov(, Statics)–
Corr(, Statics)–
OLS slope β₁–
OLS intercept β₀–
R² (simple OLS)–
ρ² (should = R²)–
2) What are Covariance and Correlation?
Covariance measures the direction of linear association. Units are the product of units of \(X\) and \(Y\).
Correlation rescales covariance to a unitless number in \([-1,1]\) indicating strength + direction of linear association.
From a joint distribution \(f_{X,Y}\): \(E[XY]=\sum_x\sum_y xy\,f_{X,Y}(x,y)\) (discrete) or \(E[XY]=\int\!\!\int xy\,f_{X,Y}(x,y)\,dx\,dy\) (continuous).
\( \textbf{Covariance:}\quad \sigma_{XY}=\operatorname{Cov}(X,Y)
= \mathbb{E}[(X-\mu_X)(Y-\mu_Y)] = \mathbb{E}[XY]-\mu_X\mu_Y \;\; \tag{5.15} \)
\( \textbf{Correlation:}\quad \rho_{XY} = \dfrac{\operatorname{Cov}(X,Y)}{\sqrt{V(X)\,V(Y)}} = \dfrac{\sigma_{XY}}{\sigma_X \sigma_Y} \;\; \tag{5.16} \)
\( \text{Because }\sigma_X>0,\ \sigma_Y>0,\ \text{sign}(\rho_{XY})=\text{sign}(\sigma_{XY}).\ \text{Also,}\ -1 \le \rho_{XY} \le +1. \;\; \tag{5.17} \)
\( \textbf{Correlation:}\quad \rho_{XY} = \dfrac{\operatorname{Cov}(X,Y)}{\sqrt{V(X)\,V(Y)}} = \dfrac{\sigma_{XY}}{\sigma_X \sigma_Y} \;\; \tag{5.16} \)
\( \text{Because }\sigma_X>0,\ \sigma_Y>0,\ \text{sign}(\rho_{XY})=\text{sign}(\sigma_{XY}).\ \text{Also,}\ -1 \le \rho_{XY} \le +1. \;\; \tag{5.17} \)
Correlation simply rescales covariance by the product of standard deviations, making it dimensionless for fair comparison across different units and scales.
3) OLS vs. Correlation — what's the difference?
Correlation (ρ)
- Symmetric: Corr(X,Y) = Corr(Y,X)
- Unitless in \([-1,1]\); scale-invariant
- Summarizes strength & direction of linear association only
OLS Regression (β₀, β₁)
- Asymmetric: regress Y on X → vertical residuals minimized in Y
- Units matter: β₁ in “units of Y per unit X”
- Best-fitting line: \( \hat Y = β_0 + β_1 X\) with \( β_1 = \dfrac{\operatorname{Cov}(X,Y)}{\operatorname{Var}(X)}\)
- In simple regression, \(R^2 = \rho^2\)
4) Excel — Step-by-step
A) Built-in functions (fast)
- Put Statics_Grade in column A (A2:A) and the other variable in column B (B2:B).
- Means:
=AVERAGE(A2:A)
,=AVERAGE(B2:B)
- Variance (population):
=VAR.P(A2:A)
,=VAR.P(B2:B)
- Std (population):
=STDEV.P(A2:A)
,=STDEV.P(B2:B)
- Covariance (population):
=COVARIANCE.P(A2:A, B2:B)
- Correlation:
=CORREL(A2:A, B2:B)
Use the .S versions (VAR.S, STDEV.S, COVARIANCE.S) for sample formulas.
B) By-hand (shows the math)
- Means: \( \bar x=\text{AVERAGE}(B),\ \bar y=\text{AVERAGE}(A)\)
- Deviations columns: \(x_i-\bar x\) and \(y_i-\bar y\)
- Product column: \((x_i-\bar x)(y_i-\bar y)\)
- Population covariance: average the products; sample covariance: sum/(n−1)
- Std devs from VAR/STD; then \( \rho = \text{cov}/(s_X s_Y)\)
5) Numeric table (live from data)
Metric | Statics_Grade | |
---|---|---|
Mean | ||
Variance | ||
Std Dev | ||
Covariance | ||
Correlation | ||
OLS: Statics = β₀ + β₁·X | ||
R² vs ρ² |