Chapter 5 . Session 5.4 — Covariance, Correlation, and OLS (EXCEL)

Data: ENGR200_Term_Project_Data_GPA • n = students. Choose a variable to compare with Statics_Grade. Download CSV

1) Explore with your data

Mean(Statics)
Var(Statics)
Std(Statics)
Mean()
Var()
Std()
Cov(, Statics)
Corr(, Statics)
OLS slope β₁
OLS intercept β₀
R² (simple OLS)
ρ² (should = R²)

2) What are Covariance and Correlation?

Covariance measures the direction of linear association. Units are the product of units of \(X\) and \(Y\).

Correlation rescales covariance to a unitless number in \([-1,1]\) indicating strength + direction of linear association.

From a joint distribution \(f_{X,Y}\): \(E[XY]=\sum_x\sum_y xy\,f_{X,Y}(x,y)\) (discrete) or \(E[XY]=\int\!\!\int xy\,f_{X,Y}(x,y)\,dx\,dy\) (continuous).

\( \textbf{Covariance:}\quad \sigma_{XY}=\operatorname{Cov}(X,Y) = \mathbb{E}[(X-\mu_X)(Y-\mu_Y)] = \mathbb{E}[XY]-\mu_X\mu_Y \;\; \tag{5.15} \)

\( \textbf{Correlation:}\quad \rho_{XY} = \dfrac{\operatorname{Cov}(X,Y)}{\sqrt{V(X)\,V(Y)}} = \dfrac{\sigma_{XY}}{\sigma_X \sigma_Y} \;\; \tag{5.16} \)

\( \text{Because }\sigma_X>0,\ \sigma_Y>0,\ \text{sign}(\rho_{XY})=\text{sign}(\sigma_{XY}).\ \text{Also,}\ -1 \le \rho_{XY} \le +1. \;\; \tag{5.17} \)

Correlation simply rescales covariance by the product of standard deviations, making it dimensionless for fair comparison across different units and scales.

3) OLS vs. Correlation — what's the difference?

Correlation (ρ)

  • Symmetric: Corr(X,Y) = Corr(Y,X)
  • Unitless in \([-1,1]\); scale-invariant
  • Summarizes strength & direction of linear association only

OLS Regression (β₀, β₁)

  • Asymmetric: regress Y on X → vertical residuals minimized in Y
  • Units matter: β₁ in “units of Y per unit X”
  • Best-fitting line: \( \hat Y = β_0 + β_1 X\) with \( β_1 = \dfrac{\operatorname{Cov}(X,Y)}{\operatorname{Var}(X)}\)
  • In simple regression, \(R^2 = \rho^2\)

4) Excel — Step-by-step

A) Built-in functions (fast)

  1. Put Statics_Grade in column A (A2:A) and the other variable in column B (B2:B).
  2. Means: =AVERAGE(A2:A), =AVERAGE(B2:B)
  3. Variance (population): =VAR.P(A2:A), =VAR.P(B2:B)
  4. Std (population): =STDEV.P(A2:A), =STDEV.P(B2:B)
  5. Covariance (population): =COVARIANCE.P(A2:A, B2:B)
  6. Correlation: =CORREL(A2:A, B2:B)

Use the .S versions (VAR.S, STDEV.S, COVARIANCE.S) for sample formulas.

B) By-hand (shows the math)

  1. Means: \( \bar x=\text{AVERAGE}(B),\ \bar y=\text{AVERAGE}(A)\)
  2. Deviations columns: \(x_i-\bar x\) and \(y_i-\bar y\)
  3. Product column: \((x_i-\bar x)(y_i-\bar y)\)
  4. Population covariance: average the products; sample covariance: sum/(n−1)
  5. Std devs from VAR/STD; then \( \rho = \text{cov}/(s_X s_Y)\)

5) Numeric table (live from data)

MetricStatics_Grade
Mean
Variance
Std Dev
Covariance
Correlation
OLS: Statics = β₀ + β₁·X
R² vs ρ²