Session 7.4.2 — Maximum Likelihood Estimation (Math Version)

1) Formal definition (as in the book)

We observe a random sample $x_1,\dots,x_n$ from a distribution with density/pmf $f(x;\theta)$. The likelihood is

$$L(\theta)=\prod_{i=1}^n f(x_i;\theta).$$

The maximum likelihood estimator (MLE) $\hat\theta$ is the value of $\theta$ that maximizes $L(\theta)$ (equivalently, maximizes $\ln L(\theta)$).

Intuition (discrete case): it picks the parameter that makes the observed sample most probable.

2) Simple recipe you can always follow

Write the likelihood $L(\theta)=\prod f(x_i;\theta)$.
Take logs: $\ell(\theta)=\ln L(\theta)$ (sums are easier than products).
Differentiate: set $\dfrac{d\ell(\theta)}{d\theta}=0$ (or partials for multi-parameter cases).
Solve for $\hat\theta$. Check it’s a max (usually obvious from context/shape).

3) Canonical examples (worked like the book)

3.1 Bernoulli($p$) — estimating a proportion $p$

$x_i\in\{0,1\}$. Likelihood: \[ L(p)=\prod_{i=1}^n p^{x_i}(1-p)^{1-x_i} \quad\Longrightarrow\quad \ell(p)=\Big(\sum x_i\Big)\ln p+\Big(n-\sum x_i\Big)\ln(1-p). \] Differentiate and set to 0: \[ \frac{d\ell}{dp}=\frac{\sum x_i}{p}-\frac{n-\sum x_i}{1-p}=0 \ \Longrightarrow\ \hat p=\frac{1}{n}\sum_{i=1}^n x_i\quad(\text{the sample proportion}). \]

Result: $\hat p=\bar x$.

Quick calculator

n: # successes k:

Interpretation: if you code success=1 and failure=0, MLE just equals the sample average of 0/1’s.

3.2 Exponential($\lambda$) — estimating a rate $\lambda$

Density $f(x;\lambda)=\lambda e^{-\lambda x}$ for $x\ge0$. Likelihood: \[ L(\lambda)=\lambda^n \exp\!\Big(-\lambda\sum x_i\Big) \quad\Longrightarrow\quad \ell(\lambda)=n\ln\lambda-\lambda\sum x_i. \] Differentiate: $\dfrac{d\ell}{d\lambda}=\dfrac{n}{\lambda}-\sum x_i=0\Rightarrow \hat\lambda=\dfrac{n}{\sum x_i}=\dfrac{1}{\bar x}.$

Result: $\hat\lambda=1/\bar x$.

Sample mean $\bar x$:

3.3 Normal($\mu,\sigma^2$) — estimating mean and variance

For i.i.d. normal $X_i\sim \mathcal N(\mu,\sigma^2)$, the log-likelihood is \[ \ell(\mu,\sigma^2)=-\frac{n}{2}\ln(2\pi\sigma^2)-\frac{1}{2\sigma^2}\sum_{i=1}^n (x_i-\mu)^2. \] Partial derivatives give $\ \hat\mu=\bar x,\quad \hat\sigma^2=\dfrac{1}{n}\sum (x_i-\bar x)^2$ (note the $1/n$ rather than $1/(n-1)$).

Mini calculator

Note: $\hat\sigma^2$ (MLE) is slightly biased low; the usual unbiased sample variance uses $1/(n-1)$. Bias $\to 0$ as $n$ grows.

4) Properties you should remember (large $n$)

Approx. unbiased: $E(\hat\theta)\approx \theta$.
Efficient: variance is about as small as possible among reasonable estimators.
Approx. normal: $\hat\theta$ is roughly normal for large $n$.

That’s why MLEs are so popular. (Also: $\hat\sigma^2$ for Normal is biased but bias $\to 0$ as $n$ increases.)

5) Invariance property (super useful)

If $\hat\theta$ is MLE for $\theta$ and you want $h(\theta)$, then the MLE is just $h(\hat\theta)$. Example: for Normal, $\hat\sigma=\sqrt{\hat\sigma^2}=\sqrt{\frac{1}{n}\sum (x_i-\bar x)^2}$ (note: not the sample SD with $n-1$).

6) Why this explains OLS in Chapter 11

Under the usual linear model $Y_i=\beta_0+\beta_1 X_i+\varepsilon_i$ with $\varepsilon_i\sim\mathcal N(0,\sigma^2)$, the log-likelihood has the same $-\frac{1}{2\sigma^2}\sum (Y_i-\beta_0-\beta_1X_i)^2$ structure as the Normal case above. Maximizing it in $\beta_0,\beta_1$ is therefore the same as minimizing SSE, so OLS = MLE under Normal errors.

7) One-page checklist (students can memorize)

Write $L=\prod f(x_i;\theta)$; take $\ell=\ln L$.
Differentiate, set $=0$, solve $\to\ \hat\theta$.
Large $n$: MLE ≈ unbiased, efficient, normal.
Need a function of parameters? Use invariance: $h(\hat\theta)$.
Normal errors model $\Rightarrow$ OLS is MLE.

Session 7.4.2 — Maximum Likelihood Estimation (MLE)