Session 7.4.2 — Maximum Likelihood Estimation (MLE)
Math version that mirrors the textbook, with clear steps and just enough calculus.
1) Formal definition (as in the book)
We observe a random sample \(x_1,\dots,x_n\) from a distribution with density/pmf \(f(x;\theta)\). The likelihood is
$$L(\theta)=\prod_{i=1}^n f(x_i;\theta).$$
The maximum likelihood estimator (MLE) \(\hat\theta\) is the value of \(\theta\) that maximizes \(L(\theta)\) (equivalently, maximizes \(\ln L(\theta)\)).
2) Simple recipe you can always follow
- Write the likelihood \(L(\theta)=\prod f(x_i;\theta)\).
- Take logs: \(\ell(\theta)=\ln L(\theta)\) (sums are easier than products).
- Differentiate: set \(\dfrac{d\ell(\theta)}{d\theta}=0\) (or partials for multi-parameter cases).
- Solve for \(\hat\theta\). Check it’s a max (usually obvious from context/shape).
3) Canonical examples (worked like the book)
3.1 Bernoulli(\(p\)) — estimating a proportion \(p\)
\(x_i\in\{0,1\}\). Likelihood: \[ L(p)=\prod_{i=1}^n p^{x_i}(1-p)^{1-x_i} \quad\Longrightarrow\quad \ell(p)=\Big(\sum x_i\Big)\ln p+\Big(n-\sum x_i\Big)\ln(1-p). \] Differentiate and set to 0: \[ \frac{d\ell}{dp}=\frac{\sum x_i}{p}-\frac{n-\sum x_i}{1-p}=0 \ \Longrightarrow\ \hat p=\frac{1}{n}\sum_{i=1}^n x_i\quad(\text{the sample proportion}). \]
Quick calculator
3.2 Exponential(\(\lambda\)) — estimating a rate \(\lambda\)
Density \(f(x;\lambda)=\lambda e^{-\lambda x}\) for \(x\ge0\). Likelihood: \[ L(\lambda)=\lambda^n \exp\!\Big(-\lambda\sum x_i\Big) \quad\Longrightarrow\quad \ell(\lambda)=n\ln\lambda-\lambda\sum x_i. \] Differentiate: \(\dfrac{d\ell}{d\lambda}=\dfrac{n}{\lambda}-\sum x_i=0\Rightarrow \hat\lambda=\dfrac{n}{\sum x_i}=\dfrac{1}{\bar x}.\)
3.3 Normal(\(\mu,\sigma^2\)) — estimating mean and variance
For i.i.d. normal \(X_i\sim \mathcal N(\mu,\sigma^2)\), the log-likelihood is \[ \ell(\mu,\sigma^2)=-\frac{n}{2}\ln(2\pi\sigma^2)-\frac{1}{2\sigma^2}\sum_{i=1}^n (x_i-\mu)^2. \] Partial derivatives give \(\ \hat\mu=\bar x,\quad \hat\sigma^2=\dfrac{1}{n}\sum (x_i-\bar x)^2\) (note the \(1/n\) rather than \(1/(n-1)\)).
Mini calculator
4) Properties you should remember (large \(n\))
- Approx. unbiased: \(E(\hat\theta)\approx \theta\).
- Efficient: variance is about as small as possible among reasonable estimators.
- Approx. normal: \(\hat\theta\) is roughly normal for large \(n\).
5) Invariance property (super useful)
If \(\hat\theta\) is MLE for \(\theta\) and you want \(h(\theta)\), then the MLE is just \(h(\hat\theta)\). Example: for Normal, \(\hat\sigma=\sqrt{\hat\sigma^2}=\sqrt{\frac{1}{n}\sum (x_i-\bar x)^2}\) (note: not the sample SD with \(n-1\)).
6) Why this explains OLS in Chapter 11
Under the usual linear model \(Y_i=\beta_0+\beta_1 X_i+\varepsilon_i\) with \(\varepsilon_i\sim\mathcal N(0,\sigma^2)\), the log-likelihood has the same \(-\frac{1}{2\sigma^2}\sum (Y_i-\beta_0-\beta_1X_i)^2\) structure as the Normal case above. Maximizing it in \(\beta_0,\beta_1\) is therefore the same as minimizing SSE, so OLS = MLE under Normal errors.
7) One-page checklist (students can memorize)
- Write \(L=\prod f(x_i;\theta)\); take \(\ell=\ln L\).
- Differentiate, set \(=0\), solve \(\to\ \hat\theta\).
- Large \(n\): MLE ≈ unbiased, efficient, normal.
- Need a function of parameters? Use invariance: \(h(\hat\theta)\).
- Normal errors model \(\Rightarrow\) OLS is MLE.