Session 7.4.2 — Maximum Likelihood Estimation (MLE)

Math version that mirrors the textbook, with clear steps and just enough calculus.

1) Formal definition (as in the book)

We observe a random sample \(x_1,\dots,x_n\) from a distribution with density/pmf \(f(x;\theta)\). The likelihood is

$$L(\theta)=\prod_{i=1}^n f(x_i;\theta).$$

The maximum likelihood estimator (MLE) \(\hat\theta\) is the value of \(\theta\) that maximizes \(L(\theta)\) (equivalently, maximizes \(\ln L(\theta)\)).

Intuition (discrete case): it picks the parameter that makes the observed sample most probable.

2) Simple recipe you can always follow

  1. Write the likelihood \(L(\theta)=\prod f(x_i;\theta)\).
  2. Take logs: \(\ell(\theta)=\ln L(\theta)\) (sums are easier than products).
  3. Differentiate: set \(\dfrac{d\ell(\theta)}{d\theta}=0\) (or partials for multi-parameter cases).
  4. Solve for \(\hat\theta\). Check it’s a max (usually obvious from context/shape).

3) Canonical examples (worked like the book)

3.1 Bernoulli(\(p\)) — estimating a proportion \(p\)

\(x_i\in\{0,1\}\). Likelihood: \[ L(p)=\prod_{i=1}^n p^{x_i}(1-p)^{1-x_i} \quad\Longrightarrow\quad \ell(p)=\Big(\sum x_i\Big)\ln p+\Big(n-\sum x_i\Big)\ln(1-p). \] Differentiate and set to 0: \[ \frac{d\ell}{dp}=\frac{\sum x_i}{p}-\frac{n-\sum x_i}{1-p}=0 \ \Longrightarrow\ \hat p=\frac{1}{n}\sum_{i=1}^n x_i\quad(\text{the sample proportion}). \]

Result: \(\hat p=\bar x\).

Quick calculator

Interpretation: if you code success=1 and failure=0, MLE just equals the sample average of 0/1’s.
3.2 Exponential(\(\lambda\)) — estimating a rate \(\lambda\)

Density \(f(x;\lambda)=\lambda e^{-\lambda x}\) for \(x\ge0\). Likelihood: \[ L(\lambda)=\lambda^n \exp\!\Big(-\lambda\sum x_i\Big) \quad\Longrightarrow\quad \ell(\lambda)=n\ln\lambda-\lambda\sum x_i. \] Differentiate: \(\dfrac{d\ell}{d\lambda}=\dfrac{n}{\lambda}-\sum x_i=0\Rightarrow \hat\lambda=\dfrac{n}{\sum x_i}=\dfrac{1}{\bar x}.\)

Result: \(\hat\lambda=1/\bar x\).
3.3 Normal(\(\mu,\sigma^2\)) — estimating mean and variance

For i.i.d. normal \(X_i\sim \mathcal N(\mu,\sigma^2)\), the log-likelihood is \[ \ell(\mu,\sigma^2)=-\frac{n}{2}\ln(2\pi\sigma^2)-\frac{1}{2\sigma^2}\sum_{i=1}^n (x_i-\mu)^2. \] Partial derivatives give \(\ \hat\mu=\bar x,\quad \hat\sigma^2=\dfrac{1}{n}\sum (x_i-\bar x)^2\) (note the \(1/n\) rather than \(1/(n-1)\)).

Mini calculator

Note: \(\hat\sigma^2\) (MLE) is slightly biased low; the usual unbiased sample variance uses \(1/(n-1)\). Bias \(\to 0\) as \(n\) grows.

4) Properties you should remember (large \(n\))

That’s why MLEs are so popular. (Also: \(\hat\sigma^2\) for Normal is biased but bias \(\to 0\) as \(n\) increases.)

5) Invariance property (super useful)

If \(\hat\theta\) is MLE for \(\theta\) and you want \(h(\theta)\), then the MLE is just \(h(\hat\theta)\). Example: for Normal, \(\hat\sigma=\sqrt{\hat\sigma^2}=\sqrt{\frac{1}{n}\sum (x_i-\bar x)^2}\) (note: not the sample SD with \(n-1\)).

6) Why this explains OLS in Chapter 11

Under the usual linear model \(Y_i=\beta_0+\beta_1 X_i+\varepsilon_i\) with \(\varepsilon_i\sim\mathcal N(0,\sigma^2)\), the log-likelihood has the same \(-\frac{1}{2\sigma^2}\sum (Y_i-\beta_0-\beta_1X_i)^2\) structure as the Normal case above. Maximizing it in \(\beta_0,\beta_1\) is therefore the same as minimizing SSE, so OLS = MLE under Normal errors.

7) One-page checklist (students can memorize)

  1. Write \(L=\prod f(x_i;\theta)\); take \(\ell=\ln L\).
  2. Differentiate, set \(=0\), solve \(\to\ \hat\theta\).
  3. Large \(n\): MLE ≈ unbiased, efficient, normal.
  4. Need a function of parameters? Use invariance: \(h(\hat\theta)\).
  5. Normal errors model \(\Rightarrow\) OLS is MLE.