đź“– Maximum likelihood estimation (MLE)#

⏱ | words

References

Essentially, these notes are following chapters 12 and 13 in Wooldridge advanced econometrics textbook. Based on lecture notes by Bertel Schjerning, University of Copenhagen.

A Nonlinear Regression Model#

Before we begin our study of the general class of estimators, known as M-estimators, we introduce the example of the nonlinear regression model to motivate the general approach.

We are interested in modelling the conditional mean of some random variable \(y\) conditional on explanatory variables \(x\).

Assume:

\[ E[y \mid x] = m(x;\theta_0) \tag{1} \]
  • \(y\) : Random variable

  • \(x\) : \(k\)-vector of explanatory variables

  • \(m(x;\theta)\) : Nonlinear parametric model for \(E(y \mid x)\)

    • \(m(\cdot)\) is known up to a set of parameters \(\theta\)

  • \(\theta\) : \(P \times 1\) vector that indexes the model for \(E[y \mid x]\)

    • \(\theta \in \Theta \subset \mathbb{R}^P\) where \(\Theta\) is the parameter space

Correct specification

If (1) holds for some \(\theta_0 \in \Theta\), we say that the model is correctly specified.

\(\theta_0\) is referred to as the true parameter vector.

Equivalent way of writing equation (1):

\[ y = m(x;\theta_0) + u, \quad \implies \quad E[u \mid x] = 0 \tag{2} \]

We can refer to (2) as the error formulation of the model.

Note that \(E[u \mid x] = 0\) is not an additional assumption; it is simply a consequence of equation (1).

To see this, define the error term as \(u \equiv y - m(x;\theta_0)\). Then, taking conditional expectations:

\[ E[u \mid x] = E[y \mid x] - E[m(x;\theta_0) \mid x] = E[y \mid x] - m(x;\theta_0) = 0 \]

Remark 1
\(E(u \mid x) = 0\) does not mean that \(u\) and \(x\) are independent.

  • Perhaps \(\operatorname{Var}(y \mid x) \neq \operatorname{Var}(y \mid x')\), heteroskedasticity is possible

  • Perhaps \(y\) is restricted to be positive, \(y \ge 0\), implying that \(u \ge -m(x;\theta_0)\), i.e. possible values of \(u\) depend on \(x\).

  • Other features of the conditional distribution of \(y\) given \(x\), such as higher conditional moments of \(y\), could be dependent too. If, for example, the skewness of \(y \mid x\) depends on \(x\), then \(y\) and \(x\) are dependent.

Remark 2
The error formulation in equation (2) results in a semi-parametric model for \(y\), since we have not specified a parametric distribution for \(u\). But \(m(x;\theta_0)\) is still a parametric model for \(E[y \mid x]\).

Identification#

Empirical analysis includes the following three steps:

  1. Identification of \(\theta_0\)

  2. Estimation of \(\theta_0\)

  3. Inference and hypothesis testing

Identification step
Would it be possible to find estimates of \(\theta_0\) if we knew the nature of the data available in the population?

This is a purely analytical exercise that has to be proven – does not involve any data or statistical analysis.

Estimation step

Find estimates of population distribution given the finite sample at hand. This is statistical analysis.

Inference
Given an estimated model we can make inferences and test relevant hypotheses.

  • How precise are our estimates?

  • Test hypotheses about parameters and functions of parameters.


The grounds for identification in the nonlinear regression model is the following minimization problem:

\[ \theta_0 = \arg\min_{\theta \in \Theta} E[(y - m(x;\theta))^2] \tag{3} \]
  • \(\theta_0\) minimizes the expected squared error

  • the objective function is called the population criterion function as it is written for random variables \(x\) and \(y\) rather than some finite sample of observed data

  • the expectation is taken over the joint distribution of \((y,x)\)

Let’s check if \(\theta_0\) defined by (1) also solves (3)

\[ (y - m(x;\theta))^2 = ([y - m(x;\theta_0)] + [m(x;\theta_0) - m(x;\theta)])^2 \]
\[ = (y - m(x;\theta_0))^2 + 2(m(x;\theta_0) - m(x;\theta))u + (m(x;\theta_0) - m(x;\theta))^2, \]

where \(u \equiv y - m(x;\theta_0)\). Taking expectations and using \(E(u \mid x) = 0\) we have

\[ E[(y - m(x;\theta))^2] = E[(y - m(x;\theta_0))^2] + E[(m(x;\theta_0) - m(x;\theta))^2] \]

Finally, since \(E[(m(x;\theta_0) - m(x;\theta))^2] \ge 0\), we have for every \(\theta \in \Theta\):

\[ E[(y - m(x;\theta))^2] \ge E[(y - m(x;\theta_0))^2] \]

Thus, \(\theta_0\) solves the population problem (3).

What about uniqueness?

Identification

If \(\theta_0\) is a unique minimizer in (3), that is

\[ E[(m(x;\theta_0) - m(x;\theta))^2] > 0 \quad \text{for all } \theta \in \Theta,\ \theta \neq \theta_0 \]

we say that \(\theta_0\) is identifiable, or simple identified.

Note once more, that identification is a property of the population problem (3) and does not depend on any data.

Example of lack of identification in NLS

Suppose that

\[ m(x;\theta) = \theta^{(1)} + \theta^{(2)} x^2 + \theta^{(3)} \exp(x^{\theta^{(4)}}), \]

but the true model is linear such that \(\theta^{(3)}_{0} = 0\).

Then, for any value of \(\theta^{(4)}\) the population criterion is minimized when the other parameters are set to their true values.

The model is poorly identified.

Example of lack of identification in OLS

Consider simple linear regression $\( m(x;\theta) = x\theta \)$

Imagine that some regressors in \(x\) are linearly dependent, or perfectly collinear, and in the extreme case there are two regressors which are exactly the same.

In this case, the true coefficients in \(\theta_0\) for these regressors can be changed arbitrarily as long as the sum of the coefficients remains the same (i.e., the combined effect of the two regressors on \(m(x;\theta)\) remains unchanged. Thus, if there exists multiple values of \(\theta\) to minimize the population criterion (3).

The model is not identified.

Estimation#

Assume we have a random sample \(\{y_i, x_i\}_{i=1,\ldots,N}\).

Replace the population concept \(E[(y - m(x;\theta))^2]\) with the sample analogue relying on the analogy principle to build an estimator:

\[ \frac{1}{N} \sum_{i=1}^N (y_i - m(x_i;\theta))^2 \]

We have the Nonlinear Least Squares (NLS) estimator:

\[ \hat{\theta}_{NLS} = \arg\min_{\theta \in \Theta} \frac{1}{N} \sum_{i=1}^N (y_i - m(x_i;\theta))^2 \]

It turns out that \(\hat{\theta}_{NLS}\) is consistent estimator for \(\theta_0\) if

  • \(\theta_0\) is identified

  • the sample criterion converges uniformly to its expectation

Let’s look at the notions of consistency and uniform convergence in more abstract terms of M-stimators.

M-estimation#

M stands for minimization or maximization, M-estimators minimize or maximize a sample criterion.

Let’s adopt the following notation:

  • \(w = (y,x)\) denotes the data vector which includes endogenous \(y\) and exogenous \(x\)

  • \(w_i\) denotes the \(i\)th observation of \(w\) in the random sample from a population

  • \(q(w;\theta)\) is a known criterion function of the data and parameters

Definition: M-estimators

Given a random sample \(\{w_i\}_{i=1,\ldots,N}\) from a population, an M-estimator is defined as minimizer \(\hat{\theta}\) of the sample analogue

\[ \hat{\theta} = \arg\min_{\theta \in \Theta} \frac{1}{N} \sum_{i=1}^N q(w_i;\theta) \]

of the population problem

\[ \theta_0 = \arg\min_{\theta \in \Theta} E[q(w;\theta)] \]

where \(q(w;\theta)\) is a known parametric criterion function and \(\theta_0\) is the true value of parameter.

M-estimatros are a very general class of estimators that includes many well-known estimators as special cases.

Random sample \(\{w_i\}_{i=1,\ldots,N} = \{y_i, x_i\}_{i=1,\ldots,N}\) required in the definition also includes many data structures: \(y_i\) can be a scalar or a vector, allowing for cross-section, panel, and multi-equation data.

Examples of M-estimators

  • Linear regression:

\[ q(y,x;\theta) = (y - x'\theta)^2 \]
  • Nonlinear regression:

\[ q(y,x;\theta) = (y - m(x;\theta))^2 \]
  • LAD:

\[ q(w;\theta) = |y - m(x;\theta)| \]
  • Maximum likelihood estimator: \(q(w;\theta)\) is the negative of the log-likelihood function

How to choose an M-estimator?#

There are many possible M-estimators for a given problem – many choices of \(q(w;\theta)\) criterion functions!

How to choose a “good” estimator?

Criteria include:

  1. Consistency: does the estimator recover the true parameter as sample size increases?

  2. Rate of convergence: how fast does a consistent estimator converge to the true parameter as the sample size increases?

  3. Known asymptotic distribution: can we derive the asymptotic distribution of the estimator for inference?

  4. Efficiency: how precise are the estimates relative to other estimators?

  5. Robustness: how sensitive is the estimator to violations of assumptions used to derive \(q(w;\theta)\)?

  6. Small-sample properties: how does the estimator perform in finite samples?

  7. Ease of use: how simple is it to understand, implement and compute the estimator?

How would you rank the selection criteria? Which one is most and least important?

We can study properties of estimators against these criteria using

  • theoretically using asymptotic theory as sample size \(N \to \infty\) approaches infinity

  • practically using Monte Carlo simulations for finite samples of various sizes

Consistency of M-estimators#

Definition

A sequence of random variables \(\{X_n\}_{n\ge 1}\) converges in probability to a random variable \(X\), written \(X_n \xrightarrow{p} X\), if for every \(\varepsilon > 0\),

\[ \lim_{n \to \infty} \Pr\!\left( |X_n - X| > \varepsilon \right) = 0. \]

Recall that we have are working with

  • population problem that the true value of parameter \(\theta_0\) solves

\[ \theta_0 = \arg\min_{\theta \in \Theta} E[q(w;\theta)] \]
  • sample problem that the M-estimator \(\hat{\theta}\) solves

\[ \hat{\theta} = \arg\min_{\theta \in \Theta} \frac{1}{N}\sum_{i=1}^N q(w_i;\theta) \]

Under weak regularity conditions (finite moments) the Law of Large Numbers states that the sample average converges in probability to its expected value:

\[ \frac{1}{N}\sum_{i=1}^N q(w_i;\theta) \xrightarrow{p} E[q(w;\theta)] \]

We wish to show consistency of the M-estimator, \(\hat{\theta} \xrightarrow{p} \theta_0\).

Using a diagram:

\[\begin{split} \begin{array}{ccc} \frac{1}{N}\sum_{i=1}^N q(w_i;\theta) & \xrightarrow{p} & E[q(w;\theta)] \\ \begin{array}{c} \downarrow \\ \arg\min_{\theta \in \Theta} \\ \downarrow \end{array} & & \begin{array}{c} \downarrow \\ \arg\min_{\theta \in \Theta} \\ \downarrow \end{array} \\ \hat{\theta} & {\xrightarrow{p}}^{???} & \theta_0 \end{array} \end{split}\]

Two requirements for consistency

  1. Identification of \(\theta_0\) – must be the unique solution to the population problem

Definition: identification

The model is identified if \(\theta_0\) is the unique minimizer of \(E[q(w;\theta)]\) over \(\theta \in \Theta\), i.e.

\[ E[q(w;\theta)] > E[q(w;\theta_0)] \text{ for all } \theta \in \Theta, \theta \neq \theta_0. \]
  1. Uniform convergence in probability of \(\frac{1}{N}\sum_{i=1}^N q(w_i;\theta)\) to its expected value, that is convergence should hold not only pointwise for each fixed \(\theta\), but uniformly over all \(\theta \in \Theta\) in the worst case.

Definition: uniform convergence

A sequence of random functions \(\{g_N(\theta)\}\) converges uniformly in probability to a function \(g(\theta)\) on \(\Theta\) if for every \(\varepsilon > 0\)

\[ \max_{\theta \in \Theta} \left| g_N(\theta) - g(\theta) \right| \xrightarrow{p} 0 \]

In our case, we want uniform convergence of the sample criterion to its expectation:

\[ \max_{\theta \in \Theta} \left| \frac{1}{N}\sum_{i=1}^N q(w_i;\theta) - E[q(w;\theta)] \right| \xrightarrow{p} 0 \]

Sufficient conditions for uniform convergence

  1. \(q(w;\cdot)\) is continuous on \(\Theta\)

  2. Parameter space \(\Theta\) is a compact set (closed, bounded)

  3. Technical conditions

    • \(q(\cdot;\theta)\) is Borel measurable on \(\mathcal{W}\)

    • \(q(w;\theta)\) is bounded in absolute value

Collecting the above conditions, we have Theorem 12.2 in Wooldridge [2010]

Consistency of M-estimators

Under conditions of uniform convergence and provided that the model is identified, the M-estimator \(\hat{\theta}\) is consistent for \(\theta_0\):

\[ \hat{\theta} \xrightarrow{p} \theta_0 \]

Asymptotic distribution#

Let us now build the asymptotic theory of M-estimators.

We require some further assumptions.

Additional conditions

  1. \(\theta_0 \in \operatorname{int}(\Theta)\)

  2. \(q(w;\cdot)\) is twice continuously differentiable on \(\operatorname{int}(\Theta)\)

Note: Condition 5 implies assumption 1 for the uniform convergence.

Under these additional assumptions we can state the following (very central) theorem (Theorem 12.3 in Wooldridge [2010]):

Asymptotic Normality of M-estimators

Under the conditions for uniform convergence above and additional conditions 4 and 5, and provided that the model parameters are identifiable, the M-estimator \(\hat{\theta}\) is asymptotically normal:

\[ \sqrt{N}(\hat{\theta}-\theta_0) \xrightarrow{d} \mathcal{N}\left(0, A_0^{-1} B_0 A_0^{-1}\right) \]

where

\[ A_0 \equiv E[H(w;\theta_0)] = E\!\left[\frac{\partial^2 q(w;\theta)}{\partial\theta\,\partial\theta'}\right] \]

and

\[ B_0 \equiv E\!\left[s(w;\theta_0)s(w;\theta_0)'\right] = \operatorname{Var}[s(w;\theta_0)] \]

with

\[ s(w;\theta) = \frac{\partial q(w;\theta)}{\partial\theta} \]

Recall that notation “\(\xrightarrow{d}\)” means convergence in distribution.

Definition: convergence in distribution

A sequence of random variables \(\{X_n\}_{n\ge 1}\) converges in distribution to a random variable \(X\), written \(X_n \xrightarrow{d} X\), if

\[ \lim_{n \to \infty} F_{X_n}(x) = F_X(x) \]

for all \(x\) at which the distribution function \(F_X(x)\) is continuous.

The quantities \(s(w;\theta) = \frac{\partial q(w;\theta)}{\partial\theta}\) are called the scores which are \(P \times 1\) column-vectors equal to the transposed gradients of the criterion function with respect to \(\theta\).

Matrixes \(A_0\) and \(B_0\) are therefore both \(P \times P\).

Given the result on asymptotic normality, we can read off the asymptotic variance of M-estimate \(\hat{\theta}\) as

\[ \operatorname{Avar}(\hat{\theta}) = \frac{1}{N} A_0^{-1} B_0 A_0^{-1}. \]

and perform inference on \(\theta_0\) using the asymptotic distribution of \(\hat{\theta}\).

Proof of asymptotic normality of M-estimators#

Let’s sketch the proof of asymptotic normality.

First, recall the standard Central Limit Theorem (CLT) for iid random vectors:

Lindeberg–Lévy Central Limit Theorem

Let \(X_i\) be a sequence of iid random vectors with \(E[X_i]=\mu\) and \(\operatorname{Var}[X_i]=\Sigma\). Then, as \(N\to\infty\),

\[ \sqrt{N}\left(\frac{1}{N}\sum_{i=1}^N X_i - \mu\right) \xrightarrow{d} \mathcal{N}(0,\Sigma). \]
  1. Since \(q(w;\cdot)\) is twice continuously differentiable in \(\theta\), define the score vector

    \[ s(w;\theta) = \nabla_\theta q(w;\theta) \]

    with components

    \[ s_i(w;\theta) = \frac{\partial q(w;\theta)}{\partial \theta_i}, \quad i=1,\ldots,P, \]

    and the Hessian

    \[ H(w;\theta) = \frac{\partial^2 q(w;\theta)}{\partial\theta\,\partial\theta'} \quad (P\times P). \]
  2. Take a mean value expansion of \(s(w;\hat{\theta})\) about \(\theta_0\):

    \[ \sum_{i=1}^N s(w_i;\hat{\theta}) = \sum_{i=1}^N s(w_i;\theta_0) + \sum_{i=1}^N H(w_i;\theta^+)(\hat{\theta}-\theta_0), \]

    where \(\theta^+ \in [\theta_0,\hat{\theta}]\).

    Intuition (mean value theorem analogy):
    If a car travels 100 miles in one hour, its average speed is 100 mph. Even if speed varies, at some point the car must travel exactly at the average speed. Similarly, there exists a \(\theta^+\) between \(\theta_0\) and \(\hat{\theta}\) where the Hessian equals the average derivative.

  3. Since \(\theta_0 \in \operatorname{int}(\Theta)\) and \(\hat{\theta}\xrightarrow{p}\theta_0\),

    \[ \Pr(\hat{\theta}\in\operatorname{int}(\Theta)) \xrightarrow{p} 1. \]

    Hence the first-order conditions for the sample problem hold with probability approaching one:

    \[ \Pr\left(\sum_{i=1}^N s(w_i;\hat{\theta})=0\right)\xrightarrow{p}1. \]

    Therefore,

    \[ 0 = \sqrt{N}\frac{1}{N}\sum_{i=1}^N s(w_i;\theta_0) + \frac{1}{N}\sum_{i=1}^N H(w_i;\theta^+)\sqrt{N}(\hat{\theta}-\theta_0), \]

    implying

    \[ \sqrt{N}(\hat{\theta}-\theta_0) = \left( \frac{1}{N}\sum_{i=1}^N H(w_i;\theta^+) \right)^{-1} \left( \frac{1}{\sqrt{N}}\sum_{i=1}^N s(w_i;\theta_0) \right). \]

    The first term converges in probability to \(A_0^{-1}\), while the second term is an iid sum with mean zero and variance \(B_0\).

  4. By the CLT, the RHS converges in distribution to \(\mathcal{N}(0,A_0^{-1}B_0A_0^{-1})\), proving the theorem.

Remarks:

  1. M-estimator \(\hat{\theta}\) is asymptotically unbiased, that is \(\sqrt{N}E(\hat{\theta}-\theta_0)\) asymptotically has zero mean, \(\sqrt{N}E(\hat{\theta}-\theta_0) \to 0\) as \(N\to\infty\). This is the concept related to consistency.

  2. \(\operatorname{Avar}(\hat{\theta})\) referred to as asymptotic variance converges to zero as \(N\to\infty\), yet \(\operatorname{Var}(\sqrt{N}(\hat{\theta}-\theta_0))\) converges to \(A_0^{-1}B_0A_0^{-1}\) which does not depend on \(N\). We use \(\operatorname{Avar}(\hat{\theta})\) for inference, i.e. to obtain standard errors, run tests, etc.

  3. \(\hat{\theta}\) is \(\sqrt{N}\)-consistent similar to many other estimators, but some nonparametric estimators converge more slowly (more data is needed for the same precision).

  4. Asymptotic covariance depends on the choice of \(q(w;\cdot)\), we should find and use better criterion functions (estimators) to improve precision.

  5. \(A_0\) must be positive definite. Otherwise we could for example get negative variances. Another example is when the objective function is flat around the minimum, i.e. \(\theta_0\) is not a unique minimizer. In this case we would have zeros on the diagonal and \(A_0\) is not positive definite and has no inverse. This is a practical manifestation of non-identification.

Estimating the Asymptotic Variance of M-estimators#

To perform inference we would like to compute

\[ \operatorname{Avar}(\hat{\theta}) = \frac{1}{N} A_0^{-1} B_0 A_0^{-1}, \]

but \(A_0\) and \(B_0\) depend on \(\theta_0\), which is unknown!

We have to estimate \(A_0\) and \(B_0\) from sample data.

Estimator 1 (most structural)#

  • Solve analytically for \(H(w;\theta)\) and \(s(w;\theta)\)

  • Take expectations

  • Plug in \(\hat{\theta}\) for \(\theta_0\)

Problems

  • Full joint distribution of \(w\) may be unknown

  • Expectations may be difficult

  • Distribution may be misspecified

Estimator 2 (least structural)#

  • Use analogy principle (sample averages)

  • Plug in \(\hat{\theta}\)

Estimators:

\[ \hat{A} = \frac{1}{N}\sum_{i=1}^N H(w_i;\hat{\theta}), \qquad \hat{B} = \frac{1}{N}\sum_{i=1}^N s(w_i;\hat{\theta})s(w_i;\hat{\theta})'. \]

Properties:

  • \(\hat{A}\xrightarrow{p}A_0\), \(\hat{B}\xrightarrow{p}B_0\)

  • \(\hat{A}\) not guaranteed to be positive definite

  • If \(\hat{\theta}\) strictly minimizes the sample criterion, \(\hat{A}\) is positive definite

Lack of positive definiteness usually indicates poor identification.

Estimator 3 (intermediate case)#

Let \(w=\{y,x\}\) and define

\[ A(x;\theta_0) \equiv E[H(w;\theta_0)\mid x]. \]

Then

\[ E_x[A(x;\theta_0)] = A_0. \]

Estimator:

\[ \tilde{A} = \frac{1}{N}\sum_{i=1}^N A(x_i;\theta_0). \]

More efficient than estimator 2 but less robust to misspecification.

Covariance estimators#

Using estimator 2:

\[ \widehat{\operatorname{Avar}}_2(\hat{\theta}) = \left(\sum_{i=1}^N \hat{H}_i\right)^{-1} \left(\sum_{i=1}^N \hat{s}_i\hat{s}_i'\right) \left(\sum_{i=1}^N \hat{H}_i\right)^{-1}. \]

Using estimator 3:

\[ \widehat{\operatorname{Avar}}_3(\hat{\theta}) = \left(\sum_{i=1}^N \tilde{A}_i\right)^{-1} \left(\sum_{i=1}^N \hat{s}_i\hat{s}_i'\right) \left(\sum_{i=1}^N \tilde{A}_i\right)^{-1}. \]

Example: NLS using estimator 3

Recall:

\[ q(y,x;\theta) = \frac{1}{2}(y-m(x;\theta))^2 = \frac{1}{2}u^2. \]

Scores:

\[ s(y,x;\theta) = \nabla_\theta m(x;\theta)'(y-m(x;\theta)) = \nabla_\theta m(x;\theta)'u. \]

Variance of scores

\[ B_0 = E[\nabla_\theta m(x;\theta_0)'uu'\nabla_\theta m(x;\theta_0)]. \]

Hessian

\[ H(y,x;\theta) = \nabla^2_\theta m(x;\theta)(y-m(x;\theta)) + \nabla_\theta m(x;\theta)'\nabla_\theta m(x;\theta). \]

Conditional expectation:

\[ A(x;\theta) = \nabla_\theta m(x;\theta)'\nabla_\theta m(x;\theta). \]

Estimator:

\[ \tilde{A} = \frac{1}{N}\sum_{i=1}^N \nabla_\theta m(x_i;\hat{\theta})' \nabla_\theta m(x_i;\hat{\theta}). \]

NLS covariance estimator

\[ \widehat{\operatorname{Avar}}_{NLS}(\hat{\theta}) = \left(\sum_{i=1}^N \nabla_\theta m_i'\nabla_\theta m_i\right)^{-1} \left(\sum_{i=1}^N u_i^2 \nabla_\theta m_i'\nabla_\theta m_i\right) \left(\sum_{i=1}^N \nabla_\theta m_i'\nabla_\theta m_i\right)^{-1}. \]

Remarks

  • This is White’s heteroskedasticity-consistent covariance estimator.

  • Robust to serial correlation in panel data.

  • If errors are homoskedastic, the estimator simplifies.

  • If \(m\) is linear in \(\theta\), this reduces to the usual OLS covariance matrix.

Optimization methods#

How to solve the sample minimization problem numerically?

\[ \hat{\theta} = \arg\min_{\theta \in \Theta} \frac{1}{N}\sum_{i=1}^N q(w_i;\theta) \]

In nonlinear models, the optimum rarely has a closed form and must be solved numerically.

Algorithms#

  • Newton–Raphson

  • BHHH (Acronym of the four authors Berndt, Hall, Hall, and Jerry Hausman)

  • BFGS (Broyden-Fletcher-Goldfarb-Shanno method)

  • many other

The first three belong to the class of quasi-Newton algorithms with have the form

\[ \theta^{(g+1)} = \theta^{(g)} - \lambda \left(\sum_i H_i(\theta^{(g)})\right)^{-1} \sum_i s_i(\theta^{(g)}) \]

where:

  • \(g\) is index of iteration

  • \(\lambda\) is step size

  • \(\sum_i H_i(\theta^{(g)})\) is the sample sum of the Hessians of \(q(w_i;\theta)\) evaluated at \(\theta^{(g)}\),reflects the curvature of \(Q_N=\sum_i q(w_i;\theta)\)

  • \(\sum_i s_i(\theta^{(g)})\) is the sample sum of the scores evaluated at \(\theta^{(g)}\), measures the slope of \(Q_N\)

Newton–Raphson (NR)#

Here we are using the flavour of Newton-Raphson method adopted for the optimization problems – finding the root for the first order conditions. Here also in multiple dimensions.

Second order Taylor expansion of the objective function around \(\theta^{(g)}\):

\[ \sum_{i=1}^N q_i(\theta^{(g+1)}) \approx \sum_{i=1}^N q_i(\theta^{(g)}) + (\theta^{(g+1)}-\theta^{(g)})' \sum_{i=1}^N s_i(\theta^{(g)}) + \frac{1}{2} (\theta^{(g+1)}-\theta^{(g)})' \sum_{i=1}^N H_i(\theta^{(g)}) (\theta^{(g+1)}-\theta^{(g)}) \]

FOC:

\[ \frac{\partial \sum_{i=1}^N q_i(\theta^{(g+1)})}{\partial \theta^{(g+1)}} = 0 + \sum_{i=1}^N s_i(\theta^{(g)}) + \sum_{i=1}^N H_i(\theta^{(g)}) (\theta^{(g+1)}-\theta^{(g)}) = 0 \]
\[ \theta^{(g+1)} = \theta^{(g)} - \left(\sum_{i=1}^N H_i(\theta^{(g)})\right)^{-1} \sum_{i=1}^N s_i(\theta^{(g)}) \]

This is the Newton–Raphson algorithm for solving the system of first order conditions.

Properties

  • Uses slope and curvature

  • Moves downhill if Hessian is positive definite

Step size \(\lambda\)#

In addition to the normal Newton-Raphson update, it is typical to include an additional step size search step. This ensures global convergence of the algorithm at slower rate.

  1. Start with full Newton step, \(\lambda=1\)

  2. If the step decreases the objective, keep proceed with \(\lambda=1\)

  3. If the step does not decrease the objective, reduce \(\lambda\) to half its value

  4. Repeat steps 2–3 until an improving step is found

The step sizes are therefore in the sequence \(\lambda=1/2,1/4,1/8,\ldots\), so the approach is often called step-halving line search.

This addition is computationally cheap because the Hessian and the scores do not have to be recomputed for different \(\lambda\) values, only the function values.

Newton-Raphson works best when the objective is:

  • Close to quadratic

  • Convex

Yet, it may be that the Hessian is not positive definite far from the optimum, leading to uphill moves. The algorithm can be improved by ensuring that the corvature matrix is always positive definite.

BHHH#

BHHH approximates the Hessian by the outer product of scores:

\[ \theta^{(g+1)} = \theta^{(g)} - \lambda \left(\sum_i s_i s_i'\right)^{-1} \sum_i s_i \]

Motivation

  • Information identity: \(E(s_i s_i') \to E(H_i)\)

  • Derived for the MLE estimator, but works more generally for M-estimators

Advantages

  • Always positive definite

  • No need to compute Hessian

Disadvantages

  • Approximation valid:

    • At true parameters

    • For large \(N\)

    • For well-specified models

BFGS and other quasi-Newton methods#

BFGS builds up an approximation to the inverse Hessian iteratively, ensuring positive definiteness.

There are other approaches for approximation of the Hessian, such as DFP (Davidon-Fletcher-Powell) method, that have been developed for general optimization problems.

BHHH that is based on the statistical properties of the of the scores is more specific to econometric M-estimation problems – may not be found in standard optimization packages.

Maximum likelihood estimation#

Maximum likelihood estimation (MLE) is a special case of M-estimation where the criterion function is the negative log-likelihood function.

Rather than focusing on conditional moments as in nonlinear regression, MLE focuses on the entire conditional distribution of the endogenous variable given the exogenous variables.

  • Not only conditional moments as in NLS, i.e. \(E(y \mid x)\), \(\operatorname{Var}(y \mid x)\)

  • But the entire conditional distribution of the random variable \(y\), i.e. \(f(y \mid x)\)

Why MLE?

  • Efficiency: Uses information on the distribution of endogenous variables in the most efficient way – if the distribution is correctly specified, MLE is asymptotically efficient and achieves the CramĂ©r-Rao lower bound that is lowest possible asymptotic variance among all consistent estimators

  • More structure gives more information: Able to estimate any feature of the conditional distribution, such as

    • \(P(y = 1 \mid x)\) (if \(y\) is discrete)

    • \(P(y \in (\underline{y}, \bar{y}) \mid x)\) (if \(y\) is continuous)

    • Conditional moments, \(E(y \mid x)\), \(\operatorname{Var}(y \mid x)\)

    • etc.

    • And how these objects change when \(x\) or \(\theta\) changes

Why not MLE?

  • Robustness: MLE is generally inconsistent if the distribution is misspecified

    • yet sometimes MLE is actually robust to mis-specification

    • in nonlinear regression, NLS and MLE are equivalent

Example: Binary Choice

Microeconomic model

\[ P(y = 1 \mid x; \beta_0) = G(x'\beta_0) \in [0,1] \]
  • \(y\) : binary random variable, \(y \in \{0,1\}\)

  • \(\beta_0\) : \(P \times 1\) parameter vector

  • \(G(x;\theta_0)\) : model for conditional probability of \(y = 1\)

  • Logit:
    $\( G(x'\beta_0) = \Lambda(x'\beta_0) = \frac{\exp(x'\beta_0)}{1 + \exp(x'\beta_0)} \)$

  • Probit:
    $\( G(x'\beta_0) = \Phi(x'\beta_0), \quad \Phi(x'\beta_0) = \int_{-\infty}^{x'\beta_0} \frac{1}{\sqrt{2\pi}} \exp\!\left(-\frac{z^2}{2}\right) dz \)$

Conditional probability of \(y\)

\[ f(y \mid x; \beta_0) = G(x'\beta_0)^y \left[1 - G(x'\beta_0)\right]^{1-y} \]

Note: \(\beta_0\) is defined as the vector that indexes \(f(y \mid x; \beta_0)\).

Framework for Conditional MLE#

Conditional MLE due to the fact that we leave out modeling of the marginal distribution of \(x\), take it as given, and focus on modeling the conditional distribution of \(y\) given \(x\).

Object of interest

\(p_0(y \mid x)\), i.e. the “true” conditional density of \(y_i\) given \(x_i = x\), which defined with respect to some measure \(\nu\) in the corresponding probability space.

Parametric model

We model \(p_0(y \mid x)\) as

\[ p_0(y \mid x) = f(y \mid x; \theta_0) \]

Assumptions

  1. \(f(y \mid x; \theta) \ge 0\)

  2. \(\int_{\mathcal{Y}} f(y \mid x; \theta)\, \nu(dy) = 1\)
    for all \(x \in \mathcal{X}\) and each \(\theta \in \Theta\); \(\mathcal{Y}\) \(\mathcal{X}\) are supports for \(y\) and \(x\)

  3. \(f(y \mid x; \theta_0)\) is a correctly specified model for \(p_0(y \mid x)\),
    i.e. \(p_0(y \mid x) = f(y \mid x; \theta_0)\) for some \(\theta_0 \in \Theta\)

Population problem#

Objective function

\[ Q(\theta) = E[\ln f(y \mid x; \theta)] \]

and (we will show)

\[ \theta_0 = \arg\max_{\theta \in \Theta} E[\ln f(y \mid x; \theta)] \]

or equivalently

\[ \theta_0 = \arg\min_{\theta \in \Theta} E[-\ln f(y \mid x; \theta)] \]

Sample problem#

Objective function

\[ Q_N(\theta) = \frac{1}{N}\sum_{i=1}^N \ln f(y_i \mid x_i; \theta) = \frac{1}{N}\sum_{i=1}^N \ell_i(\theta) \]

The value \(\hat{\theta}_{CMLE}\) that maximizes \(Q_N(\theta)\) is the conditional maximum likelihood estimator (CMLE):

\[ \hat{\theta}_{CMLE} = \arg\max_{\theta \in \Theta} Q_N(\theta) \]

\(\hat{\theta}_{CMLE}\) is clearly an M-estimator.

Identification of \(\theta_0\)#

Strategy

  1. First prove identification for unconditional MLE

  2. Then show that this implies identification for CMLE

Identification, unconditional case

Jensen’s inequality

  • \(u\) : non-constant random variable

  • \(a(u)\) : strictly concave function

Then \(E[a(u)] < a(E[u])\).

Let

  • \(u = \frac{f(z;\theta)}{f(z;\theta_0)}\), \(\theta \ne \theta_0\)

  • \(a(u) = \ln(u)\)

Then

\[ E\!\left[ \ln\!\left( \frac{f(z;\theta)}{f(z;\theta_0)} \right) \right] < \ln\!\left( E\!\left[ \frac{f(z;\theta)}{f(z;\theta_0)} \right] \right) \]

Now

\[ E\!\left[ \frac{f(z;\theta)}{f(z;\theta_0)} \right] = \int \frac{f(z;\theta)}{f(z;\theta_0)} f(z;\theta_0)\,dz = \int f(z;\theta)\,dz = 1 \]

Hence

\[ E\!\left[ \ln\!\left( \frac{f(z;\theta)}{f(z;\theta_0)} \right) \right] < \ln(1) = 0 \]

Therefore

\[ E[\ln f(z;\theta)] < E[\ln f(z;\theta_0)] \quad \text{for all } \theta \ne \theta_0 \]

Thus \(E[\ln f(z;\theta)]\) is maximized at \(\theta=\theta_0\).

Identification, conditional case

The argument also works for conditional densities:

\[ E[\ln f(z;\theta)] \text{ max at } \theta=\theta_0 \]

implies

\[ E_{y\mid x}[\ln f(y\mid x;\theta)] \text{ max at } \theta=\theta_0 \]

and therefore

\[ E_x\!\left[ E_{y\mid x}[\ln f(y\mid x;\theta)] \right] = E[\ln f(y\mid x;\theta)] \]

is maximized at \(\theta=\theta_0\).

Asymptotic properties of MLE#

Consistency#

  • Can be established as a special case of Theorem 12.2 in Wooldridge [2010], see Theorem 13.1.

  • Under the following assumptions, \(\hat{\theta}_{CMLE}\) is consistent for \(\theta_0\), i.e. \(\hat{\theta}_{CMLE} \xrightarrow{p} \theta_0\)

Assumptions:

  1. Random sample

  2. \(\theta_0\) is identified (unique maximizer of \(Q(\theta)\))

  3. Model is correctly specified: \(p_0(y\mid x)=f(y\mid x;\theta_0)\) for some \(\theta_0 \in \Theta\)

  4. \(f(y\mid x;\theta)\) is a well-specified density for all \(\theta\) and \(x\): everywehre positive and integrates to 1

  5. Log-likelihood is continuous in \(\theta\)

  6. Parameter space \(\Theta\) is compact (closed, bounded)

  7. Technical conditions:

  • Borel measurability

  • Objective function bounded in absolute value

Asymptotic distribution#

Additional assumptions:

  1. \(\theta_0 \in \operatorname{int}(\Theta)\)

  2. Log-likelihood is twice continuously differentiable in \(\theta\)

We can apply Theorem 12.3 in Wooldridge [2010] to obtain the asymptotic distribution of CMLE.

Under the assumptions above,

\[ \sqrt{N}(\hat{\theta}-\theta_0) \xrightarrow{d} \mathcal{N}(0, A_0^{-1} B_0 A_0^{-1}) \]

where

\[ A_0 = E[H(w;\theta_0)], \qquad B_0 = E[s(w;\theta_0)s(w;\theta_0)'] \]

Thus

\[ \operatorname{Avar}(\hat{\theta}) = \frac{1}{N} A_0^{-1} B_0 A_0^{-1} \]

If we want to apply Theorem 12.3 directly, we must turn the maximization problem into a minimization problem. Define

\[ q(y\mid x;\theta) = -\ln f(y\mid x;\theta) \]

Hence \(H(w;\theta_0)\) is the second derivative of \(-\ln f(y\mid x;\theta_0)\). Keeping track of the minus sign, let

\[ H(w;\theta_0) = \nabla_\theta^2 \ln f(y\mid x;\theta_0) \]

so that

\[ A_0 = E[H(w;\theta_0)] \]

Information identity#

Major result for MLE!

Conditional:

\[ E[H_i(\theta)\mid x_i] = E[s_i(\theta)s_i(\theta)'\mid x_i] \]

Unconditional:

\[ E[H_i(\theta)] = E[s_i(\theta)s_i(\theta)'] \]

Thus \(A_0 = B_0\).

Applying this gives the limiting distribution for MLE:

\[ \sqrt{N}(\hat{\theta}-\theta_0) \xrightarrow{d} \mathcal{N}(0, A_0^{-1}) \]

and

\[ \operatorname{Avar}(\hat{\theta}) = \frac{1}{N} A_0^{-1} = \frac{1}{N} B_0^{-1} \]

Note: The information identity only applies

  • As \(N \to \infty\)

  • When the model is correctly specified

  • At the true parameters

Otherwise we must use

\[ \operatorname{Avar}(\hat{\theta}) = \frac{1}{N} A_0^{-1} B_0 A_0^{-1} \]

Estimating the asymptotic variance#

Estimating \(A_0\) and \(B_0\) is similar to the general M-estimation case.

Each of the following matrices converges to \(A_0=B_0\):

\[ \frac{1}{N}\sum_{i=1}^N H_i(\hat{\theta}) \]
\[ \frac{1}{N}\sum_{i=1}^N s_i(\hat{\theta})s_i(\hat{\theta})' \]
\[ \frac{1}{N}\sum_{i=1}^N A(x_i;\hat{\theta}), \quad A(x_i;\theta)=E[H(y_i,x_i;\theta)\mid x_i] \]

Pros and cons

  • (1) always available, but can be negative definite

  • (2) easy and always positive definite, but poor small-sample properties

  • (3) best small-sample properties and positive definite, but hard to compute

Hence

\[ \widehat{\operatorname{Avar}}(\hat{\theta}) \in \left\{ \left(\sum_{i=1}^N H_i(\hat{\theta})\right)^{-1}, \left(\sum_{i=1}^N s_i(\hat{\theta})s_i(\hat{\theta})'\right)^{-1}, \left(\sum_{i=1}^N A(x_i;\hat{\theta})\right)^{-1} \right\} \]

Example: Binary Choice

Conditional density

\[ f(y_i\mid x_i;\beta) = G(x_i'\beta)^{y_i} \left[1-G(x_i'\beta)\right]^{1-y_i} \]

Conditional log-likelihood

\[ \ell_i(\beta) = y_i \ln G(x_i'\beta) + (1-y_i)\ln\!\left(1-G(x_i'\beta)\right) \]

We use the estimator \(\frac{1}{N}\sum_{i=1}^N A(x_i;\hat{\beta})\).

1: Compute Hessian

Score

\[ s_i(\beta) = \frac{\partial \ln f(y_i\mid x_i;\beta)}{\partial \beta'} = \frac{y_i-G(x_i'\beta)}{G(x_i'\beta)(1-G(x_i'\beta))} g(x_i'\beta)\,x_i' \]

where \(g(z)=\frac{\partial G(z)}{\partial z}\).

Hessian

\[ H_i(\beta) = \frac{\partial^2 \ln f(y_i\mid x_i;\beta)}{\partial \beta \partial \beta'} = h(x_i'\beta) g(x_i'\beta)x_i x_i' + \big(y_i-G(x_i'\beta)\big) \frac{\partial h}{\partial \beta'} \]

2: Conditional expected Hessian

The second term has conditional expectation zero:

\[ E[y_i-G(x_i'\beta)\mid x_i]=0 \]

Hence

\[ A(x_i;\beta) = E[H_i(\beta)\mid x_i] = \frac{[g(x_i'\beta)]^2}{G(x_i'\beta)(1-G(x_i'\beta))}x_i x_i' \]

3: Sample analogue

\[ \hat{A} = \left[ \frac{1}{N}\sum_{i=1}^N \frac{[g(x_i'\hat{\beta})]^2}{G(x_i'\hat{\beta})(1-G(x_i'\hat{\beta}))} x_i x_i' \right]^{-1} \]

Asymptotic variance of \(\hat{\beta}\)

\[ \widehat{\operatorname{Avar}}(\hat{\beta}) = \left[ \sum_{i=1}^N \frac{[g(x_i'\hat{\beta})]^2}{G(x_i'\hat{\beta})(1-G(x_i'\hat{\beta}))} x_i x_i' \right]^{-1} \]

References and Additional Resources

  • đź“– Wooldridge [2010] “Econometric Analysis of Cross Section and Panel Data”, The New Palgrave Dictionary of Economics

    • Chapter 12

    • Chapter 13