đź“– Maximum likelihood estimation (MLE)#
⏱ | words
Essentially, these notes are following chapters 12 and 13 in Wooldridge advanced econometrics textbook. Based on lecture notes by Bertel Schjerning, University of Copenhagen.
A Nonlinear Regression Model#
Before we begin our study of the general class of estimators, known as M-estimators, we introduce the example of the nonlinear regression model to motivate the general approach.
We are interested in modelling the conditional mean of some random variable \(y\) conditional on explanatory variables \(x\).
Assume:
\(y\) : Random variable
\(x\) : \(k\)-vector of explanatory variables
\(m(x;\theta)\) : Nonlinear parametric model for \(E(y \mid x)\)
\(m(\cdot)\) is known up to a set of parameters \(\theta\)
\(\theta\) : \(P \times 1\) vector that indexes the model for \(E[y \mid x]\)
\(\theta \in \Theta \subset \mathbb{R}^P\) where \(\Theta\) is the parameter space
Correct specification
If (1) holds for some \(\theta_0 \in \Theta\), we say that the model is correctly specified.
\(\theta_0\) is referred to as the true parameter vector.
Equivalent way of writing equation (1):
We can refer to (2) as the error formulation of the model.
Note that \(E[u \mid x] = 0\) is not an additional assumption; it is simply a consequence of equation (1).
To see this, define the error term as \(u \equiv y - m(x;\theta_0)\). Then, taking conditional expectations:
Remark 1
\(E(u \mid x) = 0\) does not mean that \(u\) and \(x\) are independent.
Perhaps \(\operatorname{Var}(y \mid x) \neq \operatorname{Var}(y \mid x')\), heteroskedasticity is possible
Perhaps \(y\) is restricted to be positive, \(y \ge 0\), implying that \(u \ge -m(x;\theta_0)\), i.e. possible values of \(u\) depend on \(x\).
Other features of the conditional distribution of \(y\) given \(x\), such as higher conditional moments of \(y\), could be dependent too. If, for example, the skewness of \(y \mid x\) depends on \(x\), then \(y\) and \(x\) are dependent.
Remark 2
The error formulation in equation (2) results in a semi-parametric model for \(y\), since we have not specified a parametric distribution for \(u\).
But \(m(x;\theta_0)\) is still a parametric model for \(E[y \mid x]\).
Identification#
Empirical analysis includes the following three steps:
Identification of \(\theta_0\)
Estimation of \(\theta_0\)
Inference and hypothesis testing
Identification step
Would it be possible to find estimates of \(\theta_0\) if we knew the nature of the data available in the population?
This is a purely analytical exercise that has to be proven – does not involve any data or statistical analysis.
Estimation step
Find estimates of population distribution given the finite sample at hand. This is statistical analysis.
Inference
Given an estimated model we can make inferences and test relevant hypotheses.
How precise are our estimates?
Test hypotheses about parameters and functions of parameters.
The grounds for identification in the nonlinear regression model is the following minimization problem:
\(\theta_0\) minimizes the expected squared error
the objective function is called the population criterion function as it is written for random variables \(x\) and \(y\) rather than some finite sample of observed data
the expectation is taken over the joint distribution of \((y,x)\)
Let’s check if \(\theta_0\) defined by (1) also solves (3)
where \(u \equiv y - m(x;\theta_0)\). Taking expectations and using \(E(u \mid x) = 0\) we have
Finally, since \(E[(m(x;\theta_0) - m(x;\theta))^2] \ge 0\), we have for every \(\theta \in \Theta\):
Thus, \(\theta_0\) solves the population problem (3).
What about uniqueness?
Identification
If \(\theta_0\) is a unique minimizer in (3), that is
we say that \(\theta_0\) is identifiable, or simple identified.
Note once more, that identification is a property of the population problem (3) and does not depend on any data.
Example of lack of identification in NLS
Suppose that
but the true model is linear such that \(\theta^{(3)}_{0} = 0\).
Then, for any value of \(\theta^{(4)}\) the population criterion is minimized when the other parameters are set to their true values.
The model is poorly identified.
Example of lack of identification in OLS
Consider simple linear regression $\( m(x;\theta) = x\theta \)$
Imagine that some regressors in \(x\) are linearly dependent, or perfectly collinear, and in the extreme case there are two regressors which are exactly the same.
In this case, the true coefficients in \(\theta_0\) for these regressors can be changed arbitrarily as long as the sum of the coefficients remains the same (i.e., the combined effect of the two regressors on \(m(x;\theta)\) remains unchanged. Thus, if there exists multiple values of \(\theta\) to minimize the population criterion (3).
The model is not identified.
Estimation#
Assume we have a random sample \(\{y_i, x_i\}_{i=1,\ldots,N}\).
Replace the population concept \(E[(y - m(x;\theta))^2]\) with the sample analogue relying on the analogy principle to build an estimator:
We have the Nonlinear Least Squares (NLS) estimator:
It turns out that \(\hat{\theta}_{NLS}\) is consistent estimator for \(\theta_0\) if
\(\theta_0\) is identified
the sample criterion converges uniformly to its expectation
Let’s look at the notions of consistency and uniform convergence in more abstract terms of M-stimators.
M-estimation#
M stands for minimization or maximization, M-estimators minimize or maximize a sample criterion.
Let’s adopt the following notation:
\(w = (y,x)\) denotes the data vector which includes endogenous \(y\) and exogenous \(x\)
\(w_i\) denotes the \(i\)th observation of \(w\) in the random sample from a population
\(q(w;\theta)\) is a known criterion function of the data and parameters
Definition: M-estimators
Given a random sample \(\{w_i\}_{i=1,\ldots,N}\) from a population, an M-estimator is defined as minimizer \(\hat{\theta}\) of the sample analogue
of the population problem
where \(q(w;\theta)\) is a known parametric criterion function and \(\theta_0\) is the true value of parameter.
M-estimatros are a very general class of estimators that includes many well-known estimators as special cases.
Random sample \(\{w_i\}_{i=1,\ldots,N} = \{y_i, x_i\}_{i=1,\ldots,N}\) required in the definition also includes many data structures: \(y_i\) can be a scalar or a vector, allowing for cross-section, panel, and multi-equation data.
Examples of M-estimators
Linear regression:
Nonlinear regression:
LAD:
Maximum likelihood estimator: \(q(w;\theta)\) is the negative of the log-likelihood function
How to choose an M-estimator?#
There are many possible M-estimators for a given problem – many choices of \(q(w;\theta)\) criterion functions!
How to choose a “good” estimator?
Criteria include:
Consistency: does the estimator recover the true parameter as sample size increases?
Rate of convergence: how fast does a consistent estimator converge to the true parameter as the sample size increases?
Known asymptotic distribution: can we derive the asymptotic distribution of the estimator for inference?
Efficiency: how precise are the estimates relative to other estimators?
Robustness: how sensitive is the estimator to violations of assumptions used to derive \(q(w;\theta)\)?
Small-sample properties: how does the estimator perform in finite samples?
Ease of use: how simple is it to understand, implement and compute the estimator?
How would you rank the selection criteria? Which one is most and least important?
We can study properties of estimators against these criteria using
theoretically using asymptotic theory as sample size \(N \to \infty\) approaches infinity
practically using Monte Carlo simulations for finite samples of various sizes
Consistency of M-estimators#
Definition
A sequence of random variables \(\{X_n\}_{n\ge 1}\) converges in probability to a random variable \(X\), written \(X_n \xrightarrow{p} X\), if for every \(\varepsilon > 0\),
Recall that we have are working with
population problem that the true value of parameter \(\theta_0\) solves
sample problem that the M-estimator \(\hat{\theta}\) solves
Under weak regularity conditions (finite moments) the Law of Large Numbers states that the sample average converges in probability to its expected value:
We wish to show consistency of the M-estimator, \(\hat{\theta} \xrightarrow{p} \theta_0\).
Using a diagram:
Two requirements for consistency
Identification of \(\theta_0\) – must be the unique solution to the population problem
Definition: identification
The model is identified if \(\theta_0\) is the unique minimizer of \(E[q(w;\theta)]\) over \(\theta \in \Theta\), i.e.
Uniform convergence in probability of \(\frac{1}{N}\sum_{i=1}^N q(w_i;\theta)\) to its expected value, that is convergence should hold not only pointwise for each fixed \(\theta\), but uniformly over all \(\theta \in \Theta\) in the worst case.
Definition: uniform convergence
A sequence of random functions \(\{g_N(\theta)\}\) converges uniformly in probability to a function \(g(\theta)\) on \(\Theta\) if for every \(\varepsilon > 0\)
In our case, we want uniform convergence of the sample criterion to its expectation:
Sufficient conditions for uniform convergence
\(q(w;\cdot)\) is continuous on \(\Theta\)
Parameter space \(\Theta\) is a compact set (closed, bounded)
Technical conditions
\(q(\cdot;\theta)\) is Borel measurable on \(\mathcal{W}\)
\(q(w;\theta)\) is bounded in absolute value
Collecting the above conditions, we have Theorem 12.2 in Wooldridge [2010]
Consistency of M-estimators
Under conditions of uniform convergence and provided that the model is identified, the M-estimator \(\hat{\theta}\) is consistent for \(\theta_0\):
Asymptotic distribution#
Let us now build the asymptotic theory of M-estimators.
We require some further assumptions.
Additional conditions
\(\theta_0 \in \operatorname{int}(\Theta)\)
\(q(w;\cdot)\) is twice continuously differentiable on \(\operatorname{int}(\Theta)\)
Note: Condition 5 implies assumption 1 for the uniform convergence.
Under these additional assumptions we can state the following (very central) theorem (Theorem 12.3 in Wooldridge [2010]):
Asymptotic Normality of M-estimators
Under the conditions for uniform convergence above and additional conditions 4 and 5, and provided that the model parameters are identifiable, the M-estimator \(\hat{\theta}\) is asymptotically normal:
where
and
with
Recall that notation “\(\xrightarrow{d}\)” means convergence in distribution.
Definition: convergence in distribution
A sequence of random variables \(\{X_n\}_{n\ge 1}\) converges in distribution to a random variable \(X\), written \(X_n \xrightarrow{d} X\), if
for all \(x\) at which the distribution function \(F_X(x)\) is continuous.
The quantities \(s(w;\theta) = \frac{\partial q(w;\theta)}{\partial\theta}\) are called the scores which are \(P \times 1\) column-vectors equal to the transposed gradients of the criterion function with respect to \(\theta\).
Matrixes \(A_0\) and \(B_0\) are therefore both \(P \times P\).
Given the result on asymptotic normality, we can read off the asymptotic variance of M-estimate \(\hat{\theta}\) as
and perform inference on \(\theta_0\) using the asymptotic distribution of \(\hat{\theta}\).
Proof of asymptotic normality of M-estimators#
Let’s sketch the proof of asymptotic normality.
First, recall the standard Central Limit Theorem (CLT) for iid random vectors:
Lindeberg–Lévy Central Limit Theorem
Let \(X_i\) be a sequence of iid random vectors with \(E[X_i]=\mu\) and \(\operatorname{Var}[X_i]=\Sigma\). Then, as \(N\to\infty\),
Since \(q(w;\cdot)\) is twice continuously differentiable in \(\theta\), define the score vector
\[ s(w;\theta) = \nabla_\theta q(w;\theta) \]with components
\[ s_i(w;\theta) = \frac{\partial q(w;\theta)}{\partial \theta_i}, \quad i=1,\ldots,P, \]and the Hessian
\[ H(w;\theta) = \frac{\partial^2 q(w;\theta)}{\partial\theta\,\partial\theta'} \quad (P\times P). \]Take a mean value expansion of \(s(w;\hat{\theta})\) about \(\theta_0\):
\[ \sum_{i=1}^N s(w_i;\hat{\theta}) = \sum_{i=1}^N s(w_i;\theta_0) + \sum_{i=1}^N H(w_i;\theta^+)(\hat{\theta}-\theta_0), \]where \(\theta^+ \in [\theta_0,\hat{\theta}]\).
Intuition (mean value theorem analogy):
If a car travels 100 miles in one hour, its average speed is 100 mph. Even if speed varies, at some point the car must travel exactly at the average speed. Similarly, there exists a \(\theta^+\) between \(\theta_0\) and \(\hat{\theta}\) where the Hessian equals the average derivative.Since \(\theta_0 \in \operatorname{int}(\Theta)\) and \(\hat{\theta}\xrightarrow{p}\theta_0\),
\[ \Pr(\hat{\theta}\in\operatorname{int}(\Theta)) \xrightarrow{p} 1. \]Hence the first-order conditions for the sample problem hold with probability approaching one:
\[ \Pr\left(\sum_{i=1}^N s(w_i;\hat{\theta})=0\right)\xrightarrow{p}1. \]Therefore,
\[ 0 = \sqrt{N}\frac{1}{N}\sum_{i=1}^N s(w_i;\theta_0) + \frac{1}{N}\sum_{i=1}^N H(w_i;\theta^+)\sqrt{N}(\hat{\theta}-\theta_0), \]implying
\[ \sqrt{N}(\hat{\theta}-\theta_0) = \left( \frac{1}{N}\sum_{i=1}^N H(w_i;\theta^+) \right)^{-1} \left( \frac{1}{\sqrt{N}}\sum_{i=1}^N s(w_i;\theta_0) \right). \]The first term converges in probability to \(A_0^{-1}\), while the second term is an iid sum with mean zero and variance \(B_0\).
By the CLT, the RHS converges in distribution to \(\mathcal{N}(0,A_0^{-1}B_0A_0^{-1})\), proving the theorem.
Remarks:
M-estimator \(\hat{\theta}\) is asymptotically unbiased, that is \(\sqrt{N}E(\hat{\theta}-\theta_0)\) asymptotically has zero mean, \(\sqrt{N}E(\hat{\theta}-\theta_0) \to 0\) as \(N\to\infty\). This is the concept related to consistency.
\(\operatorname{Avar}(\hat{\theta})\) referred to as asymptotic variance converges to zero as \(N\to\infty\), yet \(\operatorname{Var}(\sqrt{N}(\hat{\theta}-\theta_0))\) converges to \(A_0^{-1}B_0A_0^{-1}\) which does not depend on \(N\). We use \(\operatorname{Avar}(\hat{\theta})\) for inference, i.e. to obtain standard errors, run tests, etc.
\(\hat{\theta}\) is \(\sqrt{N}\)-consistent similar to many other estimators, but some nonparametric estimators converge more slowly (more data is needed for the same precision).
Asymptotic covariance depends on the choice of \(q(w;\cdot)\), we should find and use better criterion functions (estimators) to improve precision.
\(A_0\) must be positive definite. Otherwise we could for example get negative variances. Another example is when the objective function is flat around the minimum, i.e. \(\theta_0\) is not a unique minimizer. In this case we would have zeros on the diagonal and \(A_0\) is not positive definite and has no inverse. This is a practical manifestation of non-identification.
Estimating the Asymptotic Variance of M-estimators#
To perform inference we would like to compute
but \(A_0\) and \(B_0\) depend on \(\theta_0\), which is unknown!
We have to estimate \(A_0\) and \(B_0\) from sample data.
Estimator 1 (most structural)#
Solve analytically for \(H(w;\theta)\) and \(s(w;\theta)\)
Take expectations
Plug in \(\hat{\theta}\) for \(\theta_0\)
Problems
Full joint distribution of \(w\) may be unknown
Expectations may be difficult
Distribution may be misspecified
Estimator 2 (least structural)#
Use analogy principle (sample averages)
Plug in \(\hat{\theta}\)
Estimators:
Properties:
\(\hat{A}\xrightarrow{p}A_0\), \(\hat{B}\xrightarrow{p}B_0\)
\(\hat{A}\) not guaranteed to be positive definite
If \(\hat{\theta}\) strictly minimizes the sample criterion, \(\hat{A}\) is positive definite
Lack of positive definiteness usually indicates poor identification.
Estimator 3 (intermediate case)#
Let \(w=\{y,x\}\) and define
Then
Estimator:
More efficient than estimator 2 but less robust to misspecification.
Covariance estimators#
Using estimator 2:
Using estimator 3:
Example: NLS using estimator 3
Recall:
Scores:
Variance of scores
Hessian
Conditional expectation:
Estimator:
NLS covariance estimator
Remarks
This is White’s heteroskedasticity-consistent covariance estimator.
Robust to serial correlation in panel data.
If errors are homoskedastic, the estimator simplifies.
If \(m\) is linear in \(\theta\), this reduces to the usual OLS covariance matrix.
Optimization methods#
How to solve the sample minimization problem numerically?
In nonlinear models, the optimum rarely has a closed form and must be solved numerically.
Algorithms#
Newton–Raphson
BHHH (Acronym of the four authors Berndt, Hall, Hall, and Jerry Hausman)
BFGS (Broyden-Fletcher-Goldfarb-Shanno method)
many other
The first three belong to the class of quasi-Newton algorithms with have the form
where:
\(g\) is index of iteration
\(\lambda\) is step size
\(\sum_i H_i(\theta^{(g)})\) is the sample sum of the Hessians of \(q(w_i;\theta)\) evaluated at \(\theta^{(g)}\),reflects the curvature of \(Q_N=\sum_i q(w_i;\theta)\)
\(\sum_i s_i(\theta^{(g)})\) is the sample sum of the scores evaluated at \(\theta^{(g)}\), measures the slope of \(Q_N\)
Newton–Raphson (NR)#
Here we are using the flavour of Newton-Raphson method adopted for the optimization problems – finding the root for the first order conditions. Here also in multiple dimensions.
Second order Taylor expansion of the objective function around \(\theta^{(g)}\):
FOC:
This is the Newton–Raphson algorithm for solving the system of first order conditions.
Properties
Uses slope and curvature
Moves downhill if Hessian is positive definite
Step size \(\lambda\)#
In addition to the normal Newton-Raphson update, it is typical to include an additional step size search step. This ensures global convergence of the algorithm at slower rate.
Start with full Newton step, \(\lambda=1\)
If the step decreases the objective, keep proceed with \(\lambda=1\)
If the step does not decrease the objective, reduce \(\lambda\) to half its value
Repeat steps 2–3 until an improving step is found
The step sizes are therefore in the sequence \(\lambda=1/2,1/4,1/8,\ldots\), so the approach is often called step-halving line search.
This addition is computationally cheap because the Hessian and the scores do not have to be recomputed for different \(\lambda\) values, only the function values.
Newton-Raphson works best when the objective is:
Close to quadratic
Convex
Yet, it may be that the Hessian is not positive definite far from the optimum, leading to uphill moves. The algorithm can be improved by ensuring that the corvature matrix is always positive definite.
BHHH#
BHHH approximates the Hessian by the outer product of scores:
Motivation
Information identity: \(E(s_i s_i') \to E(H_i)\)
Derived for the MLE estimator, but works more generally for M-estimators
Advantages
Always positive definite
No need to compute Hessian
Disadvantages
Approximation valid:
At true parameters
For large \(N\)
For well-specified models
BFGS and other quasi-Newton methods#
BFGS builds up an approximation to the inverse Hessian iteratively, ensuring positive definiteness.
There are other approaches for approximation of the Hessian, such as DFP (Davidon-Fletcher-Powell) method, that have been developed for general optimization problems.
BHHH that is based on the statistical properties of the of the scores is more specific to econometric M-estimation problems – may not be found in standard optimization packages.
Maximum likelihood estimation#
Maximum likelihood estimation (MLE) is a special case of M-estimation where the criterion function is the negative log-likelihood function.
Rather than focusing on conditional moments as in nonlinear regression, MLE focuses on the entire conditional distribution of the endogenous variable given the exogenous variables.
Not only conditional moments as in NLS, i.e. \(E(y \mid x)\), \(\operatorname{Var}(y \mid x)\)
But the entire conditional distribution of the random variable \(y\), i.e. \(f(y \mid x)\)
Why MLE?
Efficiency: Uses information on the distribution of endogenous variables in the most efficient way – if the distribution is correctly specified, MLE is asymptotically efficient and achieves the Cramér-Rao lower bound that is lowest possible asymptotic variance among all consistent estimators
More structure gives more information: Able to estimate any feature of the conditional distribution, such as
\(P(y = 1 \mid x)\) (if \(y\) is discrete)
\(P(y \in (\underline{y}, \bar{y}) \mid x)\) (if \(y\) is continuous)
Conditional moments, \(E(y \mid x)\), \(\operatorname{Var}(y \mid x)\)
etc.
And how these objects change when \(x\) or \(\theta\) changes
Why not MLE?
Robustness: MLE is generally inconsistent if the distribution is misspecified
yet sometimes MLE is actually robust to mis-specification
in nonlinear regression, NLS and MLE are equivalent
Example: Binary Choice
Microeconomic model
\(y\) : binary random variable, \(y \in \{0,1\}\)
\(\beta_0\) : \(P \times 1\) parameter vector
\(G(x;\theta_0)\) : model for conditional probability of \(y = 1\)
Logit:
$\( G(x'\beta_0) = \Lambda(x'\beta_0) = \frac{\exp(x'\beta_0)}{1 + \exp(x'\beta_0)} \)$Probit:
$\( G(x'\beta_0) = \Phi(x'\beta_0), \quad \Phi(x'\beta_0) = \int_{-\infty}^{x'\beta_0} \frac{1}{\sqrt{2\pi}} \exp\!\left(-\frac{z^2}{2}\right) dz \)$
Conditional probability of \(y\)
Note: \(\beta_0\) is defined as the vector that indexes \(f(y \mid x; \beta_0)\).
Framework for Conditional MLE#
Conditional MLE due to the fact that we leave out modeling of the marginal distribution of \(x\), take it as given, and focus on modeling the conditional distribution of \(y\) given \(x\).
Object of interest
\(p_0(y \mid x)\), i.e. the “true” conditional density of \(y_i\) given \(x_i = x\), which defined with respect to some measure \(\nu\) in the corresponding probability space.
Parametric model
We model \(p_0(y \mid x)\) as
Assumptions
\(f(y \mid x; \theta) \ge 0\)
\(\int_{\mathcal{Y}} f(y \mid x; \theta)\, \nu(dy) = 1\)
for all \(x \in \mathcal{X}\) and each \(\theta \in \Theta\); \(\mathcal{Y}\) \(\mathcal{X}\) are supports for \(y\) and \(x\)\(f(y \mid x; \theta_0)\) is a correctly specified model for \(p_0(y \mid x)\),
i.e. \(p_0(y \mid x) = f(y \mid x; \theta_0)\) for some \(\theta_0 \in \Theta\)
Population problem#
Objective function
and (we will show)
or equivalently
Sample problem#
Objective function
The value \(\hat{\theta}_{CMLE}\) that maximizes \(Q_N(\theta)\) is the conditional maximum likelihood estimator (CMLE):
\(\hat{\theta}_{CMLE}\) is clearly an M-estimator.
Identification of \(\theta_0\)#
Strategy
First prove identification for unconditional MLE
Then show that this implies identification for CMLE
Identification, unconditional case
Jensen’s inequality
\(u\) : non-constant random variable
\(a(u)\) : strictly concave function
Then \(E[a(u)] < a(E[u])\).
Let
\(u = \frac{f(z;\theta)}{f(z;\theta_0)}\), \(\theta \ne \theta_0\)
\(a(u) = \ln(u)\)
Then
Now
Hence
Therefore
Thus \(E[\ln f(z;\theta)]\) is maximized at \(\theta=\theta_0\).
Identification, conditional case
The argument also works for conditional densities:
implies
and therefore
is maximized at \(\theta=\theta_0\).
Asymptotic properties of MLE#
Consistency#
Can be established as a special case of Theorem 12.2 in Wooldridge [2010], see Theorem 13.1.
Under the following assumptions, \(\hat{\theta}_{CMLE}\) is consistent for \(\theta_0\), i.e. \(\hat{\theta}_{CMLE} \xrightarrow{p} \theta_0\)
Assumptions:
Random sample
\(\theta_0\) is identified (unique maximizer of \(Q(\theta)\))
Model is correctly specified: \(p_0(y\mid x)=f(y\mid x;\theta_0)\) for some \(\theta_0 \in \Theta\)
\(f(y\mid x;\theta)\) is a well-specified density for all \(\theta\) and \(x\): everywehre positive and integrates to 1
Log-likelihood is continuous in \(\theta\)
Parameter space \(\Theta\) is compact (closed, bounded)
Technical conditions:
Borel measurability
Objective function bounded in absolute value
Asymptotic distribution#
Additional assumptions:
\(\theta_0 \in \operatorname{int}(\Theta)\)
Log-likelihood is twice continuously differentiable in \(\theta\)
We can apply Theorem 12.3 in Wooldridge [2010] to obtain the asymptotic distribution of CMLE.
Under the assumptions above,
where
Thus
If we want to apply Theorem 12.3 directly, we must turn the maximization problem into a minimization problem. Define
Hence \(H(w;\theta_0)\) is the second derivative of \(-\ln f(y\mid x;\theta_0)\). Keeping track of the minus sign, let
so that
Information identity#
Major result for MLE!
Conditional:
Unconditional:
Thus \(A_0 = B_0\).
Applying this gives the limiting distribution for MLE:
and
Note: The information identity only applies
As \(N \to \infty\)
When the model is correctly specified
At the true parameters
Otherwise we must use
Estimating the asymptotic variance#
Estimating \(A_0\) and \(B_0\) is similar to the general M-estimation case.
Each of the following matrices converges to \(A_0=B_0\):
Pros and cons
(1) always available, but can be negative definite
(2) easy and always positive definite, but poor small-sample properties
(3) best small-sample properties and positive definite, but hard to compute
Hence
Example: Binary Choice
Conditional density
Conditional log-likelihood
We use the estimator \(\frac{1}{N}\sum_{i=1}^N A(x_i;\hat{\beta})\).
1: Compute Hessian
Score
where \(g(z)=\frac{\partial G(z)}{\partial z}\).
Hessian
2: Conditional expected Hessian
The second term has conditional expectation zero:
Hence
3: Sample analogue
Asymptotic variance of \(\hat{\beta}\)
References and Additional Resources
📖 Wooldridge [2010] “Econometric Analysis of Cross Section and Panel Data”, The New Palgrave Dictionary of Economics
Chapter 12
Chapter 13