Overidentification TestsEdit

Overidentification tests are a staple of modern econometrics, used when researchers employ instruments to identify causal effects in models where the key explanatory variable is endogenous. In short, these tests check whether the extra instruments beyond the minimum needed for identification behave like good instruments: uncorrelated with the structural error term and correctly excluded from the outcome equation. The classic instruments tests you’ll encounter are the Sargan test and Hansen’s J test, both rooted in the broader framework of the generalized method of moments (Generalized method of moments). When the data pass these tests, researchers gain greater confidence in their causal interpretation; when they fail, they face questions about instrument validity and model specification.

Overview

Overidentification arises whenever there are more instruments than endogenous variables. Suppose a model uses m instruments to identify a system with r endogenous variables, with m > r. The extra moment conditions implied by the extra instruments are tested to see if they are consistent with the assumed model. If the moment conditions are consistent, the instruments are considered plausibly exogenous and properly excluded. If not, at least one instrument may be invalid, or the model may be misspecified.

In practice, researchers implement either a two-stage approach or a full GMM estimator and report a J-statistic (often called the Hansen J statistic when robust to heteroskedasticity) or its Sargan counterpart (in the traditional, homoskedastic setting). The statistic has, loosely speaking, a chi-square distribution under the null hypothesis of valid instruments, with degrees of freedom equal to the number of overidentifying restrictions (the number of instruments minus the number of endogenous variables).

Key terms you’ll see include Instrumental variables, Two-stage least squares, Exogeneity, Exclusion restriction, and Overidentifying restrictions.

Formal framework

The basic idea centers on moment conditions: if the instruments are valid, E[z_t u_t] = 0, where z_t denotes the instruments and u_t the structural error. The observed sample counterparts generate a measure of how far those moments are from zero.
The Sargan test, named after Denis Sargan, implements this in a setting with homoskedastic errors. When the data strongly contradict the moment conditions, the test rejects the null of instrument validity.
Hansen’s J test extends the idea to heteroskedastic environments and to the general GMM framework. The Hansen J statistic remains valid under heteroskedasticity when the estimator uses an appropriate weighting matrix, and it remains a primary diagnostic for exogeneity in complex models.
The practical upshot is: a non-rejection lends plausibility to the instruments, while rejection flags that at least one instrument fails the exclusion/orthogonality criteria, or that the model’s assumptions (like linearity or the exclusion restrictions) are violated.

Useful links for this framework include Generalized method of moments, Instrumental variables, Exogeneity, and Exclusion restriction.

Key tests and how to read them

Sargan test: Compares the sample moments against zero under the assumption of homoskedastic errors. A large test statistic (relative to the chi-square distribution with df = m − r) leads to rejection of the null of instrument validity.
Hansen J test: Uses the robust (heteroskedasticity-consistent) weighting matrix. It’s the preferred default in many applied settings because real-world data rarely exhibit perfectly homoskedastic errors. The interpretation remains the same: a failure to reject suggests instruments are plausibly valid, while rejection signals a problem with instrument validity or model specification.
J-statistic interpretation notes:
- A non-rejected null does not prove instruments are perfectly valid; it only supports the plausibility given the model.
- A rejection does not identify which instrument is at fault; it could be multiple instruments, weak instruments, or misspecification in the outcome equation.
Related diagnostics researchers often consider in tandem:
- Weak instruments tests (e.g., Cragg-Donald, Kleibergen-Paap rk) to assess instrument strength, since weak instruments can distort inference and complicate overidentification testing.
- Instrument proliferation concerns, discussed below, which can affect the distribution of the J statistic even when a subset of instruments is valid.
- Robustness checks across alternative instrument sets and estimation strategies such as LIML (Limited-information Maximum Likelihood) or Fuller methods.

Useful links: Sargan test, Hansen J test, Weak instrument, Cragg-Donald statistic, Kleibergen-Paap rk statistic, LIML, Fuller.

Practical considerations and best practices

Instrument strength matters: Stronger instruments improve the reliability of overidentification tests, but researchers should still assess exogeneity directly and consider complementary strength diagnostics.
Beware instrument proliferation: Using a large set of instruments can lead to instrument proliferation, inflating the J statistic and reducing finite-sample reliability. Practitioners often trim instruments or use methods designed to mitigate this issue.
Robustness to misspecification: Overidentification tests are diagnostic tools, not panaceas. If the model is misspecified, or if the true relationship is nonlinear, the test can reject for reasons unrelated to exogeneity.
Robust vs classical tests: In the presence of heteroskedasticity or other irregularities, Hansen’s J with a robust weighting matrix is preferred to the classical Sargan approach.
Complementary procedures: In addition to the J test, researchers may report results from alternative estimators (e.g., LIML or Fuller modifications) and perform sensitivity analyses across different instrument sets. They may also check external validity by comparing findings across related datasets or natural experiments.
Practical references: many applied studies discuss instrument validity in depth, and scholars frequently consult Stock-Yogo critical values tables for weak-instrument diagnostics and the relevant weak-instrument thresholds.

Encyclopedia entries related to these practices include Instrumental variables, Two-stage least squares, Exogeneity, and GMM.

Controversies and debates

What the test does and does not prove: A central debate is about how much weight to assign to a non-rejection or rejection of the null. Proponents emphasize that a non-rejection is a signal instruments pass a plausibility check given the model; opponents remind readers that the test cannot confirm correctness of the entire causal structure, as misspecification elsewhere can masquerade as exogeneity.
Misspecification vs exogeneity: Critics argue that overidentification tests can be overly sensitive to small misspecifications in the outcome equation, functional form, or measurement error. In such cases, a rejection might reflect these flaws rather than genuine exogeneity violations. Proponents respond that when combined with theory-based instrument selection and robustness checks, the tests remain a valuable safeguard.
Weak instruments and power: In settings with weak instruments, the J statistic can have distorted size properties, making it hard to distinguish truly invalid instruments from noisy data. The debate centers on how to balance instrument strength with the desire for credible causal identification. The standard response is to use alternative diagnostics (e.g., Cragg-Donald or Kleibergen-Paap tests) and to prefer better-justified instruments over quantity.
Instrument proliferation and policy conclusions: Some critics warn that piling on instruments to achieve a pass on overidentification tests can lure researchers into overfitting and spurious precision. Supporters counter that disciplined instrument selection, theoretical justification, and cross-study replication guard against such problems, and that these tests are part of a broader toolkit to avoid biased policy conclusions.
Relevance to political discourse: In public discussions of empirical policy evaluation, overidentification tests are sometimes invoked in debates about credibility of estimated effects. Proponents argue that rigorous diagnostics, including overidentification tests, are essential for credible policy analysis and resource allocation. Critics sometimes claim diagnostic routines can be wielded to cherry-pick results, but the standard defense is that transparency about methodologies and robustness checks mitigates such concerns.

See also discussions of Policy evaluation, Program evaluation, and related econometric methods in Econometrics and Macro-econometrics.