Likelihood Ratio TestEdit

The Likelihood Ratio Test (LRT) is a foundational tool in statistics for deciding between two competing models. It rests on the idea that a model’s fit to observed data can be summarized by its likelihood—a measure of how probable the data are under that model. When we compare a simpler, or null, model against a more complex, or alternative, model, the LRT looks at how much the maximum of the likelihood improves when we allow extra parameters. The test statistic is built from the ratio of those maximum likelihoods and is interpreted through its distribution under the null hypothesis.

What makes the LRT attractive is its principled grounding in the likelihood framework. Unlike ad hoc threshold rules, the LRT ties model comparison to the data-generating process assumed by the model. It also connects naturally to broader ideas in model selection and estimation, and it behaves in predictable ways as sample size grows. In practice, researchers in fields ranging from economics to biostatistics to engineering rely on the LRT to test hypotheses about effects, structures, and relationships encoded in their models.

The LRT is most straightforward when the models are nested: the null model is a special case of the alternative obtained by fixing certain parameters. In that setting, one computes the maximum likelihood under the null, the maximum likelihood under the alternative, and forms a statistic that measures the drop in fit when the simpler model is forced. The classical form uses the deviance D = -2 log(L0/L1) = -2[log L0 − log L1], where L0 is the likelihood under the null and L1 under the alternative. Under broad regularity conditions, and in large samples, D follows an approximate chi-square distribution with degrees of freedom equal to the number of extra parameters in the alternative model. This result—often attributed to Wilks and known as Wilks’ theorem—provides a calibrated way to convert the observed ratio into a p-value and a decision about the null hypothesis. See hypothesis testing and Wilks' theorem.

Theory and formulation - Likelihoods and nested models: Let H0 be the null hypothesis with parameter vector θ0 and H1 the alternative with θ1, where θ0 is a constrained version of θ1. The likelihood L(θ) summarizes how well the model with parameter θ explains the data. - Test statistic: The likelihood ratio statistic is often written as Λ = L0/L1, and the deviance D = -2 log Λ = -2[log L0 − log L1]. A larger D indicates that the more complex model provides a substantially better fit. - Null distribution: Under H0 and regularity conditions, D ≈ χ2k for large samples, where k is the number of free parameters added in the alternative relative to the null. - Interpretation: If the observed D is large enough to exceed the χ2k quantile for a chosen significance level, the null hypothesis of the simpler model is rejected in favor of the more complex one. See chi-square distribution and p-value.

Computation and interpretation - Steps: Estimate the null model via maximum likelihood to obtain L0, estimate the full model (the alternative) to obtain L1, compute D = -2 log(L0/L1), and compare D to the χ2k distribution with k equal to the difference in parameter count between the models. - Practical notes: The LRT is especially natural for generalized linear models and many other likelihood-based frameworks. It remains a benchmark against which alternatives like the Wald test or information-criterion-based methods (e.g., AIC or BIC) are evaluated. - Non-nested cases: When models are not nested, the standard LRT is not applicable. In such cases, other tools such as the Vuong test or bootstrap-based methods may be used to assess relative fit. See model selection for broader context.

Asymptotics, robustness, and extensions - Regularity and identifiability: The chi-square null distribution relies on regularity conditions, including identifiability and differentiability of the likelihood. Violations can distort the distribution, especially in small samples. - Small-sample alternatives: In modest samples, the chi-square approximation can be unreliable. Parametric bootstrap or permutation approaches can be employed to obtain more accurate p-values by simulating data under H0. See bootstrapping and permutation test. - Wide applicability: The LRT extends beyond simple normal-linear models to many generalized linear models and beyond, making it a versatile tool for hypothesis testing in diverse settings.

Practical considerations and best practices - Model specification matters: The LRT tests whether a more complex, nested model provides a substantially better explanation of the data. If the models are misspecified, the test can give misleading results; careful model checking and diagnostics are essential. - Comparing models with different data uses: The LRT presumes the same data sample is used for both models. If data are collected differently across models, or if model assumptions are violated, interpretation becomes tricky. - Complementary information: The LRT provides a p-value for testing a specific nested hypothesis, but it should be interpreted alongside effect sizes, confidence intervals, and domain knowledge. In some contexts, information criteria like AIC or BIC offer complementary perspectives on model quality that balance fit with complexity. - Controversies and debates: Critics sometimes argue that rigid thresholds and binary decisions based on p-values encourage mechanical thinking and poor replication. Proponents of the LRT respond that, when used with proper model specification, robust diagnostics, and complementary metrics, the test remains a principled, transparent basis for inference. In debates about statistical practice, some observers emphasize exploratory data analysis and Bayesian methods as alternatives or supplements; defenders of the LRT point to its long-run error-controlling properties and its direct connection to likelihood and estimation theory. Those who push for narrower or broader interpretations of statistical evidence often disagree about thresholds, but the LRT’s core idea—assessing whether added parameters meaningfully improve fit—remains a stable reference point. From a practical standpoint, keeping the focus on model credibility, replicability, and clear reporting tends to yield sound conclusions.

See also - Hypothesis testing - Maximum likelihood estimation - p-value - Chi-squared distribution - Wilks' theorem - Generalized linear model - Model selection - Bootstrapping (statistics) - Vuong test - Nested model