Fisher ScoringEdit
Fisher Scoring, also known as the scoring algorithm, is an iterative method for finding maximum likelihood estimates (MLEs) by exploiting the expected curvature of the log-likelihood. It replaces the Hessian (second-derivative matrix) with the Fisher information (the expected negative Hessian) to guide updates of the parameter vector. In practice, this leads to a stable, efficient procedure for a broad class of models, especially generalized linear models when the link and variance structures align with the exponential family.
In many applied settings, Fisher scoring is part of a family of procedures that include Newton-Raphson and iterative reweighted schemes. The key distinction is that Fisher scoring uses the expected information rather than the observed information at the current iterate, which can improve stability when data are sparse or when the likelihood surface is irregular. The method is foundational in modern maximum likelihood estimation and underpins several widely used algorithms in applied statistics and econometrics.
History and origins
The idea of improving Newton-type iterations for maximum likelihood estimation by replacing the Hessian with the expected information traces back to early 20th century work on likelihood theory. Ronald A. Fisher introduced and popularized the concepts of likelihood-based inference, score functions, and information measures that underlie the Fisher information. The scoring approach—updating parameters by incorporating the expected curvature of the log-likelihood—became a standard technique in estimation theory and later found particular prominence in the analysis of generalized linear models and related frameworks. The connection to the broader development of likelihood-based methods is emphasized in discussions of Maximum likelihood theory and the role of information in statistical inference.
Theory and algorithm
Let θ denote a K-dimensional parameter vector and ℓ(θ) its log-likelihood given observed data. The score function is U(θ) = ∂ℓ(θ)/∂θ, and the Fisher information matrix is I(θ) = E[−∂²ℓ(θ)/∂θ∂θᵀ], where the expectation is taken with respect to the model at θ. The Fisher scoring update at iteration t is
θ^{(t+1)} = θ^{(t)} + I(θ^{(t)})^{-1} U(θ^{(t)}).
Key points to know about the method:
- It is a specialization of a broader Newton-type framework, where the Hessian is replaced by the expected information. This tends to stabilize updates in many common models. See Newton-Raphson method for a related approach that uses the observed information (the negative Hessian) instead.
- For generalized linear models (GLMs) with exponential-family responses and canonical links, Fisher scoring aligns closely with a form of Iteratively Reweighted Least Squares. In these cases, each iteration reduces to solving a weighted least squares problem, with weights determined by the current fitted values and the variance function. This connection is often summarized by the shorthand that Fisher scoring "is IRLS in disguise" for GLMs with canonical links.
- Convergence properties depend on the curvature of the log-likelihood and the conditioning of the information matrix. Good starting values and well-behaved data help ensure rapid convergence. If I(θ) is ill-conditioned or singular, regularization or alternative optimization strategies may be employed.
For practical use, practitioners often write the update in terms of the log-likelihood’s gradient and its expected curvature, and then implement the iteration with numerical linear algebra routines to invert I(θ^{(t)}) and apply the update.
Relationship to related methods
- Newton-Raphson method: Both Newton-Raphson and Fisher scoring are Newton-type methods for finding MLEs. Newton-Raphson uses the observed information J(θ) = −∂²ℓ(θ)/∂θ∂θᵀ, whereas Fisher scoring uses the Fisher information I(θ) = E[−∂²ℓ(θ)/∂θ∂θᵀ]. See Newton-Raphson method for a broader discussion of convergence and stability.
- Generalized linear models and IRLS: In GLMs with exponential-family distributions and canonical links, Fisher scoring steps collapse to a form of Iteratively Reweighted Least Squares update. This makes the method particularly attractive in applied regression problems. See Generalized linear model for the broader modeling context.
- Observed vs. expected information: The choice between updating with I(θ) (Fisher scoring) or with J(θ) (Newton–Raphson) reflects a trade-off between stability and rapidity of convergence. See Fisher information and Observed information for the two viewpoints.
Applications
- Logistic regression: A common setting where Fisher scoring is used to estimate coefficients in binary outcome models. The canonical link and the Bernoulli/binomial likelihood lead to convenient expressions for U(θ) and I(θ).
- Poisson regression and other GLMs: For count data and related outcomes, Poisson regression and more general GLMs with canonical links benefit from the structured updates provided by Fisher scoring.
- Econometrics and biostatistics: The method appears in standard textbooks and software implementations for a wide class of likelihood-based models, and it remains a workhorse for fitting GLMs in applied research.
- Software implementations often expose Fisher scoring or IRLS as the core optimization engine for GLMs, sometimes with adaptations to handle convergence safeguards, regularization, or quasi-likelihood forms.
See Maximum likelihood theory for the general estimation framework and Generalized linear model for model-specific details and link choices.
Convergence, robustness, and practice
- Convergence behavior: When the likelihood surface is well-behaved and the information matrix is well-conditioned, Fisher scoring tends to converge in a small number of iterations. Poor starting values or near-singular information can slow convergence or cause numerical difficulties, in which case practitioners may switch to alternative optimizers or apply regularization.
- Data sparsity and misspecification: In sparse data situations or when the model is mis-specified, the expected information can lead updates that do not reflect the local curvature of the observed data, potentially impacting finite-sample performance. Robust alternatives or diagnostic checks may be employed in such cases.
- Alternatives and hybrids: Some practitioners prefer quasi-Newton methods (e.g., BFGS) or stochastic gradient approaches for high-dimensional problems or streaming data. In GLMs, IRLS with appropriate safeguards often provides a reliable balance of speed and stability. See Quasi-Newton methods and Stochastic gradient descent for related approaches.