Ordinary Least SquaresEdit
Ordinary least squares (OLS) is a foundational tool in statistics and econometrics for estimating the parameters of a linear relationship between a dependent variable and one or more regressors. The core idea is to choose the parameter vector β that minimizes the sum of squared deviations between observed outcomes and those predicted by the linear model y = Xβ + ε. When X contains a constant column, this becomes the classic intercept-plus-slopes specification used in many fields, from economics to psychology. The estimator has a closed-form solution β̂ = (X′X)⁻¹X′y, provided X has full column rank, and it forms the backbone of many empirical analyses in business, public policy, and science linear regression least squares.
Under standard assumptions about the error term ε and the design matrix X, OLS enjoys a number of attractive statistical properties. If the conditional mean of the error term given the regressors is zero (E[ε|X] = 0), and the errors are homoskedastic and uncorrelated with X, the OLS estimator is unbiased and, among all linear unbiased estimators, it has the smallest variance—a result formalized by the Gauss–Markov theorem. In large samples, the distribution of β̂ becomes approximately normal under mild regularity conditions, which underpins conventional inference using confidence intervals and hypothesis tests Gauss–Markov theorem normal distribution central limit theorem.
The practical usefulness of OLS rests on both estimation and inference. The standard error of β̂, typically derived under the homoskedasticity assumption, measures precision and drives t-tests and p-values. When the assumption of constant variance fails—a condition known as heteroskedasticity—standard error estimates can be biased. Robust alternatives, such as heteroskedasticity-robust covariance estimators, provide valid inference without altering the point estimates. This adaptability makes OLS a flexible starting point in empirical work, even when ideal assumptions do not hold perfectly robust standard errors White's estimator.
The relevance of OLS extends beyond a single model; it underpins broader estimation frameworks and extensions. Generalized least squares (GLS) and feasible GLS (FGLS) adapt OLS to situations with correlated or non-constant error structures, while weighted least squares (WLS) handles known heteroskedasticity by weighting observations differently. For researchers addressing endogeneity—where regressors are correlated with the error term—instrumental variables (IV) and two-stage least squares (2SLS) provide alternatives that aim to recover causal effects under specific conditions. Related concepts include multicollinearity, which affects the precision of estimates, and model specification concerns that influence the reliability of inferences drawn from OLS estimates Generalized least squares Weighted least squares Instrumental variables Two-stage least squares Endogeneity multicollinearity.
Assumptions and limitations
- Linearity: The model posits a linear relationship between the dependent variable and the regressors. When the true relationship is nonlinear, OLS can mislead, though transformations or nonlinear extensions can help. See linear regression for broader context.
- Exogeneity: E[ε|X] = 0 ensures that the regressors are uncorrelated with the error term. Violations lead to biased and inconsistent estimates, a central concern in observational research. See endogeneity.
- No perfect multicollinearity: The columns of X must be linearly independent to identify β. When this fails, estimation becomes unstable, and researchers may drop or combine regressors. See multicollinearity.
- Homoskedasticity and no autocorrelation: Constant error variance and independence across observations support precise standard errors. Violations require robust inference or model adjustments; see heteroskedasticity and autocorrelation.
- Normality for inference: Normality of errors is not required for unbiasedness or consistency, but it simplifies exact inference in small samples. In large samples, the central limit theorem often provides the needed approximation for standard errors.
Applications and debates
- Empirical policy analysis and economics rely on OLS for estimating relationships such as the impact of education, experience, or policies on earnings, output, or participation. In many cases, OLS offers transparent interpretation and straightforward communication of results, which contributes to its enduring use in applied work. See econometrics.
- Causality versus correlation remains a central debate. While OLS can estimate associations, attributing causal effects requires careful consideration of endogeneity, omitted variables, measurement error, and model specification. Analysts routinely perform robustness checks, consider alternative specifications, and, when possible, employ methods that address endogeneity (e.g., Instrumental variables or natural experiments). See causal inference.
- In some research traditions, a heavy reliance on OLS for policy evaluation is criticized when data are observational and may reflect selection effects or unobserved heterogeneity. Proponents emphasize the virtues of simplicity, transparency, and replicability, while skeptics call for additional tools and data to bolster causal claims. See policy evaluation.
- Extensions and alternatives enrich the toolbox. Regularization techniques like ridge regression and the lasso address issues of multicollinearity and model complexity, while GLS and WLS broaden applicability to non-ideal error structures. These methods are often discussed in relation to OLS as complementary approaches rather than replacements in all contexts. See ridge regression Lasso GLS.
See also
- linear regression
- least squares
- Gauss–Markov theorem
- normal distribution
- central limit theorem
- robust standard errors
- White's estimator
- Generalized least squares
- Weighted least squares
- Instrumental variables
- Two-stage least squares
- Endogeneity
- multicollinearity
- heteroskedasticity
- autocorrelation
- econometrics