Principal Component RegressionEdit

Principal Component Regression (PCR) is a regression approach that uses principal component analysis (PCA) to reduce the dimensionality of a predictor set before performing a linear regression. By projecting the original predictors onto a smaller set of uncorrelated components, PCR aims to mitigate problems caused by multicollinearity and overfitting in settings with many correlated predictors. The resulting model tends to be simpler and more stable, especially when the number of predictors is large relative to the number of observations.

PCR is distinct from other regression-with-reduction approaches in that the reduction step is unsupervised: the components are chosen solely from the structure of the predictor data, not from the response variable. The usual workflow is to standardize the data, compute the principal components of the predictors, select a subset of those components, and then regress the outcome on the chosen components. Interpretation focuses on the component space rather than the original predictor space, and the method invites careful thinking about which components actually carry predictive signal. See principal component analysis for the underlying dimension-reduction step, and linear regression for the follow-on modeling stage.

Overview

Goal: predict a scalar response y from a potentially high-dimensional predictor matrix X, while controlling overfitting and multicollinearity.
Core idea: replace X with a smaller set of orthogonal components T = X V, where V contains the eigenvectors from PCA, and fit y on a subset of these components.
Typical decisions: how many principal components to keep (k), which is often chosen by cross-validation to balance bias and variance.
Practical impact: PCR can stabilize coefficient estimates and improve predictive accuracy when predictors are highly correlated or when p is large, but it may sacrifice predictive power if components with large variance do not align well with the response.

Methodology

Data preparation: center X (and often standardize when predictor scales differ) before PCA. Centering ensures that PCA captures directions of maximum variance rather than mean levels.
Compute PCA: decompose the centered predictor matrix X into principal components. The scores T = X V represent projections of the observations onto the eigenvectors V, and the columns of T are orthogonal.
Component selection: choose the first k columns of T to form the reduced predictor set. The choice of k is crucial and commonly based on cross-validation or information criteria.
Regression on components: fit a linear model y = T_k β_k + ε, where T_k contains the first k PCs. The regression coefficients β_k are obtained via ordinary least squares.
Mapping back to original space: if desired, recover the regression coefficients in terms of the original predictors as β = V_k β_k (bearing in mind centering/scaling). Predictions are ŷ = X β, with X centered as in the training data.
Diagnostics: assess predictive performance on held-out data, examine residuals, and check whether adding more components improves generalization rather than merely fitting noise.

Mathematical foundations

Let X be an n×p predictor matrix (n observations, p predictors) and y an n×1 response vector. After centering (and possibly scaling), perform a singular value decomposition X = U Σ V^T, where the columns of V are the principal directions and the columns of U Σ are the corresponding scores.
The principal components (scores) are T = X V, and T has orthogonal columns. Selecting the first k columns forms T_k.
The PCR model is y ≈ T_k β_k, with β_k = (T_k^T T_k)^{-1} T_k^T y when T_k has full rank.
If one wishes to express the model in terms of the original predictors, the coefficients on X can be written as β = V_k β_k (subject to centering/scaling adjustments). Predictions follow ŷ = X β.
Scale matters: standardization affects the PCA directions and, consequently, which components are selected. See Standardization (statistics) for related concepts and mean centering for data preprocessing steps.
PCR inherently addresses variance inflation among correlated predictors by working in an orthogonal basis, but it does not guarantee the smallest prediction error in all circumstances; the unsupervised nature means some predictive information can be discarded if it resides in components with small variance. For context, compare with partial least squares which uses the response to guide component extraction.

Practical considerations

When to use PCR: appropriate when the predictor set is large and multicollinearity is present, and when one prefers a simpler, more stable model with interpretable structure in the component space.
Component selection: the number of components to keep is a central tuning parameter. Too few components can underfit; too many can reintroduce overfitting. Cross-validation is a common, practical method for selecting k.
Interpretability: PCR shifts interpretation away from original predictors toward the principal components. If interpretation in terms of the original features is important, additional steps or alternative methods may be preferable.
Comparisons and alternatives: in many cases, Partial Least Squares (PLS) or regularized regressions (e.g., ridge, lasso, elastic net) offer superior predictive performance because they incorporate information from the response when forming the reduced representation. See partial least squares and ridge regression for related approaches.
Extensions to nonlinear settings: kernel PCR and other nonlinear adaptations extend the idea beyond linear relationships, but they introduce additional choices and potential overfitting risks. See kernel methods and nonlinear regression for broader context.

Extensions and related methods

Sparse PCR: adds sparsity constraints to the component loadings, aiming for more interpretable models by zeroing out small contributions.
Kernel PCR: applies PCR in a transformed feature space induced by a kernel, enabling nonlinear relationships to be captured within the PCR framework.
PCR with regularization: combines the PCR pipeline with regularized regression in the second stage to stabilize estimates further.
Partial Least Squares (PLS): a closely related method that, unlike PCR, selects components by considering their ability to predict the response, often yielding better predictive performance in practice. See partial least squares for a direct comparison.
Other regression techniques for high-dimensional data: ridge regression, LASSO, and elastic net address multicollinearity and overfitting with different bias-variance trade-offs and sparsity properties. See ridge regression, Lasso and elastic net for details.
Dimension reduction and model selection: PCR sits within broader themes of dimension reduction and regularized modeling. See dimension reduction for a survey of related ideas.