Log Marginal LikelihoodEdit

Log marginal likelihood, often referred to as the evidence, is a central quantity in Bayesian statistics used for comparing competing models based on observed data. It represents the probability of the data under a given model after integrating out the model’s parameters with respect to their prior distribution. In formula form, for a model M with parameter vector θ and data D, the log marginal likelihood is log p(D|M) = log ∫ p(D|θ,M) p(θ|M) dθ. This quantity serves as the denominator in Bayes factors, which contrast how well different models explain the same data when prior knowledge is taken into account.

From a practical standpoint, the log marginal likelihood embodies a principled balance between model fit and model complexity. The fit term p(D|θ,M) rewards models that explain the data well, while the prior p(θ|M) and the integration over θ impose a built-in penalty on unnecessary complexity. The resulting evidence tends to favor models that achieve good predictive performance without overfitting, a feature aligned with a cautious, efficiency-minded approach to empirical reasoning.

Definition and intuition

In Bayesian model comparison, one writes the marginal likelihood as a marginalized version of the likelihood, integrating out all the uncertainty about the parameters:

  • p(D|M) = ∫ p(D|θ,M) p(θ|M) dθ.

The log of this quantity, log p(D|M), makes the scale more manageable and helps when comparing models via Bayes factors, B12 = p(D|M1) / p(D|M2). A higher log marginal likelihood indicates that a model assigns substantial probability to the observed data while operating with a reasonable prior over its parameters. Because the integration averages over θ, models that are overly parameteristic or that require fine-tuning to fit the data tend to be penalized relative to simpler, yet adequate, specifications.

This perspective sits well with an emphasis on robust, transparent decision-making: it requires specifying a prior and then evaluating how well the model, with that prior, can produce the observed data. The approach also delivers a coherent notion of evidence that is compatible with decision theory and predictive assessment, since the same quantity underlies the predictive distribution when one integrates θ out.

In many common settings, the log marginal likelihood naturally embodies Occam’s razor: a model that fits a wide range of parameter values without overfitting the data tends to achieve a larger marginal likelihood than a model that only fits the particular observed dataset with a highly specific parameter choice.

Computation and methods

Exact calculation of the log marginal likelihood is feasible only for a narrow class of models or priors (for example, conjugate priors with analytic solutions). In most practical problems, one relies on approximation or numerical integration methods:

  • Conjugate priors and analytic solutions: For some models with conjugate priors, the integral can be computed in closed form, yielding a direct evaluation of log p(D|M). See conjugate prior and related likelihood-prior conjugacy concepts.

  • Laplace approximation: Around the posterior mode, one approximates the integrand with a Gaussian, leading to an approximation for log p(D|M). This method is fast and often accurate when the posterior is unimodal and roughly normal.

  • Sampling-based methods:

    • Monte Carlo integration and importance sampling: Draw samples from a proposal distribution to approximate the integral.
    • Bridge sampling and thermodynamic integration: These advanced Monte Carlo techniques are designed to improve stability and accuracy when the posterior is difficult to sample from directly.
    • Nested sampling: A specialized method that simultaneously estimates the marginal likelihood and explores the posterior, often used in high-dimensional or multimodal problems.
  • Variational approaches and approximations: While variational inference primarily targets the posterior, certain variational schemes can provide lower bounds on the log marginal likelihood that can be informative for model comparison.

Each method has trade-offs in terms of bias, variance, and computational cost. In practice, practitioners weigh the dimensionality of θ, the form of p(D|θ,M), and the availability of samplers or analytic results when choosing a method. See Monte Carlo method and thermodynamic integration for foundational techniques, and Laplace approximation for a common approximation route.

Practical considerations

  • Prior choice and sensitivity: The log marginal likelihood depends on the prior p(θ|M). Different reasonable priors can lead to sizable changes in the evidence, especially in high-dimensional models or when data are limited. This sensitivity is a feature of explicit principled reasoning about prior knowledge, but it also means that one should perform sensitivity analyses with respect to priors or consider robust priors when appropriate. See prior and sensitivity analysis for related discussions.

  • Improper priors: Using improper priors (priors that do not integrate to one) can lead to undefined marginal likelihoods. This is a practical constraint when applying log marginal likelihood-based model comparison, and it motivates careful prior construction or alternative criteria.

  • Model misspecification: If all competing models are misspecified, the relative ranking given by the log marginal likelihood may reflect model limitations as much as data fit. In such cases, complementary model-checking and predictive validation (e.g., out-of-sample predictions) remain important. See model misspecification and predictive distribution for related ideas.

  • Dimensionality and computation: As the parameter count grows, the integral becomes more challenging. Approximations like the Laplace method or sampling-based estimators can degrade in high dimensions, which has driven the development of scalable approaches such as thermodynamic integration and nested sampling.

  • Model comparison vs predictive criteria: Some practitioners supplement or substitute log marginal likelihood with cross-validation-based measures or other predictive criteria. This reflects a pragmatic stance that emphasizes out-of-sample performance and robustness, especially when prior specification is contentious or difficult. See Cross-validation for an overview of such alternatives.

Controversies and debates

  • Prior sensitivity and objectivity: A central debate concerns how much faith to place in the log marginal likelihood when priors are subjective or only loosely specified. Critics argue that Bayes factors can be swayed by prior choices, potentially privileging models whose priors align with the data-generating process in subtle ways. Proponents counter that priors are an honest statement of prior knowledge and that transparency about priors is preferable to the hidden assumptions embedded in some alternative criteria. See prior and Bayes factor for broader context.

  • Improper priors and coherence: The use of improper priors can undermine the legitimacy of the marginal likelihood as a comparison tool. The controversy centers on whether a decision framework should rely on more than a single coherent metric when priors are not well grounded in measurable information. See Improper prior and Bayesian model comparison.

  • Preference for simplicity vs predictive performance: The Occam-like penalty in the log marginal likelihood favors models that are not needlessly complex, but some criticize this as overly punitive in domains where increased complexity is warranted by truly diverse phenomenon. Proponents argue that a disciplined, evidence-based balance reduces overfitting and yields more reproducible conclusions. See Occam's razor and Model selection.

  • Alternatives in practice: In many scientific and engineering settings, cross-validation, information criteria (such as AIC, BIC), or predictive posterior checks are used alongside or in place of the log marginal likelihood. This pragmatic stance emphasizes empirical performance and robustness over strict adherence to a single Bayesian evidence-based criterion. See Cross-validation and Model selection for related discussion.

See also