Cross Entropy LossEdit

Cross-entropy loss is a cornerstone objective in modern supervised learning, especially for classification tasks. It quantifies how far a model’s predicted probability distribution over classes is from the true distribution encoded in the labels. In practice, minimizing cross-entropy pushes the model to assign high probability to the correct class and low probability to the others, which aligns with the idea of learning from data in a probabilistic sense.

The concept sits at the intersection of probability theory and information theory. It is closely tied to the ideas of entropy and divergence: cross-entropy between the true distribution p and the model’s predicted distribution q can be viewed as the sum of the true entropy H(p) and the Kullback–Leibler divergence D_KL(p || q). This relationship explains why cross-entropy is a natural objective for maximum likelihood estimation and probabilistic modeling. For a broader mathematical grounding, see Entropy and Kullback–Leibler divergence as well as Probability.

In practice, cross-entropy is used with two common flavors:

  • Binary cross-entropy (log loss) for binary classification, where the model outputs a probability p that the positive class is present. The loss for a single example is L = - [ y log p + (1 - y) log (1 - p) ], with y in {0, 1}.
  • Categorical cross-entropy for multi-class classification, where the model outputs a probability distribution p over classes (often via a Softmax layer). With a one-hot encoded target y, the loss is L = - sum_i y_i log p_i.

These forms tie directly to the underlying likelihoods: minimizing cross-entropy is equivalent to maximizing the likelihood of the observed labels under the model’s predicted distribution. For a dataset, the total loss is the sum (or mean) of per-example losses. See Log-likelihood for related concepts.

Mathematical foundations

  • Definitions

    • Binary case: L = - [ y log p + (1 - y) log (1 - p) ], where p = σ(z) is the model’s predicted probability derived from a logit z.
    • Multi-class case: L = - sum_i y_i log p_i, with p_i = softmax(z)_i and y_i ∈ {0, 1}, sum_i y_i = 1.
    • Cross-entropy between distributions p and q: H(p, q) = - sum_i p_i log q_i.
  • Relationship to maximum likelihood

    • For labeled data (x, y), minimize L = - log p(y | x; θ). Across a dataset, this is the same as minimizing the cross-entropy between the empirical distribution and the model’s predicted distribution.
  • Gradients and optimization

    • With a softmax output, the gradient of the loss with respect to the logit z_i is p_i - y_i. With a sigmoid in the binary case, the gradient with respect to z is p - y.
    • These clean gradients are one reason cross-entropy works well with gradient-based optimizers like Stochastic gradient descent and its variants.
    • For numerical stability, practitioners often use the log-sum-exp trick and work directly with logits rather than probabilities in computation. See Numerical stability and LogSumExp.
  • Relationship to calibration and entropy

    • Cross-entropy measures predictive accuracy in a probabilistic sense, but it is not the only measure of a model’s reliability. Calibration (how well predicted probabilities reflect true frequencies) is a separate concern that may require additional techniques such as temperature scaling or isotonic regression. See Calibration (machine learning) for related discussions and Entropy for deeper connections.

Practical considerations in machine learning

  • Binary vs. multiclass settings

    • For binary problems, you can use the binary form directly with a single output probability. For multiclass problems, you typically use a vector of probabilities produced by a Softmax layer or equivalent, and compute the sum over classes as in the multiclass formula. See Softmax for how the probabilities are produced, and Classification for the broader task.
  • Label encoding and targets

    • Targets are commonly one-hot encoded for multiclass problems, though other encodings are possible (e.g., label smoothing to avoid overconfident predictions). See One-hot encoding and Label smoothing for related ideas.
  • Class imbalance and weighting

    • When classes are imbalanced, raw cross-entropy can bias the model toward the majority class. Practitioners address this with class weights, resampling, or alternative losses like focal loss. See Focal loss and Class imbalance for related approaches.
  • Regularization and generalization

    • Cross-entropy loss by itself is not a generalization guarantee. It is typically combined with regularization techniques (L1/L2, dropout, data augmentation) and with architectural choices that help avoid overfitting. See Regularization (machine learning) and Dropout.
  • Alternatives and complements

    • Other losses exist that can be preferable in specific contexts, such as hinge loss in some margin-based classifiers or mean squared error in regression-like settings. However, cross-entropy remains the standard for probabilistic classification because of its statistical interpretation and empirical performance. See Hinge loss and Mean squared error.
  • Practical pitfalls

    • Overconfidence: models trained with unregularized cross-entropy can become overconfident on uncertain predictions.
    • Label noise: mislabeled data can disproportionately affect cross-entropy due to its log-probability nature.
    • Calibration vs. accuracy: optimizing for accuracy with cross-entropy does not guarantee well-calibrated probabilities, which can be important in decision-making systems.

History and development

Cross-entropy’s roots lie in information theory and statistical inference. It formalizes the idea of measuring how one probability distribution diverges from another, and its use as a training objective aligns learning with the probabilistic interpretation of data. Early connections between cross-entropy, log-likelihood, and maximum likelihood estimation were clarified as probabilistic models became central in machine learning. The practical implementation with softmax outputs in neural networks popularized cross-entropy as a default loss for large-scale classification tasks, while the broader family of entropy-based measures continues to influence both theory and practice. See Maximum likelihood and Noise contrastive estimation for related historical threads.

See also