4  Logistic Regression

Logistic regression is used to model binary outcomes or cases where the response variable takes only two possible values. Consider the example where you are interested in modeling whether a patient responds to treatment or not. Intuitively, the response is discrete and bounded between 0 (no) and 1 (yes). As this is the case, linear regression (see Chapter 5) is not applicable, as it can produce values outside the \([0, 1]\) interval and does not respect the binary nature of the outcome.

Based on the Bernoulli distribution (see Section A.1), which describes binary outcomes, logistic regression models the conditional probability of the response variable \(Y\) being 1 given covariates \(x\) as \[ \begin{aligned} \mathbb{P}\left(Y = 1 \mid x\right) &= p \\ &= \frac{\operatorname{exp}\left(X \cdot \alpha\right)}{1 + \operatorname{exp}\left(X \cdot \alpha\right)} = \frac{1}{1 + \operatorname{exp}\left(-X \cdot \alpha\right)} \\ &= \operatorname{log}\left(\frac{p}{1 - p}\right) = X \cdot \alpha \end{aligned} \] where \(X\) is the fixed effects design matrix, \(\alpha\) is the fixed effects, and \(\frac{p}{1 - p}\) is called the odds. The logistic link function (logit) ensures that the predicted probabilities remain between 0 and 1. The log-odds \(\operatorname{log}\left(\frac{p}{1 - p}\right)\) is linear in the covariates, which allows for straightforward interpretation: a one-unit increase in \(x_j\) changes the log-odds by \(\alpha_j\), or equivalently, multiplies the odds by \(\operatorname{exp}\left(\alpha_j\right)\).

Shortcomings

Logistic regression, like other statistical models, has limitations. Firstly, the assumption of a linear relationship between the covariates and the log-odds may not hold for all data. When this assumption is violated, the model may provide poor predictions and misleading inference. Secondly, logistic regression does not handle missing data very well, and preprocessing techniques might have to be applied to ensure reliable results. Moreover, when including too many exploratory variables with respect to the data size, logistic regression is prone to overfitting, reducing the generalization of the fit. Furthermore, logistic regression assumes independence of observations - when observations are clustered or repeated measurements are taken on the same individuals, mixed-effects logistic regression should be considered instead. Additionally, the presence of separation (see Section 4.1) can lead to infinite coefficient estimates and numerical instability, requiring special handling.

4.1 Separation

When all outcomes within a level of a categorical variable is always 0 or always 1, this is called separation. Separation occurs when a covariate or combination of covariates perfectly predicts the binary outcome. In such cases, the maximum likelihood estimation procedure fails to converge, as the estimated coefficients approach infinity. This is because the algorithm attempts to find a coefficient that makes the predicted probability exactly 0 or 1 for those observations, which requires an infinite coefficient value.

4.1.1 Complete Separation

Complete separation occurs when there exists a linear combination of the covariates that perfectly separates the outcomes. More formally, complete separation exists if there is a vector \(\beta\) such that \(X \cdot \beta > 0\) for all observations where \(Y = 1\) and \(X \cdot \beta < 0\) for all observations where \(Y = 0\). In this case, the likelihood function is monotone and the maximum likelihood estimates do not exist. Complete separation often happens with small sample sizes, rare outcomes, or when a level of a categorical variable share the same outcome. When complete separation is present, standard logistic regression cannot be used, and alternatives should be considered.

4.1.2 Quasi-Complete Separation

Quasi-complete separation occurs when the outcomes can be nearly, but not perfectly, separated by the covariates. Unlike complete separation, some observations exist where the perfect separation rule is violated, but these violations are rare. Formally, quasi-complete separation exists when there exists a \(\beta\) such that \(X \cdot \beta \geq 0\) for all observations where \(Y = 1\) and \(X \cdot \beta \leq 0\) for all observations where \(Y = 0\), with at least one inequality being strict. While the maximum likelihood estimates technically exist in this case, they are extremely large and the standard errors are inflated, making the estimates and inference unreliable. Similar to complete separation, the likelihood function is nearly flat in certain directions, causing numerical instability in the estimation algorithm. Quasi-complete separation is more common than complete separation and can be difficult to detect, as the algorithm may appear to converge but with suspiciously large coefficient estimates and standard errors.