5  Linear Regression

Linear regression is used to model the relationship between response- and exploratory variable(s). Let \(y\) be the response variable(s) and \(X \cdot \alpha\) the system of exploratory variables with regression coefficients described by \(\alpha\) also called fixed effects. \[ y = X \cdot \alpha + \varepsilon \] where \(X\) is the fixed effects design matrix, describing where, \(\varepsilon\) is the zero-mean errors.

Shortcomings

That does not mean that the linear regression does not have its shortcomings. Firstly, the assumption of a linear relationship between the response- and exploratory variable(s). This might not be correct for some data. Even if the data assumes a linear relationship, if outliers of the response variable have a dispropotionate influence on the fixed effects, then these fixed effects might be misleading. Secondly, similar to other statistical models, the linear regression does not handle missing data very well, and some preprocessing techniques might have to be applied to ensure reliable results. Moreover, when including too many exploratory variables with respect to the data size, the linear regression is prone to overfitting, reducing the generalisation of the fit. The addition of more exploratory variables are also only additive, meaning that it is not able to handle more complex relationships between the response- and exploratory variable(s). Furthermore, the errors in the linear regression are assumed to be Gaussian (see Section A.9) with constant variance. Devations from this can affect the confidence intervals, and thus some hypotheses that depend on this assumption.

Example 5.1  

## Packages
library(dplyr)
library(mmrm)

# Data
mmrm::bcva_data %>%
  dplyr::rename_with(toupper) %>%
  dplyr::slice_head(n = 15)
## Packages
library(tidyverse)
library(mmrm)

# Data
data <- mmrm::bcva_data %>%
  dplyr::rename_with(toupper)

# Model
lm_fit <- stats::lm(BCVA_CHG ~ VISITN + BCVA_BL, data = data)

## Coefficients
broom::tidy(lm_fit, conf.int = TRUE, conf.level = 0.95)

# Plot
data %>%
  ggplot2::ggplot(aes(x = VISITN, y = BCVA_CHG)) +
  ggplot2::geom_point() +
  ggplot2::labs(x = "Visit Number", y = "Change in BCVA") +
  ggplot2::geom_line(
    data = data %>%
      dplyr::mutate(predicted = predict(lm_fit)),
    aes(x = VISITN, y = predicted),
    color = "blue"
  ) +
  ggplot2::theme_minimal()
PROC GLM data = data;
    model BCVA_CHG = VISITN BCVA_BL / solution;
run;

5.1 Estimation

The estimation of the fixed effects \(\alpha\) is commonly done using the method of least squares, which minimizes the sum of squared residuals (the differences between observed and predicted values).