# Packages
library(dplyr)
library(survival)
library(survminer)
# Data
## Get data
data(cancer, package = "survival")
## Transform and select data
data_f <- lung %>%
dplyr::select(-c(ph.ecog, ph.karno, pat.karno, meal.cal, wt.loss)) %>%
dplyr::mutate(sex = if_else(sex == 1, "Male", "Female")) %>%
dplyr::mutate_at(.vars = c("inst", "status", "sex"), .funs = as.factor)
# Model
# * status == 1: event, status == 0: censored
KM_fit <- survival::survfit(survival::Surv(time, status == 1) ~ 1, data_f)
## Coefficients
summary(KM_fit)
# Plot
survminer::ggsurvplot(KM_fit, conf.int = TRUE, data = data_f)12 Kaplan-Meier Estimator
$$ % Basic sets/Variables
% Probability and statistics % % Linear algebra % Math functions % Distributions% Update symbols $$
The Kaplan-Meier estimator is a non-parametric statistic used to estimate the survival function (see Chapter 8) from time-to-event data. It has become one of the most widely used methods in survival analysis due to its ability to handle censored data (see Chapter 10) without making assumptions about the underlying survival distribution.
An advantage of the Kaplan-Meier estimator is that it can incorporate right-censored observations (see Section 10.2) - individuals who have not experienced the event by the time of censoring.
The Kaplan-Meier estimate of the survival function at time \(t\) is given by: \[ \begin{aligned} \hat{S}(t) &= \prod_{t_i \leq t} \left(1 - \frac{d_{t_i}}{n_{t_i}}\right) \end{aligned} \tag{12.1}\] where \(t_i\) represents the event times, \(d_{t_i} > 0\) is the number of events that occurred at time \(t_i\), and \(n_{t_i} \geq 0\) is the number of individuals at risk just prior to time \(t_i\).
The estimator is constructed as a product of conditional probabilities of surviving each event time, given survival to that time. At each event time \(t_i\), the probability of surviving that time point is estimated as \(1 - d_{t_i}/n_{t_i}\), which represents one minus the empirical hazard rate at that time.
Properties
The Kaplan-Meier estimator has several important properties, most of which are also properties of the survival curve. Firstly, \(\hat{S}(0) = 1\) - all individuals are alive at the start. Secondly, \(\hat{S}(t)\) is a right-continuous step function that decreases only at observed event times. This also has the implication the it is monotonically decreasing - \(\hat{S}(t_2) \leq \hat{S}(t_1)\) for all \(t_2 > t_1\).
Under appropriate conditions, \(\hat{S}(t)\) converges to the true survival function \(S(t)\) as the sample size increases.
Assumptions
The validity of the Kaplan-Meier estimator relies on several key assumptions:
12.0.1 Independent Censoring
The most critical assumption is that censoring is independent of survival time. Formally, the censoring time \(C\) and the event time \(T\) must be independent. This means that censored individuals should have the same survival prospects as those who remain under observation. Violations of this assumption can lead to biased estimates. For example, if patients with worse prognoses are more likely to drop out of a study, the Kaplan-Meier curve will overestimate survival.
12.0.2 Non-informative Censoring
Related to independent censoring, the censoring mechanism should be non-informative about the event of interest. The probability of being censored at time \(t\) should not depend on the event time \(T\), given the observed covariates.
12.0.3 No Loss to Follow-up
Ideally, all individuals should be followed until they experience the event or reach the end of the study period. While the Kaplan-Meier method can handle censoring, excessive loss to follow-up can reduce precision and potentially introduce bias.
Example 12.1 (Kaplan-Meier estimator)