Objective
“Survival analysis concerns analyzing the time to the occurrence of an event” (Cleves et al. 2010)
Survival analysis examines the factors that drive a specific event, i.e. those that cause it to happen earlier or later.
Examples
- How long does it to build a government after an election? How long do governments manage to stay in office? → termination
- Which factors determine the length of wars? → duration
Termination of a cabinet (main variable duration, in days) technical vs. non-technical cabinet terminations
Central Concepts
- is a random variable for the survival time (random in the sense of being unknown)
- is a specific time value
Overview:
| Function name | Abreviation | Description | Example |
|---|---|---|---|
| Probability Density Function | Instantaneous likelihood of failure at a precise time | → 10 % chance that government collapses exactly after 5 years | |
| Cumulative Distribution Function | How many failures are expected within an interval of time | → 30% have collapsed by year 5 | |
| Survivor Function | Probability of no failure event up to time | → 70% of governments are stil in pover by year five | |
| Hazard Function, Hazard Rate | Instantaneous failure rate at time , given survival until | → For governments that have survived up to year 5, the chance that they’ll collapse during the next year is 8% (meaning that 8% of governments fail in that year) | |
| Cumulative Hazard Function, Hazard Rate | Accumulated risk of failure up to time | → 2 governments fail every 5 years |
Probability Density Function
Describes how likely the event is to happen at a precise point in time .
Cumulative Distribution Function
How many failures are expected within an interval of time.
Survival Function
The probability that the event hasn’t occured by time (i.e. the probability of survival past a certain time point).
The function is basically the opposite of the cumulative distribution function .
The survival function starts at 1 (meaning there was no failure event) and steadily decreases over time until it reaches 0. In other words, the chance of surviving gets smaller as time goes on.
Hazard Function
The hazard function indicates the immediate risk of the event happening at a particular point in time (assuming it hasn’t happened yet). It can be thought of as the “rate of failue at time ”.
While the survival function tells you the probability of “living longer than time ,” the hazard function tells you the “intensity” of the risk right at that moment.
The function only considers cases that are still “at risk” (those who haven’t experienced the event yet). It is measured in “events per unit of time”
Warning
The Hazard Function is not a probability, but a rate. Its value can range from zero to infinity.
Baseline hazard
The value of the hazard function when all variables are 0 (analogous to the intercept in regression models). Think of it as the baseline risk before considering any individual characteristics (like age, treatment, or other predictors).
Cumulative Hazard Function
- „Total amount of risk that has been accumulated up to time “
- Equals to the number of events that can be expected to happen until time point
- Assumes that events are repeatable
Example
In the example of cabinet failures, would mean that a cabinet fails in one of the observed countries on every fourth day.
Censoring
As the dataset used for survival analysis can’t cover an unlimited period of time, the failure of some cases happens outside of the observation period.
Partial vs. full censoring:
Left vs. right censoring:
Example
Type of Models
There are different ways to model survival. The difference between them lies in the kind of assumptions they make about (a) the baseline hazard and (b) about the effect the covariate have on the hazard
The difference between the type of models lies in the kind of assumptions they make about the baseline hazard and the effect of covariates on the hazard functions.
| Model type | Assumption about | Assumption about covariates |
|---|---|---|
| 1. Parametric models | ✅ | ✅ |
| 2. Non-parametric models | ❌ | ❌ (includes no covariates) |
| 3. Semi-parametric models (Cox Regression) | ❌ | ✅ |
1. Parametric Models
Parametric models make assumptions about both the baseline hazard and the effects of the covariates on the hazard rate.
2. Nonparametric Models
These models make no assumptions about the baseline hazard (for example that the function has a specific shape) and do not include any covariates. They are simply based on the existing data.
Kaplan-Meier Estimator
The Kaplan-Meier Estimator estimates the survivor function (the probability of “surviving” past certain time points).
Formula:
- Probability of surviving past time . The hat on the indicates that this is an estimation
- Number of units that are failed in the time interval (technically a point in time, not a intervall of time )
- Number of units that are at risk
- The different observations for all intervals are multiplied (indicated by the multiplication symbol )
- “multiply over all time points where an event occurred, up to and including time ”
- or more simply “multiply across all event times up to time
Characteristics:
- Accounts for censoring
- Step function (only changes at event times and stays flat between events)
Example
Our survival analysis looks at four governments over a time period of 36 months.
Government Observation Reason A Fails in month 8 no-confidence vote B Fails in month 12 coalition breakdown C Fails in month 24 corruption scandal D Censored at month 36 didn’t fail within the observation period After 8 months (when government A fails):
Calculation:
- Numerator: We take the number of units at risk (all four governments because none has failed before) and subtract the number of failures (only 1). This leaves us with 3 governments that have survived until this point.
- Denominator: The number of units at risk (so again all 4 governments)
- The fraction is then multiplied with the previous survival probability. Here it is 1 because the survival probability at the beginning of the observation period is always 1.
Interpretation: 75 % of governments survive after 8 months.
After 12 months (when government B fails):
Calculation:
- We take the same steps as above but adjust the numbers. Now we have 3 governments at risk (because there is only 3 still “alive”) and subtract the 1 that has failed.
- We then multiply the fraction with the previous probability (0.75)
Interpretation: Half of the governments fail after 12 months.
This can be repeated for the other steps.
Log-rank tests
3. Semiparametric Models (Cox Proportional Hazard Model)
Semaparametric models, like the Cox regression, make no assumptions about the baseline hazard, but about the effects of covariates.
Formula:
- is the hazard for individual at time
- Assumption about the functional form of the covariates influence on
- No intercept is estimated (already contained in )
Proportional Hazard Assumption
At any time , individual ’s risk relative to individual ’s risk is given by the ratio of their covariate effects, and that ratio does not change over time. So if the risk for is twice as high as the risk for , the risk is twice as high at all times.
- On the left: Risk of the event at time for individual relative to the risk for individual
- On the right: Ratio of the covariate effects for individuals and ( are the covariate values, are the effect sizes)
- The baseline hazard (t) is canceled out → time disappears from the equation → the relative risk between and is constant over time
Testing the assumption
If the explanatory variables do not interact with on a statistically significant level, the assumption is adequate.
Hazard Ratios
The exponentiated coefficients are hazard ratios. They show by what factor (multiplicative change!) the Hazard Rate changes if the independent variable increases by one unit.
Interpretation:
| Hazard rate remains constant | ||
| Hazard rate increases | ||
| Hazard rate descreases |
As a percentage change:
Implementation in R
Non-parametric models
1. Preparation
- Declare data using the function
surv_objectwhich includes the argumentstimeandevent- If
failure = 1the value oftimecorresponds to the time of failure - If
failure = 0→ censored / end of window of observation
- If
- Estimate the survivor function
2. Estimating the survivor function
- Summary function can be used for all estimations
> survfit_object <- survival::survfit(surv_object ~ 1, data=saalfeld08)
> summary(survfit_object, censored=TRUE)
Call: survfit(formula = surv_object ~ 1, data = saalfeld08)
time n.risk n.event survival std.err lower 95% CI upper 95% CI
7 424 1 0.998 0.00236 0.993 1.000
9 423 1 0.995 0.00333 0.989 1.000
11 422 2 0.991 0.00469 0.981 1.000
12 420 2 0.986 0.00574 0.975 0.997
14 418 1 0.983 0.00619 0.971 0.996timeindicates the time of cabinet termination (in days) (or the moment of censoring)n.riskshows the number of cabinets that are under risk at that point in time (indicated in the first column)n.eventshows the number of cabinets that have failed up to that point in timesurvivalindicates the probability of survival according to the Kaplan-Meier estimator (i.e. the result of the formula)
Example for calculating for the first line
For the second (423-1)/423 * (all the previous probability)
Graphical interpretation including stratification
Survivor functions for different categories / log-rank test
Example: Difference between minimum-wining-coalitions
Tests distinctiveness for the entire survivor function (i.e. no overlap for the entire line)
Null hypothesis: both hazard functions are equal
> survival::survdiff(surv_object ~ mwc, data=saalfeld08)
Call:
survival::survdiff(formula = surv_object ~ mwc, data = saalfeld08)
n=415, 9 observations deleted due to missingness.
N Observed Expected (O-E)^2/E (O-E)^2/V
mwc=0 290 170 138.8 7.0 18.3
mwc=1 125 59 90.2 10.8 18.3
Chisq= 18.3 on 1 degrees of freedom, p= 2e-05 -
Observedindicates the empirically observed failures (i.e. cabinet termination) -
Expectedshows the number of failures that would be expected if the two groups were identical -
Interpretation of
p?
Estimating the hazard function
sestis the estimated survival probabilityhestis the estimated hazard…?