Survival Analysis

Objective

“Survival analysis concerns analyzing the time to the occurrence of an event” (Cleves et al. 2010)

Survival analysis examines the factors that drive a specific event, i.e. those that cause it to happen earlier or later.

Examples

How long does it to build a government after an election? How long do governments manage to stay in office? → termination

Which factors determine the length of wars? → duration

Termination of a cabinet (main variable duration, in days) technical vs. non-technical cabinet terminations

Central Concepts

$T$ is a random variable for the survival time (random in the sense of being unknown)
$t$ is a specific time value

Overview:

Function name	Abreviation	Description	Example
Probability Density Function	$f (t)$	Instantaneous likelihood of failure at a precise time $t$	$f (5) = 0.1$ → 10 % chance that government collapses exactly after 5 years
Cumulative Distribution Function	$F (t)$	How many failures are expected within an interval of time $t$	$F (5) = 0.3$ → 30% have collapsed by year 5
Survivor Function	$S (t)$	Probability of no failure event up to time $t$	$S (5) = 0.7$ → 70% of governments are stil in pover by year five
Hazard Function, Hazard Rate	$h (t)$	Instantaneous failure rate at time $t$ , given survival until $t$	$h (5) = 0.08$ → For governments that have survived up to year 5, the chance that they’ll collapse during the next year is 8% (meaning that 8% of governments fail in that year)
Cumulative Hazard Function, Hazard Rate	$H (t)$	Accumulated risk of failure up to time $t$	$H (5) = 2$ → 2 governments fail every 5 years

Probability Density Function $f (t)$

Describes how likely the event is to happen at a precise point in time $t$ .

Cumulative Distribution Function $F (t)$

How many failures are expected within an interval of time.

Survival Function $S (t)$

The probability that the event hasn’t occured by time $t$ (i.e. the probability of survival past a certain time point).

S (t) = 1 - F (t) = P r (T > t)

The function is basically the opposite of the cumulative distribution function $F (t)$ .

The survival function starts at 1 (meaning there was no failure event) and steadily decreases over time until it reaches 0. In other words, the chance of surviving gets smaller as time goes on.

Hazard Function $h (t)$

The hazard function indicates the immediate risk of the event happening at a particular point in time (assuming it hasn’t happened yet). It can be thought of as the “rate of failue at time $t$ ”.

While the survival function $S (t)$ tells you the probability of “living longer than time $t$ ,” the hazard function tells you the “intensity” of the risk right at that moment.

h (t) = \frac{f ( t )}{S ( t )} = \frac{Probability density function}{Survivor function}

The function only considers cases that are still “at risk” (those who haven’t experienced the event yet). It is measured in “events per unit of time”

Warning

The Hazard Function is not a probability, but a rate. Its value can range from zero to infinity.

Baseline hazard $h_{0} (t)$

The value of the hazard function when all variables are 0 (analogous to the intercept in regression models). Think of it as the baseline risk before considering any individual characteristics (like age, treatment, or other predictors).

Cumulative Hazard Function $H (t)$

„Total amount of risk that has been accumulated up to time $t$ “
Equals to the number of events that can be expected to happen until time point $t$
Assumes that events are repeatable

Example

In the example of cabinet failures, $H (t) = 0.25$ would mean that a cabinet fails in one of the observed countries on every fourth day.

Censoring

As the dataset used for survival analysis can’t cover an unlimited period of time, the failure of some cases happens outside of the observation period.

Partial vs. full censoring:

Left vs. right censoring:

Example

Type of Models

There are different ways to model survival. The difference between them lies in the kind of assumptions they make about (a) the baseline hazard and (b) about the effect the covariate have on the hazard

The difference between the type of models lies in the kind of assumptions they make about the baseline hazard and the effect of covariates on the hazard functions.

Model type	Assumption about $h_{o}$	Assumption about covariates
1. Parametric models	✅	✅
2. Non-parametric models	❌	❌ (includes no covariates)
3. Semi-parametric models (Cox Regression)	❌	✅

1. Parametric Models

Parametric models make assumptions about both the baseline hazard and the effects of the covariates on the hazard rate.

2. Nonparametric Models

These models make no assumptions about the baseline hazard (for example that the function has a specific shape) and do not include any covariates. They are simply based on the existing data.

Kaplan-Meier Estimator

The Kaplan-Meier Estimator estimates the survivor function $S (t)$ (the probability of “surviving” past certain time points).

Formula:

\hat{S} (t) = m ∣ t_{m} \leq t \prod \frac{n _{m} - d _{m}}{n _{m}}

$\hat{S} (t)$ Probability of surviving past time $t$ . The hat on the $S$ indicates that this is an estimation
$d_{m}$ Number of units that are failed in the time interval $m$ (technically a point in time, not a intervall of time )
$n_{m}$ Number of units that are at risk
The different observations for all intervals are multiplied (indicated by the multiplication symbol $\prod$ )
- “multiply over all time points $t_{m}$ where an event occurred, up to and including time $t$ ”
- or more simply “multiply across all event times up to time $t$

Characteristics:

Accounts for censoring
Step function (only changes at event times and stays flat between events)

Example

Our survival analysis looks at four governments over a time period of 36 months.

Government Observation Reason
A Fails in month 8 no-confidence vote
B Fails in month 12 coalition breakdown
C Fails in month 24 corruption scandal
D Censored at month 36 didn’t fail within the observation period

After 8 months (when government A fails):
$\hat{S} (8) = 1 * \frac{4 - 1}{4} = 0.75$
Calculation:

Numerator: We take the number of units at risk (all four governments because none has failed before) and subtract the number of failures (only 1). This leaves us with 3 governments that have survived until this point.

Denominator: The number of units at risk (so again all 4 governments)

The fraction is then multiplied with the previous survival probability. Here it is 1 because the survival probability at the beginning of the observation period is always 1.

Interpretation: 75 % of governments survive after 8 months.

After 12 months (when government B fails):
$\hat{S} (12) = 0.75 * \frac{3 - 1}{3} = 0.75 * \frac{2}{3} = 0.50$
Calculation:

We take the same steps as above but adjust the numbers. Now we have 3 governments at risk (because there is only 3 still “alive”) and subtract the 1 that has failed.

We then multiply the fraction with the previous probability (0.75)

Interpretation: Half of the governments fail after 12 months.

This can be repeated for the other steps.

Government	Observation	Reason
A	Fails in month 8	no-confidence vote
B	Fails in month 12	coalition breakdown
C	Fails in month 24	corruption scandal
D	Censored at month 36	didn’t fail within the observation period

Log-rank tests

3. Semiparametric Models (Cox Proportional Hazard Model)

Semaparametric models, like the Cox regression, make no assumptions about the baseline hazard, but about the effects of covariates.

Formula:

h_{i} (t ∣ x_{i}) = h_{0} (t) * e x p (x_{ik} ß_{k})

$h_{i} (t ∣ x_{i})$ is the hazard for individual $i$ at time $t$
Assumption about the functional form of the covariates influence on $h_{i}$
No intercept is estimated (already contained in $h_{0} (t)$ )

Proportional Hazard Assumption

At any time $t$ , individual $i$ ’s risk relative to individual $j$ ’s risk is given by the ratio of their covariate effects, and that ratio does not change over time. So if the risk for $i$ is twice as high as the risk for $j$ , the risk is twice as high at all times.

\frac{h _{i} ( t ∣ x _{i} )}{h _{j} ( t ∣ x _{j} )} = \frac{h _{0} ( t ) * e x p ( x _{ik} ß _{k} )}{h _{0} ( t ) * e x p ( x _{ij} ß _{k} )} = \frac{e x p ( x _{ik} ß _{k} )}{e x p ( x _{ij} ß _{k} )}

On the left: Risk of the event at time $t$ for individual $i$ relative to the risk for individual $j$
On the right: Ratio of the covariate effects for individuals $i$ and $j$ ( $x$ are the covariate values, $ß$ are the effect sizes)
The baseline hazard $h_{0}$ (t) is canceled out → time disappears from the equation → the relative risk between $i$ and $j$ is constant over time

Testing the assumption

If the explanatory variables do not interact with $t$ on a statistically significant level, the assumption is adequate.

Hazard Ratios

e x p (ß_{1}) = \frac{h _{0} ( t ) * e x p (( x _{1} + 1 ) ß _{1} + ... )}{h _{0} ( t ) * e x p ( x _{1} ß _{1} + ... )}

The exponentiated coefficients are hazard ratios. They show by what factor (multiplicative change!) the Hazard Rate changes if the independent variable $x_{1}$ increases by one unit.

Interpretation:

$ß$	$e^{ß}$
$0$	$1$	Hazard rate remains constant
$> 0$	$> 1$	Hazard rate increases
$< 0$	$> 1$	Hazard rate descreases

As a percentage change:

(e^{ß k} - 1) * 100

Implementation in R

Non-parametric models

1. Preparation

Declare data using the function surv_object which includes the arguments timeand event
- If failure = 1 the value of time corresponds to the time of failure
- If failure = 0 → censored / end of window of observation
Estimate the survivor function

2. Estimating the survivor function

Summary function can be used for all estimations

> survfit_object <- survival::survfit(surv_object ~ 1, data=saalfeld08)
> summary(survfit_object, censored=TRUE)
Call: survfit(formula = surv_object ~ 1, data = saalfeld08)
 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    7    424       1    0.998 0.00236        0.993        1.000
    9    423       1    0.995 0.00333        0.989        1.000
   11    422       2    0.991 0.00469        0.981        1.000
   12    420       2    0.986 0.00574        0.975        0.997
   14    418       1    0.983 0.00619        0.971        0.996

time indicates the time of cabinet termination (in days) (or the moment of censoring)
n.risk shows the number of cabinets that are under risk at that point in time (indicated in the first column)
n.event shows the number of cabinets that have failed up to that point in time
survival indicates the probability of survival according to the Kaplan-Meier estimator (i.e. the result of the formula)

Example for calculating for the first line

$S (t) = p (t = 1) = \frac{424 - 1}{424} = 0.998$

For the second (423-1)/423 * (all the previous probability)

$S (t) = p (t = 9) = \frac{423 - 1}{423} * \frac{424 - 1}{424} = 0.995$

Graphical interpretation including stratification

Survivor functions for different categories / log-rank test

Example: Difference between minimum-wining-coalitions

Tests distinctiveness for the entire survivor function (i.e. no overlap for the entire line)

Null hypothesis: both hazard functions are equal

> survival::survdiff(surv_object ~ mwc, data=saalfeld08)
Call:
survival::survdiff(formula = surv_object ~ mwc, data = saalfeld08)
 
n=415, 9 observations deleted due to missingness.
 
        N Observed Expected (O-E)^2/E (O-E)^2/V
mwc=0 290      170    138.8       7.0      18.3
mwc=1 125       59     90.2      10.8      18.3
 
 Chisq= 18.3  on 1 degrees of freedom, p= 2e-05

Observed indicates the empirically observed failures (i.e. cabinet termination)
Expected shows the number of failures that would be expected if the two groups were identical
Interpretation of p?

Estimating the hazard function

sest is the estimated survival probability
hest is the estimated hazard…?

Cédric's notes

Explorer

Survival Analysis

Objective

Central Concepts

Probability Density Function $f (t)$

Cumulative Distribution Function $F (t)$

Survival Function $S (t)$

Hazard Function $h (t)$

Baseline hazard $h_{0} (t)$

Cumulative Hazard Function $H (t)$

Censoring

Type of Models

1. Parametric Models

2. Nonparametric Models

Kaplan-Meier Estimator

Log-rank tests

3. Semiparametric Models (Cox Proportional Hazard Model)

Proportional Hazard Assumption

Testing the assumption

Hazard Ratios

Implementation in R

Non-parametric models

1. Preparation

2. Estimating the survivor function

Estimating the hazard function

Graph View

Table of Contents

Cédric's notes

Explorer

Survival Analysis

Objective

Central Concepts

Probability Density Function f(t)

Cumulative Distribution Function F(t)

Survival Function S(t)

Hazard Function h(t)

Baseline hazard h0​(t)

Cumulative Hazard Function H(t)

Censoring

Type of Models

1. Parametric Models

2. Nonparametric Models

Kaplan-Meier Estimator

Log-rank tests

3. Semiparametric Models (Cox Proportional Hazard Model)

Proportional Hazard Assumption

Testing the assumption

Hazard Ratios

Implementation in R

Non-parametric models

1. Preparation

2. Estimating the survivor function

Estimating the hazard function

Graph View

Table of Contents

Probability Density Function $f (t)$

Cumulative Distribution Function $F (t)$

Survival Function $S (t)$

Hazard Function $h (t)$

Baseline hazard $h_{0} (t)$

Cumulative Hazard Function $H (t)$