Objective

“Survival analysis concerns analyzing the time to the occurrence of an event” (Cleves et al. 2010)

Survival analysis examines the factors that drive a specific event, i.e. those that cause it to happen earlier or later.

Examples

  • How long does it to build a government after an election? How long do governments manage to stay in office? → termination
  • Which factors determine the length of wars? → duration

Termination of a cabinet (main variable duration, in days) technical vs. non-technical cabinet terminations

Central Concepts

  • is a random variable for the survival time (random in the sense of being unknown)
  • is a specific time value

Overview:

Function nameAbreviationDescriptionExample
Probability Density FunctionInstantaneous likelihood of failure at a precise time → 10 % chance that government collapses exactly after 5 years
Cumulative Distribution FunctionHow many failures are expected within an interval of time → 30% have collapsed by year 5
Survivor FunctionProbability of no failure event up to time → 70% of governments are stil in pover by year five
Hazard Function, Hazard RateInstantaneous failure rate at time , given survival until → For governments that have survived up to year 5, the chance that they’ll collapse during the next year is 8% (meaning that 8% of governments fail in that year)
Cumulative Hazard Function, Hazard RateAccumulated risk of failure up to time → 2 governments fail every 5 years

Probability Density Function

Describes how likely the event is to happen at a precise point in time .

Cumulative Distribution Function

How many failures are expected within an interval of time.

Survival Function

The probability that the event hasn’t occured by time (i.e. the probability of survival past a certain time point).

The function is basically the opposite of the cumulative distribution function .

The survival function starts at 1 (meaning there was no failure event) and steadily decreases over time until it reaches 0. In other words, the chance of surviving gets smaller as time goes on.

Hazard Function

The hazard function indicates the immediate risk of the event happening at a particular point in time (assuming it hasn’t happened yet). It can be thought of as the “rate of failue at time ”.

While the survival function tells you the probability of “living longer than time ,” the hazard function tells you the “intensity” of the risk right at that moment.

The function only considers cases that are still “at risk” (those who haven’t experienced the event yet). It is measured in “events per unit of time”

Warning

The Hazard Function is not a probability, but a rate. Its value can range from zero to infinity.

Baseline hazard

The value of the hazard function when all variables are 0 (analogous to the intercept in regression models). Think of it as the baseline risk before considering any individual characteristics (like age, treatment, or other predictors).

Cumulative Hazard Function

  • „Total amount of risk that has been accumulated up to time
  • Equals to the number of events that can be expected to happen until time point
  • Assumes that events are repeatable

Example

In the example of cabinet failures, would mean that a cabinet fails in one of the observed countries on every fourth day.

Censoring

As the dataset used for survival analysis can’t cover an unlimited period of time, the failure of some cases happens outside of the observation period.

Partial vs. full censoring:

Left vs. right censoring:

Example

Type of Models

There are different ways to model survival. The difference between them lies in the kind of assumptions they make about (a) the baseline hazard and (b) about the effect the covariate have on the hazard

The difference between the type of models lies in the kind of assumptions they make about the baseline hazard and the effect of covariates on the hazard functions.

Model typeAssumption about Assumption about covariates
1. Parametric models
2. Non-parametric models❌ (includes no covariates)
3. Semi-parametric models (Cox Regression)

1. Parametric Models

Parametric models make assumptions about both the baseline hazard and the effects of the covariates on the hazard rate.

2. Nonparametric Models

These models make no assumptions about the baseline hazard (for example that the function has a specific shape) and do not include any covariates. They are simply based on the existing data.

Kaplan-Meier Estimator

The Kaplan-Meier Estimator estimates the survivor function (the probability of “surviving” past certain time points).

Formula:

  • Probability of surviving past time . The hat on the indicates that this is an estimation
  • Number of units that are failed in the time interval (technically a point in time, not a intervall of time )
  • Number of units that are at risk
  • The different observations for all intervals are multiplied (indicated by the multiplication symbol )
    • “multiply over all time points ​ where an event occurred, up to and including time
    • or more simply “multiply across all event times up to time

Characteristics:

  • Accounts for censoring
  • Step function (only changes at event times and stays flat between events)

Example

Our survival analysis looks at four governments over a time period of 36 months.

GovernmentObservationReason
AFails in month 8no-confidence vote
BFails in month 12coalition breakdown
CFails in month 24corruption scandal
DCensored at month 36didn’t fail within the observation period

After 8 months (when government A fails):

Calculation:

  • Numerator: We take the number of units at risk (all four governments because none has failed before) and subtract the number of failures (only 1). This leaves us with 3 governments that have survived until this point.
  • Denominator: The number of units at risk (so again all 4 governments)
  • The fraction is then multiplied with the previous survival probability. Here it is 1 because the survival probability at the beginning of the observation period is always 1.

Interpretation: 75 % of governments survive after 8 months.

After 12 months (when government B fails):

Calculation:

  • We take the same steps as above but adjust the numbers. Now we have 3 governments at risk (because there is only 3 still “alive”) and subtract the 1 that has failed.
  • We then multiply the fraction with the previous probability (0.75)

Interpretation: Half of the governments fail after 12 months.

This can be repeated for the other steps.

Log-rank tests

3. Semiparametric Models (Cox Proportional Hazard Model)

Semaparametric models, like the Cox regression, make no assumptions about the baseline hazard, but about the effects of covariates.

Formula:

  • is the hazard for individual at time
  • Assumption about the functional form of the covariates influence on
  • No intercept is estimated (already contained in )

Proportional Hazard Assumption

At any time , individual ’s risk relative to individual ’s risk is given by the ratio of their covariate effects, and that ratio does not change over time. So if the risk for is twice as high as the risk for , the risk is twice as high at all times.

  • On the left: Risk of the event at time for individual relative to the risk for individual
  • On the right: Ratio of the covariate effects for individuals and ( are the covariate values, are the effect sizes)
  • The baseline hazard (t) is canceled out → time disappears from the equation → the relative risk between and is constant over time
Testing the assumption

If the explanatory variables do not interact with on a statistically significant level, the assumption is adequate.

Hazard Ratios

The exponentiated coefficients are hazard ratios. They show by what factor (multiplicative change!) the Hazard Rate changes if the independent variable increases by one unit.

Interpretation:

Hazard rate remains constant
Hazard rate increases
Hazard rate descreases

As a percentage change:

Implementation in R

Non-parametric models

1. Preparation

  • Declare data using the function surv_object which includes the arguments timeand event
    • If failure = 1 the value of time corresponds to the time of failure
    • If failure = 0 → censored / end of window of observation
  • Estimate the survivor function

2. Estimating the survivor function

  • Summary function can be used for all estimations
> survfit_object <- survival::survfit(surv_object ~ 1, data=saalfeld08)
> summary(survfit_object, censored=TRUE)
Call: survfit(formula = surv_object ~ 1, data = saalfeld08)
 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    7    424       1    0.998 0.00236        0.993        1.000
    9    423       1    0.995 0.00333        0.989        1.000
   11    422       2    0.991 0.00469        0.981        1.000
   12    420       2    0.986 0.00574        0.975        0.997
   14    418       1    0.983 0.00619        0.971        0.996
  • time indicates the time of cabinet termination (in days) (or the moment of censoring)
  • n.risk shows the number of cabinets that are under risk at that point in time (indicated in the first column)
  • n.event shows the number of cabinets that have failed up to that point in time
  • survival indicates the probability of survival according to the Kaplan-Meier estimator (i.e. the result of the formula)

Example for calculating for the first line

For the second (423-1)/423 * (all the previous probability)

Graphical interpretation including stratification

Survivor functions for different categories / log-rank test

Example: Difference between minimum-wining-coalitions

Tests distinctiveness for the entire survivor function (i.e. no overlap for the entire line)

Null hypothesis: both hazard functions are equal

> survival::survdiff(surv_object ~ mwc, data=saalfeld08)
Call:
survival::survdiff(formula = surv_object ~ mwc, data = saalfeld08)
 
n=415, 9 observations deleted due to missingness.
 
        N Observed Expected (O-E)^2/E (O-E)^2/V
mwc=0 290      170    138.8       7.0      18.3
mwc=1 125       59     90.2      10.8      18.3
 
 Chisq= 18.3  on 1 degrees of freedom, p= 2e-05 
  • Observed indicates the empirically observed failures (i.e. cabinet termination)

  • Expected shows the number of failures that would be expected if the two groups were identical

  • Interpretation of p?

Estimating the hazard function

  • sest is the estimated survival probability
  • hest is the estimated hazard…?