Logistical Regression Models

Logistical regression models have binary or categorical variables as their independent variable.

Types of logistical regression models

Model	Dependent variable	Independent variables	Example
Binary Logit	Binary → two outcomes	Individual-level	Does a person vote or not?
Multinomial Logit	Unordered categorical → three or more outcomes	Individual-level	What is the probability a person votes for the Greens (compared to other parties)?
Conditional Logit	Choice among alternatives → dependent on attributes f alternatives	Alternative-level (+ individual-level)	Why a voter chooses one candidate over another based on candidate attributes?

Binary Logit

Binary Logistical Regression Models have a binary dependent variable. The variable is coded 0 for no and 1 for true.

The goal is:

to estimate who likely it is that the variable is true (i.e. takes the value 1).
to see how each of the independent variables affects that probability

Example

A person voted in the last election

A person voted by mail in the last election

A person voted for a specific party, e.g. the Greens

Why can’t you use an linear (OLS) regression for this kind of questions?

Because it violates some of the OLS assumptions and therefore leads to incorrect results.

OLS can predict probabilities that are below 0 and above 1 which doesn’t make sense given the dependent variable can only take two values

The homoskedacitiy assumption is violated

Linear relationship is inappropriate for binary outcomes

Model formula:

π_{i} = logistic function \to probability \frac{exp ( η _{i} )}{1 + exp ( η _{i} )} \leftrightarrow log odds lo g \frac{π _{i}}{1 - π _{i}} = η_{i}

The model predicts the probability $π_{i}$ for observation $i$ that the dependent variable $Y_{i}$ = 1. Example: The probability that a person voted in an election.
$η_{i}$ is a linear predictor: $η_{i} = ß_{0} + ß_{1} x_{i 1} + \dots + ß_{p} x_{i p}$ . It can take any real values

Logistic function	Log odds (logit)
Probability between 0 and 1	Can take any real value (-∞ to +∞)
Non-linear	Linear in predictors
Interpreted as expected probability	Interpreted as ratio of success/failure odds

Multinomial Logit

Multinomial Logistical Regression models have a categorical dependent variable (e.g. which party someone votes for at an election).

Whereas the binary model only has two outcomes, the multinomial model can have more than two outcomes.

Model formula

Probability for the alternatives:

The model gives the probability for the different alternatives $j$ compared to an reference alternative $l$ .

π_{ij} = P (y_{i} = j ∣ s_{i}) = \frac{e x p ( η _{ij} )}{1 + \sum _{r = 1}^{J - 1} e x p ( η _{i r} )}

$π_{ij}$ = Probability that observation $i$ chooses category $j$ conditional on a set of predictors $s_{i}$
$j$ = Alternatives of the outcome category
$l$ = Reference alternative
$s_{i}$ = vector of predictors for observation $i$
$n_{ij}$ = Linear predictor (one for each category, each time with its own coefficients, scaled in log-odds)
The denominator is for normalising the probabilities. It sums the exponentiated linear predictors over all categories, including the reference category. This way the resulting probabilities have positive values and sum up to 1.
- The 1 represents the reference category.
- The $\sum$ sums all the predictors for the other categories.

Probability for the reference categories:

π_{i l} = P (y_{i} = l ∣ s_{i}) = \frac{1}{1 + \sum _{r = 1}^{J - 1} e x p ( η _{i r} )}

For the numerator: The value of the linear predictor of the reference category is 0. As it is exponentiated as the other predictors are it takes the value 1 ( $e x p (0) = 1$ ).

Conditional Logit

A conditional logit models choices as the result of utility-maximation.

The goal is to examine how individual-specific characteristics and alternative-specific characteristics influence the probability of choosing an alternative $j$ from a set of alternatives $J$ .

n_{ij} = V_{ij} = ß_{j 0} + s_{i}^{T} ß_{j 1} + z_{ij}^{T} ß_{2}

$n_{ij}$ or $V_{ij}$ = the expected utility for individual $i$ choosing alternative $j$
$ß_{j 0}$ = an alternative-specific intercept (each choice $j$ can have its own baseline attractiveness)
$s_{i}^{T} ß_{j 1}$ = a vector of individual-specific characteristics (age, income, etc.) + their alternative-specific coefficients → tells us how the individual characteristics affect the utility of alternative $j$
$z_{ij}^{T} ß_{2}$ = a vector of alternative-specific characteristics that vary by individual. This term captures the effect of characteristics that depend both on the alternative and the individual

Example

Suppose a voter $i$ is choosing among three candidates $A, B, C$ in an election.

Alternative-specific intercept: Things like the general popularity of the candidates

Individual characteristics: Age and income of the voter

Because each candidate might appeal differently to people of different ages and with different incomes the characteristics have alternative specific-coefficients.

Alternative-specific characteristics: Campaign-spending per voter

Difference between Multinomial and Conditional Logit

The core difference between the two models is that the Conditional Logit allows one to specify alternative-specific characteristics, such as ideological distances (which vary across alternatives and individuals) in addition to individual-specific characteristics such as age.

Interpretations of coefficients

Odds scale → log-odds, odds, odds ratio
Probability scale → predicted probabilities, marginal effects

Log odds (logits)

The coefficients $ß$ of the linear predictor $η_{i}$ take the form of log odds (logarithmised odds).

Formula:

l o g \frac{π _{i}}{1 - π _{i}} = l o g \frac{P ( y = 1∣ x )}{1 - P ( y = 1∣ x )} = l o g \frac{P _{O cc u re n ce}}{P _{N o n - occ u re n ce}} = η_{i}

Interpretation:

“If $x$ increases by 1 unit, log-odds of $y = 1$ change by $β$ , holoding all other variables constant”

Because they are logarithmised they are hard to interpret substantially. Only the direction of the effect (positive / negative sign) can be interpreted.

The advantage is that their value does not depend on the levels of the independent variables.

Example

“For every unit increase in the year of schooling, the log-odds of going to vote (versus non-voting) increase by 0.71”

The log-odds are additive (as in a linear model).

Multinomial logit models

Log odds of choosing j over l l o g (\frac{P ( y _{i} = j ∣ s _{i} )}{P ( y _{i} = l ∣ s _{i} )}) = Linear predictor η_{ij}

The interpretation of the log odds always have to be relative to the specified base alternative $l$ .

Interpretation:

“For a one unit change in $s_{k}$ , the logit of outcome $j$ versus outcome $l$ is expected to change by $ß$ unites, holding all other variables constant”

Conditional logit models

Odds

The odds are the ratio of two probabilities:

the probability that $y$ is true (takes the value of 1)
the probability that $y$ is false (takes the value of 0)

Odds = \frac{P ( y = 1∣ x )}{1 - P ( y = 1∣ x )} = \frac{P _{O cc u re n ce}}{P _{N o n - occ u re n ce}} = e x p (η_{i})

Percentage change in odds:

Percentage change in odds = (e^{δ ß k} - 1) * 100

$ß_{k}$ : Coefficient of the predictor $k$ (from the logistic regression output)
$δ$ : Size of change in units of the predictor $k$

Hint

Odds can only be determined for specific observations as you need information from all covariates.

Multinomial Models → Relative Risks

For multinomial models, the odds (also called relative risks) are the ratio of:

the risk / probability for the outcome $j$
the risk / probability of the base outcome $l$ .

O dd s / RR = \frac{P ( y _{i} = j ∣ s _{i} )}{P ( y _{i} = l ∣ s _{i} )}

Odds Ratios

The odds ratio compares two odds at different levels. It usually referes to the changes in odds when the value of the independent variable $x$ changes by 1.

To get the log odds, the log odds can be exponentiated. The log odds are multiplicative in parameters.

Odds ratios = e x p (ß) = \frac{\frac{P ( y = 1∣ x + 1 )}{1 - P ( y = 1∣ x + 1 )}}{\frac{P ( y = 1∣ x )}{1 - P ( y = 1 ] x )}} = \frac{Odds if x changes by 1}{Odds if x stays the same}

Interpretation:

$e x p (ß) > 1$ : the odds increase (they are $e x p (ß)$ times larger)
$e x p (ß) = 1$ : the odds do not change
$e x p (ß) < 1$ : the odds decrease (they are $e x p (ß)$ times smaller)

Odds ratios can also be interpreted as percentage changes:

If $x$ changes $δ$ unites, the odds change by a factor (important, because it indicates that the change is multiplicative instead of additive) of $e x p (δ ß)$ .

Example

We use a binary logistical regression model to predict the probability that a person goes to vote. The independent variable $k$ we use is age.

The logistic regression output gives 0.05 as a coefficient for age ( $ß_{1}$ ).

The odds ratio would give us the increase in odds (not in probability!) for one additional unit of age ( $k$ ):
$e^{ß_{1}} = e x p (0.05) = 1.051$
The same can directly be calculated as a percentage change:
$(e^{1 * (0.05)} - 1) * 100 = 5.1%$
If we are interested in the change in odds for multiple years we can simply change the factor:
$(e^{10 * (0.05)} - 1) * 100 = 64.9%$

Multinomial Models → Relative Risk Ratio

Odds ratios / RRR = e x p (ß_{jk}) = \frac{\frac{P ( y _{i} = j ∣ s _{k} + 1 )}{P ( y _{i} = l ∣ s _{k} + 1 )}}{\frac{P ( y _{i} = j ∣ s _{k} )}{P ( y _{i} = l ∣ s _{k} )}} = \frac{Odds if s _{k} changes by 1}{Odds if s _{k} stays the same}

Predicted probabilities

It is also possible to caculate $π$ for specific values using the probability formula.

Marginal effects

A marginal effect shows how much the probability changes when a predictor changes slightly.
In a non-linear model, the effect depends on the values of the predictors.
Since each observation has different predictor values, the marginal effect can be different for each individual.

Example

In a linear regression model with the independent variable “age” the effect of that variable on the dependent variable is the same whether a person is 20 or 50 years old. The coefficient stays the same across all values for $x$ .

In a logistic regression model the effect of the coefficients is not linear. For example, the effect of age on whether a person votes or not is different at age 20 than it is at age 50.

Types of marginal effects:

Type	Where the effect is evaluated	Interpretation
Average Marginal Effects	Average of the individual marginal effects over all observations	Population-average effect (on average for everyone)
Marginal Effects at the Means	Marginal effect at the mean values of all predictors (independent variables)	Typical-case effect (for single, representative individual)
Marginal Effects at a specific (representative) value	Marginal effect at specified values for all predictors (you choose your own values that you think are representative)	Effect for selected cases (e.g. male / female or low / high income)

Marginplots

You basically plot the probabilities that the event occurs (e.g. that a person votes) for each level of one of the explanatory variables while holding the other variables in the model at a specific value.

Estimation method: Maximum likelihood estimation

Logistical regression models are estimated using the maximum likelihood estimation. The estimates from this method are the values of the model parameters that have the highest likelihood $L$ of generating the observed sample of the data.

Measures of model quality

McFaddens Pseudo-R²

The resulting value (which can range from 0 to 1) indicates by how much the log-likelihood improves when including the independent variables.

It basically compares the estimated model with a null model (i.e. one without independent variables) and tells us how much more probable the data is compared to the empty model.

ρ = 1 - \frac{L}{L ^{c}}

$L$ : Log-likelihood of the estimated model
$L^{c}$ : Log-likelihood of the model without independent variables (i.e. one in which all coefficients are 0, expect the constant term)

Hint

Log-likelihood is the natural logarithm of the likelihood function $L$ .

The values of McFaddens Pseudo-R² range from 0 to 1. Values between 0.1 and 0.2 already indicate a relatively good fit (unlike with R² for OLS models).

Likelihood-Ratio-Test

The Likelihood-Ratio-Test ask whether the model fits the data significantly better than a null model (i.e. a model without predictors).

R = - 2 (L^{c} - L) = 2 (L - L^{c})

Again:

$L$ : Log-likelihood of the estimated model
$L^{c}$ : Log-likelihood of the model without independent variables (i.e. one in which all coefficients are 0, expect the constant term)

Values:

$R$ is measured in log-likelihood units
$R$ is close to 0 → the additional variables add zero to little
$R$ is larger than 0 → the additional variables add

Test hypothesis:

$H_{0}$ : The additional parameters are (alltogether) 0
$H_{1}$ : The additional parameters are (alltogether) larger than 0

Small values of $P ro b > χ^{2}$ (< 0.05) → $H_{0}$ can be rejected chi means that the additional variable (alltogether) improve the fit of the model. $P ro b > χ^{2}$ is the equivalent of the t-value for OLS models.

Application in R

Binomial Logit Model

Command:

glm package

logit_model <- glm(enth ~ frau + alter + demo, 
                   data = clean_data_btw09,
                   family = binomial(link="logit"))
summary(logit_model)

Output:

Call:
glm(formula = enth ~ frau + alter + demo, family = binomial(link = "logit"), 
    data = clean_data_btw09)
 
Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.600236   0.187318  -3.204  0.00135 ** 
frau         0.097672   0.120252   0.812  0.41666    
alter       -0.014744   0.003423  -4.307 1.66e-05 ***
demo        -1.423570   0.169070  -8.420  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
(Dispersion parameter for binomial family taken to be 1)
 
    Null deviance: 1893.8  on 2103  degrees of freedom
Residual deviance: 1779.9  on 2100  degrees of freedom
AIC: 1787.9
 
Number of Fisher Scoring iterations: 5

Marginal Effects

Plot:

Conditional Logit Model

Generic specification (can be seen from distance varibale and estimate coefficients)

> summary(clogit_model_gles25)
 
Call:
mclogit(formula = cbind(chosen, ID) ~ asc_spd + asc_fdp + asc_gruene + 
    asc_linke + asc_afd + asc_bsw + disimmi + voter_satisfaction_demo_spd + 
    voter_satisfaction_demo_fdp + voter_satisfaction_demo_gruene + 
    voter_satisfaction_demo_linke + voter_satisfaction_demo_afd + 
    voter_satisfaction_demo_bsw + voter_age_spd + voter_age_fdp + 
    voter_age_gruene + voter_age_linke + voter_age_afd + voter_age_bsw, 
    data = gles_25_LONG)
 
                                Estimate Std. Error z value Pr(>|z|)    
asc_spd                        -0.630705   0.451182  -1.398 0.162144    
asc_fdp                        -1.434503   0.565712  -2.536 0.011221 *  
asc_gruene                      1.708457   0.405892   4.209 2.56e-05 ***
asc_linke                      -0.038676   0.474478  -0.082 0.935033    
asc_afd                        -4.971085   0.607000  -8.190 2.62e-16 ***
asc_bsw                        -4.642590   0.703062  -6.603 4.02e-11 ***
disimmi                        -0.427625   0.016412 -26.056  < 2e-16 ***
voter_satisfaction_demo_spd    -0.324406   0.142732  -2.273 0.023036 *  
voter_satisfaction_demo_fdp     0.487301   0.173556   2.808 0.004989 ** 
voter_satisfaction_demo_gruene -0.301205   0.132570  -2.272 0.023084 *  
voter_satisfaction_demo_linke   0.496719   0.146618   3.388 0.000704 ***
voter_satisfaction_demo_afd     1.889413   0.164997  11.451  < 2e-16 ***
voter_satisfaction_demo_bsw     1.232663   0.178160   6.919 4.55e-12 ***
voter_age_spd                   0.012460   0.005740   2.171 0.029940 *  
voter_age_fdp                  -0.024976   0.007393  -3.378 0.000729 ***
voter_age_gruene               -0.020292   0.005177  -3.920 8.87e-05 ***
voter_age_linke                -0.033104   0.006206  -5.334 9.62e-08 ***
voter_age_afd                  -0.022761   0.006947  -3.276 0.001052 ** 
voter_age_bsw                  -0.001857   0.008337  -0.223 0.823779    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
Null Deviance:     6815 
Residual Deviance: 4349 
Number of Fisher Scoring iterations:  6 
Number of observations:  1751

The logit for disimmi of -0.427625 means that the estimated logarithmic chance of voting for any party (this is the distance measure which is generic - it does not change across alternatives) - as compared to voting for the CDU/CSU - decreases by -0.427625 when disimmi (the distance of the individual’s position with regard to immigration and the position in these issues of the parties) increases by one unit, c.p.

Example with an alternative-specific constant: The logit for asc_gruene (1.708457) means that the estimated logarithmic chance of voting for the Gruene are 1.708457 larger that compared to voting for the CDU/CSU, c.p.OR: The alternative-specific constant for Gruene (1.708457) means that, when all other variables in the model are zero (or at their reference levels), the log-odds of choosing Gruene over the reference category (CDU/CSU) is 1.708457 higher.

Cédric's notes

Explorer

Logistical Regression Models

Types of logistical regression models

Binary Logit

Multinomial Logit

Model formula

Conditional Logit

Interpretations of coefficients

Log odds (logits)

Multinomial logit models

Conditional logit models

Odds

Multinomial Models → Relative Risks

Odds Ratios

Multinomial Models → Relative Risk Ratio

Predicted probabilities

Marginal effects

Marginplots

Estimation method: Maximum likelihood estimation

Measures of model quality

McFaddens Pseudo-R²

Likelihood-Ratio-Test

Application in R

Binomial Logit Model

Marginal Effects

Conditional Logit Model

Graph View

Table of Contents